Ethan B. Holland

Over 54,400 manually organized AI links and counting

Multimodal: AI News Week Ending 02/06/2026

February 6, 2026

Image created with gemini-3.1-flash-image-preview with claude-sonnet-4-5. Image prompt: Flat cartoon illustration of a friendly coral-red lobster mascot centered on dark charcoal background holding white speech bubble containing ‘MULTIMODALITY’ text in Helvetica font, surrounded by floating minimal icons of music note, camera, document, and microphone in cyan and white, kawaii mascot style with clean geometric shapes and high contrast.

New milestone: we trained a robot foundation model on a world model backbone, and enabled zero-shot, open-world prompting capability for new verbs, nouns, and environments. If the world model can “”dream”” the right future in pixels, then the robot can execute well in motors. We”” https://x.com/DrJimFan/status/2019112603637920237

Eleven v3 — Most Expressive AI Voice Model https://elevenlabs.io/v3

ElevenLabs CEO: Voice is the next interface for AI | TechCrunch https://techcrunch.com/2026/02/05/elevenlabs-ceo-voice-is-the-next-interface-for-ai/

ElevenLabs raises $500M Series D at $11B valuation https://elevenlabs.io/blog/series-d

Voxtral transcribes at the speed of sound. | Mistral AI https://mistral.ai/news/voxtral-transcribe-2

📢 New paper from GEAR team @NVIDIARobotics We released DreamZero, a World Action Model that turns video world models into zero-shot robot policies. Built on a pretrained video diffusion backbone, it jointly predicts future video frames and actions. 🌐”” https://x.com/yukez/status/2019096072690553112

Introducing NVIDIA Cosmos Policy for Advanced Robot Control https://huggingface.co/blog/nvidia/cosmos-policy-for-robot-control

DreamZero: World Action Models are Zero-shot Policies
https://dreamzero0.github.io/

Jim Fan on X: “The Second Pre-training Paradigm” / X
https://x.com/DrJimFan/status/2018754323141054786

Website: https://t.co/2YwjQs3JMC Robot execution demos across various verbs, nouns, and environments: https://t.co/loUZXZODcR The model is open-source! https://x.com/DrJimFan/status/2019112605315637451

🔊 Debug Voice Agents with LangSmith 🔊 The STT → Agent → TTS “sandwich” is a standard voice agent pattern. It’s easy to get started, tough to build reliable systems. 😵‍💫 Learn how to debug voice agents: We created a voice agent with Pipecat and sent traces to LangSmith to”” https://x.com/LangChain/status/2019846811997942219

Waypoint-1.1 is live, and we’re kicking off weekly updates. This release crosses an important line from impressive short rollouts to local, real-time world models that are coherent, controllable, and playable. New model. Better prompting. Smoother rollouts.”” https://x.com/overworld_ai/status/2019109415023178208

Planning is one of the most exciting uses of world models, but existing planners struggle on long horizons. Introducing GRASP: a fast gradient-based planner for world models that outperforms prior methods on long-horizon tasks. Two key ideas: 1.jointly optimize actions and”” https://x.com/_amirbar/status/2019903658792497482

tl;dr New planner for world models! GRASP: gradient-based, stochastic, parallelized. Long range planning for world models has always been an issue. 0th order methods like CEM/MPPI dominate, but have degrading performance at longer contexts or higher-dimensional actions. We”” https://x.com/michaelpsenka/status/2019870377032503595

Robbyant has announced LingBot-VLA: an open-source Vision-Language-Action model – Pretrained on ~20k hours of real-world dual-arm robot data – Strong generalization across 9 embodiments – Improves consistently with more data – Claims outperformance over π₀.₅, GR00T N1.6 &”” https://x.com/TheHumanoidHub/status/2017337216054575513

World Model meets robot policy! Robbyant’s LingBot-VA: unifies video world modeling and robotic policy learning. – A single model generates both future video and the actions to make it real. – Long-term memory enables long-horizon tasks. – Claims significant outperformance over”” https://x.com/TheHumanoidHub/status/2017638555741552672

self-driving <as a 2D robot with a low-dim action space that focused mostly on avoidance rather than interaction> will reach real-world impact faster than anything else. the really cool part is that the world model isn’t just about videos; it’s about modeling continuous,”” https://x.com/sainingxie/status/2019841784990351381

Accelerating Creation, Powered by Roblox’s Cube Foundation Model | Roblox https://about.roblox.com/newsroom/2026/02/accelerating-creation-powered-roblox-cube-foundation-model

A useful tool: VoxCPM – a tokenizer-free text-to-speech system for realistic voices It’s like diffusion meets autoregressive speech generation, without discrete tokens. It generates continuous speech representations directly from text, removing the bottleneck that limits”” https://x.com/TheTuringPost/status/2017719802375393616

A week after PaddleOCR-VL-1.5 took the top spot on OmniDocBench, *another* 0.9B model dethrones it! GLM-OCR shows SOTA results on doc parsing benchmarks and it’s apparently 50-100% faster https://x.com/jerryjliu0/status/2018713059359899729

Radiologists are a good example — a job we were promised since 2016 would soon disappear. The lesson is that even if the core tasks underlying a job can be done with AI, that doesn’t mean the human expert isn’t still needed.”” https://x.com/fchollet/status/2019610588612292834

The Helix team at Figure spent the last 12 months hitting wall after wall on what seemed like a simple problem: How do we give our AI model, Helix, control of the entire humanoid body (pixels in; motor torques out)? Core to our belief is shipping fully autonomous robots that”” https://x.com/adcock_brett/status/2016743751088263238

Congrats to @MistralAI on releasing Voxtral Mini 4B Realtime! 🎉 Day-0 support in vLLM! A 4B streaming ASR model achieving <500ms latency while matching offline model accuracy, supporting 13 languages. vLLM’s new Realtime API `/v1/realtime` provides audio streaming – optimized”” https://x.com/vllm_project/status/2019106596794814894

FlashAI Voice Agents – FlashLabs https://www.flashlabs.ai/flashai-voice-agents?AHA_ORDER_ID=69708acfa57db72d632e9528&AHA_CAMPAIGN_ID=44046&AHA_SOURCE=linkedin

We’re introducing WorldVQA, a new benchmark to measure atomic vision-centric world knowledge in Multimodal Large Language Models. Current evaluations often conflate visual knowledge retrieval with reasoning. In contrast, WorldVQA decouples these capabilities to strictly measure”” https://x.com/Kimi_Moonshot/status/2018697552456257945

ollama pull glm-ocr All local. You own your data. GLM-OCR delivers state-of-the-art performance for document understanding. Use it for recognizing text, tables, and figures, or output to a specific JSON format. Drag and drop images into the terminal, script it or access via”” https://x.com/ollama/status/2018525802057396411

Nemotron ColEmbed V2: Raising the Bar for Multimodal Retrieval with ViDoRe V3’s Top Model https://huggingface.co/blog/nvidia/nemotron-colembed-v2

Reinforcement Learning for Active Perception in Autonomous Navigation. [📍GitHub & Paper ] Most robots navigate as if their cameras were nailed in place. But perception is not passive. Animals move their heads and eyes constantly to decide where to go next. Robots should do”” https://x.com/IlirAliu_/status/2018762226170016109

Robust humanoid perceptive locomotion is still underexplored. Especially when different cameras see different terrains, paths get narrow, and payloads disturb balance… Introduce RPL, tackling this with one unified policy: • Challenging terrains (slopes, stairs and stepping”” https://x.com/Yuanhang__Zhang/status/2019092752240181641

Tired of teleoperation? One human video → 1,000s of robot demos. (📍GitHub ) Scaling Robot Data Without Dynamics Simulation or Robot Hardware Real2Render2Real (R2R2R) is a new way to scale robot data without physics simulation or hardware. You take a phone scan + a single”” https://x.com/IlirAliu_/status/2017884655869976975

PaperBanana: Automating Academic Illustration for AI Scientists https://dwzhu-pku.github.io/PaperBanana/

Grok Imagine rank 1″” https://x.com/elonmusk/status/2019164163906629852

Introducing Grok Imagine 1.0, our biggest leap yet. 1.0 unlocks 10-second videos, 720p resolution, and dramatically better audio. Imagine has generated 1.245 billion videos in the last 30 days alone. Try it now: https://x.com/xai/status/2018164753810764061?s=20

⚡ Deploy GLM-OCR from @Zai_org on Novita with ease! GLM-OCR is a state‑of‑the‑art multimodal OCR model for real‑world business OCR at scale. Key Features: ✨ #1 on OmniDocBench V1.5 (94.62) ✨ Excels at tables, formulas, code, seals & complex layouts ✨ 0.9B params for”” https://x.com/novita_labs/status/2018565896013574225

🎉 Congrats to @Zai_org on releasing GLM-OCR! We have day-0 support ready for GLM-OCR in SGLang! GLM-OCR is a multimodal OCR model built for complex document understanding 🏆 SOTA document understanding: #1 on OmniDocBench v1.5 (94.62), strong on tables, formulas & IE 📄 Built”” https://x.com/lmsysorg/status/2018521181146751486

🚀Congrats to @Zai_org on the GLM-OCR release! vLLM has day-0 support ready 💪 Great work on pushing OCR for complex document understanding forward with only 0.9B params! PR: https://t.co/eurZfbac8e Try it now:”” https://x.com/vllm_project/status/2018582480518091083

zai-org/GLM-OCR · Hugging Face https://huggingface.co/zai-org/GLM-OCR