Multimodal: AI News Week Ending 01/30/2026

Multimodal: AI News Week Ending 01/30/2026

January 30, 2026

Image created with gemini-3.1-flash-image-preview with claude-sonnet-4-5. Image prompt: Animation cel style image of a muscular blue genie with friendly expression emerging from a golden brass oil lamp, five distinct objects floating in an arc around him connected by glowing cyan magical wisps: a camera lens, vintage microphone, handwritten paper, musical staff, and painter’s palette, warm Arabian Nights gradient background, Disney quality 2D animation aesthetic with clean lines and volumetric magical effects, horizontal composition with clear space for text overlay across top third.

We’ve launched the first official extension to MCP. MCP Apps lets tools return interactive interfaces instead of just plain text. Live in Claude today across a range of tools.”” https://x.com/alexalbert__/status/2015854375051428111

Your work tools are now interactive in Claude. Draft Slack messages, visualize ideas as Figma diagrams, or build and see Asana timelines.”” https://x.com/claudeai/status/2015851783655194640

MCP Apps – Bringing UI Capabilities To MCP Clients | Model Context Protocol Blog https://blog.modelcontextprotocol.io/posts/2026-01-26-mcp-apps/

Interactive tools in Claude | Claude https://claude.com/blog/interactive-tools-in-claude

I got early access to Project Genie from @GoogleDeepMind ✨ It’s unlike any realtime world model I’ve tried – you generate a scene from text or a photo, and then design the character who gets to explore it. I tested dozens of prompts. Here are the standout features 👇”” https://x.com/venturetwins/status/2016919922727850333

HOLY FUCK Genie 3 is the craziest thing I’ve tried in a long time Just… wow. Watch this.”” https://x.com/mattshumer_/status/2017058981286396001

Project Genie is an impressive demonstration of what world models can do. But there’s a difference between seeing the future and being able to build with it today. This is what running locally looks like”” https://x.com/overworld_ai/status/2017298592919392717

Here’s how it works: 🔵 Design your world and character using text and visual prompts. 🔵 Nano Banana Pro makes an image preview that you can adjust. 🔵 Our Genie 3 world model generates the environment in real-time as you move through. 🔵 Remix existing worlds or discover new”” https://x.com/GoogleDeepMind/status/2016919762924949631

Project Genie is a prototype web app powered by Genie 3, Nano Banana Pro + Gemini that lets you create your own interactive worlds. I’ve been playing around with it a bit and it’s…out of this world:) Rolling out now for US Ultra subscribers.”” https://x.com/sundarpichai/status/2016979481832067264

5/ Building responsibly 🛡️ Building AI responsibly is core to our mission. As an experimental @GoogleLabs prototype, Project Genie is still in development. This means you might encounter 60-second generation limits, control latency, or physics that don’t always perfectly adhere”” https://x.com/Google/status/2016972686208225578

Project Genie: AI world model now available for Ultra users in U.S. https://blog.google/innovation-and-ai/models-and-research/google-deepmind/project-genie/

Thrilled to launch Project Genie, an experimental prototype of the world’s most advanced world model. Create entire playable worlds to explore in real-time just from a simple text prompt – kind of mindblowing really! Available to Ultra subs in the US for now – have fun exploring!”” https://x.com/demishassabis/status/2016925155277361423

Introducing Project Genie: An experimental research prototype powered by Genie 3, our world model, that lets you prompt an interactive world into existence — and then step inside 🌎”” https://x.com/Google/status/2016926928478089623

Project Genie is rolling out for AI Ultra members in the USA. It’s an experimental tool that allows you to create and explore infinite virtual worlds, and I’ve never seen anything like this. It’s still early, but it’s already unreal. Nano Banana Pro + Project Genie = My low-poly”” https://x.com/joshwoodward/status/2016921839038255210

Step inside Project Genie: our experimental research prototype that lets you create, edit, and explore virtual worlds. 🌎”” https://x.com/GoogleDeepMind/status/2016919756440240479

Project Genie is rolling out to @Google AI Ultra subscribers in the U.S. (18+) With this prototype, we want to learn more about immersive user experiences to advance our research and help us better understand the future of world models. See the details → https://x.com/GoogleDeepMind/status/2016919765713826171

I’ve written 250k+ lines of game engine code. Here’s why Genie 3 isn’t what people think it is: World models are something genuinely new. A third category of media we don’t have a name for yet. Near-term they’re too slow and expensive for consumers. But for training robots?”” https://x.com/jsnnsa/status/2017276112561422786

Introducing Agentic Vision in Gemini 3 Flash https://blog.google/innovation-and-ai/technology/developers-tools/agentic-vision-gemini-3-flash/

Introducing Agentic Vision — a new frontier AI capability in Gemini 3 Flash that converts image understanding from a static act into an agentic process. By combining visual reasoning with code execution, one of the first tools supported by Agentic Vision, the model grounds”” https://x.com/GoogleAI/status/2016267526330601720

Google launches Agentic Vision in Gemini 3 Flash https://www.testingcatalog.com/google-launches-agentic-vision-in-gemini-3-flash/

This paper puts a multimodal agent (using Gemini 2.5) into a realistic medical sim used to train physicians: “”The AI agent matches or exceeds [14,000] medical students in case completion rates and secondary outcomes such as time and diagnostic accuracy”” https://x.com/emollick/status/2016641414713704957

If NotebookLM was a web browser | AI Focus https://aifoc.us/if-notebooklm-was-a-web-browser/

[AINews] Moonshot Kimi K2.5 – Beats Sonnet 4.5 at half the cost, SOTA Open Model, first Native Image+Video, 100 parallel Agent Swarm manager https://www.latent.space/p/ainews-moonshot-kimi-k25-beats-sonnet

One-shot “”Video to code”” result from Kimi K2.5 It not only clones a website, but also all the visual interactions and UX designs. No need to describe it in detail, all you need to do is take a screen recording and ask Kimi: “”Clone this website with all the UX designs.”””” https://x.com/KimiProduct/status/2016081756206846255

Apple acquires Israeli startup Q.ai https://www.cnbc.com/2026/01/29/apple-acquires-israeli-startup-qai-.html

Apple buys Israeli startup Q.ai as the AI race heats up | TechCrunch https://techcrunch.com/2026/01/29/apple-buys-israeli-startup-q-ai-as-the-ai-race-heats-up/

World Models | Ankit Maloo https://ankitmaloo.com/world-models/

D4RT: Unified, Fast 4D Scene Reconstruction & Tracking — Google DeepMind
https://deepmind.google/blog/d4rt-teaching-ai-to-see-the-world-in-four-dimensions/

Crazy results but the speed is what makes this incredible!”” https://x.com/Almorgand/status/2014615608545915168

🚀 DeepSeek-OCR 2 — introducing Visual Causal Flow from @deepseek_ai, learning to read documents the way humans do — now running on vLLM ⚡ with vllm==0.8.5 day-0 support. 🧠 Replaces fixed raster scanning with learned causal token reordering via DeepEncoder V2. 📄 16× visual”” https://x.com/vllm_project/status/2016065526058090967

New DeepSeek-OCR-2 model! 1. Utilizes Qwen2 500M as a vision encoder instead of VIT 300M 2. Adds causal mask with a non causal mask 3. Accuracy boost by 3.73% to 91.09% from 87.36% 4. Edit Distance 0.100 vs 0.129 for OCR v1 And we added DS-OCR-2 fine-tuning support in Unsloth!”” https://x.com/danielhanchen/status/2016043326760485313

Excited to launch Agentic Vision in Gemini 3 Flash, a new capability that combines visual reasoning with code execution to ground answers in visual evidence. Activate `code_execution` and it will make use of it. – Delivers 5-10% quality boost across vision benchmarks. – Zooms,”” https://x.com/_philschmid/status/2016225242394296773

With Agentic Vision, Gemini can better understand images by analyzing them in new and different ways: • Planning: Gemini thinks about your prompt and image and creates a multi-step plan to analyze it. • Zooming: when Gemini sees fine details in an image, it zooms in so that it”” https://x.com/GeminiApp/status/2016914637523210684

Introducing Agentic Vision, a new capability in Gemini 3 Flash. Agentic Vision makes Gemini even better at analyzing complex images, enabling it to more accurately and consistently read fine details, like serial numbers or text on a complex diagram. See what it can do. 🧵”” https://x.com/GeminiApp/status/2016914275886125483

Agentic Vision is rolling out now in the Gemini app when you select “Thinking” from the model drop-down. Learn more about Agentic Vision in Gemini 3 Flash:”” https://x.com/GeminiApp/status/2016914638861193321

Recursive Self-Aggregation (RSA) + Gemini 3 Flash scores 59.31% at only 1/10th the cost of Gemini Deep Think on the public ARC-AGI-2 evals. Insane”” https://x.com/kimmonismus/status/2015717203362926643

8 most illustrative VLA (Vision-Language-Action) models: ▪️ Gemini Robotics ▪️ π0 ▪️ SmolVLA ▪️ Helix ▪️ ChatVLA-2 (with MoE design) ▪️ ACoT-VLA (Action Chain-of-Thought) ▪️ VLA-0 ▪️ Rho-alpha (ρα) – the newest VLA + model from Microsoft Here you can explore what these models”” https://x.com/TheTuringPost/status/2015016772043452834

finally paid for a Gemini Ultra sub, and tried it out for an unsponsored unsolicited review. it has obvious flaws but… it’s here! realtime playable video world model!! here’s “”arid desert with little tiny human towns here and there and big cliffs and lots of terrain to walk”” https://x.com/swyx/status/2017111381456400603

AI Mode in Google Search and AI Overviews get Gemini upgrades https://blog.google/products-and-platforms/products/search/ai-mode-ai-overviews-updates/

The next evolution: VLA+ models Just yesterday @MSFTResearch released Rho-alpha (ρα) – their first robotics model, built on the Phi family. While most Vision-Language-Action (VLA) models stop at vision and language, Rho-alpha adds: ▪️ Tactile sensing to feel objects during”” https://x.com/TheTuringPost/status/2014284149872644351

Kimi K2.5 Tech Blog: Visual Agentic Intelligence https://www.kimi.com/blog/kimi-k2-5.html#footnotes

Kimi K2.5 tech report just dropped! Quick hits: – Joint text-vision training: pretrained with 15T vision-text tokens, zero-vision SFT (text-only) to activate visual reasoning – Agent Swarm + PARL: dynamically orchestrated parallel sub-agents, up to 4.5× lower latency, 78.4% on”” https://x.com/Kimi_Moonshot/status/2017249233775260021

Kimi K2.5 having fully multimodal understanding including video was not on my bingo card. I love it!”” https://x.com/kimmonismus/status/2016120251717714273

🧠👀 @Kimi_Moonshot just shipped Kimi-K2.5 with multimodality. Behind this big step lies a deeper question: what kind of multimodal model actually matters? Zhihu contributor & Moonshot AI researcher Lechatelia: ✨K2.5 is not “”just another VLM.”” I came from CV → VL → VLM, and”” https://x.com/ZhihuFrontier/status/2016438778030850059

🚨BREAKING: Kimi K2.5 Thinking by @Kimi_Moonshot is the #1 open model for Vision Arena! Highlights: – #1 open model in Vision (+40pt over the next open model) – #6 overall (Qwen3-vl-235b-a22b-instruct is next open model at #18) This is the only open model in the Top 15.”” https://x.com/arena/status/2016984335380001268

Kimi K2.5 Technical Report: “”early fusion with a lower vision ratio yields better results given a fixed total vision-text token budget”” – “”Visual RL Improves Text Performance”” – “”joint multimodal RL paradigm during Kimi K2.5’s post-training. Departing from conventional”” https://x.com/scaling01/status/2017255763400364049

Any guess why Kimi team calls Kimi 2.5 as ‘Native Multimodal’ & how is it different from Kimi VL? In response to this question on HF , Kimi team response was “”It is an ungraded version compared to Kimi-VL, especially featuring video understanding. Will release more details”” https://x.com/thefirehacker/status/2016223118738764081

K2.5 technical report suggests that early fusion of vision tokens is best, but they start from the K2 checkpoint and then train for 15T more tokens. Did I miss something, or does this mean they’re still kind of doing late fusion anyway?”” https://x.com/andrew_n_carr/status/2017304411345981518

K2.5 is a V3 generation model, explicitly built on V3 architecture. It’s not frontier within Moonshot’s own portfolio. They just pushed continued training further than anyone. V4 is all but guaranteed to do vastly better. Its competition will come from K3, GLM-5. Next gen.”” https://x.com/teortaxesTex/status/2016956019239272717

LiDAR free object detection and tracking 👀”” https://x.com/bilawalsidhu/status/2016357045717414010

WildRayZer: Self‑supervised Large View Synthesis in Dynamic Environments”” TL;DR: self‑supervised NVS model that disentangles motion from static structure to render clean static novel views from dynamic video without 3D supervision.”” https://x.com/Almorgand/status/2014754835740958788

Motion 3-to-4: 3D Motion Reconstruction for 4D Synthesis”” TL;DR: feed-forward 4D mesh synthesis from a single monocular video (+ optional reference mesh) by predicting per-frame vertex trajectories for temporally coherent motion.”” https://x.com/Almorgand/status/2014391936178643447

GR3EN: Generative Relighting for 3D Environments”” TL;DR: generative relighting method that distills video-to-video relighting diffusion outputs into 3D scene relighting, enabling controllable lighting changes in large room-scale reconstructions.”” https://x.com/Almorgand/status/2016202951908274228

damn. spatial intelligence can be FAST asf! “”D4RT can continuously understand what’s moving while running 18x-300x faster than previous methods – processing a 1-minute video in roughly 5 seconds on a single TPU chip.”””” https://x.com/bilawalsidhu/status/2014490782506356998

Here’s how to turn the world around you into a story you can tell with Gen-4.5 Image to Video.”” https://x.com/runwayml/status/2017238025982427316

Runway Gen 4.5 has two new features. Motion Sketch and Character Swap are now in-built apps for the tool. Sketch camera motion with annotations on a start frame to control the movement. Swap a character with two images before the video step. Here’s how!”” https://x.com/jerrod_lew/status/2016816309762486423

Take inspiration from your world. Then turn it into a story you want to tell. With Gen-4.5 Image to Video. Simply take a photo, load it into Runway, then ask for what you want. A Day at the Museum. Generated with AI. Made by Áron. Full how to video coming soon. Get started at”” https://x.com/runwayml/status/2016882344427147275

This is happening live. It’s a realtime AI video model. We are so cooked.”” https://x.com/bilawalsidhu/status/2015993354576634235

World models are going to take over in 2026. While video generation models are impressive dreamers, they aren’t world simulators. Generative video models can produce stunning clips, but they hallucinate pixel transitions based on statistical correlations. This leads to”” https://x.com/dair_ai/status/2016881546909929775

Human3R: Everyone Everywhere All at Once”” TL;DR: unified feed-forward 4D reconstruction from monocular video joint multi-person SMPL-X, scene geometry, and camera trajectories in real time (~15 FPS).”” https://x.com/Almorgand/status/2016546477569429544

Whenever I see demos about gaming world models, this is what I expect. Anything else is video generation not gaming”” https://x.com/sethkarten/status/2017322251385745570

Thrilled to share our new Grok Imagine release 🚀 It is the highest quality, fastest, and most cost-effective video generation model yet. Comes with 720P, video editing and better audio! We listened closely to your feedback and moved fast. Just six months ago, we had almost”” https://x.com/EthanHe_42/status/2016749123198673099

Grok Imagine is also #1 in the Artificial Analysis Image to Video Leaderboard!”” https://x.com/ArtificialAnlys/status/2016749790907027726

LingBot-World from Ant Group An open-source world simulator from video generation with real-time interactivity. Maintains high fidelity across diverse environments with minute-level consistency and <1s latency at 16 FPS.”” https://x.com/HuggingPapers/status/2016787043028746284

fal is proud to partner with @xai as Grok Imagine’s day-0 platform partner xAI’s latest image & video gen + editing model ✨ Stunning photorealistic images/videos from text ⚡ Lightning-fast generation 🎥 Dynamic animations with precise control 🎨 Edit elements, styles & more”” https://x.com/fal/status/2016746472931283366

.@xai’s Groks new video generation model is so freaking good. And even more important: price/performance ratio is next level.”” https://x.com/kimmonismus/status/2017252078272553396

🎉 Congrats @Alibaba_Qwen on the Qwen3-ASR release — vLLM has day-0 support. 52 languages, 2000x throughput on the 0.6B model, singing voice recognition, and SOTA accuracy on the 1.7B. Serve it now in vLLM! 🚀 Learn more: https://x.com/vllm_project/status/2016865238323515412

Qwen3-ASR and Qwen3-ForcedAligner are now open source — production-ready speech models designed for messy, real-world audio, with competitive performance and strong robustness. ● 52 languages & dialects with auto language ID (30 languages + 22 dialects/accents) ● Robust in”” https://x.com/Alibaba_Qwen/status/2016858705917075645

Qwen3-ASR is out🚀 https://t.co/pVnuuNPMEL ✨ 0.6B & 1.7B – Apache2.0 ✨ 30 languages + 22 Chinese dialects, plus English accents across regions ✨ Single model for language ID + ASR (no extra pipeline stitching) ✨ Qwen3-ForcedAligner-0.6B, a strong forced aligner”” https://x.com/AdinaYakup/status/2016865634559152162

Qwen3-ASR is the first open-source LLM-based ASR in the industry with native streaming support. Demo: https://t.co/y2X1slCMcs vLLM Example:”” https://x.com/Alibaba_Qwen/status/2016900512478875991

Big thanks to vLLM for providing Day 0 support for Qwen3-ASR.”” https://x.com/Alibaba_Qwen/status/2016905051395260838

A generative world for general-purpose robotics & embodied AI learning. Genesis is a physics platform designed for general-purpose Robotics/Embodied AI/Physical AI applications. 📍GitHub: https://t.co/1WkYOD8Djm —– Weekly robotics and AI insights. Subscribe free:”” https://x.com/IlirAliu_/status/2015710305368605022

Foundation models are enough to solve robotics! Unfortunately, this is not true. We keep hearing that Vision-Language-Action (VLA) models struggle because of the gap between static training and the dynamic real world. A German startup (@SereactAI) just released a solution that”” https://x.com/IlirAliu_/status/2016228327103574326

High-speed food packaging only makes sense if automation actually changes the economics. At Anı Bisküvi A.Ş., a robotic box-filling system from Robentex now runs two lines at a combined 800 products per minute. They moved to a tray-and-lid concept instead of classic display”” https://x.com/IlirAliu_/status/2015863130341941749

Today we’re introducing Helix 02 Dancing robots are trivial, the hard part is intelligent control This is our most powerful model to date – able to work across complex tasks & long time horizons”” https://x.com/adcock_brett/status/2016207851891667395