Multimodal: AI News Week Ending 12/19/2025

Image created with gemini-2.5-flash-image with claude-sonnet-4-5. Image prompt: Photorealistic 35mm cinema shot of child aged 7 in cozy bedroom with warm peach lighting, viewed from side angle sitting on plush cream rug, surrounded by panoramic arc of TV screens each showing same content in different modalities (text transcription, audio waveform, image with captions, silent video), shallow depth of field, scattered newspapers and multimodal props including braille book, music box with sheet music, headphones on sketchbook, soft blue-white screen glow contrasting warm pastels, bold text MULTIMODALITY at top, tender intimate atmosphere, realistic fabric and skin texture

GROK JUST TURNED VOICE AI INTO A REAL PRODUCT, FAST, AND EVERYWHERE xAI just opened Grok Voice to developers, and this isn’t some early experiment dressed up as a launch. It’s the same system already running inside millions of Teslas, now exposed through an API that actually https://x.com/MarioNawfal/status/2001472484869329288

Grok Voice Agent API | xAI https://x.ai/news/grok-voice-agent-api

Today, we’re excited to launch the Grok Voice Agent API, empowering developers to build voice agents that speak dozens of languages, call tools, and search realtime data. https://x.com/xai/status/2001385958147752255

Took less than an hour for Grok Voice Agent by @xai to be ported to Reachy Mini thanks to @atariorbit! https://x.com/ClementDelangue/status/2001410494528213481

Tinker is now generally available. We also added support for advanced vision input models, Kimi K2 Thinking, and a simpler way to sample from models. https://x.com/thinkymachines/status/1999543421631946888

Tinker: General Availability and Vision Input – Thinking Machines Lab
https://thinkingmachines.ai/blog/tinker-general-availability/

Tinker: General Availability and Vision Input – Thinking Machines Lab https://thinkingmachines.ai/blog/tinker-general-availability/

Today we are releasing Tinker to everyone, and now with vision input! You can now finetune a frontier Qwen3-VL-235B on your own image+text data, bringing your own algorithm (sft, RL, something else?). We’ll take care of the GPU infra. Full update: https://x.com/rown/status/1999544121984245872

Gemini 3, create a really novel and clever and funny Venn diagram. think hard. do not do research.”” So close to coming together (I am not sure the center works for all three, illustrations are odd), but also better than I expected. https://x.com/emollick/status/2000805347590856822

Gemini 3, please provide the rail/subway map for Middle Earth in the third age, with accurate stops and taking into account natural barriers, alliances, and so on.”” Not bad. I do like the “”service suspended – Balrog”” note at Moria. https://x.com/emollick/status/1999930443001737700

Gemini can now illustrate a visual report https://blog.google/products-and-platforms/products/gemini/visual-reports/

Google Antigravity https://antigravity.google/

Google expands Gemini with NotebookLM integration https://www.testingcatalog.com/google-expands-gemini-with-notebooklm-integration/

Gemini 3.0 Flash is an absolutely fantastic release. Consider this: It costs a quarter (1/4) of what Gemini 3.0 Pro costs and achieves similar results to the Pro model in almost all benchmarks, such as HLE and ARC-AGI 2. In other benchmarks, it even outperforms the more https://x.com/kimmonismus/status/2001326181875154983

Introducing Gemini 3 Flash: Benchmarks, global availability https://blog.google/products-and-platforms/products/gemini/gemini-3-flash/

Starting today, Gemini can serve up local results in a rich, visual format. See photos, ratings, and real-world info from @GoogleMaps, right where you need them.”” / X https://x.com/GeminiApp/status/1999631529379791121

🔉 Introducing SAM Audio, the first unified model that isolates any sound from complex audio mixtures using text, visual, or span prompts. We’re sharing SAM Audio with the community, along with a perception encoder model, benchmarks and research papers, to empower others to https://x.com/AIatMeta/status/2000980784425931067

SAM Audio https://ai.meta.com/samaudio/

sam-audio – a facebook Collection https://huggingface.co/collections/facebook/sam-audio

(13) Molmo 2 | Complex video question answering – YouTube https://www.youtube.com/watch?v=Ej3Hb3kRiac

(13) Molmo 2 | Counting objects and actions – YouTube https://www.youtube.com/watch?v=fvYfPTTTZ_w

(13) Molmo 2 | Video Tracking – YouTube https://www.youtube.com/watch?v=uot140v_h08

Molmo 2: State-of-the-art video understanding, pointing, and tracking | Ai2 https://allenai.org/blog/molmo2

GPT-5.2 exceeded a trillion tokens in the API on its first day of availability and is growing fast!”” / X https://x.com/sama/status/1999624463013544024

I have found GPT-5.2 Thinking to be a surprisingly deep second-opinion/fact checker. I gave it a dense paragraph with a few correct claims, a couple errors that required research to find, and some things that needed interpretation It found and gently corrected all the problems https://x.com/emollick/status/2000666007010971787

Introducing GPT-5.2-Codex | OpenAI https://openai.com/index/introducing-gpt-5-2-codex/

GPT-5.2 is here and it’s the best model out there for everyday professional work. On GDPval, the thinking model beats or ties human experts on 70.9% of common professional tasks like spreadsheets, presentations, and document creation. It’s also better at general intelligence,”” / X https://x.com/fidjissimo/status/1999183159356006450

Today I ran two complex tasks through Codex with GPT 5.2 Extra High The first ran for 2 hours 30 minutes The second ran for 1 hours 45 minutes Both resulted in: – all acceptance criteria resolved – all test coverage complete – zero broken or non-working code Amazing”” / X https://x.com/nummanali/status/2000228337030152347

Whoa. This new GDPval score is a very big deal. Probably the most economically relevant measure of AI ability suggesting that in head-to-head competition with human experts on tasks that require 4-8 hours for a human to do, GPT-5.2 wins 71% of the time as judged by other humans https://x.com/emollick/status/1999189828756263359

🎨 Qwen-Image-Layered is LIVE — native image decomposition, fully open-sourced! ✨ Why it stands out ✅ Photoshop-grade layering Physically isolated RGBA layers with true native editability ✅ Prompt-controlled structure Explicitly specify 3-10 layers — from coarse layouts to https://x.com/Alibaba_Qwen/status/2002034611229229388

🚨 Qwen Image Layered is live on fal! ✨ Photoshop-grade layering – Native Decomposition 👑 Physically isolated RGBA layers with true native editability 🎨 Explicitly specify layers, from coarse layouts to fine-grained details https://x.com/fal/status/2002055913390195137

Runway unveiled three GWM-1 models that generate video frame-by-frame so scenes stay consistent as the camera moves and can react instantly to user inputs. GWM Worlds makes navigable scenes, GWM Robotics simulates robot viewpoints for planning/data, and GWM Avatars creates https://x.com/DeepLearningAI/status/2001834874487861352

WorldPlay Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling https://x.com/_akhaliq/status/2001286164469227555

MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos”” TL;DR: 3 learnable modules+lightweight IK stage: a Reference Prompt Encoder that distills per-joint queries from the asset’s skeleton, mesh, and rendered image set; (1/4) https://x.com/Almorgand/status/1999530563607122271

Introducing Real-time Transcription with Speakers! – Step change in accuracy, surpassing top cloud APIs – Faster than real-time on Mac and iPhone – Still under 3 watts when all features are enabled Available in Argmax SDK 2.0 for early access! Benchmarks and details in comments. https://x.com/argmax/status/2001296557556040028

Dolphin-v2 🐬 new document parsing model released by @ByteDanceOSS ✨ 3B – MIT license ✨ Works on any document: PDFs, scans, photos ✨ Understands 21 types of content: text, tables, code, formulas, figures & more ✨ Pixel-level precision via absolute coordinate prediction https://x.com/AdinaYakup/status/1999462500551786692

Gemini 3 Pro continues to be SOTA at multimodal understanding and generation : ) cc @bcaine for the great example https://x.com/OfficialLoganK/status/1999270402712023158

Gemini 3 Pro playing Pokémon vs 2.5 Pro (we used to all be impressed by 2.5 Pro) https://x.com/OfficialLoganK/status/2000728193599226187

Google Translate gets new Gemini AI translation models https://blog.google/products-and-platforms/products/search/gemini-capabilities-translation-upgrades/

🚨BREAKING: Leaderboard updates for Text, Vision & WebDev Gemini-3-Flash by @GoogleDeepMind is now ranked top 5 across Text, Vision, and WebDev, making it the most cost-efficient frontier model (input $0.5 and output $3/MTokens). Gemini-3-Flash highlights: 🔹 Top 5 across Text, https://x.com/arena/status/2001322123730788698

🗣️ “”Help me build an app…”” That’s all it takes. Watch Gemini 3 Flash turn a single voice prompt into a functional prototype in the @GeminiApp. https://x.com/Google/status/2002123256854425918

Congrats to the Gemini team on the great release and exceptional SWE-bench Verified numbers! 76.2% (3 Pro) vs. 78% (3 Flash), +6 task instances – a whole lot in the realm of the last quarter of SWE-bench. mini-SWE-agent + Gemini 3 Flash coming soon!”” / X https://x.com/jyangballin/status/2001336879120363639

Gemini 3 Flash across different test-time compute levels (green line below) represents a new score/cost Pareto frontier on ARC-AGI-2. Congrats to @demishassabis and @sundarpichai on the launch! https://x.com/fchollet/status/2001330643423449409

Gemini 3 Flash is out ⚡️- and we built a CLI agent powered by this latest model to perform work over your filesystem 🤖 Basically all the file capabilities within Claude Code in a lighter form factor. Shoutout to @itsclelia for the launch demo, check it out! Repo: https://x.com/jerryjliu0/status/2001335494534402521

how can flash beat pro??”” -> the answer is RL! flash is not just a distilled pro. we’ve had lots of exciting research progress on agentic RL which made its way into flash but was too late for pro. can’t wait to finally bring them to pro👀”” / X https://x.com/ankesh_anand/status/2002017859443233017

Introducing Gemini 3 Flash ⚡️Performance close to Gemini 3 Pro, with great multimodal and tool use quality ⚡️3x faster than Gemini 2.5 Pro, while cheaper and better at most benchmarks ⚡️LMArena score of 1477 (top 3 model) The time to build is now (and yes, there’s a free tier)”” / X https://x.com/osanseviero/status/2001323721232163053

Introducing Gemini 3 Flash, our frontier intelligence model, available at scale for everyone. It excels at coding, tool calling, and is stronger than 2.5 Pro across most metrics!! ⚡️ Available in the API at $0.50 in / 1M tokens and $3.00 out / 1M tokens across. https://x.com/OfficialLoganK/status/2001322275656835348

Introducing Gemini 3 Flash! ⚡️⚡️⚡️ Frontier intelligence built for speed at a fraction of the cost. Here’s ~4 minutes of demos. https://x.com/addyosmani/status/2001324727504359745

Speed test: Gemini 3 Flash vs. Gemini 2.5 Pro ⏱️ We put our new Gemini 3 Flash model (left) up against Gemini 2.5 Pro (right) in @GoogleAIStudio, so you can watch the difference in near real-time. Watch them go head-to-head ↓ https://x.com/Google/status/2001397324551946523

Study with help from Gemini 3 Flash. Upload an audio recording of yourself explaining a difficult concept and Gemini will identify knowledge gaps, create a custom quiz, and provide instant assessments and explanations for each question.”” / X https://x.com/GeminiApp/status/2001351746338329063

Today, we’re releasing an updated Gemini 2.5 Flash Native Audio model. Now available via the Live API 🗣 https://x.com/googleaidevs/status/1999539531826036973

Watch Gemini 3 Flash vs Gemini 3 Pro playing Pokemon Crystal : ) https://x.com/OfficialLoganK/status/2001428651121025391

We’re back in a Flash ⚡ Gemini 3 Flash is our latest model with frontier intelligence built for lightning speed, and pushing the Pareto Frontier of performance and efficiency. It outperforms 2.5 Pro while being 3x faster at a fraction of the cost. With this release, Gemini 3’s https://x.com/sundarpichai/status/2001326061787942957

We’re expanding the Gemini 3 family with the launch of Gemini 3 Flash. This model: — Combines Gemini 3’s Pro-grade reasoning with Flash-level latency, efficiency, and cost — Delivers frontier-level performance on PHD-level reasoning and knowledge benchmarks — Is our most https://x.com/googleai/status/2001323069105692914

we’re going live at 11:30am PT with the team for a deep dive on gemini 3 flash hosted by @OfficialLoganK, @joshwoodward, @tulseedoshi and more post your questions below ⬇️ https://x.com/GoogleAIStudio/status/2001330099841556490

We’ve pushed out the Pareto frontier of efficiency vs. intelligence again. With Gemini 3 Flash ⚡️, we are seeing reasoning capabilities previously reserved for our largest models, now running at Flash-level latency. This opens up entirely new categories of near real-time https://x.com/JeffDean/status/2001323132821569749

With Gemini 3 Flash, you can quickly build fun, useful apps from scratch using your voice without any prior coding knowledge. Just dictate to Gemini on the go, and it can transform your unstructured thoughts into a functioning app in minutes.”” / X https://x.com/GeminiApp/status/2001760080518353261

Realtime speech to speech translation powered by Gemini, available in Google Translate now, coming to developers early next year : ) https://x.com/OfficialLoganK/status/1999994009452962073

🚀 The GeoAI QGIS Plugin is here 🔥 You can run Moondream vision-language models, object detection, image segmentation (SAM 3), and even train your own geospatial segmentation model end-to-end. Website: https://x.com/giswqs/status/1999536028282179721

Meta just released sam-audio https://x.com/_akhaliq/status/2001000836017844296

Mistral OCR 3 sets new benchmarks in both accuracy and efficiency, outperforming enterprise document processing solutions as well as AI-native OCR. https://x.com/MistralAI/status/2001669583296712970

Very happy to announce the release of our latest Mistral OCR, which significantly outperforms existing solutions! A lot of effort was done to improve handwritten content, low quality scans, and complex tables & forms commonly found in enterprise documents. https://x.com/GuillaumeLample/status/2001719413649617404

Introducing Mistral OCR 3 | Mistral AI https://mistral.ai/news/mistral-ocr-3

Introducing Mistral OCR 3, a new frontier in document intelligence! 🧵👇 https://x.com/MistralAI/status/2001669581275033741

A nice lateral thinking addition to the Sparks unicorn. Ask Opus 4.5 to make a TikZ unicorn, and it not only draws the unicorn in TikZ, but then compiles it in LaTeX, turns that into a PDF, turns the PDF into a PNG and then gives me the PNG image. Also cute little hearts & stars https://x.com/emollick/status/1999716324520464582

great text separation, even when some of it is in the front and some in the back https://x.com/linoy_tsaban/status/2002073701941121174

Most vision systems don’t fail in the lab. They fail after deployment… the real world❗️ Latency creeps in. Calibration drifts. Someone adds a PC. Someone adds a cloud dependency… Suddenly, the system that WORKED at demo day becomes fragile in the field. That’s why this new https://x.com/IlirAliu_/status/1999481112905531585

Snappy: Vision-Grounded PDF Search, Powered by Qdrant Snappy is an open-source project that brings vision-first, multimodal retrieval to PDFs and scanned documents, especially useful for real-world files where layout, tables, diagrams, and visuals matter as much as text. Made by https://x.com/qdrant_engine/status/2001170495987966132

Introducing Wan2.6 – A native multimodal model that turns your ideas into breathtaking videos and images! · Starring: Cast characters from reference videos into new scenes. Support human or human-like figures, enabling complex multi-person and human-object interactions with https://x.com/Alibaba_Wan/status/2000930078037827972?s=20

Last year Molmo set SOTA on image benchmarks + pioneered image pointing. Millions of downloads later, Molmo 2 brings Molmo’s grounded multimodal capabilities to video 🎥–and leads many open models on challenging industry video benchmarks. 🧵 https://x.com/allen_ai/status/2000962068774588536

Molmo 2 sets new sota in image and video tasks in open models 🔥 > comes in 3 sizes, based on SigLIP2 + Qwen3 > separate 4B model for video pointing/counting (sota!) > 💗 Apache 2.0 licensed 💗 > image + video datasets are out as well!”” / X https://x.com/mervenoyann/status/2000965892230815756

Multimodal serving pain: vision encoder work can stall text prefill/decode and make tail latency jittery. We built Encoder Disaggregation (EPD) in vLLM: run the encoder as a separate scalable service, pipeline it with prefill/decode, and reuse image embeddings via caching. This https://x.com/vllm_project/status/2000535421642502335

TurboDiffusion Accelerating Video Diffusion Models by 100-205 Times https://x.com/_akhaliq/status/2001342606450774299

Vision Bridge Transformer (ViBT) – a large-scale model based on Brownian Bridge Models for conditional generation. It’s a new kind of model that learns direct data-to-data trajectories for fast, high-quality image/video editing and stylization. We scale ViBT to 1.3B and 20B https://x.com/TheTuringPost/status/2000313966648844447

Even if it is just an X algorithm issue on my end, I find it surprising that I’m not seeing many long-context impressions of GPT-5.2. I’ve been using it consistently for long-context work since the initial release, and in my use cases it’s been delivering results I prefer over”” / X https://x.com/Hangsiin/status/2002015892654502158

Even without the ability to do new things like output polished files, GPT-5.2 feels like the biggest upgrade we’ve had in a long time. Curious to hear what you think!”” / X https://x.com/sama/status/1999185220680012207

Finally had time to test GPT-5.2 Pro. On my tasks Extended Thinking is a VERY significant improvement over 5.1 Pro – feels roughly on the order of o1 Pro -> o3 Pro jump.”” / X https://x.com/MParakhin/status/2000079349706539442

gpt 5.2 has been amazing for my daily work it’s sharper and more dependable on the hard stuff, things that would’ve sounded crazy two years ago and yeah, i’m genuinely convinced this tech is going to change the world. once this kind of help is normal, ppl are going to move way”” / X https://x.com/slow_developer/status/2001178044535316571

GPT-5.2 Is Frontier Only For The Frontier https://thezvi.substack.com/p/gpt-52-is-frontier-only-for-the-frontier

GPT-5.2 is here! Available today in ChatGPT and the API. It is the smartest generally-available model in the world, and in particular is good at doing real-world knowledge work tasks.”” / X https://x.com/sama/status/1999184337460428962

GPT-5.2 Pro for mathematical research:”” / X https://x.com/gdb/status/2000687002799194246

GPT-5.2-Codex is more cyber-capable than GPT-5.1-Codex-Max, and we expect future models to continue on this trajectory. This helps strengthen cybersecurity at scale by giving defenders more powerful tools, but also raises new dual-use risks that require careful deployment. https://x.com/OpenAIDevs/status/2001723693496775167

Had early access to GPT-5.2. Its an impressive model. Here is GPT 5.2 Pro’s version of “”create a visually interesting shader that can run in twigl-dot-app make it like an infinite city of neo-gothic towers partially drowned in a stormy ocean with large waves,”” single shot. https://x.com/emollick/status/1999185085719887978

Looks like @OpenAI has added an even MORE powerful version of Pro mode… you can now ask GPT-5.2 Pro to think even longer than before. Starting to test this… I have high expectations here. https://x.com/mattshumer_/status/1999905708238880895

OK, I think GPT 5.2 Pro is actually a step change in usefulness for my applications (algebraic geometry/number theory research).”” / X https://x.com/littmath/status/2000636724574302478

GPT-5.2 xhigh reasoning scores 89.3 on the Extended NYT Connections benchmark, compared with 77.9 for GPT-5.2 high reasoning. GPT-5.2 Pro scores lower (86.7) but above GPT-5 Pro (83.9). https://x.com/LechMazur/status/1999582591905583256

Ok GPT-5.2 is *much* stronger at proof-writing. It notices BS previous models wrote immediately (I like to test this between model iterations to see if they notice what I notice). It also has better sense for what problems seem more tractable, and makes further progress.”” / X https://x.com/AcerFur/status/1999314476320063546

Real user feedback matters in model evaluation. ✨GPT-5.2 Instant, meant for everyday work, is #1 on @yupp_ai’s Text Leaderboard while GPT-5.2 (High) is #1 on our SVG Leaderboard. @openai’s strategy of releasing model variants suited to the task looks sound. Congrats @openai! 🎉 https://x.com/lintool/status/2000368978708119958

Yeah it’s over AI explained specified that this GPT-5.2 result was with reasoning effort xhigh aka 100k tokens spent thinking”” / X https://x.com/scaling01/status/1999535536130662576

ITS LIVE photoshop-grade layering physically isolated RGBA layers with native editability 🤯 https://x.com/linoy_tsaban/status/2002038877511377393

Robots learning from human videos used to be a hard research problem. It turns out scale changes that. A new result from @physical_int shows an emergent property of large VLAs like π0.5. As pre training scales, the model naturally aligns human egocentric video and robot data https://x.com/IlirAliu_/status/2001216734850646410

Efficiently Reconstructing Dynamic Scenes One 🎯 D4RT at a Time”” TL;DR: self-attention encoder transforms the input video into the latent Global Scene Representation; decoder can query 3D position P of any given 2D point (u, v) from the source timestep at target timestep 1/2 https://x.com/Almorgand/status/1999138551972221358

Great paper on why RL actually works for LLM reasoning. Apparently, “”aha moments”” during training aren’t random. They’re markers of something deeper. Researchers analyzed RL training dynamics across eight models, including Qwen, LLaMA, and vision-language models. The findings https://x.com/omarsar0/status/1999483394963701911

🚨 New Open Models @Zai_org’s GLM-4.6V and GLM-4.6V-Flash are now available in the Arena. The latest open source releases adds native function calling, larger context windows, and improved coding and reasoning, marking the next step in the GLM vision model lineup. Try them https://x.com/arena/status/2000610761371267350