Multimodal: AI News Week Ending 03/28/2025

Multimodal: AI News Week Ending 03/28/2025

March 28, 2025

“🚨 Gemini 2.5 Pro Exp dropped and it’s now #1 across SEAL leaderboards: 🥇 Humanity’s Last Exam 🥇 VISTA (multimodal) 🥇 (tie) Tool Use 🥇 (tie) MultiChallenge (multi-turn) 🥉 (tie) Enigma (puzzles) Congrats to @demishassabis @sundarpichai & team! 🔗 https://x.com/alexandr_wang/status/1904590438469951873

“Gemini 2.5 Pro Experimental on Livebench 🤯🥇 https://x.com/OfficialLoganK/status/1904925675892728179

“Introducing Gemini 2.5 Pro Experimental! 🎉 Our newest Gemini model has stellar performance across math and science benchmarks. It’s an incredible model for coding and complex reasoning, and it’s #1 on the @lmarena_ai leaderboard by a drastic 40 ELO margin. Only a handful of https://x.com/OriolVinyalsML/status/1904583691566727361

“Gemini 2.5 Pro is an awesome state-of-the-art model, no.1 on LMArena by a whopping +39 ELO points, with significant improvements across the board in multimodal reasoning, coding & STEM. You can try it out now in AI Studio https://x.com/demishassabis/status/1904587103805006218

“Today we are launching 2.5 Pro! I think it’s the best model in the world. State-of-the-art reasoning and great vibes (+39 ELO gap on lmsys!) 2.5 Pro improves in coding, stem, multimodal, instruction following, and lots more. Available in AI Studio & the Gemini App! https://x.com/jack_w_rae/status/1904583894458110218

“Introducing Gemini 2.5 Pro, the world’s most powerful model, with unified reasoning capabilities + all the things you love about Gemini (long context, tools, etc) Available as experimental and for free right now in Google AI Studio + API, with pricing coming very soon! https://x.com/OfficialLoganK/status/1904580368432586975

Gemini 2.5: Our newest Gemini model with thinking https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/

“Gemini 2.5 now does a much better job than most AIs writing plots & characters, thanks in part to its ability to “think” through details in advance. A fun test: “play a game of Fiasco (think a Coen Brothers movie, but as an RPG), make it interesting in terms of plot and prose” https://x.com/emollick/status/1904656593083396541

Google is rolling out Gemini’s real-time AI video features | The Verge https://www.theverge.com/news/634480/google-gemini-live-video-screen-sharing-astra-features-rolling-out

“Deep Dive Video: Complex image editing used to take hours — now Google’s Gemini 2.0 turns advanced ComfyUI & Photoshop workflows into simple text prompts. Here’s exactly how to try it (completely free). Chapters: 00:00 Conversational Editing with Google’s Multimodal AI 00:53 https://x.com/bilawalsidhu/status/1903172149764034605

“Google’s new Gemini 2.5 Pro Experimental takes the #1 position across a range of our evaluations that we have run independently Gemini 2.5 Pro is a reasoning model, it ‘thinks’ before answering questions. Google has released it as an experimental API in AI Studio only, and has https://x.com/ArtificialAnlys/status/1904923020604641471

“BREAKING: Gemini 2.5 Pro is now #1 on the Arena leaderboard – the largest score jump ever (+40 pts vs Grok-3/GPT-4.5)! 🏆 Tested under codename “nebula”🌌, Gemini 2.5 Pro ranked #1🥇 across ALL categories and UNIQUELY #1 in Math, Creative Writing, Instruction Following, Longer https://x.com/lmarena_ai/status/1904581128746656099

AI diagnoses major cancer with near perfect accuracy | Charles Darwin University https://www.cdu.edu.au/news/ai-diagnoses-major-cancer-near-perfect-accuracy

“Sure, you could use an annotation tool to create bounding boxes of objects in images for you… or you can ask a multimodal AI to do it freehand. https://x.com/emollick/status/1904028116063822141

“This is a more significant paper than people seem to realize. You can give the AI a novel picture of a location and it can, with reasonable accuracy, tell you where it was taken even if it hasn’t “seen” that picture before This is a finding with a lot of real-world implications” / X https://x.com/emollick/status/1903135115334594871

“do not sleep on RF-DETR 🔥 it’s a sota, real-time and most importantly, open-source object detector released by @roboflow @skalskip92 my vibe tests on very noisy images and videos have passed 👏 currently being integrated to transformers 🤗 https://x.com/mervenoyann/status/1905318925325173000

“✨ Excited to share QVQ-Max, our visual reasoning model that’s still evolving We’ve been experimenting with this approach for a while – try it out on Qwen Chat! (https://t.co/FmQ0B9tiE7) 🚀 Just upload any image or video, ask away, and hit the “Thinking” button to see how it https://x.com/Alibaba_Qwen/status/1905342260100956210

“Introducing Together Chat! Use DeepSeek R1 (hosted in North America) & other top open source models to do web search, coding, image generation, & image analysis. Available today for free! https://x.com/togethercompute/status/1904204860217500123

“Since o1, and especially since DeepSeek-R1, improved “reasoning” has basically become the standard for new LLMs. Speaking of which, Gemini 2.5 Pro just came out as the latest reasoning model offering, which ends up at the top of most benchmarks (notably Humanity’s Last Exam). https://x.com/rasbt/status/1904940955192418555

“2.5 Pro will come to Advanced users in the @GeminiApp. 💬 Simply select it in the model dropdown on desktop and mobile. It will also soon be available on @GoogleCloud’s #VertexAI platform. Developers and enterprises can start experimenting now in @Google AI Studio → https://x.com/GoogleDeepMind/status/1904581166755123463

“Gemini models, both 2.5 Pro and 2.0 Flash, have the fastest output speed compared to leading models https://x.com/ArtificialAnlys/status/1904923026820653435

“Gemini 2.5 Pro is live in the WebDev Arena! You can even see Gemini’s progress against its past. ⚔️ ✏️ “Animate layered triangles or polygons rotating in opposite directions with changing hues, creating a mesmerizing kaleidoscope effect”. https://x.com/lmarena_ai/status/1905002138172112971

“Gemini 2.5 Pro is a very good model, seems like a real step forward, in both metrics and practical use. I think because it is labelled 2.5 and was sort of quietly rolled out, people may miss how big a jump it is, but discussions are making me think others are feeling similarly.” / X https://x.com/emollick/status/1904756949284983200

“Introducing Gemini 2.5 Pro Experimental. The 2.5 series marks a significant evolution: Gemini models are now fundamentally thinking models. This means the model reasons before responding, to maximize accuracy — and it’s our best Gemini model yet. Blog -” / X https://x.com/NoamShazeer/status/1904581813215125787

“Built on our state-of-the-art Gemma models, TxGemma can understand and predict the properties of small molecules, chemicals, proteins and more. This could help scientists identify promising targets faster, predict clinical trial outcomes, and reduce overall costs. https://x.com/GoogleDeepMind/status/1905274926665208043

“Think you know Gemini? 🤔 Think again. Meet Gemini 2.5: our most intelligent model 💡 The first release is Pro Experimental, which is state-of-the-art across many benchmarks – meaning it can handle complex problems and give more accurate responses. Try it now → https://x.com/GoogleDeepMind/status/1904579660740256022

“Gemini 2.5 is a very impressive model so far. For example, it is only the third LLM (after Grok 3 and o3-mini-high) to be able to pull off “create a visually interesting shader that can run in twigl-dot-app make it like the ocean in a storm,” and I think the best so far. https://x.com/emollick/status/1904700257822540076

“Gemini 2.5 Pro is the most skilled model in the world, please let us know if you have any issues using it! ♊️♊️♊️” / X https://x.com/zacharynado/status/1904641052096754156

“Gemini 2.5 Pro Released and it’s beating every other model across the board🤯 💡Includes reasoning with Chain of Thought 💻Top 1 on Aider code editing leaderboard 🆓The model is free for everyone to use https://x.com/casper_hansen_/status/1904590489128440163

“Today we’re releasing an experimental version of Gemini 2.5 Pro. 💡2.5 Pro shows strong reasoning and improved code capabilities, with state-of-the-art performance across a range of benchmarks. 📈It’s topped @lmarena_ai’s leaderboard by a huge margin https://x.com/Google/status/1904581629017735261

“Google is winning by so much. We have always been the best in AI. 🔥 Gemini 2.5 pro is the best model in the world 👇” / X https://x.com/YiTayML/status/1904598794278494272

“Google JUST dropped Gemini 2.5 Pro their “most intelligent AI model yet”. This new “thinking model” tops LMArena by a significant margin and crushes DeepSeek-R1, Grok 3 and Claude 3.7 in math, science and coding benchmarks. Ships with 1M token context (2M coming), can process https://x.com/bilawalsidhu/status/1904579366232695128

“This is cool. Qwen is the solid leader on open source multimodality. https://x.com/teortaxesTex/status/1904950082480279943

“is your vision LM in prod even safe? 👀 ShieldGemma 2 is the first ever safety model for multimodal vision LMs in production by @GoogleDeepMind, came with Gemma 3 🔥 I saw confusion around how to use it, so I put together a notebook and a demo, find it in the next one ⤵️ https://x.com/mervenoyann/status/1904870106594750508

“Gemma 3 Technical Report 💎 We introduce Gemma3, a multimodal addition to the Gemma family of lightweight open models, ranging in scale from 1 to 27 billion parameters This version introduces vision understanding abilities, a wider coverage of languages and longer context – at https://x.com/reach_vb/status/1905315180235338116

“wait a sec. look at the content — did y’all actually go this route? This looks way too plausible, and honestly the most practical approach on multimodal gen rn (based on my own experience with students). So, not pure AR, but an LLM + a diffusion “renderer” on the compressed” / X https://x.com/sainingxie/status/1904643929724645453

Map Features in OpenStreetMap with Computer Vision https://blog.mozilla.ai/map-features-in-openstreetmap-with-computer-vision/

“LETS GOO! @kyutai_labs just released MoshiVis – an end-to-end low-latency Vision Speech Model, CC-BY license 🔥 > Only adds 206M parameters via lightweight cross-attention (CA) modules to integrate visual inputs from a frozen PaliGemma2-3B-448 vision encoder > Uses a learnable https://x.com/reach_vb/status/1903126742954377445

“Yes, vision LLMs are pretty good geo-guessers (I had wondered about this), easily beating humand Bigger models are more accurate, and data leakage (having seen the exact picture in training) does not seem to be a big problem. The models are better at urban & more developed areas https://x.com/emollick/status/1902946590597239038

“Cross-attention is a fundamental idea that is heavily used by multi-modal LLMs. Let’s laern how it works from the ground up… The original transformer architecture has two components: an encoder and a decoder. The encoder and decoder contain repeated blocks of: 1. https://x.com/cwolferesearch/status/1904890395466883567