Multimodal: AI News Week Ending 05/16/2025

Multimodal: AI News Week Ending 05/16/2025

May 16, 2025

Image created with GPT Image 1. Image prompt: high-contrast monochrome portrait silhouette on cream backdrop, Low-Life monochrome palette, minimalist graphic design inspired by New Order’s ‘Low-Life’, metaphor for sensor fusion prisms refracting data, flat color, subtle texture, 1980s Saville typography style

Amazon launched Nova Sonic, a real-time speech-to-speech model with bidirectional streaming, tool calling, and RAG support, delivering low-latency, expressive voice output at top-tier price-performance. → Nova Sonic handles real-time, interactive conversations with human-like https://x.com/rohanpaul_ai/status/1920972570595127640

Gemini 2.5 Pro (05-06) is SOTA at most video understanding tasks (by a large margin) 📽️. Lots of work by the Gemini multimodal team to make this happen, excited to see developers push this capability in new ways. More details below! https://x.com/OfficialLoganK/status/1920863634374172853

Advancing the frontier of video understanding with Gemini 2.5 – Google Developers Blog https://developers.googleblog.com/en/gemini-2-5-video-understanding/

BTW, Gemini one shotted these chapter summaries w/amazing accuracy. I just pointed it at the yt video. First time I’ve seen a model do this https://x.com/HamelHusain/status/1922119981526880515

is this.. AGI? 😮 meet any-to-any models on @huggingface, models that take in and output multiple modalities (e.g. a model that takes image + text input and responds with speech!) we’ve shipped a beginner friendly doc on everything you need to know, on the next one ⤵️ https://x.com/mervenoyann/status/1923053505704493311

NEW: up-to 8x faster whisper transcription on just a single L4, powered by @vllm_project 💥 you can now deploy blazingly fast whisper endpoints directly via HF Endpoints – all in <0.8 USD/ hour enjoy! 🤗 https://x.com/reach_vb/status/1922324889593102584

Rowan Cheung on X: “Meta AI dropped Meta Perception Language Model, an open & reproducible vision-language AI for challenging visual tasks It can watch videos and extract details like what a person is doing in the content and how they are doing it https://t.co/JSbbgCukRI” / X
https://x.com/rowancheung/status/1920384499583459776

Microsoft announced X-REASONER Towards Generalizable Reasoning Across Modalities and Domains https://x.com/_akhaliq/status/1920752791405863000

New sota open-source depth estimation: Marigold IID 🌼 > normal maps, depth maps of scenes & faces > get albedo (true color) and BRDF (texture) maps of scenes, they even release a depth-to-3D printer format demo 😮 link to all models and demos on the next one ⤵️ https://x.com/mervenoyann/status/1923318140965990814

VLMS 2025 UPDATE 🔥 We just shipped a blog on everything latest on vision language models, including 🤖 GUI agents, agentic VLMs, omni models 📑 multimodal RAG ⏯️ video LMs 🤏🏻 smol models ..and more! find it on the next one ⤵️ https://x.com/mervenoyann/status/1921962750353301986

UC Berkeley researchers announced VideoMimic, a real-to-sim-to-real pipeline that trains robots with mobile videos It mines videos, reconstructs the humans and the environment, and produces policies for humanoids, enabling skills like climbing stairs https://x.com/adcock_brett/status/1921597176028733566

Could AI translate animal sounds into words? Tech experts hope so | Science, Climate & Tech News | Sky News https://news.sky.com/story/could-ai-translate-animal-sounds-into-words-tech-experts-hope-so-13363743

8x faster/cheaper @openai Whisper API thanks to Hugging Face Inference Endpoints & @vllm_project! https://x.com/ClementDelangue/status/1922383289408491629

Just launched: 8x faster Whisper transcription endpoints on @huggingface 🗣️ Powered by @vllm_project and optimized for NVIDIA GPUs. Same accuracy, way better performance! https://x.com/freddy_alfonso_/status/1922313983006056607

Blazingly fast whisper transcriptions with Inference Endpoints https://x.com/_akhaliq/status/1922315470478139537

Bytedance just dropped Seed1.5-VL on Hugging Face Achieves top performance with a relatively modest architecture, 532M vision encoder & 20B active parameter MoE LLM. Delivers State-of-the-Art results on 38 out of 60 public VLM benchmarks, demonstrating broad competence. https://x.com/_akhaliq/status/1922318117385932993

Google pushed an update to its Gemini 2.0 Flash image generation model The release promises improved quality of generations with better text rendering and fewer content restrictions https://x.com/rowancheung/status/1920384567162060980

I don’t think we’ve fully appreciated how wild natively multimodal image generation is with GPT-4o and Gemini. This was one prompt. It used to be a whole ass ComfyUI workflow, with a variable hit rate — now it just works. Legit the closest thing to a “”graphic designer as an https://x.com/bilawalsidhu/status/1920277002935755135

llama.cpp has vision language model support now! ❤️‍🔥 get started with sota VLMs (gemma 3, Qwen2.5VL, InternVL3 & more) and serve them wherever you want 🤩 https://x.com/mervenoyann/status/1921471242852331719

The latest models (Gemini 2.5 Pro, GPT-4.1) are cracked at document parsing and traditional OCR is dead. They’re not 100% accurate though – they still struggle on hard data. For any job where you’re relying on LLMs/LVMs for automation, you need to have the UX for human review https://x.com/jerryjliu0/status/1921621794265665749

Making complex text understandable: Minimally-lossy text simplification with Gemini https://research.google/blog/making-complex-text-understandable-minimally-lossy-text-simplification-with-gemini/

Gemma just passed 150 million downloads and over 70k variants on Hugging Face🚀🚀🚀 What would you like to see in the next Gemma versions?”” / X https://x.com/osanseviero/status/1921636582873800746

Video Understanding! 📽️ Gemini 2.5 Pro (05-06) is changing on how we will work with videos! You can now share recordings of videos on what the model should change in your code or process up to 6 hours in a single request (‘lower resolution’). 😮 TL;DR: 🏆 Gemini 2.5 Pro achieves https://x.com/_philschmid/status/1921838835735867533

OpenVision, a fully open vision encoder family, offering 25+ models (5.9M–632M params) that outperform or match OpenAI’s CLIP and Google’s SigLIP on 9+ multimodal benchmarks. This matters as it’s completely open—training data, code, and weights included—unlike CLIP/SigLIP. → https://x.com/rohanpaul_ai/status/1920974917866057913

Vision Language Models (Better, faster, stronger) https://huggingface.co/blog/vlms-2025

Salesforce introduces: BLIP3-o: A Family of Fully Open Unified Multimodal Models—Architecture, Training and Dataset “”we introduce a novel approach that employs a diffusion transformer to generate semantically rich CLIP image features, in contrast to conventional VAE-based https://x.com/iScienceLuvr/status/1922843713514193076

Trying out llama.cpp’s new vision support https://simonwillison.net/2025/May/10/llama-cpp-vision/

Multimodal on-device! Llama.cpp does vision now https://x.com/fdaudens/status/1921211454453088620

Helium 1: a modular and multilingual LLM https://kyutai.org/2025/04/30/helium.html

Skywork-VL Reward An Effective Reward Model for Multimodal Understanding and Reasoning https://x.com/_akhaliq/status/1922326980680138925

Vision-Language-Action framework from AGIBot. https://x.com/teortaxesTex/status/1921774079834529862

Seed1.5-VL Technical Report “”Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and https://x.com/iScienceLuvr/status/1922226964599095740

BIG one today. Introducing: – AI Meeting Notes (never take notes again) – Enterprise Search (find answers across all your tools) – Research Mode (auto-draft polished docs) – Model picker (chat with GPT-4.1 & Claude 3.7 directly) – All-in-one pricing (AI now included on Biz plan) https://x.com/NotionHQ/status/1922318308893708557

Salesforce just dropped BLIP3-o on Hugging Face A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset https://x.com/_akhaliq/status/1923001183804764391

🚀 Introducing HunyuanCustom: An open-source, multimodal-driven architecture for customized video generation, powered by HunyuanVideo-13B. Outperforming existing open-source models, it rivals top closed-source solutions! 🎥 Highlights: ✅Subject Consistency: Maintains identity https://x.com/TencentHunyuan/status/1920679422379913330

Github 👨‍🔧: Scalable Multi-modal RAG → Ingests diverse unstructured data (PDFs, video, text) with intelligent parsing and automatic chunking/embedding. → Implements advanced Retrieval Augmented Generation (RAG) using multi-modal embeddings (ColPali) and integrated Knowledge https://x.com/rohanpaul_ai/status/1922276643520811308

X-REASONER: Towards Generalizable Reasoning Across Modalities and Domains “”General-domain text-based post-training can enable such strong generalizable reasoning.”” “”we introduce X-REASONER, a vision-language model posttrained solely on general-domain text for generalizable https://x.com/iScienceLuvr/status/1920435270824178089

[2505.09568v1] BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset https://arxiv.org/abs/2505.09568v1

From Zero to Hero on all things Vision Language Models – from multimodal to reasoning to to MoEs to benchmarks AND more 🔥 One definitive blogpost to put you up to speed on all things VLMs! – Enjoy! 🤗 https://x.com/reach_vb/status/1921974792242016591

Releasing the OpenAI to Z Challenge — using o3/o4 mini and GPT 4.1 models to discover previously unknown archaeological sites:”” / X https://x.com/gdb/status/1923105670464782516

Announcing the OpenAI to Z Challenge: use OpenAI o3, o4-mini, or GPT-4.1 to find previously unknown archaeological sites in the Amazon. Use #OpenAItoZ to share your progress. https://x.com/OpenAIDevs/status/1923062948060168542

OpenAI to Z Challenge | OpenAI https://openai.com/openai-to-z-challenge/

our new system trains humanoid robots using data from cell phone videos, enabling skills such as climbing stairs and sitting on chairs in a single policy (w/ @redstone_hong @junyi42 @davidrmcall) https://x.com/arthurallshire/status/1920187086860116339

Mass General Brigham’s researchers introduced FaceAge, an AI tool that can estimate cancer survival outcomes with facial photos The AI estimates biological age from photos, helping teams guide their treatment levels accordingly https://x.com/rowancheung/status/1922201339318206495

Did you know your face can reveal your biological age? @MGBResearchNews has developed FaceAge, an #AI algorithm that predicts biological age survival outcomes for patients with cancer using a single photo. Patients with cancer appeared five years older than their actual age, and https://x.com/MassGenBrigham/status/1920607240865698080

Current radiology report models lack expert-like structured reasoning. They fail to link visual findings to precise anatomical locations, hindering clinical trust. BoxMed-RL solves this with a two-phase framework. It first instills radiologist-like thinking and visual https://x.com/rohanpaul_ai/status/1921511349978632479