Image created with GPT Image 1. Image prompt: high-contrast monochrome portrait silhouette on cream backdrop, Low-Life monochrome palette, minimalist graphic design inspired by New Order’s ‘Low-Life’, metaphor for sensor fusion prisms refracting data, flat color, subtle texture, 1980s Saville typography style
Amazon launched Nova Sonic, a real-time speech-to-speech model with bidirectional streaming, tool calling, and RAG support, delivering low-latency, expressive voice output at top-tier price-performance. → Nova Sonic handles real-time, interactive conversations with human-like https://x.com/rohanpaul_ai/status/1920972570595127640
Gemini 2.5 Pro (05-06) is SOTA at most video understanding tasks (by a large margin) 📽️. Lots of work by the Gemini multimodal team to make this happen, excited to see developers push this capability in new ways. More details below! https://x.com/OfficialLoganK/status/1920863634374172853
Advancing the frontier of video understanding with Gemini 2.5 – Google Developers Blog https://developers.googleblog.com/en/gemini-2-5-video-understanding/
BTW, Gemini one shotted these chapter summaries w/amazing accuracy. I just pointed it at the yt video. First time I’ve seen a model do this https://x.com/HamelHusain/status/1922119981526880515
is this.. AGI? 😮 meet any-to-any models on @huggingface, models that take in and output multiple modalities (e.g. a model that takes image + text input and responds with speech!) we’ve shipped a beginner friendly doc on everything you need to know, on the next one ⤵️ https://x.com/mervenoyann/status/1923053505704493311
NEW: up-to 8x faster whisper transcription on just a single L4, powered by @vllm_project 💥 you can now deploy blazingly fast whisper endpoints directly via HF Endpoints – all in <0.8 USD/ hour enjoy! 🤗 https://x.com/reach_vb/status/1922324889593102584
Rowan Cheung on X: “Meta AI dropped Meta Perception Language Model, an open & reproducible vision-language AI for challenging visual tasks It can watch videos and extract details like what a person is doing in the content and how they are doing it https://t.co/JSbbgCukRI” / X
https://x.com/rowancheung/status/1920384499583459776
Microsoft announced X-REASONER Towards Generalizable Reasoning Across Modalities and Domains https://x.com/_akhaliq/status/1920752791405863000
New sota open-source depth estimation: Marigold IID 🌼 > normal maps, depth maps of scenes & faces > get albedo (true color) and BRDF (texture) maps of scenes, they even release a depth-to-3D printer format demo 😮 link to all models and demos on the next one ⤵️ https://x.com/mervenoyann/status/1923318140965990814
VLMS 2025 UPDATE 🔥 We just shipped a blog on everything latest on vision language models, including 🤖 GUI agents, agentic VLMs, omni models 📑 multimodal RAG ⏯️ video LMs 🤏🏻 smol models ..and more! find it on the next one ⤵️ https://x.com/mervenoyann/status/1921962750353301986
UC Berkeley researchers announced VideoMimic, a real-to-sim-to-real pipeline that trains robots with mobile videos It mines videos, reconstructs the humans and the environment, and produces policies for humanoids, enabling skills like climbing stairs https://x.com/adcock_brett/status/1921597176028733566
Could AI translate animal sounds into words? Tech experts hope so | Science, Climate & Tech News | Sky News https://news.sky.com/story/could-ai-translate-animal-sounds-into-words-tech-experts-hope-so-13363743
8x faster/cheaper @openai Whisper API thanks to Hugging Face Inference Endpoints & @vllm_project! https://x.com/ClementDelangue/status/1922383289408491629
Just launched: 8x faster Whisper transcription endpoints on @huggingface 🗣️ Powered by @vllm_project and optimized for NVIDIA GPUs. Same accuracy, way better performance! https://x.com/freddy_alfonso_/status/1922313983006056607
Blazingly fast whisper transcriptions with Inference Endpoints https://x.com/_akhaliq/status/1922315470478139537
Bytedance just dropped Seed1.5-VL on Hugging Face Achieves top performance with a relatively modest architecture, 532M vision encoder & 20B active parameter MoE LLM. Delivers State-of-the-Art results on 38 out of 60 public VLM benchmarks, demonstrating broad competence. https://x.com/_akhaliq/status/1922318117385932993
Google pushed an update to its Gemini 2.0 Flash image generation model The release promises improved quality of generations with better text rendering and fewer content restrictions https://x.com/rowancheung/status/1920384567162060980
I don’t think we’ve fully appreciated how wild natively multimodal image generation is with GPT-4o and Gemini. This was one prompt. It used to be a whole ass ComfyUI workflow, with a variable hit rate — now it just works. Legit the closest thing to a “”graphic designer as an https://x.com/bilawalsidhu/status/1920277002935755135
llama.cpp has vision language model support now! ❤️🔥 get started with sota VLMs (gemma 3, Qwen2.5VL, InternVL3 & more) and serve them wherever you want 🤩 https://x.com/mervenoyann/status/1921471242852331719
The latest models (Gemini 2.5 Pro, GPT-4.1) are cracked at document parsing and traditional OCR is dead. They’re not 100% accurate though – they still struggle on hard data. For any job where you’re relying on LLMs/LVMs for automation, you need to have the UX for human review https://x.com/jerryjliu0/status/1921621794265665749
Making complex text understandable: Minimally-lossy text simplification with Gemini https://research.google/blog/making-complex-text-understandable-minimally-lossy-text-simplification-with-gemini/
Gemma just passed 150 million downloads and over 70k variants on Hugging Face🚀🚀🚀 What would you like to see in the next Gemma versions?”” / X https://x.com/osanseviero/status/1921636582873800746
Video Understanding! 📽️ Gemini 2.5 Pro (05-06) is changing on how we will work with videos! You can now share recordings of videos on what the model should change in your code or process up to 6 hours in a single request (‘lower resolution’). 😮 TL;DR: 🏆 Gemini 2.5 Pro achieves https://x.com/_philschmid/status/1921838835735867533
OpenVision, a fully open vision encoder family, offering 25+ models (5.9M–632M params) that outperform or match OpenAI’s CLIP and Google’s SigLIP on 9+ multimodal benchmarks. This matters as it’s completely open—training data, code, and weights included—unlike CLIP/SigLIP. → https://x.com/rohanpaul_ai/status/1920974917866057913
Vision Language Models (Better, faster, stronger) https://huggingface.co/blog/vlms-2025
Salesforce introduces: BLIP3-o: A Family of Fully Open Unified Multimodal Models—Architecture, Training and Dataset “”we introduce a novel approach that employs a diffusion transformer to generate semantically rich CLIP image features, in contrast to conventional VAE-based https://x.com/iScienceLuvr/status/1922843713514193076
Trying out llama.cpp’s new vision support https://simonwillison.net/2025/May/10/llama-cpp-vision/
Multimodal on-device! Llama.cpp does vision now https://x.com/fdaudens/status/1921211454453088620
Helium 1: a modular and multilingual LLM https://kyutai.org/2025/04/30/helium.html
Skywork-VL Reward An Effective Reward Model for Multimodal Understanding and Reasoning https://x.com/_akhaliq/status/1922326980680138925
Vision-Language-Action framework from AGIBot. https://x.com/teortaxesTex/status/1921774079834529862
Seed1.5-VL Technical Report “”Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and https://x.com/iScienceLuvr/status/1922226964599095740
BIG one today. Introducing: – AI Meeting Notes (never take notes again) – Enterprise Search (find answers across all your tools) – Research Mode (auto-draft polished docs) – Model picker (chat with GPT-4.1 & Claude 3.7 directly) – All-in-one pricing (AI now included on Biz plan) https://x.com/NotionHQ/status/1922318308893708557
Salesforce just dropped BLIP3-o on Hugging Face A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset https://x.com/_akhaliq/status/1923001183804764391
🚀 Introducing HunyuanCustom: An open-source, multimodal-driven architecture for customized video generation, powered by HunyuanVideo-13B. Outperforming existing open-source models, it rivals top closed-source solutions! 🎥 Highlights: ✅Subject Consistency: Maintains identity https://x.com/TencentHunyuan/status/1920679422379913330
Github 👨🔧: Scalable Multi-modal RAG → Ingests diverse unstructured data (PDFs, video, text) with intelligent parsing and automatic chunking/embedding. → Implements advanced Retrieval Augmented Generation (RAG) using multi-modal embeddings (ColPali) and integrated Knowledge https://x.com/rohanpaul_ai/status/1922276643520811308
X-REASONER: Towards Generalizable Reasoning Across Modalities and Domains “”General-domain text-based post-training can enable such strong generalizable reasoning.”” “”we introduce X-REASONER, a vision-language model posttrained solely on general-domain text for generalizable https://x.com/iScienceLuvr/status/1920435270824178089
[2505.09568v1] BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset https://arxiv.org/abs/2505.09568v1
From Zero to Hero on all things Vision Language Models – from multimodal to reasoning to to MoEs to benchmarks AND more 🔥 One definitive blogpost to put you up to speed on all things VLMs! – Enjoy! 🤗 https://x.com/reach_vb/status/1921974792242016591
Releasing the OpenAI to Z Challenge — using o3/o4 mini and GPT 4.1 models to discover previously unknown archaeological sites:”” / X https://x.com/gdb/status/1923105670464782516
Announcing the OpenAI to Z Challenge: use OpenAI o3, o4-mini, or GPT-4.1 to find previously unknown archaeological sites in the Amazon. Use #OpenAItoZ to share your progress. https://x.com/OpenAIDevs/status/1923062948060168542
OpenAI to Z Challenge | OpenAI https://openai.com/openai-to-z-challenge/
our new system trains humanoid robots using data from cell phone videos, enabling skills such as climbing stairs and sitting on chairs in a single policy (w/ @redstone_hong @junyi42 @davidrmcall) https://x.com/arthurallshire/status/1920187086860116339
Mass General Brigham’s researchers introduced FaceAge, an AI tool that can estimate cancer survival outcomes with facial photos The AI estimates biological age from photos, helping teams guide their treatment levels accordingly https://x.com/rowancheung/status/1922201339318206495
Did you know your face can reveal your biological age? @MGBResearchNews has developed FaceAge, an #AI algorithm that predicts biological age survival outcomes for patients with cancer using a single photo. Patients with cancer appeared five years older than their actual age, and https://x.com/MassGenBrigham/status/1920607240865698080
Current radiology report models lack expert-like structured reasoning. They fail to link visual findings to precise anatomical locations, hindering clinical trust. BoxMed-RL solves this with a two-phase framework. It first instills radiologist-like thinking and visual https://x.com/rohanpaul_ai/status/1921511349978632479



