Ethan B. Holland

Over 54,900 manually organized AI links and counting

Multimodal: AI News Week Ending 05/09/2025

May 9, 2025

Image created with GPT Image 1. Image prompt: A nonbinary muse glides in a patchwork gown of kente, latex, and optical fiber, one eye augmented with glowing glass, the other lined in gold; surrounded by a chromatic blur of soundwaves and swirling binary, on a mirrored runway blending Harlem Renaissance jazz club and cyberpunk Tokyo, shot on tilt-shift lens for layered focus.

Sam Patterson https://sampatt.com/blog/2025-04-28-can-o3-beat-a-geoguessr-master

Introducing Meta Perception Language Model (PLM): an open & reproducible vision-language model tackling challenging visual tasks. Learn more about how PLM can help the open source community build more capable computer vision systems. Read the research paper, and download the https://x.com/AIatMeta/status/1920153975921521018

Chatting” with LLM feels like using an 80s computer terminal. The GUI hasn’t been invented, yet but imo some properties of it can start to be predicted. 1 it will be visual (like GUIs of the past) because vision (pictures, charts, animations, not so much reading) is the 10-lane https://x.com/karpathy/status/1917920257257459899

o3 now cracks new Harvard Business School cases from the PDF, in one shot I blurred the figures to not ruin the case, but I asked the AI to figure out financials, which incorporates data scattered throughout the case. More interesting, I asked it to compare to the case’s answer. https://x.com/emollick/status/1918355078253027802

🏆 With our new Parakeet model (parakeet-tdt-0.6b-v2), we have achieved a new standard for automatic speech recognition (ASR) with an 👀 industry-best 6.05% Word Error Rate on the @HuggingFace Open-ASR-Leaderboard. 🦜 Parakeet V2 takes performance to the next level with https://x.com/NVIDIAAIDev/status/1917976429939351944

o3 I want you to make a map of the lighthouses of the great lakes. I want the map in “dark mode “ but each lighthouse marker should be aesthetically sized so it covers the distance it can be seen on an average night and is the color of the light” Few rounds of feedback later… https://x.com/emollick/status/1918888777826676738

Amazon released Nova Premier, its most advanced AI model yet —Handles complex tasks and also acts as a “teacher” to fine-tune smaller models —Multimodal with 1M-token context window —Excels at orchestrating multi-agent workflows https://x.com/adcock_brett/status/1919060306879226349

Created an agent with Google ADK that reads the script creates blog post from youtube videos. #aiagents #buildingagents @GeminiApp @Google @GoogleIndia https://x.com/fuzzzypan/status/1914660041782657330

Microsoft dropped three reasoning-focused Phi-4 models —Flagship Phi-4-reasoning (14B) beats OpenAI’s o1-mini —Smaller 3.8B param Phi-4-mini-reasoning matches 7B models on math —Suited for small devices like phones —Open-source with permissive licenses https://x.com/adcock_brett/status/1919060284078997565

How can smaller LLMs achieve strong reasoning? By combining data curation with supervised fine-tuning (SFT) and targeted reinforcement learning (RL). Microsoft released their first open reasoning/thinking models with Phi-4-reasoning distilled from OpenAI o3-mini. Implementation https://x.com/_philschmid/status/1918216082231320632

Gemini 2.5(new) does a really nice job with “create a visually interesting shader that can run in twigl app make it like the ocean in a storm” (though it had an error that it corrected). It is the current leader. You can’t see the occasional flashes of lightening in this video. https://x.com/emollick/status/1919938304822124979

RAG is the number 1 use-case of LLMs in Enterprises, but so far primarily limited to text-only. How can we bridge the modality gap, and make it understand and use complex visual information like graphs & charts? In this talk @Nils_Reimers will outline how to build an” / X https://x.com/jxnlco/status/1919830678524289263

The Phi-4-reasoning tech report is a real tour de force in both rigour and pragmatism. The main lessons for me are: > Most gains come from careful SFT, with RL the 🍒 on top > Filter the data for the most “teachable” prompts, ie not too easy for the model you want to tune.” / X https://x.com/_lewtun/status/1917947747195298086

this has been one of my helicopter moments too (Sam Altman on Geoguessing): https://x.com/sama/status/1918741036702044645

You can finally do the Blade Runner Esper Machine, thanks to o3. “Zoom, enhance.” https://x.com/emollick/status/1919254629637849316

You can just do things. Ridiculous things. https://x.com/emollick/status/1918164116188909668

Microsoft releases Phi-4-reasoning – 14B param SFT of Phi-4 on demonstrations from o3-mini – Phi-4-reasoning-plus is RL-trained – outperforms DeepSeek-R1-Distill-Llama-70B model, approaches the performance levels of full DeepSeek-R1 model https://x.com/iScienceLuvr/status/1917742817914544355

We’ve been cooking… a new open weights 14B Phi-4 reasoning model, SFT’d on ~1.4M carefully curated reasoning demonstrations from o3-mini and RL’d for a tiny bit. This model is a little beast. https://x.com/DimitrisPapail/status/1917731614899028190

AMIE gains vision: A research AI agent for multimodal diagnostic dialogue https://research.google/blog/amie-gains-vision-a-research-ai-agent-for-multi-modal-diagnostic-dialogue/

We published a new sample app for building multimodal agents with the @OpenAIDevs Agents SDK! https://x.com/DKundel/status/1909993072039260255

just gave my chatbots a massive upgrade: they can now generate audio from text, modify images — you name it. Here’s how: The @Gradio team shipped MCP support. That means you can plug any AI app built with it into Claude or Cursor using the Model Context Protocol (MCP) — think of https://x.com/fdaudens/status/1917932360474960162

3/ @DKundel + team published a new sample app for building multimodal agents with the @OpenAIDevs Agents SDK 🤖 how to wrap your existing agents in a VoicePipeline 🎙️ capture/play audio in a React 🔌 send the audio between Python & your frontend https://x.com/AtomSilverman/status/1919066831190470933

Nvidia just open sourced Parakeet TDT 0.6B – the BEST Speech Recognition model on Open ASR Leaderboard 🔥 Can transcribe 60 minutes of audio in 1 second 🤯 600M parameters, with CC-BY-4.0 license (commercially permissive) Congrats Nvidia on the brilliant release and beating https://x.com/reach_vb/status/1919422953256587376

Introducing Meta Perception Encoder: a vision encoder setting new standards in image & video tasks. It excels in zero-shot classification & retrieval, surpassing existing models. Learn more about Meta Perception Encoder, read the research paper, and download the code and dataset https://x.com/AIatMeta/status/1919829024173654260

icymi we started a new multimodal AI community join, share what you build, or new models, anything you find interesting https://x.com/mervenoyann/status/1920472958071369742

v4.5 (Suno) just dropped for Pro & Premier subscribers 🔥 A wider range of genres, richer vocals, & enhanced prompt understanding for songs that match your vision. What’s New: 🙌 Expanded genres & smarter mashups: More genre options — Blends like midwest emo + neosoul or EDM + folk https://x.com/SunoMusic/status/1917979468699931113

Ming-Lite-Uni just dropped on Hugging Face Advancements in Unified Architecture for Natural Multimodal Interaction https://x.com/_akhaliq/status/1919677117337395359

you can easily fine-tune, quantize, play with sota vision LM InternVL3 now 🔥 we have recently merged InternVL3 to @huggingface transformers and released converted checkpoints 🤗 find the model collection and a notebook to get started on the next one ⤵️ https://x.com/mervenoyann/status/1918340027219603683

o3 is the first model to actual play Magic the Gathering decently well, though a combination of quite good vision & the ability to zoom in on details. A few little weird choices, but the anticipation of the other player’s action & lines of action are good. Hits a wall eventually https://x.com/emollick/status/1919955250040959360

Introducing ERNIE X1 Turbo & ERNIE 4.5 Turbo! Building on the success of ERNIE X1 and 4.5, the upgraded ERNIE X1 Turbo and 4.5 Turbo deliver results faster and cheaper. Both models stand out for their multimodal capabilities, strong reasoning and low costs. For X1 Turbo, input https://x.com/Baidu_Inc/status/1915603080336597310

[2505.02583] Towards Cross-Modality Modeling for Time Series Analytics: A Survey in the LLM Era https://arxiv.org/abs/2505.02583

RADIO – a nvidia Collection https://huggingface.co/collections/nvidia/radio-669f77f1dd6b153f007dd1c6

Ace Studio dropped ACE-Step v1-3.5B, an ultra-fast, open-source music generation model It can generate 4 minutes of music in 20s (15× faster than LLMs) with support for several genres and structure control https://x.com/rowancheung/status/1920018927670685914

YAYYY! MSFT released Phi 4 Reasoning & Reasoning plus on Hugging Face🔥 Architecture: > Dense decoder-only Transformer > 14B params > 32k context (extendable to 64k) Training: > SFT + RL on 16B tokens (8.3B unique) > 32 H100-80G GPUs for 2.5 days Benchmarks: > AIME 2025:” / X https://x.com/reach_vb/status/1917852036369916081

Phi models are frustrating. I guess MSFT internal tests are also very impressive, but they lack some way to make sure it’s generally robust.” / X https://x.com/teortaxesTex/status/1918389360439013535

Proof that human videos can be converted into context-aware humanoid locomotion. VideoMimic is a real-to-sim-to-real pipeline that reconstructs humans and scenes from video, retargets motion to a humanoid, and trains a single policy in sim immediately deployable on real robots. https://x.com/TheHumanoidHub/status/1920178537803493409

A real-time object detector much faster and accurate than YOLO with Apache 2.0 license just landed to @huggingface transformers 🔥 D-FINE is the sota real-time object detector that runs on T4 (free Colab) 🤩 Keep reading for the paper explainer, notebooks & demo 👀 https://x.com/mervenoyann/status/1919431751689998348