Ethan B. Holland

Over 51,900 manually organized AI links and counting

Multimodal: AI News Week Ending 06/06/2025

June 6, 2025

Image created with OpenAI GPT-Image-1. Image prompt: vintage Sly & the Family Stone album-cover style, warm pastel picnic scene beside lake, candid family vibe featuring collage of text, image & audio icons overlapping; grainy retro print texture, vibrant 60s funk color palette, high-resolution

H Company released Holo-1: 3B and 7B GUI Action Vision Language Models for various web and computer agent tasks 🤗 Holo-1 has Apache 2.0 license and @huggingface transformers support 🔥 more details in their blog post (next ⤵️) https://x.com/mervenoyann/status/1929896423765500358

Introducing Conversational AI 2.0 Build voice agents with: • New state-of-the-art turn-taking model • Language switching • Multicharacter mode • Multimodality • Batch calls • Built-in RAG Now fully enterprise-ready with HIPAA compliance, EU data residency, and robust https://x.com/elevenlabsio/status/1928527751956308004

Introducing Eleven v3 (alpha) – the most expressive Text to Speech model ever. Supporting 70+ languages, multi-speaker dialogue, and audio tags such as [excited], [sighs], [laughing], and [whispers]. Now in public alpha and 80% off in June. https://x.com/elevenlabsio/status/1930689774278570003

New native audio capabilities in Gemini 2.5 enable text-to-speech in over 24 languages. 🔊Voices are more natural and expressive, and you can seamlessly switch between languages. https://x.com/i/web/status/1929960513779204198

Introducing Mirage Studio. Powered by our proprietary omni-modal foundation model. Generate expressive videos at scale, with actors that actually look and feel alive. Our actors laugh, flinch, sing, rap — all of course, per your direction. Just upload an audio, describe the https://x.com/getcaptionsapp/status/1929554635544461727

Meta aims to fully automate advertising with AI by 2026, WSJ reports | Reuters https://www.reuters.com/business/media-telecom/meta-aims-fully-automate-advertising-with-ai-by-2026-wsj-reports-2025-06-02/

AGI Is Not Multimodal https://thegradient.pub/agi-is-not-multimodal/

ColQwen2 just landed to @huggingface transformers main 😍 use state-of-the-art visual document retrieval model ColQwen2 for your PDF retrieval or RAG pipelines 🎉 link to notebook and model on the next one ⤵️ https://x.com/mervenoyann/status/1929563866658218316

It’s VLA day with open-source model releases today from both @hcompany_ai & @huggingface @LeRobotHF 🦾🦾🦾 VLA is short for Vision, Language, Action models. These are the models that allow modern robots to see, hear, understand & take action thanks to AI. It’s GPT but for https://x.com/i/web/status/1929927844227899841

Physical AI announced knowledge insulation, a way to train vision-language action models 7.5x faster with diffusion output This enables the model to inherit better language following from the VLM, leading to better results https://x.com/adcock_brett/status/1929207285374411199

Verified Auto Labeling: Smarter Annotation at Scale – June 24, 2025 https://voxel51.com/events/verified-auto-labeling-smarter-annotation-at-scale-june-24-2025

Impromptu VLA https://impromptu-vla.c7w.tech/