Ethan B. Holland

Over 53,700 manually organized AI links and counting

Multimodal: AI News Week Ending 11/07/2025

November 7, 2025

Image created with gemini-2.5-flash-image with claude-sonnet-4-5. Image prompt: Photorealistic wide shot of six Ionic limestone columns on a grassy quad with red-brick buildings behind, late afternoon golden light, topped with a classical stone entablature bearing the carved inscription MULTIMODALITY in centered Roman serif capitals, flanked by shallow limestone relief carvings of a scroll, framed portrait, lyre, and theatrical mask on either side, all carved into the same beige stone surface, sharp architectural detail, natural shadows, clear blue sky.

EdgeTAM, real-time segment tracker by Meta is now in @huggingface transformers with Apache-2.0 license 🔥 > 22x faster than SAM2, processes 16 FPS on iPhone 15 Pro Max with no quantization > supports single/multiple/refined point prompting, bounding box prompts https://x.com/mervenoyann/status/1986785795424788812

Any2Track: Galbot’s whole-body motion tracking system. https://x.com/TheHumanoidHub/status/1985219089015664676

Google Maps is getting a powerful boost with Gemini, making navigation smarter and easier. ✨ Learn more about the new features ↓”” / X https://x.com/Google/status/1986164830588248463

Google Maps launches Gemini features, including landmark navigation https://blog.google/products/maps/gemini-navigation-features-landmark-lens/

We wanted to share more information about Gemma in AI Studio: First, to clarify the distinction between our AI products. Our Gemma models are a family of open models built specifically for the developer and research community. They are not meant for factual assistance or for”” / X https://x.com/NewsFromGoogle/status/1984412221531885853

VCode a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation https://x.com/_akhaliq/status/1986073575216824650

SAM 2++: Tracking Anything at Any Granularity”” TL;DR: unifies video tracking across masks, boxes & points; uses task-specific prompts, a unified decoder, and a task-adaptive memory to track at any granularity. Backed by a new large-scale dataset. https://x.com/Almorgand/status/1986112315050369103

When Visualizing is the First Step to Reasoning MIRA, a Benchmark for Visual Chain-of-Thought https://x.com/_akhaliq/status/1986075520962793672

ByteDance released BindWeave Subject-Consistent Video Generation via Cross-Modal Integration https://x.com/_akhaliq/status/1986058046876070109

MAI-Image-1 has shipped 🚢 Try it now at https://x.com/mustafasuleyman/status/1985777196460622327

Whisper Into This AI-Powered Smart Ring to Organize Your Thoughts | WIRED https://www.wired.com/story/sandbar-stream-smart-ring/

How we built a custom vision LLM to improve document processing at Grab https://engineering.grab.com/custom-vision-llm-at-grab

what is ocr? – introspection ft. harsehaj https://harsehaj.substack.com/p/ocr-models-explained

#1 on the MTEB multilingual leaderboard”” / X https://x.com/fdaudens/status/1984541314063446191

This is the letter that caused Gemma to be pulled from AI Studio.
Thread by @AndrewCurran_ on Thread Reader App – Thread Reader App https://threadreaderapp.com/thread/1984995011482755085.html

Vidu Q2 launches at #8 on the Artificial Analysis Text to Video Leaderboard, surpassing standard Sora 2 and Wan 2.5! It is also one of the first models to support video generation with multiple reference images, enabling more controllable results by using multiple angles of the https://x.com/ArtificialAnlys/status/1985781760236630305

Inworld TTS 1 Max is the new leader on the Artificial Analysis Speech Arena Leaderboard, surpassing MiniMax’s Speech-02 series and OpenAI’s TTS-1 series The Artificial Analysis Speech Arena ranks leading Text to Speech models based on human preferences. In the arena, users https://x.com/ArtificialAnlys/status/1986464484492447801

Bats inspire WPI researchers to develop drones using echolocation – The Robot Report https://www.therobotreport.com/wpi-researchers-create-bat-inspired-search-rescue-drones/

Introducing the world’s largest, most capable, robotic foundation model.”” / X https://x.com/E0M/status/1985760232170209583

This is a great achievement from the SGLang team! I got immediately asked by a few people what vLLM’s plan is. Multimodal generation with omni models is something we have been working on with the community and model vendors. As a sneak peek, Hunyuan-image 3.0 (that beats”” / X https://x.com/rogerw0108/status/1986919399346106490