Ethan B. Holland

Over 51,300 manually organized AI links and counting

Multimodal: AI News Week Ending 09/12/2025

September 12, 2025

Image created with Flux Pro v1.1 Ultra. Image prompt: Multimodality, poolside lounge, notebook cover embossed with simple icons for text audio and image, shallow depth of field, photorealistic, editorial, minimal, landscape, vacation, no text overlays

Connecting the world of bits and atoms has never been easier. This is Sony’s “”Hawk-Eye”” system that fuses dozens of cameras placed around a stadium to build a spatial 4D understanding of the action in real-time! It can accurately track the flight path of a ball, determining https://x.com/bilawalsidhu/status/1964053826878517556

Why add sensors and complex systems when physics can do the job? This production line sorts products using only weight and controlled bursts of air. ✅ No cameras or vision models ✅ No expensive integration ✅ Just reliable, repeatable separation at scale It’s a reminder that https://x.com/IlirAliu_/status/1963869227019845865

LLMs do many things, to different levels of quality, the “jagged frontier” of ability that my coauthors and I discussed in 2023. One weak part of multimodal LLMs has been seeing fine visual details. So this is an interesting benchmark to watch to follow progress in this area.”” / X https://x.com/emollick/status/1964758268930379794

The map is not the territory. So how can you derive the territory purely from the map? Language is a limited symbolic map of reality. Or as Fei-Fei puts it – it’s a purely generative signal. A crude abstraction. To model reality we must sample it directly. https://x.com/bilawalsidhu/status/1965881373027438683

Gemini, just like everybody else. From a fascinating blog post about AI agents assigned to play web games, and failing, in large part because vision and computer use tools aren’t good enough: https://x.com/emollick/status/1963968533051617322

fun fact: when prompting a clock with a specific time, nano banana will always just show 10:10 like, consistently 10/10. no notes https://x.com/fabianstelzer/status/1965001753059057925

4B OCR with Apache-2.0 license outperforming Mistral OCR 🔥 Tencent released Points-Reader, it’s a new model firstly trained on Qwen2.5VL annotations and then self-trained on real data in many benchmarks, it performs better than Qwen2.5VL and MistralOCR! https://x.com/mervenoyann/status/1966176133894098944

📢 Getting Started with VLM on Jetson Nano Tiny Vision Language Models (VLMs) like Moondream2, LiquidAI’s LFM2-VL, Apple’s FastVLM, and Huggingface’s SmolVLM2 are bringing vision-language capabilities to the edge. In this tutorial, LearnOpenCV demonstrates how to deploy and run https://x.com/LearnOpenCV/status/1965769149646540880

V4 is multimodal embeddings, but V4-GGUF wasn’t—until now. We’ve finally cracked how to generate multimodal embeddings using llama.cpp & GGUF. We fixed two main issues. First, in the language model part, we corrected the attention mask in the transformer block so it properly https://x.com/JinaAI_/status/1965836110371893353

an underrated skill in ML is deriving great insight from small wiggles in the graphs”” / X https://x.com/gdb/status/1964801175066456435

fan-favorite vision LM Florence-2 is now officially supported in @huggingface transformers 🤗 find all the models in florence-community org 🫡 https://x.com/mervenoyann/status/1966122522723725420

POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion https://huggingface.co/tencent/POINTS-Reader