Ethan B. Holland

Over 54,400 manually organized AI links and counting

Multimodal: AI News Week Ending 03/27/2026

March 27, 2026

Image created with gemini-3.1-flash-image-preview with claude-sonnet-4-5. Image prompt: Using the provided reference image, preserve the exact crate construction with horizontal dark reddish-brown wooden slats, iron hardware, weathered paint, and three-panel layout with hand-painted black stencil text, but replace the original address text with ‘MULTIMODALITY’ in the same loose confident brushstroke style. Place the crate on a weathered wooden dock in early spring dawn light, with vintage objects representing different senses–leather book, binoculars, old radio, lavender bundle, ceramic bowl–resting against and spilling from the partially-open crate, all rendered in soft focus except the crate face.

Cohere Transcribe: state-of-the-art speech recognition https://cohere.com/blog/transcribe

A foundation model of vision, audition, and language for in-silico neuroscience | Research – AI at Meta https://ai.meta.com/research/publications/a-foundation-model-of-vision-audition-and-language-for-in-silico-neuroscience/

Ai2 just released MolmoPoint GUI on Hugging Face A specialized VLM for GUI automation that points using grounding-tokens instead of coordinates, reaching 61.1 on ScreenSpotPro.
https://x.com/HuggingPapers/status/2036101402477404284

🎉 Congrats to @Cohere on releasing Cohere Transcribe, a 2B speech recognition model (Apache 2.0, 14 languages). Day-0 support in vLLM. Cohere contributed encoder-decoder serving optimizations to vLLM: variable-length encoder batching and packed attention for the decoder. Up to
https://x.com/vllm_project/status/2037197243111895066

Introducing: Cohere Transcribe – a new state-of-the-art in open source speech recognition.
https://x.com/cohere/status/2037159129345614174

New SoTa transcription model from @cohere! – #1 on accuracy on the Open ASR Leaderboard. – Open Source (Apache 2.0) – 14 Languages (English, French, Arabic, German, Italian, Spanish, Portuguese, Greek, Dutch, Polish, Chinese, Japanese, Korean, Vietnamese).
https://x.com/JayAlammar/status/2037172878165053951

🔥 Introducing LongCat-Next: A Discrete Native Autoregressive Multimodal Model LongCat-Next integrates language, vision, and audio into a unified discrete autoregressive model, extending Next-Token Prediction to native multimodality and delivering industrial-strength performance
https://x.com/Meituan_LongCat/status/2036861293140054510

Introducing the MiniMax Token Plan: First All-Modality API Subscription Flat-rate API access to MiniMax’s leading text, speech, music, video, and image models. Stop juggling multiple unpredictable bills for different modalities. One key. One predictable bill. All modalities.
https://x.com/MiniMax_AI/status/2036123727373672910

Meituan: «unified latent space in which all tokens, textual, visual, and acoustic, are processed through a single modality-agnostic pathway» derived from their model with N-gram embedding. interesting report even irrespective of the model (seems… meh for imagen).
https://x.com/teortaxesTex/status/2036896514157502749

Victoria Slocum on X: “If you’re building a PDF RAG pipeline: Should you be using OCR and 𝘁𝗲𝘅𝘁-𝗯𝗮𝘀𝗲𝗱 𝗿𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹 methods, or just 𝗲𝗺𝗯𝗲𝗱 𝗶𝗺𝗮𝗴𝗲𝘀 𝗱𝗶𝗿𝗲𝗰𝘁𝗹𝘆 using late interaction models? This paper says the answer might actually be 𝘣𝘰𝘵𝘩. My colleagues at Weaviate https://t.co/iNQOR56nnU” / X
https://x.com/victorialslocum/status/2037113651174199778