Multimodal: AI News Week Ending 08/22/2025

Image created with Flux Pro v1.1 Ultra. Image prompt: Studio control room blending audio waves, image frames, and caption tracks; the word “Multimodality” printed on a patch panel label in monospace; arrows link modes on a central timeline; crisp, prism spectrum accents

AI agent that taps and types on your iPhone across apps https://x.com/tom_doerr/status/1955655887684591829

Nvidia announced Cosmos Reason 7B, an open-source VLM to enable robots to see, reason, and act in the physical world, solving multistep tasks The company also made Isaac Sim 5.0 and Isaac Lab 2.2 generally available https://x.com/adcock_brett/status/1957111085481242892

when you engage “”hovercraft mode”” on your new whip (made w/ nvidia cosmos) https://x.com/bilawalsidhu/status/1956160140404777142

NVIDIA ON A ROLL! Canary 1B and Parakeet TDT (0.6B) SoTA ASR models – Multilingual, Open Source 🔥 – 1B and 600M parameters – 25 languages – automatic language detection and translation – word and sentence timestamps – transcribe up to 3 hours of audio in one go – trained on 1 https://x.com/reach_vb/status/1957148807562723809

Testing out the new Helix walking controller. it’s unstoppable https://x.com/adcock_brett/status/1958193476639826383

BeyondMimic From Motion Tracking to Versatile Humanoid Control via Guided Diffusion
https://beyondmimic.github.io/

📢DINOv3: Scaling Self-Supervised Learning for Vision Foundation Models (Meta AI) DINOv3 is a next-generation vision foundation model trained purely with self-supervised learning. It introduces innovations that allow robust dense feature learning at scale with models reaching 7B https://x.com/OpenCVUniverse/status/1957426189477482558

DINOv3 https://ai.meta.com/dinov3/

how does DINOv3 perceive objects? 👀 I dropped a mini visualizer: you can upload images, click on objects and check > patch similarities > object boundaries > most similar other objects 🤗 live on @huggingface Spaces https://x.com/mervenoyann/status/1956694798519161118

Meta introduced two new models: —TRIBE: An AI that predicts the brain’s response to visual content, without running physical scans —DINOv3: A general-purpose computer vision model that learns from unlabeled data to power detection, segmentation + more https://x.com/adcock_brett/status/1957111038396043624

WOW! 🤯 DINOv3 can run locally on your phone… from the browser! This unlocks endless possibilities for AI-powered web apps. 🤏 Model is tiny (only 15MB at 4-bit quantization) 🧠 Delivers powerful, high-resolution image features ✨ Works completely offline Try it yourself 👇 https://x.com/xenovacom/status/1956763976080970071

Introducing ComputerRL, a framework for autonomous desktop intelligence that enables agents to operate complex digital workspaces skillfully. https://x.com/Zai_org/status/1958175133706891613

We are super excited to release OpenCUA — the first from 0 to 1 computer-use agent foundation model framework and open-source SOTA model OpenCUA-32B, matching top proprietary models on OSWorld-Verified, with full infrastructure and data. 🔗 [Paper] https://x.com/xywang626/status/1956400403911962757

The GeoGuessr powered by GLM-4.5V! GLM-4.5V skipped class and learned everything from staring at photos.👀 No Google, no maps, just visual reasoning. Drop your weirdest landscape or street photo and see if GLM-4.5V can guess where on Earth (or in the multiverse) it is! https://x.com/Zai_org/status/1956353661397094890

Ant Group just released UI-Venus on @huggingface It’s a native UI agent achieving SOTA in grounding & navigation tasks from just screenshots. Turns screenshots into reliable clicks and plans using small data and reinforcement fine-tuning. The usual way, supervised fine https://x.com/rohanpaul_ai/status/1956777729304711639

Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory https://m3-agent.github.io/

Adapt3R: Adaptive 3D Scene Representation for Domain Transfer in Imitation Learning https://www.pair.toronto.edu/Adapt3R/

introducing Halo X, always listening AI glasses. vibe think. never use your brain again. pre-order now. https://x.com/AnhPhuNguyen1/status/1958199821048705312

pairlab/Adapt3R: Official implementation of Adapt3R: Adaptive 3D Scene Representation for Domain Transfer in Imitation Learning https://github.com/pairlab/Adapt3R

Granary addresses the scarcity of high-quality speech data for low-resource languages by consolidating multiple datasets under a unified framework: https://x.com/reach_vb/status/1957149812849066448

mlx-vlm v0.3.3 is here gr New models: – @LiquidAI_ LFM2-VL – @Zai_org GLM-4.5V – @cohere Command-A-Vision Changes: – New kernel for grid_sample – Fix bicubic interpolate kernel compatibility with macOS < 15 – Fix config inheritance Thank you very much to all the amazing https://x.com/Prince_Canuma/status/1958469233622327785

Big upgrade coming to Helix https://x.com/adcock_brett/status/1957526592793838038

You can now share your camera in conversations with Gemini Live to chat back-and-forth about what you see and get real-time advice. @GeminiApp can also point things out too, like what glasses are the best shape for your face¹ 😎 #MadeByGoogle https://x.com/madebygoogle/status/1958216279300403670

RotBench Evaluating Multimodal Large Language Models on Identifying Image Rotation https://x.com/_akhaliq/status/1958635243197325625

🎨✨ From simple sketches to stunning 3D interiors — powered by Qwen-Image-Edit! All designs are community contributions, showcasing how AI transforms architectural visions into realistic, stylish, and precise creations. Try it now: https://x.com/Alibaba_Qwen/status/1958744976772198825

📸 Just showed Qwen Chat Vision Understanding how to “”see”” and understand a meal — and it didn’t just identify the food, it analyzed what, where, weight and even how many calories! From a simple photo, we extracted detailed insights: ✅ Object detection ✅ Weight estimation ✅ https://x.com/Alibaba_Qwen/status/1956618027769971070

🖼️ 🚨 Image Edit Leaderboard Update: Qwen-Image-Edit is now the #1 open model for Image Edit in the Arena (Apache 2.0). The model by @alibaba_qwen debuts at #6 overall on the Image Edit leaderboard tied with Gemini 2.0 Flash Preview. https://x.com/lmarena_ai/status/1958206842657743270

🖼️ Image Edit Model Update Qwen-Image-Edit, developed by @Alibaba_Qwen, is now available in the Arena. This model brings image editing capabilities, and we encourage you to test it with your most complex prompts. https://x.com/lmarena_ai/status/1957878222986821711

🚀 Excited to introduce Qwen-Image-Edit! Built on 20B Qwen-Image, it brings precise bilingual text editing (Chinese & English) while preserving style, and supports both semantic and appearance-level editing. ✨ Key Features ✅ Accurate text editing with bilingual support ✅ https://x.com/Alibaba_Qwen/status/1957500569029079083

🚀 Small but mighty update to Vision Understanding in Qwen Chat — now with native 128K context and stronger performance across vision, video, and 3D tasks! 🔥 Key Upgrades: ✅ Significant boost in math & reasoning ✅ More accurate object recognition ✅ OCR support for 30+ https://x.com/Alibaba_Qwen/status/1956289523421470855

NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale “”Autoregressive models—generating content step-by-step like reading a sentence—excel in language but struggle with images. Traditionally, they either depend on costly diffusion models or https://x.com/iScienceLuvr/status/1956321483183329436

Qwen Image Edit works too well with lightx2v LoRA to run with just 8 and 4 steps, wtf? in my experience, 8 steps keeps the quality of the edits at the same level as the original model, at a 12x speedup 💨 (ofc i built a demo for it) https://x.com/multimodalart/status/1958217824629092568

Qwen-Image Edit in ComfyUI”” / X https://x.com/Alibaba_Qwen/status/1957991583649001555

Qwen-Image-Edit is out in anycoder for image editing in your vibe coded apps Built on 20B Qwen-Image, it brings precise bilingual text editing (Chinese & English) while preserving style, and supports both semantic and appearance-level editing. https://x.com/_akhaliq/status/1957519569016238268

Qwen-Image-Edit is the new open weights leader in Image Editing, with quality comparable to GPT-4o and FLUX.1 Kontext [max] Qwen-Image-Edit is the image editing variant of the recent Qwen-Image release from Alibaba, also released under the Apache 2.0 license with weights https://x.com/ArtificialAnlys/status/1958712568731902241

Qwen-Image-Edit: Image Editing with Higher Quality and Efficiency | Qwen https://qwenlm.github.io/blog/qwen-image-edit/

Relighting images with Qwen Edit impressive directional control and color temperature manipulation w/o additional finetuning crazy how we needed a dedicated model for this not long ago https://x.com/linoy_tsaban/status/1958176756185325931

Thank you! Qwen-Image-Edit is now available in anycoder!”” / X https://x.com/Alibaba_Qwen/status/1957709912202682588

👀🚨 Vision Leaderboard update! Two new models have entered the Vision Top 20 this week: 🔸Qwen-vl-max-2025 by @alibaba_qwen lands at #10 (tied with gemini-1.5-pro & gpt-5-nano-high) 🔸Step 3 by @StepFun_ai ranks at #19 (tied with step-lo-turbo) Congrats to both 🎉 this is https://x.com/lmarena_ai/status/1958957107946168470

Wow — Qwen-Image-Edit just debuted at #2 in the Image Editing Arena 🏆 ELO 1098, with performance on par with GPT-4o — and all at open weights under Apache 2.0. Thanks to @ArtificialAnlys Try it now: https://x.com/Alibaba_Qwen/status/1958725835818770748

Farewell Microsoft Lens – popular mobile PDF scanner app set to be ditched soon | TechRadar https://www.techradar.com/pro/microsoft-is-killing-off-its-well-loved-lens-pdf-scanner-app-in-favor-of-ai

🚀 Big update: Open ASR goes multilingual! We’re kicking off with 🇩🇪🇫🇷🇮🇹🇪🇸🇵🇹 — German, French, Italian, Spanish & Portuguese. English ASR has reached a strong level of maturity, so we’re exploring new languages 🌍 More languages coming soon… Which one should we add next? https://x.com/Tu7uruu/status/1956354974226456794

🚀 New in Weave: Content API Log any media your AI apps use and analyze it in traces. Inspect, evaluate, and compare images, audio, video, markdown, PDFs, and even HTML. Works across all of your AI agents and apps. One place for all of your multimodal debugging needs. https://x.com/weave_wb/status/1956412035647815735

Is text-only information enough for LLM/VLM Web Agents? 🤔 Clearly not. 🙅‍♂️ The modern web is a rich tapestry of text, images 🖼️, and videos 🎥. To truly assist us, agents need to understand it all. That’s why we built MM-BrowseComp. 🌐 We’re introducing MM-BrowseComp 🚀, a new https://x.com/GeZhang86038849/status/1958381269617955165

Introducing Chat Mode You can now build text-only conversational agents. Ideal for: – Customers that prefer typing to speaking. – Precise inputs like order IDs or email addresses. – Solving simple issues, handing off to our voice agents for complex tasks. https://x.com/elevenlabsio/status/1957820056387166413

New from S-Lab, Nanyang Technological University & SenseTime Research: Next Visual Granularity Generation (NVG)! This novel framework progressively refines images from global layout to fine details, offering fine-grained control over generation. It outperforms the VAR series in https://x.com/HuggingPapers/status/1957836902020612180

NFL and Microsoft expand partnership to bring Copilot to the sidelines and beyond – Source https://news.microsoft.com/source/2025/08/20/nfl-and-microsoft-expand-partnership-to-bring-copilot-to-the-sidelines-and-beyond/

Has GPT-5 Achieved Spatial Intelligence? GPT-5 sets SoTA but not human‑level spatial intelligence. My notes below: https://x.com/omarsar0/status/1957885032716177415

Well, @jianyuan_wang of VGGT fame simply plugged in DINOv3 into his pipeline and off-handedly got a new SotA 3D model out. Seems promising enough? https://x.com/maxseitzer/status/1956029433329922116

HUGE RELEASE! Nvidia just droppped: > Granary: the largest open-source speech dataset for European languages 🗣️🇪🇺 > Canary-1b-v2: 25 languages, ASR + En↔X translation > Parakeet-tdt-0.6b-v3: SOTA multilingual ASR You can now train your ASR model to understand European https://x.com/Tu7uruu/status/1956350036343701583

Nvidia Parakeet v3 is out! Enjoy Day 0 support with Argmax SDK – What changed from v2? – How do I use it? – Should I upgrade to this model right away? Answers in comments https://x.com/argmaxinc/status/1956385793892917288

Capabilities of GPT-5 on Multimodal Medical Reasoning https://arxiv.org/pdf/2508.08224