Multimodal: AI News Week Ending 06/20/2025

Multimodal: AI News Week Ending 06/20/2025

June 20, 2025

Image created with OpenAI GPT-Image-1. Image prompt: Cheesy late-night infomercial freeze-frame—split-screen combo of image, audio, text icons branded “MULTIMODALITY MIX-MASTER™”; rainbow gels, CRT artifacts, high-resolution

(BRAVEHEART) Microsoft’s new Copilot Vision can ‘see’ your apps on Windows | The Verge https://www.theverge.com/news/685963/microsoft-copilot-vision-windows-launch

Apple refreshed its Apple Foundation Models (AFM) with new versions for on-device and server use, aiming to improve performance in tasks like image understanding and multilingual reasoning. The company also released a Foundation Models API for developers to integrate the https://x.com/DeepLearningAI/status/1936121879552537056

Our vision is for AI that uses world models to adapt in new and dynamic environments and efficiently learn new skills. We’re sharing V-JEPA 2, a new world model with state-of-the-art performance in visual understanding and prediction. V-JEPA 2 is a 1.2 billion-parameter model, https://x.com/AIatMeta/status/1932808881627148450

#NVIDIAIsaac Sim 5.0 and Isaac Lab 2.2 are now available in early developer preview on Github. 🎉 These releases give #Robotics developers early access to cutting-edge tools to simulate, train, and validate robots in a physics-based simulation environment. What’s new? https://x.com/NVIDIARobotics/status/1934768379652665403

Record mode is rolling out today in ChatGPT to Pro, Enterprise, and Edu users. Available on macOS desktop app.”” / X https://x.com/OpenAI/status/1935419375600926971

#CVPR2025 Picks #3 Alibaba just released VideoRefer-VideoLLaMA3 (2B & 7B video LLMs with A2.0 license!) These models can understand videos and segment objects, answer questions about them throughout the video at the same time 🤯 see it in action ⤵️ https://x.com/mervenoyann/status/1935739721772081336

more experiments on letting agents on @heyglif generate longer vids with Flux Ultra, Kling 2.1, MMAudio and automated stitching prompt was “”roman legionnaire travel log”” abundantly clear IMO that authorship will move from creating films to creating agents that create films https://x.com/fabianstelzer/status/1935038388782113197

AI spots heart disease warning signs in routine chest scans – Earth.com https://www.earth.com/news/ai-spots-heart-disease-warning-signs-in-routine-chest-scans/

We benchmarked Apple’s new On-Device model: trails most Gemma and Qwen on-device suitable models but still very useful GPQA Diamond performance trailed models that are suitable for on-device use such as the smaller Gemma models (3n E4B, 4B, 12B) and Qwen3 models (1.7B, 4B, 8B). https://x.com/ArtificialAnlys/status/1936141541023924503

🚀 Introducing Cosmos-Predict2! Our most powerful open video foundation model for Physical AI. Cosmos-Predict2 significantly improves upon Predict1 in visual quality, prompt alignment, and motion dynamics—outperforming popular open-source video foundation models. It’s openly https://x.com/qsh_zh/status/1933024567011995865

Honda-backed Helm.ai unveils vision system for self-driving cars https://tech.yahoo.com/transportation/articles/honda-backed-helm-ai-unveils-100605030.html

We just shipped video FPS support in the Gemini API, so you can dynamically customize how many frames per second you want the model to see, unlocking lots of interesting new video use cases! 📹 https://x.com/OfficialLoganK/status/1935444350374125983

1X World Model Scaling Evaluation for Robots
https://x.com/1x_tech/status/1934634700758520053

Apple (AAPL) Targets Spring 2026 for Release of Delayed Siri AI Upgrade – Bloomberg https://www.bloomberg.com/news/articles/2025-06-12/apple-targets-spring-2026-for-release-of-delayed-siri-ai-upgrade

7/ Omi world’s leading open-source AI wearable that captures conversations, gives summaries, action items and does actions for you. Simply connect Omi to your mobile device and enjoy automatic, high-quality transcriptions of meetings + life @kodjima33 https://x.com/AtomSilverman/status/1932988839578206268

Under-rated privacy risk of LLMs is that they are great at finding gems in large piles of language & images. Our world is increasingly recorded for social media, etc., but it didn’t matter too much because no one could sort through all that content. Now everyone can with an AI.”” / X https://x.com/emollick/status/1933544585961353237

Microsoft just announced that Copilot Vision is now free to try on mobile The feature, much like Gemini Live, reads the camera feed in real time to help with tasks like fixing a bicycle Accessible via Copilot’s voice mode https://x.com/rowancheung/status/1933072107631509910

Vision sees what you see and helps in real time. 🙌 Try it for free. 👀 https://x.com/Copilot/status/1932840025349370064

First-of-its-kind technology helps man with ALS ‘speak’ in real time https://health.ucdavis.edu/news/headlines/first-of-its-kind-technology-helps-man-with-als-speak-in-real-time/2025/06

Robotic heart transplant surgery performed at Baylor St. Luke’s Medical Center | BCM https://www.bcm.edu/news/robotic-heart-transplant-surgery-performed-at-baylor-st-lukes-medical-center

Humanoid robots are the ultimate deployment vector for AGI https://x.com/adcock_brett/status/1935394565286154595

I wonder if Apple’s approach (on-device LLM with LoRAs & stateless cloud LLM to preserve privacy) makes sense in 2026. It was an alright bet last year, but that isn’t where the market has gone. We got a world with complex multimodal chatbots that people form ongoing bonds with.”” / X https://x.com/emollick/status/1933565955092746483

Introducing the V-JEPA 2 world model and new benchmarks for physical reasoning https://ai.meta.com/blog/v-jepa-2-world-model-benchmarks/

China unveils a new brain for humanoids. The Beijing Academy of Artificial Intelligence (BAAI) has announced RoboBrain 2.0, an open-source general-purpose AI model designed to power humanoids and other general-purpose robots. BAAI is already collaborating with 20 Chinese https://x.com/TheHumanoidHub/status/1934326215382569132

The Beijing Academy of Artificial Intelligence dropped RoboBrain 2.0, an open-source AI for humanoids/robots It ingests multi-image and long videos as inputs and delivers capabilities like spatial perception, temporal perception, and scene reasoning https://x.com/rowancheung/status/1934518213687029851

Paradromics Ready First Human Brain-Computer Interface (BCI) https://cannadelics.com/2025/06/05/paradromics-brain-computer-interface-implant/

6/ LlamaFS is a self-organizing file manager. It automatically renames and organizes your files based on their content and well-known conventions (e.g., time). It supports many file types , including images and audio. Super cool project by @AlexReibman! https://x.com/AtomSilverman/status/1932988830904365480

Gemini 2.5 models are sparse mixture-of-experts (MoE) transformers with native multimodal support for text, vision, and audio inputs. https://x.com/_philschmid/status/1935017208343634032

stop using VLMs blindly ✋🏻 compare different VLM outputs on a huge variety of inputs (from reasoning to OCR!) 🔥 > has support for multiple VLMs: Gemma 3, Qwen2.5VL, Llama4 > recommend us new models or inputs, we’ll add 🫡 https://x.com/mervenoyann/status/1935708014645784713

it’s been raining OCR models 🔥 here’s a new app to compare the newest OCR models on various inputs (handwriting, charts, LaTex & more!) 📑 https://x.com/mervenoyann/status/1936033324977266874

Cover’s gen-2 hardware can now detect weapons hidden under clothes or inside bags The image here is from an early prototype – it shows a gun detected under a hoodie Over time, this kind of technology needs to be in every major public venue around the world https://x.com/adcock_brett/status/1936100934880538903

Grok cannot see videos. It also cannot tell whether an image is AI. It would be good for the X team to have the model clarify this when asked because people keep asking Grok for its opinion on fake images and videos.”” / X https://x.com/emollick/status/1934773392156279201

Let’s goooo! @kyutai_labs just dropped SoTA Speech to Text transcriptions model – CC-BY-4.0 Licensed 🔥 > kyutai/stt-1b-en_fr (1B params, 500ms delay, English & French) > kyutai/stt-2.6b-en (2.6B params, 2.5s delay, English-only, higher accuracy) > Capable of 400 real-time https://x.com/reach_vb/status/1935655403024498814

RT @LerrelPinto: We have developed a new tactile sensor, called e-Flesh, with a simple working principle: measure deformations in 3D printa…”” / X https://x.com/ylecun/status/1935466674242666831

Training robots without robots: Smart glasses capture first-person task demos https://techxplore.com/news/2025-06-robots-smart-glasses-capture-person.html

Sydney team develop AI model to identify thoughts from brainwaves – ABC News https://www.abc.net.au/news/2025-06-16/mind-reading-ai-brain-computer-interface/105376164

Human-like object concept representations emerge naturally in multimodal large language models https://arxiv.org/pdf/2407.01067

ChatGPT Record | OpenAI Help Center https://help.openai.com/en/articles/11487532-chatgpt-record

it changed to MANGO: Meta Anthropic Netflix Google OpenAI”” / X https://x.com/jxmnop/status/1934370318027460635

RT @andimarafioti: 📢 A new open-source OCR model is breaking the internet: Nanonets-OCR-s! Nanonets understands context and semantic struc…”” / X https://x.com/ClementDelangue/status/1934974278287479182

RT @kyutai_labs: Kyutai Speech-To-Text is now open-source! It’s streaming, supports batched inference, and runs blazingly fast: perfect for…”” / X https://x.com/clefourrier/status/1935701954358890806