Multimodal: AI News Week Ending 07/04/2025

Multimodal: AI News Week Ending 07/04/2025

July 4, 2025

Image created with OpenAI GPT-Image-1. Image prompt: rich crimson, bright ivory, deep navy Independence-Day palette, vibrant, celebratory, wholesome, authentic, photorealistic Coast Guard cutter lined with flags scene featuring an AR overlay mixing text, audio, and image icons in the air; natural lighting, subtle film grain, high detail

The race for LLM “cognitive core” – a few billion param model that maximally sacrifices encyclopedic knowledge for capability. It lives always-on and by default on every computer as the kernel of LLM personal computing.
Its features are slowly crystalizing:

– Natively multimodal text/vision/audio at both input and output.
– Matryoshka-style architecture allowing a dial of capability up and down at test time.
– Reasoning, also with a dial. (system 2)
– Aggressively tool-using.
– On-device finetuning LoRA slots for test-time training, personalization and customization.
– Delegates and double checks just the right parts with the oracles in the cloud if internet is available.

It doesn’t know that William the Conqueror’s reign ended in September 9 1087, but it vaguely recognizes the name and can look up the date. It can’t recite the SHA-256 of empty string as e3b0c442…, but it can calculate it quickly should you really want it.

What LLM personal computing lacks in broad world knowledge and top tier problem-solving capability it will make up in super low interaction latency (especially as multimodal matures), direct / private access to data and state, offline continuity, sovereignty (“not your weights not your brain”). i.e. many of the same reasons we like, use and buy personal computers instead of having thin clients access a cloud via remote desktop or so.https://x.com/karpathy/status/1938626382248149433

Amazon deploys over 1 million robots and launches new AI foundation model https://www.aboutamazon.com/news/operations/amazon-million-robots-ai-foundation-model

(6/9) Generalization: LeVERB sees “take a seat”, “sit down”, or “sit on blue chair” and knows they mean the same thing. It also reasons about space: if the chair is in front, it turns first, then sits. https://x.com/HaoruXue/status/1937216472872550894

Real and Sim Demos Diverse Humanoid Behavior Conditioned on Visual-Language https://ember-lab-berkeley.github.io/LeVERB-Website/

NVIDIA just dropped a blog on their 8B VLM Llama Nemotron Nano VL 📖 icymi they also release an OCR leaderboard with this release 📑 https://x.com/mervenoyann/status/1938713088020136218

Welcome the NVIDIA Llama Nemotron Nano VLM to Hugging Face Hub
https://huggingface.co/blog/nvidia/llama-nemotron-nano-vl

Mayo Clinic’s AI tool identifies 9 dementia types, including Alzheimer’s, with one scan – Mayo Clinic News Network https://newsnetwork.mayoclinic.org/discussion/mayo-clinics-ai-tool-identifies-9-dementia-types-including-alzheimers-with-one-scan/

Mandelbrot in x86 assembly by Claude https://simonwillison.net/2025/Jul/2/mandelbrot-in-x86-assembly-by-claude/

Integrate Tableau with any AI agent or application using Tableau MCP, an implementation of the Model Context Protocol released by @AnthropicAI. Leverage the MCP host of your choice (ex. Agentforce Agents, Claude, or Cursor) to perform ad-hoc data analysis that extends the https://x.com/tableau/status/1937554515659550920

Introducing Document Extraction as an MCP Server ✂️📑 A huge use case for AI agents is being able to extract out items from a diverse set of complex documents in a repeatable manner – whether it’s legal contracts, invoices, financial statements, passports, and more. In this https://x.com/jerryjliu0/status/1940209573585199234

Berkeley AI introduced LeVERB, the first latent whole-body humanoid VLA, trained on sim data It saw an 80% zero-shot success on simple navigation tasks, and 58.5% across the board. This is 7.8 times better than a basic hierarchical VLA implementation https://x.com/adcock_brett/status/1939354156264808758

Four teams of humanoid robots went head-to-head in China’s first 3-on-3 soccer matches. Booster T1 robots were programmed by four university teams to play fully autonomously. The event served as a preview of the upcoming World Humanoid Robot Games, set for August in Beijing. https://x.com/TheHumanoidHub/status/1939734335927509363

RT @gs_ai_: Today, We’re launching Genesis AI — a global physical AI lab and full-stack robotics company — to build generalist robots and u…”” / X https://x.com/dchaplot/status/1940061390678733010

What an incredible start to automatica 2025! 🙏 The reveal of our 4NE1 Gen 3 marks a new chapter in human-robot collaboration and the excitement during the live unveiling was unforgettable. 🦾 A huge thank you to our team for making this day possible. See you tomorrow! https://x.com/NEURARobotics/status/1937571815234089328

We’re bringing powerful AI directly onto robots with Gemini Robotics On-Device. 🤖 It’s our first vision-language-action model to help make robots faster, highly efficient, and adaptable to new tasks and environments – without needing a constant internet connection. 🧵 https://x.com/GoogleDeepMind/status/1937511515768176966

how we accidentally solved robotics by watching 1 million hours of YouTube | atharva’s blog https://ksagar.bearblog.dev/vjepa/

This is what efficient AI looks like: Gemma 3n just dropped – a natively multimodal model that runs entirely on your device. No cloud. No API calls. 🧠 Text, image, audio, and video – handled locally. ⚡️Only needs 2B in GPU memory to run 🤯 First sub-10B model to hit 1300+ Elo https://x.com/fdaudens/status/1938304519344992493

🚀 Meet Qwen-TTS – now live via the Qwen API ! Trained on millions of hours of speech, it delivers ultra-natural, expressive audio with smart prosody, pacing, and emotion. 🗣️ Supports 3 Chinese dialects: Beijing, Shanghai, Sichuan 🎙️ 7 bilingual voices: Cherry, Ethan, Chelsie, https://x.com/Alibaba_Qwen/status/1939553252166836457

🚨 NEW LABS EXPERIMENT 🚨 Introducing Doppl, a new mobile app that lets you upload a photo or screenshot of an outfit and then creates a video of you wearing the clothes to help you find your ✨aesthetic ✨ Available on iOS and Android in the US to users 18+, download the https://x.com/GoogleLabs/status/1938284886277951916

Try on looks and discover your style with Doppl
https://blog.google/technology/google-labs/doppl/

Multi-modal researcher with Gemini 2.5 Generate reports + custom podcasts on any topic w/ LangGraph + Gemini 2.5: 📽️ YouTube video processing 🔍 Real-time Google Search integration 🗣️ Multi-speaker text-to-speech 💻: https://x.com/LangChainAI/status/1940064813054582995

Gemma 3N quirks! 1. Vision NaNs on float16 2. Conv2D weights are large FP16 overflows to infinity 3. Large activations fixed vs Gemma 3 4. 6-7 training losses: normal for multimodal? 5. Large nums in msfa_ffn_pw_proj 6. NaNs fixed in @UnslothAI Details: https://x.com/danielhanchen/status/1940073369648734571

🚨 MCP (Model Context Protocol) is all the rage these days, yet when it comes to multimodal tool-calling the current infrastructure is, quite frankly, abysmal. 💪 That’s why we built @vlmrun MCP – giving any AI agent the power to see, understand, and automate visual content https://x.com/sudeeppillai/status/1940040251176886600

I built Claude Tasks inside @CodeGuidedev 2 days back. Now I added vision to this, using Browser MCP. It can: – open all pages – click on any button – take screenshots – test all features And feed the report back to Claude Code. It fixed a major issue, let me explain below: https://x.com/cjzafir/status/1940072438739738663

One of the most useful MCP tools out there: ultra-precise background removal (use it directly from your chat) with this MCP 𝚑𝚝𝚝𝚙𝚜://𝚗𝚘𝚝-𝚕𝚊𝚒𝚗-𝚋𝚊𝚌𝚔𝚐𝚛𝚘𝚞𝚗𝚍-𝚛𝚎𝚖𝚘𝚟𝚊𝚕.𝚑𝚏.𝚜𝚙𝚊𝚌𝚎/𝚐𝚛𝚊𝚍𝚒𝚘_𝚊𝚙𝚒/𝚖𝚌𝚙/𝚜𝚜𝚎 https://x.com/abidlabs/status/1939778684388614303

Okay it works! 🤯 I built an MCP server for Premier Pro. It helps you edit videos using just prompts! Here’s a demo of me editing a Tiktok video in just a few sentences 👇. https://x.com/itstundealao/status/1940098517394932109

Announcing the Open Source Release of the ERNIE 4.5 Model Family | ERNIE Blog https://yiyan.baidu.com/blog/posts/ernie4.5/

Baidu just released the weights for multiple ERNIE 4.5 variants including multimodal models https://x.com/scaling01/status/1939509144903422131

MASSIVE release from Baidu – Ernie 4.5 VLMs & LLMs, Models beat DeepSeek v3, Qwen 235B and competitive to OpenAI O1 (for VLM) – Apache 2.0 licensed 💥 https://x.com/reach_vb/status/1939569283111235645

The ERNIE 4.5 series is now officially open source. This family of models includes 10 variants—from MoE models with 47B and 3B active parameters, the largest having 424B total parameters, to a 0.3B dense model—all available now to the global AI community for open research and https://x.com/Baidu_Inc/status/1939724778157511126

Wait, that’s JUST a 3B multimodal model understanding and generation model AND apache 2.0 licensed 🔥 https://x.com/reach_vb/status/1939627598830559644

Galbot G1：Our First Embodied AI Product Embodying the Future of Industrial and Home Autopilot Introducing Galbot G1, our groundbreaking first-generation robot, designed for generalizable long-duration real-world robotic operations with its advanced Sim2Real technologies and https://x.com/GalbotRobotics/status/1802735719821299987

One of the why questions we’re always asking is how can we help operations employees access inventory more efficiently? We’re taking a big leap forward by introducing DeepFleet, an exciting new AI model that makes our one million+ robot fleet work smarter. Think of DeepFleet as https://x.com/ajassy/status/1940119199139209389

More humanoid robots from China! Beijing-based industrial robot maker ROKAE Robotics has unveiled two humanoid robot models. https://x.com/TheHumanoidHub/status/1937926026514088103

New video of CL-3 humanoid by Chinese company LimX Dynamics. https://x.com/TheHumanoidHub/status/1940431073827328010

K-Bot is the world’s first open-source humanoid robot that is affordable, available and made in America. Robots should serve people and empower anyone to build the future, not just big corporations. ➡️Order now https://x.com/kscalelabs/status/1940108075064865126

Your Personal Open-Source Humanoid Robot for $8,999 — JX Mo, K-Scale Labs – YouTube https://www.youtube.com/watch?v=BS92RdBvI90

Okay Wispr Flow is actually amazing. Order of magnitude faster to write & review docs vs. using something like superwhisper on mac. Is there anything local that comes close to the user experience of Flow? Because rn I think the tradeoff of cloud processing seems totally worth”” / X https://x.com/bilawalsidhu/status/1940550340144775251

The study asks if models like GPT‑4o truly understand images. It finds they juggle many jobs yet still trail task‑focused vision tools. Past tests could not match chat models with pixel specialists fairly. The authors turn every benchmark into quick yes‑no image checks any API https://x.com/rohanpaul_ai/status/1941086082679951554

Want to learn about the research behind Gemma 3n? Altup – https://x.com/osanseviero/status/1940127957730959494

Nano is a depth-aware atmospheric haze plugin that uses ML depth estimation to add physically accurate fog and light scattering to your footage. Works *best* on log footage with visible light sources – it analyzes scene highlights then creates airlight (atmospheric scatter) and https://x.com/bilawalsidhu/status/1938421841753772434

visual reasoning is now in @huggingface transformers 🔥 GLM-4.1V-Thinking is just released and merged into transformers, we gave it a vibe test run 🤠 it’s very good, comes with 64k context length and MIT license 😍 it supports 4k image tokens and any aspect ratio as well! https://x.com/mervenoyann/status/1940358096552902675

FlashDepth: Real-time Streaming Video Depth Estimation at 2K Resolution”” TL;DR: Depth estimation (2044×1148) streaming video at 24 FPS; careful modifications of pretrained single-image depth models, these capabilities are enabled with relatively little data and training. https://x.com/Almorgand/status/1939724839004037617

Optimus vision in the latest Tesla Impact Report: Optimus, our autonomous humanoid robot, will give people back more time to do impactful work and enjoy their lives by automating time-consuming, unsafe, and repetitive tasks at work and in the home. https://x.com/TheHumanoidHub/status/1940151959501382082