Multimodal: AI News Week Ending 04/10/2026

Multimodal: AI News Week Ending 04/10/2026

April 10, 2026

Image created with gemini-3.1-flash-image-preview with claude-sonnet-4-5. Image prompt: Using the provided reference image, keep the tight left-anchored profile composition, deep blue-purple cinematic lighting, wispy atmospheric smoke bleeding rightward, and glitter scatter catching light, but replace the central subject with a figure in profile holding a glass prism refracting fragments of images, text glyphs, and sound waveforms into the haze, maintaining the emotionally weighted post-party stillness and HBO prestige drama aesthetic, with ‘multimodality’ in thin lowercase white Helvetica Neue Light on the right two-thirds.

Today, we are launching our collaboration with @nomic_ai to make AI agents more effectively and efficiently understand complex PDF documents. Nomic’s new nomic-layout-v1 model allows your AI agents to parse documents locally, so sensitive documents never leave your machine.
https://x.com/usemuna/status/2041879769332216009

we just shipped layout models that run entirely on your laptop with @usemuna no server. no API key. no cost per page. an agent can now parse a 500-page PDF the same way it reads a text file
https://x.com/andriy_mulyar/status/2041893915347812710

Gemma 4 E2B on iPhone 17 Pro Max in AI Edge Gallery! Using skills to query wikipedia. 🔥 App link below. [cr: @mweinbach]
https://x.com/_philschmid/status/2041171039598543064

Insane I’m running Gemma 4 on my iPhone 16 pro max Vibe coded the app in under 1h Singularity is here
https://x.com/enjojoyy/status/2040563245925151229

Gemma 4 E4B is impressive for an on-device LLM. GPT-4ish quality, and expect hallucinations. Here is: “List five sociological theories starting with u and what they are. Then describe them in a rhyming verse” Its in real time, the last is a little bit of a stretch, but not bad!
https://x.com/emollick/status/2040851723774808310

People are asking what’s the difference between Falcon Perception and SAM3, so here’s my opinion: SAM3:
https://t.co/KVRbuHm8H1 Falcon Perception:
https://t.co/QDgMlOBvDH First, sam3 does “”promptable concept segmentation””: simple noun phrases (like “”yellow bus””, “”red apple””) +
https://x.com/dahou_yasser/status/2041474094252933195

Today we’re releasing WildDet3D–an open model for monocular 3D object detection in the wild. It works with text, clicks, or 2D boxes, and on zero-shot evals it nearly doubles the best prior scores. 🧵
https://x.com/allen_ai/status/2041545111151022094

kays on X: “I noticed there wasn’t anything like this out there, so I wrote a tiny visual blog for those wanting to introduce themselves to Dynamic Gaussian Splatting and their current methods 🖼️ Feel free to check out, these are some of the visuals taken from it https://t.co/6W2qx2yI1K” / X
https://x.com/pabloadaw/status/2041650303804555278

We’re excited to be rolling out two model updates today! Marble 1.1: Improves lighting and contrast, with a major reduction in visual artifacts. Marble 1.1-Plus: Our new model built for scale. Create larger, more complex environments than ever before.
https://x.com/theworldlabs/status/2041554646561677701

Generate 3D models and interactive charts with the Gemini app
https://blog.google/innovation-and-ai/products/gemini-app/3d-models-charts/

Google quietly launched an AI dictation app that works offline | TechCrunch

Google quietly launched an AI dictation app that works offline

Google’s Gemma 4 E2B running on-device on iPhone 17 Pro Gemma 4 is built from the same research as Gemini 3, has image understanding capabilities and can reason if needed Running at ~40tk/s with MLX optimized for Apple Silicon
https://x.com/adrgrondin/status/2040512861953270226

Lots of people want Gemma 4! Google AI Edge is #8 on the iOS App Store for productivity apps.
https://x.com/OfficialLoganK/status/2040874501777317982

Gemma 2 Release – a google Collection
https://huggingface.co/collections/google/gemma-2-release

Gemma 3 Release – a google Collection
https://huggingface.co/collections/google/gemma-3-release

Gemma 4 – a google Collection
https://huggingface.co/collections/google/gemma-4

Gemma 4 is now available in the Gemini API and Google AI Studio. Use `gemma-4-26b-a4b-it` and `gemma-4-31b-it` with the same `google-genai` sdk as Gemini. 📝 Text generation with generate_content . 🧭 System instruction + Function Calling example. 🖼️ Image understanding example.
https://x.com/_philschmid/status/2041532358969446596

Breaking: @AIatMeta just released Muse Spark — now live across @ScaleAILabs leaderboards. Here’s how it stacks up: Tied for 🥇on SWE-Bench Pro Tied for 🥇on HLE Tied for 🥇on MCP Atlas Tied for 🥇on PR Bench – Legal Tied for 🥈on SWE Atlas Test Writing 🥈on PR Bench – Finance
https://x.com/scale_AI/status/2041934840879358223

Introducing Muse Spark, the first in the Muse family of models developed by Meta Superintelligence Labs. Muse Spark is a natively multimodal reasoning model with support for tool-use, visual chain of thought, and multi-agent orchestration. Muse Spark is available today at
https://x.com/AIatMeta/status/2041910285653737975

NEW: Meta announces Muse Spark. All you need to know: * It’s their new multi-modal reasoning model. * Strong at multi-agent orchestration and multi-modal reasoning. * Contemplating mode orchestrates multiple agents that reason in parallel. Helps to compete with models such
https://x.com/omarsar0/status/2041919769536770247

To spend more test-time reasoning without drastically increasing latency, we can scale the number of parallel agents that collaborate to solve hard problems. While standard test-time scaling has a single agent think for longer, scaling Muse Spark with multi-agent thinking enables
https://x.com/AIatMeta/status/2041926297216282639

Meta is back! Muse Spark scores 52 on the Artificial Analysis Intelligence Index, behind only Gemini 3.1 Pro, GPT-5.4, and Claude Opus 4.6. Muse Spark is the first new release since Llama 4 in April 2025 and also Meta’s first release that is not open weights Muse Spark is a new
https://x.com/ArtificialAnlys/status/2041913043379220801

try muse spark via the Meta AI app or
https://t.co/DipeeIuXm2! check out this simulation i made:
https://x.com/alexandr_wang/status/2041953243895623913

1/ today we’re releasing muse spark, the first model from MSL. nine months ago we rebuilt our ai stack from scratch. new infrastructure, new architecture, new data pipelines. muse spark is the result of that work, and now it powers meta ai. 🧵
https://x.com/alexandr_wang/status/2041909376508985381

The new model from Meta, Muse Spark, is pretty good at converting images to code!
https://x.com/skirano/status/2041920891072700631

Excited to share what we’ve been building at Meta Superintelligence Labs! We just released Muse Spark, our first AI model. It’s a natively multimodal reasoning model and the first step on our path to personal superintelligence. We’ve overhauled our entire stack to support
https://x.com/shengjia_zhao/status/2041909050728931581

Introducing Muse Spark: Scaling Towards Personal Superintelligence
https://ai.meta.com/blog/introducing-muse-spark-msl/

Meta is back in the game! It’s been fun to test out Muse Spark. Beyond benchmarks, it’s actually a good day to day model… surprisingly good at technical problems and making arcade games. Never bet against @alexandr_wang @natfriedman @danielgross
https://x.com/matthuang/status/2041911766586945770

Meta just released a frontier model, Muse Spark- it takes the #3 spot on our Vals Index.
https://x.com/ValsAI/status/2041922037745381389

try muse spark yourself! download the Meta AI app or go to
https://x.com/alexandr_wang/status/2042024651610861657

We had pre-release access to Meta’s new Muse Spark model and evaluated it on FrontierMath. It scored 39% on Tiers 1-3 and 15% on Tier 4. This is competitive with several recent frontier models, though behind GPT-5.4.
https://x.com/EpochAIResearch/status/2041947954202988757

To build personal superintelligence, our model’s capabilities should scale predictably and efficiently. Below, we share how we study and track Muse Spark’s scaling properties along three axes: pretraining, reinforcement learning, and test-time reasoning. 🧵👇 Let’s start with
https://x.com/AIatMeta/status/2041926291142930899

I showed you SAM 3 all week. This is a 0.6B model that outperforms it. Falcon Perception. Type “”detect the plane”” and it segments every plane in the frame. Pixel-accurate masks from natural language. Fighter jets. Fire. Crowds. All on a MacBook via MLX. No cloud.
https://x.com/MaziyarPanahi/status/2040776481673281936

An open-source Python library for structured data extraction – LangExtract from Google It turns unstructured text into grounded, verifiable structured outputs using LLMs. Every extraction is mapped back to the source, fully traceable and verifiable. LangExtract: – Combines
https://x.com/TheTuringPost/status/2040097129759445439

There were some exceptionally cool demos from @ollama and omlx using MLX to run Qwen 3.5 and Gemma 4 on Apple silicon. The capabilities of local LLMs and the surrounding ecosystem have come a long way in the past couple years.
https://x.com/awnihannun/status/2042456446122803275

Gemma-4 finetuning 2B, 4B, 26B, 31B all work in Unsloth! We also fixed a few issues: 1. Grad accumulation no longer causes losses to explode 2. Index Error for 26B and 31B for inference 3. use_cache=False had gibberish for E2B, E4B 4. float16 audio -1e9 overflows on float16
https://x.com/danielhanchen/status/2041516671119327590

Introducing Gemma 4, our series of open weight (Apache 2.0 licensed) models, which are byte for byte the most capable open models in the world! Gemma 4 is build to run on your hardware: phones, laptops, and desktops. Frontier intelligence with a 26B MOE and a 31B Dense model!
https://x.com/OfficialLoganK/status/2039735606268314071

People underestimate the level of collaboration that needs to happen for a model such as Gemma 4 to land Before the launch, we worked with HF, VLLM, llama.cpp, Ollama, NVIDIA, Unsloth, Cactus, SGLang, Docker, CloudFlare, and so many others This ecosystem is amazing 🔥
https://x.com/osanseviero/status/2041154555530932578

Gemma 4 31B, quantized and evaluated. Instruction following evals are live on our NVFP4 and FP8-block model cards. Results look great. Reasoning and vision evals coming later this week. NVFP4:
https://t.co/GIc7y1Abkc FP8:
https://x.com/RedHat_AI/status/2040766645480628589

Gemma 4 is #1 on @huggingface!
https://x.com/ClementDelangue/status/2040911131108069692

Gemma 4 is a beast.
https://x.com/Yampeleg/status/2040495537598648357

Speculative decoding for Gemma 4 31B (EAGLE-3) A 2B draft model predicts tokens ahead; the 31B verifier validates them. Same output, faster inference. Early release. vLLM main branch support is in progress (PR #39450). Reasoning support coming soon.
https://x.com/RedHat_AI/status/2042660544797110649

Gemma 4 is the #1 trending model on @huggingface 🤗
https://x.com/GlennCameronjr/status/2040529333794824456

Falcon Perception is unbelievable! Look at the demo video!
https://x.com/ivanfioravanti/status/2040886300971004270

Know3D: Prompting 3D Generation with Knowledge from Vision-Language Models”” TL;DR: injects vision-language knowledge into diffusion-based 3D generation to make unseen regions controllable and semantically consistent
https://x.com/Almorgand/status/2040420958532514067

We always need more visuals! Checkout this on for dynamic gaussian splatting
https://x.com/Almorgand/status/2041773431524302968

Projects in the @GeminiApp are now live, with a fun twist…. Notebooks! Enjoy the NotebookLM inspired experience.
https://x.com/OfficialLoganK/status/2042025888053702911

We taught a 1.3M parameter model to play DOOM. It outperforms LLMs up to 92,000x its size. Happy Easter Monday! Here’s our Easter egg release: SauerkrautLM-Doom-MultiVec-1.3M. 17.8 average points per episode. We benchmarked our tiny model against GPT-4o-mini (via OpenAI API),
https://x.com/DavidGFar/status/2041063368656585002

Seems like a good model from Meta that is still trailing the current series of releases. The most important thing to note is that it is not open weights. That was the main reason that Meta’s models were so important. Without that, it is a lot harder to predict the value of Spark
https://x.com/emollick/status/2041924282964394085

try for yourself!
https://t.co/DipeeIuXm2 or download Meta AI app
https://x.com/alexandr_wang/status/2041985846950424760

Our first model from MSL, Muse Spark, is now available on
https://t.co/qBMQ6BPVgP! This is an efficient all-rounder model. It supports fast responses, deeper thinking, visual chain of thought, a higher inference “Contemplating” mode. Plus, it’s natively multimodal. 1/
https://x.com/jack_w_rae/status/2041925332631183421

1/ It’s been so fun working with @shengjia_zhao, @alexandr_wang and the team to build muse spark from scratch. It is early and has rough edges, but excited to continue our research velocity. I especially love that we’re doubling down on the fundamental science. We’re focused on
https://x.com/ananyaku/status/2041913147842556390

1/ Muse Spark is live, and alongside it, our new Advanced AI Scaling Framework which details how we evaluate and prepare for advanced AI. We tested across bio, chem, cyber, and loss of control risks before and after mitigations. Muse Spark achieves a 98% bioweapons refusal rate
https://x.com/summeryue0/status/2041956901769113948

Check out Muse Spark, our first milestone in the quest for personal superintelligence! Scaling this with the team has been a total blast. Give it a spin and let us know what you think! 🥑
https://x.com/ren_hongyu/status/2041922484040298796

try muse spark on
https://x.com/alexandr_wang/status/2041956770864885870

Excited to share Muse Spark! It’s a strong natively multimodal model with many surprising properties that emerged. Here, the model is able to use Python tools to make a playable Sudoku game on the web from an image input of the board. ✨
https://x.com/mattdeitke/status/2041915503795671056

.@MicrosoftAI just dropped a full next-gen media stack – fast efficient SOTA models at an affordable cost ▪️MAI-Transcribe-1, speech → text ▪️ MAI-Voice-1, text → speech ▪️ MAI-Image-2, text → image All ready to build with in Microsoft Foundry and MAI Playground We tested
https://x.com/TheTuringPost/status/2039720786722951624

Three models. Three top-tier results. All shipped within just a few months by the @MicrosoftAI team. – MAI-Transcribe-1 dropped today, the most accurate transcription model in the world across 25 languages according to FLEURS WER benchmark. – MAI-Voice-1 sets a new standard for
https://x.com/mustafasuleyman/status/2039704624006148195

For all my lifters: computer vision app to measure back curvature during deadlift! main technical highlights: — RF-DETR (Roboflow) to segment the person (great performance out-the-box with no additional training!) — YOLO11n (Ultralytics) for bounding box prediction around the
https://x.com/IlirAliu_/status/2041939917673017855

Ace Step 1.5 XL is here Suno 5+ quality at home, open source & fine-tuneable
https://x.com/multimodalart/status/2041563576876327048

etn. & @ElevenLabs at 10 Downing Street
https://x.com/lukeknight/status/2042221068425785526?s=20

v5.5 is the best music model on the planet. Here’s why.
https://x.com/suno/status/2041541160015937995

We spent weeks testing text vs. image retrieval for RAG. The winner? 𝗡𝗲𝗶𝘁𝗵𝗲𝗿. Our recent publication, IRPAPERS, compares 𝘁𝗲𝘅𝘁-𝗯𝗮𝘀𝗲𝗱 𝗿𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹 (OCR + vector, keyword, and hybrid search) and 𝗶𝗺𝗮𝗴𝗲-𝗯𝗮𝘀𝗲𝗱 𝗿𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹 (multimodal late
https://x.com/weaviate_io/status/2041897318367060054

Multimodal Embedding & Reranker Models with Sentence Transformers
https://huggingface.co/blog/multimodal-sentence-transformers

For Olmo 3, we moved from a synchronous RL setup to an asynchronous one. This made our code 4x faster in terms of throughput (tokens/second). I wrote about the changes in the paper, but I finally found the time to go deeper on what was involved:
https://x.com/finbarrtimbers/status/2041176604961878271

Common Failure Modes Break VLM-Powered OCR in Production. 🔁 Repetition Loops — model spirals into infinite whitespace, exhausts resources, cascades latency across your system 🛑 Recitation Errors — safety filters hard-stop legitimate extractions as “”copyright violations””
https://x.com/llama_index/status/2041923086719631780

Hooked up GPT 5.4 to auto-translate our docs via automatic GitHub trigger; way better than what google translate gives you. (also takes significantly longer, so happens once a day)
https://x.com/steipete/status/2040831898620932397

Robots can now reconstruct 3D scenes in real time from a single RGB camera. [📍 Projects page + paper] No depth sensor. No retraining. 30 FPS. Researchers at the Imperial College London introduced KV-Tracker, a training-free method that makes heavy models like π³ and Depth
https://x.com/IlirAliu_/status/2041062366025031787

Researchers just taught a robot to play tennis. From just clips of a few amateur players performing basic forehands, backhands, and shuffles… …a robot learned one of the fastest, most coordinated physical skills there is. Insane!
https://x.com/rowancheung/status/2040085788256506190

Robotics pre-training *from scratch* has been a heretical idea for the last two years. That “there’s no internet of robotics data” has led to two prevailing conclusions: 1) we need to use pretrained model backbones and 2) we need to scale robotics data. The first conclusion in
https://x.com/xiao_ted/status/2041547335935853025

Nvidia’s answer to Tesla’s data advantage in self-driving Ali Kani, who has been at NVIDIA Automotive for almost 8 years, explains ↓ Watch the full video to see a test of Nvidia’s driving system on real streets and explore how they plan to bring self-driving to every car:
https://x.com/TheTuringPost/status/2041089313388343530