Ethan B. Holland

Over 56,100 manually organized AI links and counting

Multimodal: AI News Week Ending 05/08/2026

May 8, 2026

Image created with gemini-3.1-flash-image-preview with claude-opus-4.7. Image prompt: Using the trail_landscape.jpg environment and trail_sign.jpg sign construction as references, render a photorealistic McDowell Preserve trail junction where the brown wooden post holds a weathered ranger sign reading ‘MULTIMODALITY’ in bold all-caps with entries like ‘← Vision Overlook 1.2’, ‘Audio Wash 0.8 →’, and ‘← Text Mesa 2.4’, with a small camera-eye-and-soundwave emblem replacing the WP3 medallion; a curve-billed thrasher perches singing on a volcanic boulder beside the post while an open field notebook with a pressed palo verde leaf rests against the base, saguaros and the hazy Scottsdale valley stretching behind under bright partly-cloudy Arizona sky.

You can now enable Claude to use your computer to complete tasks. It opens your apps, navigates your browser, fills in spreadsheets–anything you’d do sitting at your desk. Research preview in Claude Cowork and Claude Code, macOS only.
https://x.com/claudeai/status/2036195789601374705?s=20

Anthropic Orbit leaked Orbit, a proactive assistant for Claude Cowork that auto-generates briefings and insights from Gmail, Slack, GitHub, Calendar, Drive, and Figma, no prompting required. Users can also deploy and pin “”Orbit apps”” for quick access. It’s Anthropic’s answer to
https://x.com/kimmonismus/status/2051618156385366305

New in Claude Managed Agents: dreaming, outcomes, and multiagent orchestration | Claude
https://claude.com/blog/new-in-claude-managed-agents

Gemini API File Search is now multimodal
https://blog.google/innovation-and-ai/technology/developers-tools/expanded-gemini-api-file-search-multimodal-rag/

Good news for AI builders: the File Search tool in the Gemini API is now multi-modal 🗃️, powered by our Gemini Embedding 2 model, + support for custom metadata & inline citations : ) File Search comes with storage and embedding generation at query time free of charge!
https://x.com/OfficialLoganK/status/2051728186824904743

The Gemini API’s File Search tool now supports multimodal retrieval. Use `gemini-embedding-2` as the embedding model to build a true multimodal RAG system for PDFs and images with a single call. How it works: 1. Create a store with `gemini-embedding-2` as the embedding model 2.
https://x.com/_philschmid/status/2052060912425546050

MolmoAct 2: An open foundation for robots that work in the real world | Ai2
https://allenai.org/blog/molmoact2

Real-world robots that actually work in messy kitchens, labs, or factories… fully open-source and ready to deploy today: A model that can do toy demos? Nope. AI2 just dropped MolmoAct 2: a bimanual action reasoning model that handles real chores (washing dishes, sorting items)
https://x.com/IlirAliu_/status/2051934034935128446

// OCR-Memory // Well this is a unique approach to store memory for long-horizon agents. Most of the agent memory systems compress trajectories into text summaries and hope the model remembers what matters. But that’s where the information loss hides. Long-horizon agents need
https://x.com/dair_ai/status/2049957482811056307

GLM-5V-Turbo Tech Report: Toward a Native Foundation Model for Multimodal Agents This report summarizes the main improvements behind GLM-5V-Turbo across model design, multimodal training, reinforcement learning, toolchain expansion, and integration with agent frameworks. These
https://x.com/Zai_org/status/2052426777654387168

Scenethesis: A Language and Vision Agentic Framework for 3D Scene Generation”” TL;DR: combines LLM planning with vision-guided refinement to generate physically plausible and coherent 3D scenes from text
https://x.com/Almorgand/status/2051320217674870795

World’s first native color LiDAR sensor. I want one of these puppies.
https://x.com/bilawalsidhu/status/2051657980181934298

👀 DeepSeek briefly released (then deleted) a vision model tech report — what did it reveal? 👀 Zhihu contributor 刘聪NLP breaks it down: Core idea: 👉 A new multimodal reasoning framework that embeds spatial pointers (boxes & points) directly into the chain-of-thought • The
https://x.com/ZhihuFrontier/status/2050238000433659958

DeepSeek V4–almost on the frontier, a fraction of the price
https://simonwillison.net/2026/Apr/24/deepseek-v4/

A fully open source mocap system that works with cheap webcams: The FreeMoCap Project A free-and-open-source, hardware-and-software-agnostic, minimal-cost, research-grade, motion capture system and platform for decentralized scientific research, education, and training:
https://x.com/IlirAliu_/status/2050484464220827774