Image created with gemini-3.1-flash-image-preview with claude-opus-4.7. Image prompt: Using the trail_landscape.jpg environment and trail_sign.jpg sign construction as references, render a photorealistic McDowell Preserve trail junction where the brown wooden post holds a weathered ranger sign reading ‘MULTIMODALITY’ in bold all-caps with entries like ‘← Vision Overlook 1.2’, ‘Audio Wash 0.8 →’, and ‘← Text Mesa 2.4’, with a small camera-eye-and-soundwave emblem replacing the WP3 medallion; a curve-billed thrasher perches singing on a volcanic boulder beside the post while an open field notebook with a pressed palo verde leaf rests against the base, saguaros and the hazy Scottsdale valley stretching behind under bright partly-cloudy Arizona sky.
You can now enable Claude to use your computer to complete tasks. It opens your apps, navigates your browser, fills in spreadsheets–anything you’d do sitting at your desk. Research preview in Claude Cowork and Claude Code, macOS only.
https://x.com/claudeai/status/2036195789601374705?s=20
Anthropic Orbit leaked Orbit, a proactive assistant for Claude Cowork that auto-generates briefings and insights from Gmail, Slack, GitHub, Calendar, Drive, and Figma, no prompting required. Users can also deploy and pin “”Orbit apps”” for quick access. It’s Anthropic’s answer to
https://x.com/kimmonismus/status/2051618156385366305
New in Claude Managed Agents: dreaming, outcomes, and multiagent orchestration | Claude
https://claude.com/blog/new-in-claude-managed-agents
Gemini API File Search is now multimodal
https://blog.google/innovation-and-ai/technology/developers-tools/expanded-gemini-api-file-search-multimodal-rag/
Good news for AI builders: the File Search tool in the Gemini API is now multi-modal 🗃️, powered by our Gemini Embedding 2 model, + support for custom metadata & inline citations : ) File Search comes with storage and embedding generation at query time free of charge!
https://x.com/OfficialLoganK/status/2051728186824904743
The Gemini API’s File Search tool now supports multimodal retrieval. Use `gemini-embedding-2` as the embedding model to build a true multimodal RAG system for PDFs and images with a single call. How it works: 1. Create a store with `gemini-embedding-2` as the embedding model 2.
https://x.com/_philschmid/status/2052060912425546050
MolmoAct 2: An open foundation for robots that work in the real world | Ai2
https://allenai.org/blog/molmoact2
Real-world robots that actually work in messy kitchens, labs, or factories… fully open-source and ready to deploy today: A model that can do toy demos? Nope. AI2 just dropped MolmoAct 2: a bimanual action reasoning model that handles real chores (washing dishes, sorting items)
https://x.com/IlirAliu_/status/2051934034935128446
// OCR-Memory // Well this is a unique approach to store memory for long-horizon agents. Most of the agent memory systems compress trajectories into text summaries and hope the model remembers what matters. But that’s where the information loss hides. Long-horizon agents need
https://x.com/dair_ai/status/2049957482811056307
GLM-5V-Turbo Tech Report: Toward a Native Foundation Model for Multimodal Agents This report summarizes the main improvements behind GLM-5V-Turbo across model design, multimodal training, reinforcement learning, toolchain expansion, and integration with agent frameworks. These
https://x.com/Zai_org/status/2052426777654387168
Scenethesis: A Language and Vision Agentic Framework for 3D Scene Generation”” TL;DR: combines LLM planning with vision-guided refinement to generate physically plausible and coherent 3D scenes from text
https://x.com/Almorgand/status/2051320217674870795
World’s first native color LiDAR sensor. I want one of these puppies.
https://x.com/bilawalsidhu/status/2051657980181934298
👀 DeepSeek briefly released (then deleted) a vision model tech report — what did it reveal? 👀 Zhihu contributor 刘聪NLP breaks it down: Core idea: 👉 A new multimodal reasoning framework that embeds spatial pointers (boxes & points) directly into the chain-of-thought • The
https://x.com/ZhihuFrontier/status/2050238000433659958
DeepSeek V4–almost on the frontier, a fraction of the price
https://simonwillison.net/2026/Apr/24/deepseek-v4/
A fully open source mocap system that works with cheap webcams: The FreeMoCap Project A free-and-open-source, hardware-and-software-agnostic, minimal-cost, research-grade, motion capture system and platform for decentralized scientific research, education, and training:
https://x.com/IlirAliu_/status/2050484464220827774





Leave a Reply