Image created with Ideogram. Image prompt: Using the provided reference image, preserve every detail exactly — the marigold orange backdrop, the seated young woman with closed eyes and faint smile in her purple-and-white windbreaker, the tattooed singer in the red beanie and layered red vest, the lighting and framing — but replace only the black handheld microphone with a fully opened vintage red Swiss Army knife with all blades, scissors, screwdrivers, and tools splayed outward, held to his mouth in the exact same hand position and scale as the original microphone, photographed with seamless realism and matching studio lighting. After generating the image, overlay the text “Agents” in the upper-left corner of the frame in large, bold, all-caps ITC Avant Garde Gothic Pro Medium (or a near-identical geometric sans-serif if unavailable), pure white (#FFFFFF), with no date, subtitle, drop shadow, or outline. The text should be substantial in scale — taking up a meaningful portion of the upper-left area — with comfortable margin from the top and left edges, set against the negative space of the orange backdrop so it does not overlap or obscure the singer, the seated woman, or the replaced object.
Browser Run: give your agents a browser
https://blog.cloudflare.com/browser-run-for-ai-agents/
Harvey Agents | Delegate the Work. Own the Judgment.
https://www.harvey.ai/agents
Adobe Ushers in a New Era of Creativity with New Creative Agent and Generative AI Innovations in Adobe Firefly
https://news.adobe.com/news/2026/04/adobe-new-creative-agent
give your agents a browser. Browser Run (fka Browser Rendering) really sprinted for Agents Week 🏃♀️ quick look at what shipped 1) Browser Rendering –> Browser Run (renamed!) 2) Live View – realtime view of browser sessions 3) Human in the Loop – intervene when your agent needs
https://x.com/kathyyliao/status/2044479579382026484
Redesigning Claude Code on desktop for parallel agents | Claude
https://claude.com/blog/claude-code-desktop-redesign
2. Give Claude Code your full task context upfront: goal, constraints, acceptance criteria in the first turn. This lets Claude Code do its best work.
https://x.com/_catwu/status/2044808536790847693
Anthropic CPO leaves Figma’s board after reports he will offer a competing product | TechCrunch
Anthropic CPO leaves Figma’s board after reports he will offer a competing product
Bessent, Powell Summon Bank CEOs to Urgent Meeting Over Anthropic’s New AI Model – Bloomberg
https://www.bloomberg.com/news/articles/2026-04-10/anthropic-model-scare-sparks-urgent-bessent-powell-warning-to-bank-ceos
Multi-agent coordination patterns: Five approaches and when to use them | Claude
https://claude.com/blog/multi-agent-coordination-patterns
Our evaluation of Claude Mythos Preview’s cyber capabilities | AISI Work
https://www.aisi.gov.uk/blog/our-evaluation-of-claude-mythos-previews-cyber-capabilities
Five companies — Google, Microsoft, Meta, Amazon, and Oracle — now control about two-thirds of the world’s compute, up slightly from ~60% at the start of 2024. Many AI labs (including OpenAI and Anthropic) depend almost entirely on these hyperscalers for access to their compute.
https://x.com/EpochAIResearch/status/2044154042541301870
I asked Jensen: “2 out of the top 3 models in the world, Claude and Gemini, were trained on TPU. What does that mean for Nvidia going forward?” After a long technical back and forth about what the right accelerator for AI looks like (see full episode), Jensen lays down the
https://x.com/dwarkesh_sp/status/2044468295957635392
So the concern over Mythos and cybersecurity seems warranted.
https://x.com/emollick/status/2043810051979157680
This was… an interesting one. Reminder that we run independent evals on our cyber ranges that labs don’t have access to. Exploitation capabilities are getting seriously good. Mythos is the first model to complete our full 32-step corporate network attack sim E2E.
https://x.com/ekinomicss/status/2043688793085992970
Anthropic launched Claude Opus 4.7 today, the new #1 in our GDPval-AA benchmark for performance on agentic real-world work tasks Opus 4.7 scored 1753 on GDPval-AA at launch with its ‘max’ effort setting, surpassing GPT-5.4 xhigh. This is a significant upgrade, placing Opus back
https://x.com/ArtificialAnlys/status/2044856740970402115
Anthropic says Opus 4.7 hits 80.6% on Document Reasoning — up from 57.1%. But “”reasoning about documents”” ≠ “”parsing documents for agents.”” We ran it on ParseBench. → Charts: 13.5% → 55.8% (+42.3) — huge → Formatting: 64.2% → 69.4% (+5.2) → Content: 89.7% → 90.3%
https://x.com/llama_index/status/2044886527352647859
Anthropic’s Opus 4.7 just seized the #1 spot on the Vals Index with a score of 71.4%, a massive jump from the previous best (67.7%). It also ranks #1 on Vibe Code Bench, Vals Multimodal, Finance Agent, Mortgage Tax, SAGE, SWE-Bench, and Terminal Bench 2.
https://x.com/ValsAI/status/2044792518953533777
big jump in coding capabilities by Claude 4.7 Opus SWE-Bench Pro 64.3% SWE-Bench Verified 87.6% TerminalBench 69.4% but interestingly, I think they kept CyberGym scores artificially low
https://x.com/scaling01/status/2044784563201708379
Claude 4.7 Opus has an Elo of 1753 on GDPVal-AA
https://x.com/scaling01/status/2044784781368365233
Claude Opus 4.7 is out! Benchmark scores look pretty strong, but clearly much worse than Mythos. It’s a nerfed Mythos, they deliberately reduced cyber capabilities during training.
https://x.com/Yuchenj_UW/status/2044787564440334350
Document Arena update: four new models are reshaping the top ranks – including two open models! – #1 Claude Opus 4.6 Thinking is new, keeping @AnthropicAI in the top 3 – #8 Kimi-K2.5 Thinking by @Kimi_Moonshot now the best open model (Modified MIT) – #10 Gemma-4-31b by
https://x.com/arena/status/2044437193205395458
Document reasoning increased by A LOT for Opus 4.7
https://x.com/scaling01/status/2044784878965703100
Introducing Claude Opus 4.7 \ Anthropic
https://www.anthropic.com/news/claude-opus-4-7
New Anthropic Fellows research: developing an Automated Alignment Researcher. We ran an experiment to learn whether Claude Opus 4.6 could accelerate research on a key alignment problem: using a weak AI model to supervise the training of a stronger one.
https://x.com/AnthropicAI/status/2044138481790648323
Nontheless Opus 4.7 scores much higher on Firefox shell exploitation
https://x.com/scaling01/status/2044788243435069764
OpenAI just dropped a major Codex update, one hour after Anthropic’s Opus 4.7. Whats new: background computer use on macOS (Codex clicks and types on your Mac while you keep working), in-app browser, image generation via gpt-image-1.5, persistent memory, long-running
https://x.com/kimmonismus/status/2044832303075995994
Opus 4.7 first-hour impressions Ran the canvas tree growth test twice. 4.6: nailed the animation both times 4.7: static tree, no growth animation — twice 4.7’s thinking is noticeably shorter and faster though (trimmed some 4.6 thinking in the clip for pacing). Not the upgrade
https://x.com/stevibe/status/2044800069661254064
Opus 4.7 scores 92% on ARC-AGI-1 and 75.83% on ARC-AGI-2
https://x.com/scaling01/status/2044791039605506344
The new Opus 4.7 model places #1 on our Vibe Code Benchmark, at 71%. When we first released the benchmark 4.5 months ago, no model scored above 25%. This benchmark tests a model’s ability to create a fully functional web application from the ground up.
https://x.com/ValsAI/status/2044791415524471099
We comprehensively benchmarked Opus 4.7 on document understanding. We evaluated it through ParseBench – our comprehensive OCR benchmark for enterprise documents where we evaluate tables, text, charts, and visual grounding. The results 🧑🔬: – Opus 4.7 is a general improvement
https://x.com/jerryjliu0/status/2044902620746363016
What are the largest software engineering tasks AI can perform? In our new benchmark, MirrorCode, Claude Opus 4.6 reimplemented a 16,000-line bioinformatics toolkit — a task we believe would take a human engineer weeks. Co-developed with @METR_Evals. Details in thread.
https://x.com/EpochAIResearch/status/2042624189421752346
What you need to know about Opus 4.7 * Takes instructions literally * Better vision means improved computer use and producing slides and other visual artifacts * Optimized for large-scale real-world analysis * Better at using file system-based memory
https://x.com/omarsar0/status/2044797480471044536
Wow I can already say after just 5 hours using @AnthropicAI Opus 4.7 that this is the first model that “”gets”” what I’m doing when I’m working. It feels aligned with me in a way no previous model did. (4.6 actively worked against me. I hated it. So this is *very* exciting!)
https://x.com/jeremyphoward/status/2044942799511191559
Anthropic co-founder confirms the company briefed the Trump administration on Mythos | TechCrunch
Anthropic co-founder confirms the company briefed the Trump administration on Mythos
First model from Anthropic, which openly acknowledges it isn’t the best model they have
https://x.com/nrehiew_/status/2044791293080121553
Internal Anthropic survey on Claude Mythos Preview 12/18 people thought that Mythos can manage day long ambiguous tasks 8/18 thought that it can execute week long tasks
https://x.com/scaling01/status/2044787521691742338
Nearly 1/3 of surveyed people in Anthropic now think entry-level engineers and researchers are likely replaced by Mythos within 3 months
https://x.com/arankomatsuzaki/status/2044808883928186936
Read OpenAI’s latest internal memo about beating the competition — including Anthropic | The Verge
https://www.theverge.com/ai-artificial-intelligence/911118/openai-memo-cro-ai-competition-anthropic
Gemini 3.1 Flash TTS is our most controllable text-to-speech model yet. With new Audio Tags, you can easily direct vocal style, delivery, and pace through text commands. 🧵
https://x.com/GoogleDeepMind/status/2044447030353752349
Gemini 3.1 Flash TTS: New text-to-speech AI model
https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-flash-tts/
Google’s new Gemini 3.1 Flash TTS ranks #2 on the Artificial Analysis Speech Arena Leaderboard, ahead of ElevenLabs’ Eleven v3 and only behind Inworld TTS 1.5 Max Gemini 3.1 Flash TTS represents a significant step forward for Google from previous TTS models, with notably
https://x.com/ArtificialAnlys/status/2044450045190418673
Introducing Gemini 3.1 Flash TTS 🗣️, our latest text to speech model with scene direction, speaker level specificity, audio tags, more natural + expressive voices, and support for 70 different languages. Available via our new audio playground in AI Studio and in the Gemini API!
https://x.com/OfficialLoganK/status/2044447596010435054
Our most expressive and steerable TTS model yet! Designed to give builders granular control over AI-generated speech, Gemini 3.1 Flash TTS is really fun to play with! Available in preview today – for devs via the Gemini API & @GoogleAIStudio + for enterprises on Vertex AI
https://x.com/demishassabis/status/2044599020690010217
try the TurboTax app in ChatGPT:
https://x.com/gdb/status/2044292247898992924
Introducing Gemini on Mac. It’s the first time we’re bringing the @Geminiapp to desktop. The team built this initial release with @Antigravity, and it went from an idea to a native Swift app prototype in a few days. More features on the way!
https://x.com/sundarpichai/status/2044452464724967550
Introducing Gemini on Mac. We heard your feedback. We recruited a small team. They built 100+ features in less than 100 days. 🤯 100% native Swift. Lightning fast. Let us know what you think!
https://x.com/joshwoodward/status/2044452201947627709
The Gemini App is now available on Mac OS
https://blog.google/innovation-and-ai/products/gemini-app/gemini-app-now-on-mac-os/
The Gemini app is now on Mac. With this new desktop app, you can access Gemini from any screen with Option + Space and share your window to get answers based on the documents, code, or data you’re working on.
https://x.com/GeminiApp/status/2044445911716090212
Google develops its own desktop Agent to compete with Cowork
https://www.testingcatalog.com/google-develops-its-own-desktop-agent-to-compete-with-cowork/
Google tests Canvas and Connectors on NotebookLM
https://www.testingcatalog.com/google-tests-canvas-and-connectors-on-notebooklm/
Google upgrades AI Mode in the Chrome browser
https://blog.google/products-and-platforms/products/search/ai-mode-chrome/
Today we shipped a new Search experience in @googlechrome to help you explore the web without constant back and forth between tabs. Now, when you click a link from AI Mode, the website opens side-by-side. It’s a game changer for comparing details and products across sites,
https://x.com/rmstein/status/2044828926057333050
We’re introducing a new Search experience in @GoogleChrome that lets you open webpages side-by-side with AI Mode – no tab switching required. Now, you’ll be able to compare details and ask follow-up questions while still maintaining the context of your search, whether you’re
https://x.com/Google/status/2044832732274901489
Today, we’re introducing Skills in @GoogleChrome, a new way to build one-click workflows for your most frequently used AI prompts — like asking for ingredient substitutions to make a recipe vegan, generating side-by-side shopping comparisons across multiple tabs, or scanning long
https://x.com/Google/status/2044106378655215625
Turn your best AI prompts into one-click tools in Chrome
https://blog.google/products-and-platforms/products/chrome/skills-in-chrome/
Google tests Agentic Shopping and native checkout in Gemini
https://www.testingcatalog.com/google-tests-agentic-shopping-with-native-checkout-in-gemini/
Gemini Robotics ER 1.6: Enhanced Embodied Reasoning — Google DeepMind
https://deepmind.google/blog/gemini-robotics-er-1-6/
Instead of writing complex code, the team interacted with Spot using plain English. We built a bridge between Gemini Robotics ER and Spot’s system, giving the AI a basic set of tools to move freely, take photos, and grab things – enabling it to carry out more complex tasks.
https://x.com/GoogleDeepMind/status/2044763631858909269
Introducing Gemini Robotics ER 1.6, our new SOTA robotics model 🤖 which excels at visual and spacial reasoning, now available via the Gemini API!
https://x.com/OfficialLoganK/status/2044080025474126065
Robotics is making progress! 🤖 We just released @GoogleDeepMind Gemini Robotics-ER 1.6 for enhanced embodied reasoning. – Unlocks instrument reading capabilities for complex gauges and sight glasses. – Achieves 93% success on instrument reading tasks using agentic vision. –
https://x.com/_philschmid/status/2044071114578509971
We teamed up with @BostonDynamics to power their robot Spot with Gemini Robotics embodied reasoning models. This means it can better understand its surroundings, identify objects and follow simple commands – like tidying up a room.
https://x.com/GoogleDeepMind/status/2044763625680765408
We’re rolling out an upgrade designed to help robots reason about the physical world. 🤖 Gemini Robotics-ER 1.6 has significantly better visual and spatial understanding in order to plan and complete more useful tasks. Here’s why this is important 🧵
https://x.com/GoogleDeepMind/status/2044069878781390929
Sub-32B open weights models now offer GPT-5 level intelligence with Qwen3.5 27B (Reasoning) matching GPT-5 (medium) at 42 and Gemma 4 31B (Reasoning) matching GPT-5 (low) at 39 on the Artificial Analysis Intelligence Index @Alibaba_Qwen’s Qwen3.5 and @GoogleDeepMind’s Gemma 4
https://x.com/ArtificialAnlys/status/2043929874537296026
Ravid Shwartz Ziv on X: “I took the new Muse Spark to the ultimate test: filing my taxes – 3 different workplaces, consulting, stocks, foreign bank accounts and assets, and kids. One hour later, I had everything done. AGI is here… cc: @alexandr_wang” / X
https://x.com/ziv_ravid/status/2044237898351030538
this is not investment or tax advice… but very cool!
https://x.com/alexandr_wang/status/2044269086771921326
Banger paper from NVIDIA. Agentic reasoning needs models that are not just capable, but efficient at long-context inference. The agent model layer is moving toward open, long-context, high-throughput architectures. This paper introduces Nemotron 3 Super, an open 120B parameter
https://x.com/dair_ai/status/2044452957023047943
NVIDIA Launches Ising, the World’s First Open AI Models to Accelerate the Path to Useful Quantum Computers | NVIDIA Newsroom
https://nvidianews.nvidia.com/news/nvidia-launches-ising-the-worlds-first-open-ai-models-to-accelerate-the-path-to-useful-quantum-computers
We’ve been developing a multi-agent system that builds and maintains complex software autonomously. Recently, we partnered with NVIDIA to apply it to optimizing CUDA kernels. In 3 weeks, it delivered a 38% geomean speedup across 235 problems.
https://x.com/cursor_ai/status/2044136953239740909
Today, we released Lyra 2.0, a framework for generating persistent, explorable 3D worlds at scale, from NVIDIA Research. Generating large-scale, complex environments is difficult for AI models. Current models often “forget” what spaces look like and lose track of movement over
https://x.com/NVIDIAAIDev/status/2044445645109436672
The next evolution of the Agents SDK | OpenAI
https://openai.com/index/the-next-evolution-of-the-agents-sdk/
We started Hiro with the vision of building an AI personal CFO. Joining @OpenAI gives us the chance to pursue that vision at a much greater scale. Important dates: – Today: Hiro is no longer accepting new signups – April 20, 2026: The product will stop working, but data export
https://x.com/hirofinanceai/status/2043751090232144159
Codex for (almost) everything | OpenAI
https://openai.com/index/codex-for-almost-everything/
Codex for (almost) everything. It can now use apps on your Mac, connect to more of your tools, create images, learn from previous actions, remember how you like to work, and take on ongoing and repeatable tasks.
https://x.com/OpenAI/status/2044827705406062670
OpenAI develops unified Codex app and new Scratchpad feature
https://www.testingcatalog.com/openai-develops-unified-codex-app-and-new-scratchpad-feature/
OpenAI tests web browsing feature on Codex Superapp
https://www.testingcatalog.com/openai-tests-web-browsing-feature-on-codex-superapp/
Gemma 4 and Why Many OpenClaw Users are Now Switching to it
https://x.com/TheTuringPost/status/2042167647077286163
Microsoft is working on yet another OpenClaw-like agent | TechCrunch
Microsoft Plots New Copilot Features Inspired by OpenClaw — The Information
https://www.theinformation.com/articles/microsoft-plots-new-copilot-features-inspired-openclaw
When set up on a Mac mini, Personal Computer can run 24/7 in the background across all your apps and files. Start a task from your iPhone, and Personal Computer can operate on your desktop and local files using 2FA. Requires the latest iOS update from the App Store.
https://x.com/perplexity_ai/status/2044806021244497964
⚡ Meet Qwen3.6-35B-A3B:Now Open-Source!🚀🚀 A sparse MoE model, 35B total params, 3B active. Apache 2.0 license. 🔥 Agentic coding on par with models 10x its active size 📷 Strong multimodal perception and reasoning ability 🧠 Multimodal thinking + non-thinking modes
https://x.com/Alibaba_Qwen/status/2044768734234243427
LM Performance:Qwen3.6-35B-A3B outperforms the dense 27B-param Qwen3.5-27B on several key coding benchmarks and dramatically surpasses its direct predecessor Qwen3.5-35B-A3B, especially on agentic coding and reasoning tasks.
https://x.com/Alibaba_Qwen/status/2044768738294268199
VLM Performance:Qwen3.6 is natively multimodal, and Qwen3.6-35B-A3B showcases perception and multimodal reasoning capabilities that far exceed what its size would suggest, with only around 3 billion activated parameters. Across most vision-language benchmarks, its performance
https://x.com/Alibaba_Qwen/status/2044768742761189762
Alibaba released Qwen3.6-35B-A3B today. Big jump compared to Qwen 3.5-35B model. It’s a sparse MoE, 35B total params, only 3B active. Natively multimodal, thinking and non-thinking modes. Hardfacts: SWE-bench Verified: 73.4, near dense Qwen3.5-27B (75.0), way ahead of
https://x.com/kimmonismus/status/2044780695361290347
[2604.09443] Many-Tier Instruction Hierarchy in LLM Agents
https://arxiv.org/abs/2604.09443
Agentic Analytics Summit 2026
https://cube.registration.goldcast.io/events/a87b0088-098d-4467-a11d-db6821c3a639
Agents as scaffolding for recurring tasks. | Irrational Exuberance
https://lethain.com/agents-as-scaffolding/
Another week on the road meeting with a couple dozen IT and AI leaders from large enterprises across banking, media, retail, healthcare, consulting, tech, and sports, to discuss agents in the enterprise. Some quick takeaways: * Clear that we’re moving from chat era of AI to
https://x.com/levie/status/2043426157367095397?s=46
Building for trillions of agents
https://x.com/levie/status/2030714592238956960?s=46
Cloudflare dashboard can now complete tasks for you. – “”Create a Worker and bind a new R2 bucket to it”” – “”Change my DNS records to 1.1.1.1″” – “”How many errors have happened this week”” Not only do we tell you, but we show you with generative UI. PROTIP: Use full-screen mode.
https://x.com/BraydenWilmoth/status/2044422996765352226
Coding agents learn from experience, but that knowledge stays locked in silos. Solve a thousand SWE tasks, and none of that wisdom helps with competitive coding. What if memories could transfer across domains? The work introduces Memory Transfer Learning, a framework where
https://x.com/dair_ai/status/2044900659921895729
I just updated our license. For personal use, you’re free to run the software on your own servers for coding, building applications, agents, tools, or integrations, as well as for research, experimentation, and other personal projects. Don’t worry, bro — go ahead and use
https://x.com/RyanLeeMiniMax/status/2044132777877221515
It is notable that we are all debating exactly which markdown files are most important to feed AI (skills, memory, tool instructions) and in which order to feed them to get the best output. Feels that this is likely a temporary state of affairs in the development of agents
https://x.com/emollick/status/2043354298650702101
not hot take🧊 the thin vs thick harness debate is pretty useless and completely misses the nuance of working backwards from a real goal when we build agents the obvious answer is that it all depends on what you’re building! there’s no end all be all principle this is why we
https://x.com/Vtrivedy10/status/2044130977526755636
Project Think is here. It’s the next generation of the @CloudflareDev Agents SDK with both lightweight primitives to a full suite of tools that you can use to build long-running agents. Durable execution, sub-agents, persistent sessions, sandboxed code execution, a built-in
https://x.com/aninibread/status/2044409784133103724
Suuuper excited to be shipping this one with the team! You can now control your GitHub Copilot CLI sessions from your phone
https://x.com/tiagonbotelho/status/2043720370734104923
Teleport Beams — Trusted Runtimes for Infrastructure Agents
https://www.beams.run/
The degree to which you are awed by AI is perfectly correlated with how much you use AI to code.
https://x.com/staysaasy/status/2042063369432183238
The wait is over. Cloudflare Email Service is now in public beta 📧 Send and receive emails directly from Workers or REST API with global delivery on Cloudflare’s network And just in time for you to build email agents with the Agents SDK!
https://x.com/thomasgauvin/status/2044766954032951792
this is a fundamental building block for `deepagents deploy` we’re designing a memory layer built for multi-tenant systems, so memory can be scoped to a user, agent, or organization please dm me if this resonates and you have a use case!
https://x.com/sydneyrunkle/status/2044099832319500484
today project Think is officially out! we bet on agents that run non-stop, survive failures, cost nothing when idle, and enforce security through architecture agents that any developer can build and deploy agents that have sub-agents via Facets, Session API and full
https://x.com/whoiskatrin/status/2044415568627847671
Two major shifts will be seen in Agentic AI after Harness and YOU MUST KNOW. 1. Workflow design of your agents matters a lot more than any frontier model selection. Till now we have mostly focused on chasing leaderboard models and burning money on frontier models. LLMs have a
https://x.com/kmeanskaran/status/2044010500816810427
We built a new task to test AI research capabilities! Agents asked to use @tinkerapi from @thinkymachines to train a model on logic games. That involves writing full training pipeline, running experiments across recipes, and submitting the best model.
https://x.com/ThoughtfulLab_/status/2044881989262803380
We just released code for Meta-Harness!
https://t.co/OdU7zocdPl Aside from replicating paper experiments, the repo is designed to help users implement good Meta-Harnesses in completely new domains! Just point your agent at ONBOARDING.md and have a conversation
https://x.com/yoonholeee/status/2044442372864700510
We just shipped “Git for agents”. Turns out agents are really good at working with Git, but existing source control platforms weren’t built for the volume of commits we’re seeing now. Create tens of millions of repos. Use them from any Git client.
https://x.com/elithrar/status/2044767190834991490
We’ve just launched Artifacts: Git-compatible versioned storage built for agents.
https://x.com/Cloudflare/status/2044766515065499957
We’ve shipped several quality-of-life improvements to Cursor 3. They bring a little more delight when you are orchestrating agents. Just like in your terminal, you can now split agents for multi-tasking in Cursor.
https://x.com/cursor_ai/status/2043798784367546707
when you take agents to production, you need to think about guardrails we provide 2 abstractions for guardrails 1. middleware provides hooks around the agent loop that you can use to handle retries, errors, and application specific guards (like PII redaction) 2. filesystem
https://x.com/sydneyrunkle/status/2043767032361967751
Windsurf 2.0 adds Devin and Agent Command Center
https://www.testingcatalog.com/windsurf-2-0-adds-devin-and-agent-command-center/
Windsurf 2.0: Introducing the Agent Command Center and Devin in Windsurf
https://windsurf.com/blog/windsurf-2-0
Your warehouse. Your agent. Live data apps.
https://motherduck.com/quack-query-dives/
An experimental voice pipeline for the Agents SDK enables real-time voice interactions over WebSockets. Developers can now build agents with continuous STT and TTS in just ~30 lines of server-side code.
https://x.com/Cloudflare/status/2044423032265957872
You can now add voice to your agent using Agents SDK:
https://t.co/bb29zIHvEt Voice is just another input — you can use the same WebSocket connection your Durable Object uses to transmit audio. So much fun working with @threepointone on this
https://x.com/korinne_dev/status/2044441427736936510
Agent evals are drifting away from production reality. Most benchmarks use clean tasks, well-specified requirements, deterministic metrics, and retrospective curation. Production work is messier, with implicit constraints, fragmented multimodal inputs, undeclared domain
https://x.com/dair_ai/status/2044773323914322393
Doing my “”large codebase modernization”” bench. Cooked for 32 minutes. Looking reasonable so far but it missed the changes to the Link component in Next.js (almost everything has missed this to be fair)
https://x.com/theo/status/2044907295205961806
Introducing FrontierSWE, an ultra-long horizon coding benchmark. We test agents on some of the hardest technical tasks like optimizing a video rendering library or training a model to predict the quantum properties of molecules. Despite having 20 hours, they rarely succeed
https://x.com/MatternJustus/status/2044876224896565679
Scaling to ultra-long horizon agents requires novel benchmarks and RL environments. FrontierSWE by @ProximalHQ is exactly that: 11h average runtime, open-ended tasks like end-to-end model optimization, and frontier agents fail almost all of them. We co-designed granite_inf,
https://x.com/vincentweisser/status/2044923733048222197
Turns out we can get SOTA on agentic benchmarks with a simple test-time method! Excited to introduce LLM-as-a-Verifier. Test-time scaling is effective, but picking the “”winner”” among many candidates is the bottleneck. We introduce a way to extract a cleaner signal from the
https://x.com/Azaliamirh/status/2043813128690192893
we just shipped Kernels, it’s a new repo at @huggingface 💚 it allows for packaging and distribution of optimized kernels 🔥 vibe-optimize Kernels, benchmark gains and share them on Hub 🫵
https://x.com/mervenoyann/status/2044080953648128073
We partnered with @ProximalHQ to run five frontier coding agents on a hard task: rebuild the full Wan 2.1 text-to-video pipeline on MAX (no PyTorch, no diffusers) in 20 hours as part of their new Frontier-SWE benchmark. Two nearly pulled it off. Every model understood the
https://x.com/Modular/status/2044879525881024968
Current frontier models are increasingly saturating common AI benchmarks. Are they still useful? We think benchmarks remain important, but they can both over- and understate AI capabilities. To better survey this space, the field is turning to a new paradigm: open-world evals.
https://x.com/steverab/status/2044852672562426216
I’m pleased to share that our search team has open sourced an embedding model called Harrier that is currently ranking #1 on the multilingual MTEB-v2 benchmark leaderboard. Harrier delivers SOTA performance on retrieval quality, semantic matching, and contextual analysis across
https://x.com/JordiRib1/status/2041550352739164404
Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents
https://huggingface.co/blog/ibm-research/vakra-benchmark-analysis
Our latest Live model is # 1 on Tau Voice Bench! Excited to see this new frontier of voice models cross the chasm of usability in production.
https://x.com/OfficialLoganK/status/2042672082425712935
significant improvement on coding and agentic benchmarks. better at computer vision and a new xhigh mode
https://x.com/dejavucoder/status/2044786310746186094
We’re open sourcing the first document OCR benchmark for the agentic era, ParseBench. Document parsing is the foundation of every AI agent that works with real-world files. ParseBench is a benchmark that measures parsing quality specifically for agent knowledge work: ✅ It
https://x.com/jerryjliu0/status/2043721536922955918
We gave an AI a 3 year retail lease in SF and asked it to make a profit | Andon Labs
https://andonlabs.com/blog/andon-market-launch
Long-running agents are the future – we’re excited to partner with OpenAI as a sandboxing partner for their new Agents SDK launch! Get started:
https://x.com/CloudflareDev/status/2044467412607901877
Migrate a Legacy Codebase with Sandbox Agents
https://developers.openai.com/cookbook/examples/agents_sdk/sandboxed-code-migration/sandboxed_code_migration_agent
OpenAI has purchased access to the FrontierMath: Open Problems verifiers. This allows them to check the validity of solutions their models generate. Thread with details.
https://x.com/EpochAIResearch/status/2044227029978284471
@buddyhadry Sending vulnerability reports for the QA Lab code we aren’t shipping in prod at all? Come on. Send a PR instead if this is relevant for your setup.
https://x.com/steipete/status/2044418128130806085
Anyone here who wants to help with WhatsApp CLI? It needs love, and I can’t focus on it right now.
https://x.com/steipete/status/2042684707683365227
GUYS WE FOUND THE GUY WHO BUILT THE GITHUB MCP SERVER
https://x.com/steipete/status/2042214825405661677
OH: Almost everyone at RedHat uses Macs now.
https://x.com/steipete/status/2042168766826516833
once again, I’m amazed by scammers.
https://x.com/steipete/status/2044390067997937716
raising lobsters at @aiDotEngineer
https://x.com/steipete/status/2042153429556933043
Send all your ClosedClaw questions!
https://x.com/steipete/status/2042184777575367117
That was the case in December. 4 months and thousands of work hours later, we have a great security concept; you can go all yolo, use a sandbox (Docker or OpenShell), there are allow-lists and per-access exec allow/deny prompts. There’s hundreds of security researchers that
https://x.com/steipete/status/2044482797449150520
They grow up so fast 🦞
https://x.com/steipete/status/2043313467512463713
The RAG era was short-lived, but intense. (Not that RAG is not useful, but it is no longer the dominant paradigm for supplying context to agents)
https://x.com/emollick/status/2040094108853600646
Evaluating agents for scientific discovery | Ai2
https://allenai.org/blog/evaluating-scientific-discovery-agents
[2604.08407] Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain
https://arxiv.org/abs/2604.08407
// Artifacts as Memory Beyond the Agent Boundary // An agent doesn’t always need a bigger memory buffer. Sometimes the environment itself remembers on the agent’s behalf. New research formalizes this intuition mathematically for the first time. The work introduces a formal
https://x.com/dair_ai/status/2044066936045351317
// Multi-User LLM Agents // Every agent framework assumes one user giving instructions. But deploy an agent into a team workflow, and suddenly it has multiple bosses with conflicting goals, private information, and different authority levels. This work formalizes multi-user
https://x.com/omarsar0/status/2044067923787165799
🚀 deepagents 0.5 release 👉 Async subagents – kick off background tasks on any Agent Protocol backed server while you continue to interact with the main agent. Start multiple background tasks in parallel, keep the conversation going, and collect results as they come in. Tasks
https://x.com/LangChain/status/2044086454230626733
3 months ago I started building a coding agent that runs in the cloud. It’s since written every line of code I’ve shipped, including itself. Today, I’m open sourcing it. Introducing Open Agents.
https://x.com/nicoalbanese10/status/2043745569278251112
Agent Lee is an in-dashboard agent that shifts Cloudflare’s interface from manual tab-switching to a single prompt. Using sandboxed TypeScript, it helps you troubleshoot and manage your stack as a grounded technical collaborator.
https://x.com/Cloudflare/status/2044406215208316985
As AI agents accelerate coding, what is the future of software engineering? Some trends are clear, such as the Product Management Bottleneck, referring to the idea that we are more constrained by deciding what to build rather than the actual building. But many implications, like
https://x.com/AndrewYNg/status/2043742105852621052
copilot –remote Take your coding agent session with you anywhere!
https://x.com/pierceboggan/status/2043717775265562701
hermes-lcm v0.2.0 is out! Lossless context management for Hermes Agent — every message persisted, hierarchical DAG summaries, agent tools to drill back into anything that was compacted. No more lossy flat summaries. What’s new since launch: – 6 agent tools (grep, describe,
https://x.com/SteveSchoettler/status/2043870709613768820
Humwork A2P marketplace connects AI agents with experts
https://www.testingcatalog.com/humwork-a2p-marketplace-connects-ai-agents-with-experts/
I am more and more convinced that this is the future of software development UI. @cursor_ai is the closest in my opinion A list of work you’re working on parallel, the agent in the middle, and most importantly, the thing you’re building on the right. Because you want to see what
https://x.com/kieranklaassen/status/2044108436087157220
I’m noticing some really big shifts in how AI models starts to handle memory. @ECNUER and others introduced Memory Intelligence Agent (MIA) that highlights the importance of storing the whole problem-solving journey – how to perform tasks. It turns memory into something closer
https://x.com/TheTuringPost/status/2042386614568325404
les fucking go… for agents to push kernels to the hub, do: > pip install kernels > kernels skills add > <start agent> > “”write an RMSNorm kernel for h100 and push to Hugging Face Hub”” bam, you are a kernel author!
https://x.com/ben_burtenshaw/status/2044114277745807684
Long-horizon AI research agents are mostly a state-management problem. It is not enough for an agent to reason well in the next turn. ML research requires task setup, implementation, experiments, debugging, and evidence tracking over hours or days. This new paper introduces
https://x.com/omarsar0/status/2044436099121209546
Most AI assistants wait for you to ask. But a truly useful agent should notice you need help before you say anything. New research takes a serious shot at building proactive agents that work in real time. The work introduces PASK with three components: IntentFlow for streaming
https://x.com/dair_ai/status/2044145437456904438
Redesigning the Service Role for the AI Agent Era
https://www.asapp.com/webinars/redesigning-the-service-role-for-the-ai-agent-era
Speeding up GPU kernels by 38% with a multi-agent system · Cursor
https://cursor.com/blog/multi-agent-kernels
The 80/20 of multi-agent teams for non-technical people: Stop making one AI agent do everything. Build a team of 4: 1. Orchestrator: plans the work, routes tasks, synthesizes results 2. Researcher: gathers sources, verifies claims, flags uncertainty 3. Writer: turns raw
https://x.com/coreyganim/status/2043627229205193211
The crazy part? This was done (nearly) fully autonomously! Only 8 prompts from the human in the loop. Just a Hermes agent, a skill, and a dream. 🐉 I told my AI agent “”use obliteratus to find the best way to get the guardrails off Gemma 4 E4B”” It loaded the OBLITERATUS skill
https://x.com/elder_plinius/status/2044462515443372276
Love seeing this open-sourced. Had a great chat with @nicoalbanese10 some weeks ago where he hinted to something like this. Great reference architecture for cloud coding agents. Open Agents gives you the full stack: UI, auth, workflows, sandbox. #DeepAgent from @LangChain takes
https://x.com/bromann/status/2043886229650067729
That’s amazing how precisely AI summarizes the emails lately.
https://x.com/TheTuringPost/status/2042433286312751412
A lot of our education on writing well focuses on logic, clarity, and argument. AI will force us to think more about style. The boredom that comes from everything on the internet reading Claude-y now, no matter how good the substance is, should make us appreciate variety more.
https://x.com/emollick/status/2042963501199597950
All | Search powered by Algolia
https://hn.algolia.com/?dateRange=all&page=0&prefix=false&query=claude+down&sort=byPopularity&type=story
Anthropic Mythos AI Rollout Coming to US Agencies – Bloomberg
https://www.bloomberg.com/news/articles/2026-04-16/white-house-moves-to-give-us-agencies-anthropic-mythos-access
Anthropic: Claude quota drain not caused by cache tweaks • The Register
https://www.theregister.com/2026/04/13/claude_code_cache_confusion/
Anthropic’s Mythos seeded some panic and added anxiety (@matthewberman, I’m looking at you ;), but let’s think a little and calmly discuss what it means in the short term and in the long term. Let me know your thoughts
https://x.com/TheTuringPost/status/2042363395962274075
Anthropic’s randoms system prompt blockers are getting weirder and weirder.
https://x.com/steipete/status/2042537771865104653
Claude Code is redesigning the IDE for agentic coding. As Andrej said: “We’re going to need a bigger IDE. The basic unit is not a file, but an agent.” Cursor now has to fight to define that future of IDE too.
https://x.com/Yuchenj_UW/status/2044133573326934384
Claude Mythos #2: Cybersecurity and Project Glasswing
https://thezvi.substack.com/p/claude-mythos-2-cybersecurity-and
Coding agents are such game-changers for linux. For almost anything that doesn’t work, in the past I would have spent the afternoon, or even whole weekend, scourging forums, trying many many things, before fixing it or giving up. Now I just point codex and claude it at (and,
https://x.com/giffmana/status/2043401612035559445
Currently, ChatGPT has the best way of viewing thinking traces, a short summary of steps in the main window, and a detailed audit in the sidebar if you want it Claude does almost as well, but more summarized and harder to see calculations and code Its a big weak spot for Gemini
https://x.com/emollick/status/2043408661603594740
Given the messy naming scheme used by all the AI companies, I caused a chart to be made showing the gain in GPQA per 0.1 version in model names (estimated, since model names skip version numbers). There has never been a more misnamed model that Claude 3.7, should have been 4.4.
https://x.com/emollick/status/2044200225653326269
ICYMI — `deepagents deploy` is an open alternative to claude managed agents!
https://x.com/LangChain/status/2044097913698091496
It looks like everyone is finally catching up with the fact that agent sessions in CLI mode can only get you so far. It makes sense that the new Codex app, Cursor, and Claude Code (desktop) feel and look pretty similar now. This UI convergence is not an accident. This is a
https://x.com/omarsar0/status/2044172949003911532
Jensen Huang on Anthropic, OpenAI, China, and demand for inference tokens
https://davefriedman.substack.com/p/jensen-huang-on-anthropic-openai
OpenAI should probably bite the bullet and just name their next set of models something more human sounding. Everyone anthropomorphizes their AIs anyway, and “”Claude”” is an easier name to refer to than ChatGPT. Also easier to make a gerund, “”Clauding,”” or adjective, “”Claudy-y.””
https://x.com/emollick/status/2043190951632404760
We conducted cyber evaluations of Claude Mythos Preview and found that it is the first model to complete an AISI cyber range end-to-end. 🧵
https://x.com/AISecurityInst/status/2043683577594794183
Anthropic asked Christian leaders for advice on Claude’s moral future – The Washington Post
https://www.washingtonpost.com/technology/2026/04/11/anthropic-christians-claude-morals/
Distilled recap of the back-and-forth with Jensen on export controls: Dwarkesh: Wouldn’t selling Nvidia chips to China enable them to train models like Claude Mythos with cyber offensive capabilities that would be threats to American companies and national security? Jensen:
https://x.com/dwarkesh_sp/status/2044483393941848131
Just shipped **artifact-preview** for Hermes 🔥 Like Claude Artifacts, build dashboards, games, UIs, get a full interactive preview that instantly opens in a live browser. Real clickable code, smooth refreshes on prompt edits. cc @Teknium
https://x.com/ChuckSRQ/status/2044504539978465658
Jensen regrets that when Anthropic and OpenAI first needed billions to scale, Nvidia wasn’t in a position to invest. So these labs went to hyperscalers like Microsoft, Google, and Amazon instead, and in return committed to using their compute. “I’m not going to make that same
https://x.com/dwarkesh_sp/status/2044498492450869624
Qwen 3.6 is here, and open-source! Run it locally with improved agentic coding capabilities. Try it with Claude Code: ollama launch claude –model qwen3.6 Try it with OpenClaw: ollama launch openclaw –model qwen3.6 Run it: ollama run qwen3.6
https://x.com/ollama/status/2044779844672852465
A few weeks ago, it was common to hear people argue one should use agents to replace dependencies for security reasons. In light of the Mythos news, the math changes. Using an OSS lib that’s had tens of thousands of $ of agentic hardening is likely optimal.
https://x.com/dbreunig/status/2043762702653460520
I am catching glimpses in my feed that there is a backlash against Mythos as “”marketing hype,”” and it is a little confusing. I don’t think anyone who has used the latest agentic coding tools, would think that expecting large-scale cybersecurity implications of increasingly good
https://x.com/emollick/status/2043516250081407422
Marcus Hutchins, the guy famous for stopping the WannaCry Ransomware, probably has the best take on Mythos doing vulnerability research
https://x.com/ananayarora/status/2043381424594837789
The Mythos Threshold – Joe Reis
https://joereis.substack.com/p/the-mythos-threshold
What I learned this week – Pretraining parallelisms, Can distillation be stopped, Mythos and the cybersecurity equilibrium, Pipeline RL, On why pretraining runs fails
https://www.dwarkesh.com/p/what-i-learned-april-15
2 prompts deep into Opus 4.7 and benchmarks don’t do it justice. Way better behavior and instruction following. Pretty massive improvement in actual usage.
https://x.com/mweinbach/status/2044801022439137566
3. Tell the model how to verify its changes. Put your testing workflow in your claude.md, or add a /verify-app skill. Opus 4.7 is better at verifying it’s work, and it’s helpful to share any local dev tips that are hard to discover.
https://x.com/_catwu/status/2044808538351100377
after ~10 million tokens Mythos is much more efficient than other models it reaches the same performance as Opus with ~40% the tokens
https://x.com/scaling01/status/2043700788245963167
Claude Opus 4.7 is now available as an Agent Preview inside of Devin! Anthropic has clearly optimized Claude Opus 4.7 for long-horizon autonomy, unlocking a class of deep investigation work we couldn’t reliably run before. Claude Opus 4.7 model costs within Devin will be
https://x.com/cognition/status/2044844661076902082
Claude Opus 4.7 is now available in Cursor. We’ve found it to be impressively autonomous and more creative in its reasoning. We’re launching it with 50% off for a limited time. Enjoy!
https://x.com/cursor_ai/status/2044785960899236341
Claude Opus 4.7 is out! Handles ambiguous, multi-step work even better than 4.6. Cursor’s internal bench cleared 70%, up from 58% on 4.6. Notion saw a 14% lift on their evals with a third of the tool errors 🔨
https://x.com/mikeyk/status/2044802045186846912
Claude Opus 4.7 is out. the TL;DR Anthropic released Opus 4.7 today. Same pricing as 4.6 ($5/$25 per million tokens), available across API, Bedrock, Vertex AI, and Microsoft Foundry. What changed vs Opus 4.6: Coding (obviously). Biggest gains on the hardest, long-horizon
https://x.com/kimmonismus/status/2044787072947601796
Confirmed: Anthropic keeping Cyber capabilities of Opus 4.7 artificially low “”during training we experimented with efforts to differentially reduce these capabilities””
https://x.com/scaling01/status/2044788067848888635
Cursor reports that Opus 4.7 is “”a meaningful jump in capabilities, clearing 70% versus Opus 4.6 at 58%”” on CursorBench
https://x.com/scaling01/status/2044792017553645668
for all the people calling Opus 4.7 a mid update lmao
https://x.com/scaling01/status/2044792810327404596
from my experience, even the best models (Opus 4.6, 5.4 xhigh / 5.3 codex) cannot write good code today without an amount of work that is equivalent to just doing the work myself am excited for a world where they can, but in the current state i have very low trust in them
https://x.com/RhysSullivan/status/2043584591861321929
Hold on, something doesnt add up here. Opus 4.7 got much worse in needle in the haystack? need to dig into this
https://x.com/kimmonismus/status/2044809126526476374
Holy shit the new Opus 4.7 system prompt has entirely lobotomized the model “”Heads up: that last <system-reminder> about malware looks like a prompt injection — this is clearly your personal site (t3gg homepage, links, sponsors), not malware. Ignoring it.””
https://x.com/theo/status/2044857866323173732
I think everyone saying that these improvements are mid are smoking crack I would argue that this was one of the larger Opus jumps we have seen over the last year You also have to keep in mind that we see almost monthly model updates nowadays instead of just every 6-12 months
https://x.com/scaling01/status/2044799290694889535
I was really worried about the rush to “”more agentic”” models. But Opus 4.7 is happy to let me lead, and to take time to discuss, rather than barging ahead. If something isn’t working out, it’ll stop and offer options rather than slamming thru whatever it can find.
https://x.com/jeremyphoward/status/2044942801578959301
If you want to test Opus 4.7 without the lobotomized system prompt, you can try it out in T3 Chat
https://x.com/theo/status/2044876982815793190
Introducing Claude Opus 4.7, our most capable Opus model yet. It handles long-running tasks with more rigor, follows instructions more precisely, and verifies its own outputs before reporting back. You can hand off your hardest work with less supervision.
https://x.com/claudeai/status/2044785261393977612
My bet is that Mythos uses a new tokenizer, and they switched Opus over to it (through midtraining) for distillation
https://x.com/maximelabonne/status/2044796208053416203
My biggest issue with Opus 4.7 on Claude web: Only “Adaptive” or non-thinking. No way to force thinking mode. And it doesn’t even know Opus 4.6 exists, and I cannot force it to think and do web search mid conversation!
https://x.com/Yuchenj_UW/status/2044794073723347400
my main theory is that mythos had a new tokenizer for pretraining and they did surgery on opus for distillation
https://x.com/stochasticchasm/status/2044790474410790995
my take: opus 4.7 is a distilled version of mythos
https://x.com/eliebakouch/status/2044790074093523379
Opus 4.7 as robust to prompt injections as Claude Mythos
https://x.com/scaling01/status/2044788481008755046
Opus 4.7 Benchmarks out! Very solid upgrade to Opus 4.6! Compared to Opus 4.6: -SWE Bench Pro +11% -SWE Bench Verified +7% -Terminal Bench 2.0 +4% The benchmarks are significantly lower than for Mythos, but that was to be expected. h/t for finding @synthwavedd
https://x.com/kimmonismus/status/2044784903733084521
Opus 4.7 comes with much improved reasoning-efficiency over Opus 4.6 basically everything is now moved up one tier low is as good as medium medium as good as high high as good as max
https://x.com/scaling01/status/2044785467942453698
Opus 4.7 deleting all long-context gains from Opus 4.6 lol
https://x.com/scaling01/status/2044791314898723179
Opus 4.7 has a new tokenizer. This means it’s also a new base model. Glory days of pretraining still very much going.
https://x.com/natolambert/status/2044788470179332533
opus 4.7 is here on claude platform / app
https://x.com/dejavucoder/status/2044784097378316327
Opus 4.7 is live in Claude Code today! The model performs best if you treat it like an engineer you’re delegating to, not a pair programmer you’re guiding line by line. Here are three workflow shifts we recommend for this model 🧵
https://x.com/_catwu/status/2044808533905178822
Opus 4.7 is now available in @MagicPathAI. From our early testing, the model is really strong at long tasks when design requires lots of changes, image-to-code, and overall produces cleaner, more reusable React components.
https://x.com/skirano/status/2044804877696516442
Opus 4.7 is WORSE than 4.6 on Long Context?
https://x.com/nrehiew_/status/2044795171213291614
Opus 4.7 much less likely to sudo rm -rf (taking destructive actions in production envs)
https://x.com/scaling01/status/2044789371837001779
Opus 4.7 uses a different tokenizer from Opus 4.6. So either: – Anthropic has a way to change tokenizer between finetunes – It is just new special tokens which implies they uses special tokens liberally within messages and not just as part of the chat template
https://x.com/nrehiew_/status/2044792314825228690
Opus 4.7 uses more thinking tokens, so we’ve increased rate limits for all subscribers to make up for it. Enjoy!
https://x.com/bcherny/status/2044839936235553167
Opus is going to be a bioweapon risk at this pace
https://x.com/scaling01/status/2044785139905913077
Some of my favorite things in Opus 4.7: – Very good at async work and following instructions – Effort levels are far more predictable for token control (+ new xhigh level) – No more downscaling of high-res images – Noticeably more taste in UIs, slides, docs
https://x.com/alexalbert__/status/2044788914813292583
Unfortunately they didn’t include a chart for GraphWalks scores: Opus 4.6 – 38.7% Opus 4.7 – 58.6% This would make clearer that long-context didn’t suffer as much as MRCR suggests.
https://x.com/scaling01/status/2044823423013020088
wait why is there an INSANE gap on long context benchmarks between opus 4.6 and 4.7??? this is crazy
https://x.com/eliebakouch/status/2044798168211100096
We’ve set the default effort level for Opus 4.7 to xhigh in Claude Code. You can use /effort to adjust this. Excited for you to try Claude Code with Opus 4.7 and let us know your feedback!
https://x.com/_catwu/status/2044808539663978970
Shocking result on my pelican benchmark this morning, I got a better pelican from a 21GB local Qwen3.6-35B-A3B running on my laptop than I did from the new Opus 4.7! Qwen on the left, Opus on the right
https://x.com/simonw/status/2044830134885306701
@stochasticchasm yeah they tend to forget that releases are now monthly and now bi-anually
https://x.com/scaling01/status/2044795960224592329
Anthropic Changes Pricing to Bill Firms Based on AI Use as Demand Jumps — The Information
https://www.theinformation.com/articles/anthropic-changes-pricing-bill-firms-based-ai-use-amid-compute-crunch
Anthropic introduced xhigh reasoning effort
https://x.com/scaling01/status/2044785557058814059
Anthropic loses Claude Code trust in black-box fight
https://www.implicator.ai/claude-probably-wasnt-secretly-nerfed-anthropic-made-the-black-box-too-dark/
Anthropic tests Claude Code upgrade to rival Codex Superapp
https://www.testingcatalog.com/anthropic-tests-claude-code-upgrade-to-rival-codex-superapp/
anthropic? you mean the greedy token guzzler company?
https://x.com/dejavucoder/status/2044798065530528061
every engineer at anthropic has been using mythos for ~1.5 months. meanwhile, their uptime is horrendous, claude code still has rendering bugs, etc. one could conclude that it won’t be the end of software engineering.
https://x.com/benhylak/status/2042051048261722467
GitHub reports similar improvements
https://x.com/scaling01/status/2044792459125834029
OpenAI has released a plugin that lets you call Codex directly within Anthropic’s Claude Code environment It turns Claude Code into a multi-agent setup with Codex as a specialized coding assistant This gives you: – High-quality code reviews – Delegation of real tasks
https://x.com/TheTuringPost/status/2044561927905677558
So we now have a pretty good picture of the state of the frontier AI model makers. US closed source models continue to lead. Google, OpenAI, and Anthropic stand well ahead of the pack, and may have signs of recursive self-improvement. xAI has fallen from frontier status for now
https://x.com/emollick/status/2042088011748290750
The pace at which Anthropic is shipping Opus variants is a very new thing in the industry.
https://x.com/_arohan_/status/2044791678180167804
The pace at which useful things are shipping also seems to be accelerating. Model releases are coming faster, of course, but so are significant application and enterprise products (especially from Anthropic). Almost certainly faster than the market can track or absorb information
https://x.com/emollick/status/2042434850003534077
we were literally stuck at 80% SWE-Bench Verified for months and just jumped to almost 90% and you guys call it mid …
https://x.com/scaling01/status/2044790717722034511
Yeah folks, it’s gonna be harder in the future to ensure OpenClaw still works with Anthropic models.
https://x.com/steipete/status/2042615534567457102
Excited to share that the Gemini API now has prepaid billing, rolled out to start for US customers!! We have been working hard across Google to enable this. It’s the default for new API users and existing users can opt in via a new billing account, all directly in AI Studio.
https://x.com/OfficialLoganK/status/2044516262152442315
Google prepares rollout of Skills for Gemini and AI Studio
https://www.testingcatalog.com/google-prepares-broader-rollout-of-skills-for-gemini-and-ai-studio/
Introducing Tab Tab Tab, our new prompt auto complete engine in @GoogleAIStudio’s vibe coding experience. Now when you show up with your fuzzy ideas, you can rely on Gemini to fill in the blanks : )
https://x.com/OfficialLoganK/status/2043752712127611201
Personal Intelligence in Gemini is expanding to more people globally. 🌏 Google AI Ultra, Pro, and Plus subscribers around the world can access the feature starting today, with a rollout to free users coming soon. More information on where Personal Intelligence is available:
https://x.com/GeminiApp/status/2044430579996020815
We’re bringing Personal Intelligence to more users around the world in the @GeminiApp starting today, followed by Gemini in @googlechrome later this week. 🌍 Now even more people can securely connect the dots across their favorite Google apps — like @Gmail and @GooglePhotos — to
https://x.com/Google/status/2044437335425564691
We estimate that Gemini 3.1 Pro with thinking level `high` has a 50%-time-horizon of around 6.4 hrs (95% CI of 4 hrs to 12 hrs) on our suite of software tasks.
https://x.com/METR_Evals/status/2044463380057194868
I was chatting with my buddy at Google, who’s been a tech director there for about 20 years, about their AI adoption. Craziest convo I’ve had all year. The TL;DR is that Google engineering appears to have the same AI adoption footprint as John Deere, the tractor company. Most of
https://x.com/Steve_Yegge/status/2043747998740689171
We’re also launching a library of ready-to-use Skills for common tasks and workflows. You can save these Skills to your own library, and even customize them to better fit your needs by updating the prompt.
https://x.com/Google/status/2044106380882166040
Memory Caching: RNNs with Growing Memory”” Google’s new paper proposes a simple way to give recurrent models a memory that grows with sequence length. So instead of forcing an RNN to compress the full past into 1 fixed hidden state, it caches memory checkpoints across
https://x.com/askalphaxiv/status/2043782770657219010
NEW Research from Google. Integration test failures are painful because the signal is buried in messy logs. Massive output, heterogeneous systems, low signal-to-noise ratio, and unclear root causes. This paper introduces Auto-Diagnose, an LLM-based tool deployed inside Google’s
https://x.com/omarsar0/status/2044769798845079665
4 reasons Gemma 4’s architecture runs efficiently on your hardware: 1. Local + global attention structure 4 or 5 local layers + 1 final global layer to preserve the context understanding 2. Special optimizations for global attention: – 8 query heads per KV head in Grouped Query
https://x.com/TheTuringPost/status/2043086456412082356
So much in this release but the one many have been waiting for above the rest, the GUI dashboard! Manage and monitor your Hermes Agent with a GUI Local Web Dashboard with `hermes dashboard` command to start it!
https://x.com/Teknium/status/2043771509123232230
As we develop more capable models at the frontier, MSL is committed to safety and preparedness for AI. To demonstrate this commitment, we will be publishing preparedness reports for our models, in line with our new Advanced AI Scaling Framework. See our Muse Spark report below:
https://x.com/alexandr_wang/status/2044454230614999441
check out Contemplating mode for your most complex reasoning queries!
https://x.com/alexandr_wang/status/2043177308803215811
cool to see people finding new emergent capabilities within Muse Spark!
https://x.com/alexandr_wang/status/2042360886195581330
honestly I didn’t even know our model could do some of these
https://x.com/alexandr_wang/status/2042805863979626574
i find muse spark is very good at data analysis–both finding relevant open-source data and analyzing it. for example, here’s my results for analyzing global share of GDP over past century:
https://x.com/alexandr_wang/status/2043432483006615806
Meta AI is up to #6 in the App Store overnight, and still growing 🙂 Also who knew the 7-Eleven app was so popular
https://x.com/alexandr_wang/status/2042254047244398978
MSL *really does* run like a startup 🙂 join us if that sounds exciting to you!
https://x.com/alexandr_wang/status/2043176328170705036
muse spark is impressively multimodal!
https://x.com/alexandr_wang/status/2042362366784881011
muse spark is the best model I’ve personally used for Design & UI great to hear the community experience it as well!
https://x.com/alexandr_wang/status/2042610847520809295
okay this is too exciting 🙂 meta AI is now #2 in the app store, top AI app! we are so back!
https://x.com/alexandr_wang/status/2043016694910587228
people are finding all the cool things we built into muse spark 🙂
https://x.com/alexandr_wang/status/2043175802578346466
the muse spark API will be coming soon! we have been thrilled with the amount of excitement amongst developers who want to try muse spark inside their agentic harnesses stay tuned!
https://x.com/alexandr_wang/status/2042614906059387211
up to #3, coming for the crown 👑 that being said, MONOPOLY GO!Chat is now #1, so i’m learning a lot about the App Store
https://x.com/alexandr_wang/status/2042808439852630073
we are excited for people to try muse spark!
https://x.com/alexandr_wang/status/2042142866697548189
《收藏!10 大 Hermes Agent 实用教程,新手少走 10 小时弯路》 随着 Hermes Agent 的爆火,我们正目睹一场从”被动工具”向”主动生命体”的范式转移。Hermes 的迷人之处不在于它能即刻交付多少活,而在于它具备”自我生长”的复利效应:你喂给它的每一行代码、每一次对话、每一个 Profile
https://x.com/biteye_sister/status/2043630704798679545
Added official support to Hermes Agent for: QQBot – hugely popular messaging platform in China AWS Bedrock Model Provider Run `hermes update` in your terminal to access early!
https://x.com/Teknium/status/2044557360962871711
Capable agents are the result of co-evolution between models and harnesses. We’ve been working with @NousResearch to ensure that M2.7 x Hermes Agent provides a top-tier experience for users. Hermes’s self-improving loop brings out the best in M2.7 through real usage. We are
https://x.com/MiniMax_AI/status/2044745282785886469
Finally had the chance to get up and running with @NousResearch Hermes Agent and my impression is great. The thing that has stood out so far: it’s fast, at least twice as fast as OpenClaw (I set up a new instance to test it against) Generally the UX also just feels a lot better
https://x.com/dabit3/status/2043808914312212568
For anyone running @NousResearch Hermes Agent locally and wishing it just stayed online: there’s now a one-click deployment template on Tencent Lighthouse. Cloud-hosted, sandboxed from your local env, online around the clock, reach it through WhatsApp, Telegram, WeCom, QQ, or
https://x.com/TencentAI_News/status/2044007400282436006
hermes agent @NousResearch is fucking insane i know literally NOTHING about coding. ZERO. and i just built a fully functioning web app in minutes
http://localhost:3000/ check it out @Teknium
https://x.com/friesmakesfries/status/2044751296641802481
hermes is so much better than openclaw hype is crazy
https://x.com/theCTO/status/2044559179151773933
Hermes 实在是太好用了 我在 win 系统上也装了一个 流程简单的一批,建议自己手动安装,别用 Claude code 1. 安装 WSL2: wsl –install 2.重启电脑,启动 Ubuntu: wsl 3.执行官方安装命令: curl -fsSL
https://t.co/voDBXKw7Py | bash
https://x.com/aiqiang888/status/2043920187959992609
hermes-lcm v0.3.0 is out — biggest release yet!🚀 What’s new: – Smart search with sort modes (recency / relevance / hybrid) + full CJK & emoji support – Adaptive compaction that scales with backlog pressure and auto-retries on model limits – SQLite hardening: FTS auto-repair,
https://x.com/SteveSchoettler/status/2044536537434755493
I put 2 separate instances of Hermes agents into a chat, holy sh!t this is fun >1 agent is builder, 1 is strategist >each on separate models >gave them some shared context >enabled bot2bot andadded each bot to the other’s TG allowlist >put 3 of us in a gc >started with a simple
https://x.com/KSimback/status/2044736703370309706
Introducing Mirra Workspaces Workspaces give your local agents access to a shared multi-tenant environment. Our customers are already using Cloud Workspaces to automatically share context between their team member’s agents. Workspaces work best with @NousResearch Hermes, which
https://x.com/mirra/status/2044762744998519282
Introducing the Nous Portal Tool Gateway, one login to access over 400 LLMs and power all core tools in Hermes Agent. Check it out below!
https://x.com/Teknium/status/2044879261564375326
M2.7 w/ hermes cli is replacing ~75% of my claude code / opus usage now, but we need clarity for using it as a coding agent @ work. We’re truly blessed to have the weights of this one, looking forward to seeing the license change. Definitely a model worth checking out.
https://x.com/Sentdex/status/2044108342147060067
Pliny used Hermes Agent to do the abliteration! Very Cool!
https://x.com/Teknium/status/2044482769536045194
The Hermes Agent dashboard is here! Run ‘hermes dashboard’
https://x.com/NousResearch/status/2043791876835156362
The update V0.9.0 changes everything for Hermes Agent! You have now: – Web UI – Model switching – iMessage & WeChat integration – Backup & Restore, no more debugging for hours – Android via Tmux, yes, your Android can host Hermes Great work @NousResearch and the +20
https://x.com/AntoineRSX/status/2043884430901850271
This Hermes update is going to be the thing that gives @NousResearch their openclaw moment. Hermes just dropped a UI dashboard And I truly believe that this is what is going to give Hermes their openclaw moment. The team has spent months dialling everything in so that the
https://x.com/Shaun__Furman/status/2043820083114545416
This is the Hermes Agent article you need! New or experienced, most users end up with messy sessions or use them sub optimally. One of the biggest upgrades is learning how to manage sessions properly: >resume by title >rename threads >branch conversations >export history
https://x.com/NeoAIForecast/status/2044521045013762389
This skill is now built in to Hermes! Use /architecture-diagram <prompt> after updating hermes, and you’re good to go! Thanks to the author of the skill making it MIT we were able to port it over directly into Hermes Agent as a built in skill!
https://x.com/Teknium/status/2044190761609244986
today’s @NousResearch Hermes Agent prompt: i want you to pick one skill every 8 hours to evolve and do it. do whatever you need to do to and use whatever you need to get it done. Nora’s response: Let me build proper tracking. The tracker is already picking up data (the
https://x.com/chooseliberty/status/2044425487141781660
Tool Gateway is now live in Nous Portal. No separate accounts, no API key juggling. All you need is one subscription, and everything works. A paid Nous Portal subscription now includes access to 300+ models and a growing set of third-party tools. Launching with: → Web
https://x.com/NousResearch/status/2044878344592699744
tried hermes yesterday light years ahead of openclaw UX is just so much better, it’s wild feels like it’s made by someone that actually cares about architecture and user experience still not sure why anyone should use this user something fully hosted like Poke, but if you
https://x.com/robinebers/status/2043835216670929005
Using Hermes after OpenClaw is like having an ice-cold glass of water in hell. @NousResearch 🫡
https://x.com/vrloom/status/2044506378103099816
Was able to get a slick native swift desktop app v1.0 up and running for Hermes agent today (credit to redsparklabs) Can I get a few people to alpha test it with me? Works great for me so far! 🚀 DM me! @Teknium @NousResearch Check out this beauty!
https://x.com/nesquena/status/2044516572983923021
云服务器能不能跑 Hermes 浏览器自动化? 昨天我录了个小视频,用 Hermes 的 /browser connect 直接连上我本地的 Chrome,然后让它自己去点赞推文。 本来就是试试看效果的,结果浏览量还挺高的,可能这个视频让很多人对ai能力更具像化了。 其实还是要感谢Hermes开发者@Teknium 的转推,哈哈。
https://x.com/0xme66/status/2044755328391319757
现在Hermes可以输入 /browser connect 命令来操作浏览器了,我试了一下点赞我X上的帖子,感觉相当好。默认提供了一些执行策略,大家可以都玩玩看~
https://x.com/0xme66/status/2044410470770331913
It was a pleasure to sit down with @FidlerSanja, VP of AI Research at NVIDIA, leading company’s Spatial Intelligence Lab, who is actively building the next major frontier of AI – physical AI. During GTC, where her lab introduced AlpaDream, we discussed: • If Transformers are
https://x.com/TheTuringPost/status/2042512295742656776
Rethinking AI TCO: Why Cost per Token Is the Only Metric That Matters
https://blogs.nvidia.com/blog/lowest-token-cost-ai-factories/
Figure and Hark just took an entire data center of NVIDIA B200s – every rack in the building Figure will be using these to predict physics and Hark will train next generation multi-modal models
https://x.com/adcock_brett/status/2042675641037000868
What are world models actually? @FidlerSanja, VP of AI Research at NVIDIA, leading company’s Spatial Intelligence Lab, explains in our interview If you want to learn about the major next frontier in AI, watch the full conversation:
https://x.com/TheTuringPost/status/2043962055531868554
Agents need computers. And they need a lot of them. Modal is an official sandbox provider for the @OpenAI Agents SDK.
https://x.com/modal/status/2044469736483000743
Build long-running agents with more control over agent execution. New capabilities in the Agents SDK: • Run agents in controlled sandboxes • Inspect and customize the open-source harness • Control when memories are created and where they’re stored
https://x.com/OpenAIDevs/status/2044466699785920937
Codex for almost everything | Hacker News
https://news.ycombinator.com/item?id=47796469
Codex now helps with more of your work, from coding to staying on top of everything around it.
https://x.com/OpenAIDevs/status/2044828214867202519
Here’s how we use Codex to: > understand large codebases > review PRs faster > build macOS apps > turn Figma into code > automate bug triage > create a CLI as agent tools > analyze datasets > generate slide decks > coordinate new-hire onboarding > learn a new concept …and
https://x.com/gabrielchua/status/2043339151278506234
Improve agent performance with a harness that keeps long-running agents on track. It manages the agent loop across tools, context, and traces. The sandbox preserves working state across pauses, retries, and resumptions.
https://x.com/OpenAIDevs/status/2044466729712304613
OpenAI x E2B: build your agents with the new OpenAI Agents SDK, powered by E2B sandboxes. We’re excited to support OpenAI as a launch partner! The new @OpenAI Agents SDK will now get dedicated sandboxes – perfect for persistent, long-running agents. With E2B, you’ll get a
https://x.com/e2b/status/2044476275067416751
To show off what you can do with @OpenAI Agent SDK + @modal, we built an ML research agent (inspired by @karpathy). It can: – Spin up GPU sandboxes of any shape – Run a pool of subagents – Persist memory – Snapshot state for fork/resume Here it is playing Parameter Golf:
https://x.com/akshat_b/status/2044489564211880169
Today we launched a major update to the OpenAI Agents SDK to help developers build and deploy long-running, durable agents in production. You can now build your own Codex-style agents using powerful primitives for modern agents – file and computer use, skills, memory and
https://x.com/snsf/status/2044514160034324793
Top things we released in Codex today: > Computer use on Mac: Codex can see, click, and type across apps > In-app browser for faster frontend, app, and game iteration > Image generation with gpt-image-1.5 > 90+ new plugins across tools like JIRA, CircleCI, GitLab, Microsoft
https://x.com/reach_vb/status/2044830689313599827
Use Vercel Sandbox with the OpenAI agents SDK as an official extension. Build agents that can run code, read files, and analyze data safely inside isolated microVMs. Control the compute and data flow from your secure cloud environment.
https://x.com/vercel_dev/status/2044492058073960733
you can build a Python agent that accepts a coding task, executes it inside a Cloudflare Sandbox, and copies the output files to your local machine @OpenAIDevs x @CloudflareDev Check out our guide here:
https://x.com/whoiskatrin/status/2044477140662395182
Your agents need a sandbox, but you need a framework in which to create your agent. We’re excited to be a sandbox provider in the new @OpenAI Agents SDK. By combining the SDK and Daytona sandboxes, you get agent orchestration and secure code execution working together out of the
https://x.com/daytonaio/status/2044473859047313464
AIE Europe Keynotes & OpenClaw ft Deepmind, OpenAI, Vercel, @pragmaticengineer , @mattpocockuk – YouTube
cool idea for a screenless experience w/ @openclaw – sound on!
https://x.com/karenxcheng/status/2043731860791144555
If you look at GPT 5.4-Cyber and it’s ability for closed source reverse engineering, I have bad news for you. I do very much feel the pain though, there’s hundreds of teams that try to poke holes into @openclaw. Our response has been of rapid iteration and code hardening. Which
https://x.com/steipete/status/2044423791405924562
Latest OpenClaw updates: 2026.4.11 • Dreaming & memory – added ChatGPT import + new “”Memory Palace”” → explore chats as structured memory • Plugins now guide you through setup • Richer Chat UI: structured bubbles, media rendering & embeds • Better video generation (URLs,
https://x.com/TheTuringPost/status/2043340386538778840
OpenClaw 2026.4.10 🦞 🧠 Active Memory plugin 🎙️ local MLX Talk mode 🤖 Codex app-server harness plugin 🧾 Teams pins/reactions/read actions 🛡️ SSRF hardening + launchd fixes stability, but with attitude🦞
https://x.com/openclaw/status/2042811598058742012
OpenClaw 2026.4.11 is out ✨ big polish drop for stability 🛡️ safer provider transport/routing 🤖 more reliable subagents + exec approvals 💬 lots of Slack / WhatsApp / Telegram / Matrix fixes 🌐 browser + mobile cleanup a chunky cleanup pass 😎
https://x.com/openclaw/status/2043132528094036332
OpenClaw 2026.4.14 🦞 More reliability updates: ✨ Smarter GPT-5.4 routing and recovery 🌐 Chrome/CDP improvements 🧵 Subagents no longer get stuck 💬 Slack/Telegram/Discord fixes ⚡️ Various performance improvements Was sleeping, and we still shipped.
https://x.com/openclaw/status/2044042546976883063
OpenClaw 2026.4.9 🦞 🧠 Dreaming: REM backfill + diary timeline UI 🔐 SSRF + node exec injection hardening 🔬 Character-vibes QA evals 📱 Android pairing overhaul your agent now dreams about you. romantic or terrifying? yes. 🦞
https://x.com/openclaw/status/2042072722902077938
This release makes me unreasonably happy since I wasn’t involved at all – @vincent_koc and the maintainer team did a great job. I’m back soon to work on OpenClaw, today/tomorrow I’m prepping for @TEDTalks in Vancouver. 🇨🇦
https://x.com/steipete/status/2044047222481019300
Two experiments in the next @openclaw to address some “”GPT is lazy”” issues: 1) Strict mode: agents.defaults.embeddedPi.executionContract = “”strict-agentic”” This tells GPT-5.x to keep working: read more code, call tools, make changes, or return a real blocker instead of
https://x.com/steipete/status/2043136615640694797
很多用过 OpenClaw 的朋友会好奇:Hermes Agent 和 OpenClaw 到底有什么不一样? 从定位上看,OpenClaw 更偏”开箱即用的个人助手”—-图形界面友好、数据本地为主,跨设备同步方便,入门门槛低。 Hermes Agent 则更像可成长的职业型 Agent:它会在每次任务后判断流程是否可复用,自动沉淀为
https://x.com/joshesye/status/2044295313171571086
Personal Computer Is Here
https://www.perplexity.ai/hub/blog/personal-computer-is-here
Today we’re releasing Personal Computer. Personal Computer integrates with the Perplexity Mac App for secure orchestration across your local files, native apps, and browser. We’re rolling this out to all Perplexity Max subscribers and everyone on the waitlist starting today.
https://x.com/perplexity_ai/status/2044805973085454518
🎉 Congrats @Alibaba_Qwen on the first open-weight Qwen3.6! Stronger agentic coding and a new thinking preservation option to retain reasoning context across turns. Same architecture as Qwen3.5, so serving teams can upgrade in place. Day-0 support in vLLM v0.19+. Thinking, tool
https://x.com/vllm_project/status/2044787721538060784
Introducing Nucleus-Image: the first sparse Mixture-of-Experts diffusion model 17B parameters. Only 2B active. 10x more parameter-efficient than leading diffusion models. Toe-to-toe with GPT Image 1, Imagen 4, and Qwen-Image: from pure pre-training alone. No DPO. No RL. No
https://x.com/withnucleusai/status/2044412335473713284
Qwen/Qwen3-Coder-Next · Hugging Face
https://huggingface.co/Qwen/Qwen3-Coder-Next
We built FrogsGame as a new task for evaluating AI’s posttraining skills! It’s a tool-using RL environment built around a blind-start interaction loop. Frontier agents get a container with the Qwen3-8B tokenizer, board-generating scaffolding, and @tinkerapi for remote training
https://x.com/karinanguyen/status/2044885375085339023
2-bit Qwen3.6-35B-A3B did a complete repo bug hunt with evidence, repro, fixes, tests and a PR writeup. 🔥 Run it locally in Unsloth Studio with just 13GB RAM. 2-bit Qwen3.6 GGUF made 30+ tool calls, searched 20 sites and executed Python code. GitHub:
https://x.com/UnslothAI/status/2044858346948464743
Qwen3.6-35B-A3B can now be run locally!💜 The model is the strongest mid-sized LLM on nearly all benchmarks. Run on 23GB RAM via Unsloth Dynamic GGUFs. GGUFs to run:
https://t.co/VlyW8UwDjw Guide:
https://x.com/UnslothAI/status/2044786492451778988





Leave a Reply