Agents and Copilots: AI News Week Ending 04/17/2026

Agents and Copilots: AI News Week Ending 04/17/2026

April 17, 2026

Image created with Ideogram. Image prompt: Using the provided reference image, preserve every detail exactly — the marigold orange backdrop, the seated young woman with closed eyes and faint smile in her purple-and-white windbreaker, the tattooed singer in the red beanie and layered red vest, the lighting and framing — but replace only the black handheld microphone with a fully opened vintage red Swiss Army knife with all blades, scissors, screwdrivers, and tools splayed outward, held to his mouth in the exact same hand position and scale as the original microphone, photographed with seamless realism and matching studio lighting. After generating the image, overlay the text “Agents” in the upper-left corner of the frame in large, bold, all-caps ITC Avant Garde Gothic Pro Medium (or a near-identical geometric sans-serif if unavailable), pure white (#FFFFFF), with no date, subtitle, drop shadow, or outline. The text should be substantial in scale — taking up a meaningful portion of the upper-left area — with comfortable margin from the top and left edges, set against the negative space of the orange backdrop so it does not overlap or obscure the singer, the seated woman, or the replaced object.

Browser Run: give your agents a browser
https://blog.cloudflare.com/browser-run-for-ai-agents/

Harvey Agents | Delegate the Work. Own the Judgment.
https://www.harvey.ai/agents

Adobe Ushers in a New Era of Creativity with New Creative Agent and Generative AI Innovations in Adobe Firefly
https://news.adobe.com/news/2026/04/adobe-new-creative-agent

give your agents a browser. Browser Run (fka Browser Rendering) really sprinted for Agents Week 🏃‍♀️ quick look at what shipped 1) Browser Rendering –> Browser Run (renamed!) 2) Live View – realtime view of browser sessions 3) Human in the Loop – intervene when your agent needs
https://x.com/kathyyliao/status/2044479579382026484

Redesigning Claude Code on desktop for parallel agents | Claude
https://claude.com/blog/claude-code-desktop-redesign

2. Give Claude Code your full task context upfront: goal, constraints, acceptance criteria in the first turn. This lets Claude Code do its best work.
https://x.com/_catwu/status/2044808536790847693

Anthropic CPO leaves Figma’s board after reports he will offer a competing product | TechCrunch

Anthropic CPO leaves Figma’s board after reports he will offer a competing product

Bessent, Powell Summon Bank CEOs to Urgent Meeting Over Anthropic’s New AI Model – Bloomberg
https://www.bloomberg.com/news/articles/2026-04-10/anthropic-model-scare-sparks-urgent-bessent-powell-warning-to-bank-ceos

Multi-agent coordination patterns: Five approaches and when to use them | Claude
https://claude.com/blog/multi-agent-coordination-patterns

Our evaluation of Claude Mythos Preview’s cyber capabilities | AISI Work
https://www.aisi.gov.uk/blog/our-evaluation-of-claude-mythos-previews-cyber-capabilities

Five companies — Google, Microsoft, Meta, Amazon, and Oracle — now control about two-thirds of the world’s compute, up slightly from ~60% at the start of 2024. Many AI labs (including OpenAI and Anthropic) depend almost entirely on these hyperscalers for access to their compute.
https://x.com/EpochAIResearch/status/2044154042541301870

I asked Jensen: “2 out of the top 3 models in the world, Claude and Gemini, were trained on TPU. What does that mean for Nvidia going forward?” After a long technical back and forth about what the right accelerator for AI looks like (see full episode), Jensen lays down the
https://x.com/dwarkesh_sp/status/2044468295957635392

So the concern over Mythos and cybersecurity seems warranted.
https://x.com/emollick/status/2043810051979157680

This was… an interesting one. Reminder that we run independent evals on our cyber ranges that labs don’t have access to. Exploitation capabilities are getting seriously good. Mythos is the first model to complete our full 32-step corporate network attack sim E2E.
https://x.com/ekinomicss/status/2043688793085992970

Anthropic launched Claude Opus 4.7 today, the new #1 in our GDPval-AA benchmark for performance on agentic real-world work tasks Opus 4.7 scored 1753 on GDPval-AA at launch with its ‘max’ effort setting, surpassing GPT-5.4 xhigh. This is a significant upgrade, placing Opus back
https://x.com/ArtificialAnlys/status/2044856740970402115

Anthropic says Opus 4.7 hits 80.6% on Document Reasoning — up from 57.1%. But “”reasoning about documents”” ≠ “”parsing documents for agents.”” We ran it on ParseBench. → Charts: 13.5% → 55.8% (+42.3) — huge → Formatting: 64.2% → 69.4% (+5.2) → Content: 89.7% → 90.3%
https://x.com/llama_index/status/2044886527352647859

Anthropic’s Opus 4.7 just seized the #1 spot on the Vals Index with a score of 71.4%, a massive jump from the previous best (67.7%). It also ranks #1 on Vibe Code Bench, Vals Multimodal, Finance Agent, Mortgage Tax, SAGE, SWE-Bench, and Terminal Bench 2.
https://x.com/ValsAI/status/2044792518953533777

big jump in coding capabilities by Claude 4.7 Opus SWE-Bench Pro 64.3% SWE-Bench Verified 87.6% TerminalBench 69.4% but interestingly, I think they kept CyberGym scores artificially low
https://x.com/scaling01/status/2044784563201708379

Claude 4.7 Opus has an Elo of 1753 on GDPVal-AA
https://x.com/scaling01/status/2044784781368365233

Claude Opus 4.7 is out! Benchmark scores look pretty strong, but clearly much worse than Mythos. It’s a nerfed Mythos, they deliberately reduced cyber capabilities during training.
https://x.com/Yuchenj_UW/status/2044787564440334350

Document Arena update: four new models are reshaping the top ranks – including two open models! – #1 Claude Opus 4.6 Thinking is new, keeping @AnthropicAI in the top 3 – #8 Kimi-K2.5 Thinking by @Kimi_Moonshot now the best open model (Modified MIT) – #10 Gemma-4-31b by
https://x.com/arena/status/2044437193205395458

Document reasoning increased by A LOT for Opus 4.7
https://x.com/scaling01/status/2044784878965703100

Introducing Claude Opus 4.7 \ Anthropic
https://www.anthropic.com/news/claude-opus-4-7

New Anthropic Fellows research: developing an Automated Alignment Researcher. We ran an experiment to learn whether Claude Opus 4.6 could accelerate research on a key alignment problem: using a weak AI model to supervise the training of a stronger one.
https://x.com/AnthropicAI/status/2044138481790648323

Nontheless Opus 4.7 scores much higher on Firefox shell exploitation
https://x.com/scaling01/status/2044788243435069764

OpenAI just dropped a major Codex update, one hour after Anthropic’s Opus 4.7. Whats new: background computer use on macOS (Codex clicks and types on your Mac while you keep working), in-app browser, image generation via gpt-image-1.5, persistent memory, long-running
https://x.com/kimmonismus/status/2044832303075995994

Opus 4.7 first-hour impressions Ran the canvas tree growth test twice. 4.6: nailed the animation both times 4.7: static tree, no growth animation — twice 4.7’s thinking is noticeably shorter and faster though (trimmed some 4.6 thinking in the clip for pacing). Not the upgrade
https://x.com/stevibe/status/2044800069661254064

Opus 4.7 scores 92% on ARC-AGI-1 and 75.83% on ARC-AGI-2
https://x.com/scaling01/status/2044791039605506344

The new Opus 4.7 model places #1 on our Vibe Code Benchmark, at 71%. When we first released the benchmark 4.5 months ago, no model scored above 25%. This benchmark tests a model’s ability to create a fully functional web application from the ground up.
https://x.com/ValsAI/status/2044791415524471099

We comprehensively benchmarked Opus 4.7 on document understanding. We evaluated it through ParseBench – our comprehensive OCR benchmark for enterprise documents where we evaluate tables, text, charts, and visual grounding. The results 🧑‍🔬: – Opus 4.7 is a general improvement
https://x.com/jerryjliu0/status/2044902620746363016

What are the largest software engineering tasks AI can perform? In our new benchmark, MirrorCode, Claude Opus 4.6 reimplemented a 16,000-line bioinformatics toolkit — a task we believe would take a human engineer weeks. Co-developed with @METR_Evals. Details in thread.
https://x.com/EpochAIResearch/status/2042624189421752346

What you need to know about Opus 4.7 * Takes instructions literally * Better vision means improved computer use and producing slides and other visual artifacts * Optimized for large-scale real-world analysis * Better at using file system-based memory
https://x.com/omarsar0/status/2044797480471044536

Wow I can already say after just 5 hours using @AnthropicAI Opus 4.7 that this is the first model that “”gets”” what I’m doing when I’m working. It feels aligned with me in a way no previous model did. (4.6 actively worked against me. I hated it. So this is *very* exciting!)
https://x.com/jeremyphoward/status/2044942799511191559

Anthropic co-founder confirms the company briefed the Trump administration on Mythos | TechCrunch

Anthropic co-founder confirms the company briefed the Trump administration on Mythos

First model from Anthropic, which openly acknowledges it isn’t the best model they have
https://x.com/nrehiew_/status/2044791293080121553

Internal Anthropic survey on Claude Mythos Preview 12/18 people thought that Mythos can manage day long ambiguous tasks 8/18 thought that it can execute week long tasks
https://x.com/scaling01/status/2044787521691742338

Nearly 1/3 of surveyed people in Anthropic now think entry-level engineers and researchers are likely replaced by Mythos within 3 months
https://x.com/arankomatsuzaki/status/2044808883928186936

Read OpenAI’s latest internal memo about beating the competition — including Anthropic | The Verge
https://www.theverge.com/ai-artificial-intelligence/911118/openai-memo-cro-ai-competition-anthropic

Gemini 3.1 Flash TTS is our most controllable text-to-speech model yet. With new Audio Tags, you can easily direct vocal style, delivery, and pace through text commands. 🧵
https://x.com/GoogleDeepMind/status/2044447030353752349

Gemini 3.1 Flash TTS: New text-to-speech AI model
https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-flash-tts/

Google’s new Gemini 3.1 Flash TTS ranks #2 on the Artificial Analysis Speech Arena Leaderboard, ahead of ElevenLabs’ Eleven v3 and only behind Inworld TTS 1.5 Max Gemini 3.1 Flash TTS represents a significant step forward for Google from previous TTS models, with notably
https://x.com/ArtificialAnlys/status/2044450045190418673

Introducing Gemini 3.1 Flash TTS 🗣️, our latest text to speech model with scene direction, speaker level specificity, audio tags, more natural + expressive voices, and support for 70 different languages. Available via our new audio playground in AI Studio and in the Gemini API!
https://x.com/OfficialLoganK/status/2044447596010435054

Our most expressive and steerable TTS model yet! Designed to give builders granular control over AI-generated speech, Gemini 3.1 Flash TTS is really fun to play with! Available in preview today – for devs via the Gemini API & @GoogleAIStudio + for enterprises on Vertex AI
https://x.com/demishassabis/status/2044599020690010217

try the TurboTax app in ChatGPT:
https://x.com/gdb/status/2044292247898992924

Introducing Gemini on Mac. It’s the first time we’re bringing the @Geminiapp to desktop. The team built this initial release with @Antigravity, and it went from an idea to a native Swift app prototype in a few days. More features on the way!
https://x.com/sundarpichai/status/2044452464724967550

Introducing Gemini on Mac. We heard your feedback. We recruited a small team. They built 100+ features in less than 100 days. 🤯 100% native Swift. Lightning fast. Let us know what you think!
https://x.com/joshwoodward/status/2044452201947627709

The Gemini App is now available on Mac OS
https://blog.google/innovation-and-ai/products/gemini-app/gemini-app-now-on-mac-os/

The Gemini app is now on Mac. With this new desktop app, you can access Gemini from any screen with Option + Space and share your window to get answers based on the documents, code, or data you’re working on.
https://x.com/GeminiApp/status/2044445911716090212

Google develops its own desktop Agent to compete with Cowork
https://www.testingcatalog.com/google-develops-its-own-desktop-agent-to-compete-with-cowork/

Google tests Canvas and Connectors on NotebookLM
https://www.testingcatalog.com/google-tests-canvas-and-connectors-on-notebooklm/

Google upgrades AI Mode in the Chrome browser
https://blog.google/products-and-platforms/products/search/ai-mode-chrome/

Today we shipped a new Search experience in @googlechrome to help you explore the web without constant back and forth between tabs. Now, when you click a link from AI Mode, the website opens side-by-side. It’s a game changer for comparing details and products across sites,
https://x.com/rmstein/status/2044828926057333050

We’re introducing a new Search experience in @GoogleChrome that lets you open webpages side-by-side with AI Mode – no tab switching required. Now, you’ll be able to compare details and ask follow-up questions while still maintaining the context of your search, whether you’re
https://x.com/Google/status/2044832732274901489

Today, we’re introducing Skills in @GoogleChrome, a new way to build one-click workflows for your most frequently used AI prompts — like asking for ingredient substitutions to make a recipe vegan, generating side-by-side shopping comparisons across multiple tabs, or scanning long
https://x.com/Google/status/2044106378655215625

Turn your best AI prompts into one-click tools in Chrome
https://blog.google/products-and-platforms/products/chrome/skills-in-chrome/

Google tests Agentic Shopping and native checkout in Gemini
https://www.testingcatalog.com/google-tests-agentic-shopping-with-native-checkout-in-gemini/

Gemini Robotics ER 1.6: Enhanced Embodied Reasoning — Google DeepMind
https://deepmind.google/blog/gemini-robotics-er-1-6/

Instead of writing complex code, the team interacted with Spot using plain English. We built a bridge between Gemini Robotics ER and Spot’s system, giving the AI a basic set of tools to move freely, take photos, and grab things – enabling it to carry out more complex tasks.
https://x.com/GoogleDeepMind/status/2044763631858909269

Introducing Gemini Robotics ER 1.6, our new SOTA robotics model 🤖 which excels at visual and spacial reasoning, now available via the Gemini API!
https://x.com/OfficialLoganK/status/2044080025474126065

Robotics is making progress! 🤖 We just released @GoogleDeepMind Gemini Robotics-ER 1.6 for enhanced embodied reasoning. – Unlocks instrument reading capabilities for complex gauges and sight glasses. – Achieves 93% success on instrument reading tasks using agentic vision. –
https://x.com/_philschmid/status/2044071114578509971

We teamed up with @BostonDynamics to power their robot Spot with Gemini Robotics embodied reasoning models. This means it can better understand its surroundings, identify objects and follow simple commands – like tidying up a room.
https://x.com/GoogleDeepMind/status/2044763625680765408

We’re rolling out an upgrade designed to help robots reason about the physical world. 🤖 Gemini Robotics-ER 1.6 has significantly better visual and spatial understanding in order to plan and complete more useful tasks. Here’s why this is important 🧵
https://x.com/GoogleDeepMind/status/2044069878781390929

Sub-32B open weights models now offer GPT-5 level intelligence with Qwen3.5 27B (Reasoning) matching GPT-5 (medium) at 42 and Gemma 4 31B (Reasoning) matching GPT-5 (low) at 39 on the Artificial Analysis Intelligence Index @Alibaba_Qwen’s Qwen3.5 and @GoogleDeepMind’s Gemma 4
https://x.com/ArtificialAnlys/status/2043929874537296026

Ravid Shwartz Ziv on X: “I took the new Muse Spark to the ultimate test: filing my taxes – 3 different workplaces, consulting, stocks, foreign bank accounts and assets, and kids. One hour later, I had everything done. AGI is here… cc: @alexandr_wang” / X
https://x.com/ziv_ravid/status/2044237898351030538

this is not investment or tax advice… but very cool!
https://x.com/alexandr_wang/status/2044269086771921326

Banger paper from NVIDIA. Agentic reasoning needs models that are not just capable, but efficient at long-context inference. The agent model layer is moving toward open, long-context, high-throughput architectures. This paper introduces Nemotron 3 Super, an open 120B parameter
https://x.com/dair_ai/status/2044452957023047943

NVIDIA Launches Ising, the World’s First Open AI Models to Accelerate the Path to Useful Quantum Computers | NVIDIA Newsroom
https://nvidianews.nvidia.com/news/nvidia-launches-ising-the-worlds-first-open-ai-models-to-accelerate-the-path-to-useful-quantum-computers

We’ve been developing a multi-agent system that builds and maintains complex software autonomously. Recently, we partnered with NVIDIA to apply it to optimizing CUDA kernels. In 3 weeks, it delivered a 38% geomean speedup across 235 problems.
https://x.com/cursor_ai/status/2044136953239740909

Today, we released Lyra 2.0, a framework for generating persistent, explorable 3D worlds at scale, from NVIDIA Research. Generating large-scale, complex environments is difficult for AI models. Current models often “forget” what spaces look like and lose track of movement over
https://x.com/NVIDIAAIDev/status/2044445645109436672

The next evolution of the Agents SDK | OpenAI
https://openai.com/index/the-next-evolution-of-the-agents-sdk/

We started Hiro with the vision of building an AI personal CFO. Joining @OpenAI gives us the chance to pursue that vision at a much greater scale. Important dates: – Today: Hiro is no longer accepting new signups – April 20, 2026: The product will stop working, but data export
https://x.com/hirofinanceai/status/2043751090232144159

Codex for (almost) everything | OpenAI
https://openai.com/index/codex-for-almost-everything/

Codex for (almost) everything. It can now use apps on your Mac, connect to more of your tools, create images, learn from previous actions, remember how you like to work, and take on ongoing and repeatable tasks.
https://x.com/OpenAI/status/2044827705406062670

OpenAI develops unified Codex app and new Scratchpad feature
https://www.testingcatalog.com/openai-develops-unified-codex-app-and-new-scratchpad-feature/

OpenAI tests web browsing feature on Codex Superapp
https://www.testingcatalog.com/openai-tests-web-browsing-feature-on-codex-superapp/

Gemma 4 and Why Many OpenClaw Users are Now Switching to it
https://x.com/TheTuringPost/status/2042167647077286163

Microsoft is working on yet another OpenClaw-like agent | TechCrunch

Microsoft is working on yet another OpenClaw-like agent

Microsoft Plots New Copilot Features Inspired by OpenClaw — The Information
https://www.theinformation.com/articles/microsoft-plots-new-copilot-features-inspired-openclaw

When set up on a Mac mini, Personal Computer can run 24/7 in the background across all your apps and files. Start a task from your iPhone, and Personal Computer can operate on your desktop and local files using 2FA. Requires the latest iOS update from the App Store.
https://x.com/perplexity_ai/status/2044806021244497964

⚡ Meet Qwen3.6-35B-A3B：Now Open-Source！🚀🚀 A sparse MoE model, 35B total params, 3B active. Apache 2.0 license. 🔥 Agentic coding on par with models 10x its active size 📷 Strong multimodal perception and reasoning ability 🧠 Multimodal thinking + non-thinking modes
https://x.com/Alibaba_Qwen/status/2044768734234243427

LM Performance：Qwen3.6-35B-A3B outperforms the dense 27B-param Qwen3.5-27B on several key coding benchmarks and dramatically surpasses its direct predecessor Qwen3.5-35B-A3B, especially on agentic coding and reasoning tasks.
https://x.com/Alibaba_Qwen/status/2044768738294268199

VLM Performance：Qwen3.6 is natively multimodal, and Qwen3.6-35B-A3B showcases perception and multimodal reasoning capabilities that far exceed what its size would suggest, with only around 3 billion activated parameters. Across most vision-language benchmarks, its performance
https://x.com/Alibaba_Qwen/status/2044768742761189762

Alibaba released Qwen3.6-35B-A3B today. Big jump compared to Qwen 3.5-35B model. It’s a sparse MoE, 35B total params, only 3B active. Natively multimodal, thinking and non-thinking modes. Hardfacts: SWE-bench Verified: 73.4, near dense Qwen3.5-27B (75.0), way ahead of
https://x.com/kimmonismus/status/2044780695361290347

[2604.09443] Many-Tier Instruction Hierarchy in LLM Agents
https://arxiv.org/abs/2604.09443

Agentic Analytics Summit 2026
https://cube.registration.goldcast.io/events/a87b0088-098d-4467-a11d-db6821c3a639

Agents as scaffolding for recurring tasks. | Irrational Exuberance
https://lethain.com/agents-as-scaffolding/

Another week on the road meeting with a couple dozen IT and AI leaders from large enterprises across banking, media, retail, healthcare, consulting, tech, and sports, to discuss agents in the enterprise. Some quick takeaways: * Clear that we’re moving from chat era of AI to
https://x.com/levie/status/2043426157367095397?s=46

Building for trillions of agents
https://x.com/levie/status/2030714592238956960?s=46

Cloudflare dashboard can now complete tasks for you. – “”Create a Worker and bind a new R2 bucket to it”” – “”Change my DNS records to 1.1.1.1″” – “”How many errors have happened this week”” Not only do we tell you, but we show you with generative UI. PROTIP: Use full-screen mode.
https://x.com/BraydenWilmoth/status/2044422996765352226

Coding agents learn from experience, but that knowledge stays locked in silos. Solve a thousand SWE tasks, and none of that wisdom helps with competitive coding. What if memories could transfer across domains? The work introduces Memory Transfer Learning, a framework where
https://x.com/dair_ai/status/2044900659921895729

I just updated our license. For personal use, you’re free to run the software on your own servers for coding, building applications, agents, tools, or integrations, as well as for research, experimentation, and other personal projects. Don’t worry, bro — go ahead and use
https://x.com/RyanLeeMiniMax/status/2044132777877221515

It is notable that we are all debating exactly which markdown files are most important to feed AI (skills, memory, tool instructions) and in which order to feed them to get the best output. Feels that this is likely a temporary state of affairs in the development of agents
https://x.com/emollick/status/2043354298650702101

not hot take🧊 the thin vs thick harness debate is pretty useless and completely misses the nuance of working backwards from a real goal when we build agents the obvious answer is that it all depends on what you’re building! there’s no end all be all principle this is why we
https://x.com/Vtrivedy10/status/2044130977526755636

Project Think is here. It’s the next generation of the @CloudflareDev Agents SDK with both lightweight primitives to a full suite of tools that you can use to build long-running agents. Durable execution, sub-agents, persistent sessions, sandboxed code execution, a built-in
https://x.com/aninibread/status/2044409784133103724

Suuuper excited to be shipping this one with the team! You can now control your GitHub Copilot CLI sessions from your phone
https://x.com/tiagonbotelho/status/2043720370734104923

Teleport Beams — Trusted Runtimes for Infrastructure Agents
https://www.beams.run/

The degree to which you are awed by AI is perfectly correlated with how much you use AI to code.
https://x.com/staysaasy/status/2042063369432183238

The wait is over. Cloudflare Email Service is now in public beta 📧 Send and receive emails directly from Workers or REST API with global delivery on Cloudflare’s network And just in time for you to build email agents with the Agents SDK!
https://x.com/thomasgauvin/status/2044766954032951792

this is a fundamental building block for `deepagents deploy` we’re designing a memory layer built for multi-tenant systems, so memory can be scoped to a user, agent, or organization please dm me if this resonates and you have a use case!
https://x.com/sydneyrunkle/status/2044099832319500484

today project Think is officially out! we bet on agents that run non-stop, survive failures, cost nothing when idle, and enforce security through architecture agents that any developer can build and deploy agents that have sub-agents via Facets, Session API and full
https://x.com/whoiskatrin/status/2044415568627847671

Two major shifts will be seen in Agentic AI after Harness and YOU MUST KNOW. 1. Workflow design of your agents matters a lot more than any frontier model selection. Till now we have mostly focused on chasing leaderboard models and burning money on frontier models. LLMs have a
https://x.com/kmeanskaran/status/2044010500816810427

We built a new task to test AI research capabilities! Agents asked to use @tinkerapi from @thinkymachines to train a model on logic games. That involves writing full training pipeline, running experiments across recipes, and submitting the best model.
https://x.com/ThoughtfulLab_/status/2044881989262803380

We just released code for Meta-Harness!
https://t.co/OdU7zocdPl Aside from replicating paper experiments, the repo is designed to help users implement good Meta-Harnesses in completely new domains! Just point your agent at ONBOARDING.md and have a conversation
https://x.com/yoonholeee/status/2044442372864700510

We just shipped “Git for agents”. Turns out agents are really good at working with Git, but existing source control platforms weren’t built for the volume of commits we’re seeing now. Create tens of millions of repos. Use them from any Git client.
https://x.com/elithrar/status/2044767190834991490

We’ve just launched Artifacts: Git-compatible versioned storage built for agents.
https://x.com/Cloudflare/status/2044766515065499957

We’ve shipped several quality-of-life improvements to Cursor 3. They bring a little more delight when you are orchestrating agents. Just like in your terminal, you can now split agents for multi-tasking in Cursor.
https://x.com/cursor_ai/status/2043798784367546707

when you take agents to production, you need to think about guardrails we provide 2 abstractions for guardrails 1. middleware provides hooks around the agent loop that you can use to handle retries, errors, and application specific guards (like PII redaction) 2. filesystem
https://x.com/sydneyrunkle/status/2043767032361967751

Windsurf 2.0 adds Devin and Agent Command Center
https://www.testingcatalog.com/windsurf-2-0-adds-devin-and-agent-command-center/

Windsurf 2.0: Introducing the Agent Command Center and Devin in Windsurf
https://windsurf.com/blog/windsurf-2-0

Your warehouse. Your agent. Live data apps.
https://motherduck.com/quack-query-dives/

An experimental voice pipeline for the Agents SDK enables real-time voice interactions over WebSockets. Developers can now build agents with continuous STT and TTS in just ~30 lines of server-side code.
https://x.com/Cloudflare/status/2044423032265957872

You can now add voice to your agent using Agents SDK:
https://t.co/bb29zIHvEt Voice is just another input — you can use the same WebSocket connection your Durable Object uses to transmit audio. So much fun working with @threepointone on this
https://x.com/korinne_dev/status/2044441427736936510

Agent evals are drifting away from production reality. Most benchmarks use clean tasks, well-specified requirements, deterministic metrics, and retrospective curation. Production work is messier, with implicit constraints, fragmented multimodal inputs, undeclared domain
https://x.com/dair_ai/status/2044773323914322393

Doing my “”large codebase modernization”” bench. Cooked for 32 minutes. Looking reasonable so far but it missed the changes to the Link component in Next.js (almost everything has missed this to be fair)
https://x.com/theo/status/2044907295205961806

Introducing FrontierSWE, an ultra-long horizon coding benchmark. We test agents on some of the hardest technical tasks like optimizing a video rendering library or training a model to predict the quantum properties of molecules. Despite having 20 hours, they rarely succeed
https://x.com/MatternJustus/status/2044876224896565679

Scaling to ultra-long horizon agents requires novel benchmarks and RL environments. FrontierSWE by @ProximalHQ is exactly that: 11h average runtime, open-ended tasks like end-to-end model optimization, and frontier agents fail almost all of them. We co-designed granite_inf,
https://x.com/vincentweisser/status/2044923733048222197

Turns out we can get SOTA on agentic benchmarks with a simple test-time method! Excited to introduce LLM-as-a-Verifier. Test-time scaling is effective, but picking the “”winner”” among many candidates is the bottleneck. We introduce a way to extract a cleaner signal from the
https://x.com/Azaliamirh/status/2043813128690192893

we just shipped Kernels, it’s a new repo at @huggingface 💚 it allows for packaging and distribution of optimized kernels 🔥 vibe-optimize Kernels, benchmark gains and share them on Hub 🫵
https://x.com/mervenoyann/status/2044080953648128073

We partnered with @ProximalHQ to run five frontier coding agents on a hard task: rebuild the full Wan 2.1 text-to-video pipeline on MAX (no PyTorch, no diffusers) in 20 hours as part of their new Frontier-SWE benchmark. Two nearly pulled it off. Every model understood the
https://x.com/Modular/status/2044879525881024968

Current frontier models are increasingly saturating common AI benchmarks. Are they still useful? We think benchmarks remain important, but they can both over- and understate AI capabilities. To better survey this space, the field is turning to a new paradigm: open-world evals.
https://x.com/steverab/status/2044852672562426216

I’m pleased to share that our search team has open sourced an embedding model called Harrier that is currently ranking #1 on the multilingual MTEB-v2 benchmark leaderboard. Harrier delivers SOTA performance on retrieval quality, semantic matching, and contextual analysis across
https://x.com/JordiRib1/status/2041550352739164404

Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents
https://huggingface.co/blog/ibm-research/vakra-benchmark-analysis

Our latest Live model is # 1 on Tau Voice Bench! Excited to see this new frontier of voice models cross the chasm of usability in production.
https://x.com/OfficialLoganK/status/2042672082425712935

significant improvement on coding and agentic benchmarks. better at computer vision and a new xhigh mode
https://x.com/dejavucoder/status/2044786310746186094

We’re open sourcing the first document OCR benchmark for the agentic era, ParseBench. Document parsing is the foundation of every AI agent that works with real-world files. ParseBench is a benchmark that measures parsing quality specifically for agent knowledge work: ✅ It
https://x.com/jerryjliu0/status/2043721536922955918

We gave an AI a 3 year retail lease in SF and asked it to make a profit | Andon Labs
https://andonlabs.com/blog/andon-market-launch

Long-running agents are the future – we’re excited to partner with OpenAI as a sandboxing partner for their new Agents SDK launch! Get started:
https://x.com/CloudflareDev/status/2044467412607901877

Migrate a Legacy Codebase with Sandbox Agents
https://developers.openai.com/cookbook/examples/agents_sdk/sandboxed-code-migration/sandboxed_code_migration_agent

OpenAI has purchased access to the FrontierMath: Open Problems verifiers. This allows them to check the validity of solutions their models generate. Thread with details.
https://x.com/EpochAIResearch/status/2044227029978284471

@buddyhadry Sending vulnerability reports for the QA Lab code we aren’t shipping in prod at all? Come on. Send a PR instead if this is relevant for your setup.
https://x.com/steipete/status/2044418128130806085

Anyone here who wants to help with WhatsApp CLI? It needs love, and I can’t focus on it right now.
https://x.com/steipete/status/2042684707683365227

GUYS WE FOUND THE GUY WHO BUILT THE GITHUB MCP SERVER
https://x.com/steipete/status/2042214825405661677

OH: Almost everyone at RedHat uses Macs now.
https://x.com/steipete/status/2042168766826516833

once again, I’m amazed by scammers.
https://x.com/steipete/status/2044390067997937716

raising lobsters at @aiDotEngineer
https://x.com/steipete/status/2042153429556933043

Send all your ClosedClaw questions!
https://x.com/steipete/status/2042184777575367117

That was the case in December. 4 months and thousands of work hours later, we have a great security concept; you can go all yolo, use a sandbox (Docker or OpenShell), there are allow-lists and per-access exec allow/deny prompts. There’s hundreds of security researchers that
https://x.com/steipete/status/2044482797449150520

They grow up so fast 🦞
https://x.com/steipete/status/2043313467512463713

The RAG era was short-lived, but intense. (Not that RAG is not useful, but it is no longer the dominant paradigm for supplying context to agents)
https://x.com/emollick/status/2040094108853600646

Evaluating agents for scientific discovery | Ai2
https://allenai.org/blog/evaluating-scientific-discovery-agents

[2604.08407] Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain
https://arxiv.org/abs/2604.08407

// Artifacts as Memory Beyond the Agent Boundary // An agent doesn’t always need a bigger memory buffer. Sometimes the environment itself remembers on the agent’s behalf. New research formalizes this intuition mathematically for the first time. The work introduces a formal
https://x.com/dair_ai/status/2044066936045351317

// Multi-User LLM Agents // Every agent framework assumes one user giving instructions. But deploy an agent into a team workflow, and suddenly it has multiple bosses with conflicting goals, private information, and different authority levels. This work formalizes multi-user
https://x.com/omarsar0/status/2044067923787165799

🚀 deepagents 0.5 release 👉 Async subagents – kick off background tasks on any Agent Protocol backed server while you continue to interact with the main agent. Start multiple background tasks in parallel, keep the conversation going, and collect results as they come in. Tasks
https://x.com/LangChain/status/2044086454230626733

3 months ago I started building a coding agent that runs in the cloud. It’s since written every line of code I’ve shipped, including itself. Today, I’m open sourcing it. Introducing Open Agents.
https://x.com/nicoalbanese10/status/2043745569278251112

Agent Lee is an in-dashboard agent that shifts Cloudflare’s interface from manual tab-switching to a single prompt. Using sandboxed TypeScript, it helps you troubleshoot and manage your stack as a grounded technical collaborator.
https://x.com/Cloudflare/status/2044406215208316985

As AI agents accelerate coding, what is the future of software engineering? Some trends are clear, such as the Product Management Bottleneck, referring to the idea that we are more constrained by deciding what to build rather than the actual building. But many implications, like
https://x.com/AndrewYNg/status/2043742105852621052

copilot –remote Take your coding agent session with you anywhere!
https://x.com/pierceboggan/status/2043717775265562701

hermes-lcm v0.2.0 is out! Lossless context management for Hermes Agent — every message persisted, hierarchical DAG summaries, agent tools to drill back into anything that was compacted. No more lossy flat summaries. What’s new since launch: – 6 agent tools (grep, describe,
https://x.com/SteveSchoettler/status/2043870709613768820

Humwork A2P marketplace connects AI agents with experts
https://www.testingcatalog.com/humwork-a2p-marketplace-connects-ai-agents-with-experts/

I am more and more convinced that this is the future of software development UI. @cursor_ai is the closest in my opinion A list of work you’re working on parallel, the agent in the middle, and most importantly, the thing you’re building on the right. Because you want to see what
https://x.com/kieranklaassen/status/2044108436087157220

I’m noticing some really big shifts in how AI models starts to handle memory. @ECNUER and others introduced Memory Intelligence Agent (MIA) that highlights the importance of storing the whole problem-solving journey – how to perform tasks. It turns memory into something closer
https://x.com/TheTuringPost/status/2042386614568325404

les fucking go… for agents to push kernels to the hub, do: > pip install kernels > kernels skills add > <start agent> > “”write an RMSNorm kernel for h100 and push to Hugging Face Hub”” bam, you are a kernel author!
https://x.com/ben_burtenshaw/status/2044114277745807684

Long-horizon AI research agents are mostly a state-management problem. It is not enough for an agent to reason well in the next turn. ML research requires task setup, implementation, experiments, debugging, and evidence tracking over hours or days. This new paper introduces
https://x.com/omarsar0/status/2044436099121209546

Most AI assistants wait for you to ask. But a truly useful agent should notice you need help before you say anything. New research takes a serious shot at building proactive agents that work in real time. The work introduces PASK with three components: IntentFlow for streaming
https://x.com/dair_ai/status/2044145437456904438

Redesigning the Service Role for the AI Agent Era
https://www.asapp.com/webinars/redesigning-the-service-role-for-the-ai-agent-era

Speeding up GPU kernels by 38% with a multi-agent system · Cursor
https://cursor.com/blog/multi-agent-kernels

The 80/20 of multi-agent teams for non-technical people: Stop making one AI agent do everything. Build a team of 4: 1. Orchestrator: plans the work, routes tasks, synthesizes results 2. Researcher: gathers sources, verifies claims, flags uncertainty 3. Writer: turns raw
https://x.com/coreyganim/status/2043627229205193211

The crazy part? This was done (nearly) fully autonomously! Only 8 prompts from the human in the loop. Just a Hermes agent, a skill, and a dream. 🐉 I told my AI agent “”use obliteratus to find the best way to get the guardrails off Gemma 4 E4B”” It loaded the OBLITERATUS skill
https://x.com/elder_plinius/status/2044462515443372276

Love seeing this open-sourced. Had a great chat with @nicoalbanese10 some weeks ago where he hinted to something like this. Great reference architecture for cloud coding agents. Open Agents gives you the full stack: UI, auth, workflows, sandbox. #DeepAgent from @LangChain takes
https://x.com/bromann/status/2043886229650067729

That’s amazing how precisely AI summarizes the emails lately.
https://x.com/TheTuringPost/status/2042433286312751412

A lot of our education on writing well focuses on logic, clarity, and argument. AI will force us to think more about style. The boredom that comes from everything on the internet reading Claude-y now, no matter how good the substance is, should make us appreciate variety more.
https://x.com/emollick/status/2042963501199597950

All | Search powered by Algolia
https://hn.algolia.com/?dateRange=all&page=0&prefix=false&query=claude+down&sort=byPopularity&type=story

Anthropic Mythos AI Rollout Coming to US Agencies – Bloomberg
https://www.bloomberg.com/news/articles/2026-04-16/white-house-moves-to-give-us-agencies-anthropic-mythos-access

Anthropic: Claude quota drain not caused by cache tweaks • The Register
https://www.theregister.com/2026/04/13/claude_code_cache_confusion/

Anthropic’s Mythos seeded some panic and added anxiety (@matthewberman, I’m looking at you ;), but let’s think a little and calmly discuss what it means in the short term and in the long term. Let me know your thoughts
https://x.com/TheTuringPost/status/2042363395962274075

Anthropic’s randoms system prompt blockers are getting weirder and weirder.
https://x.com/steipete/status/2042537771865104653

Claude Code is redesigning the IDE for agentic coding. As Andrej said: “We’re going to need a bigger IDE. The basic unit is not a file, but an agent.” Cursor now has to fight to define that future of IDE too.
https://x.com/Yuchenj_UW/status/2044133573326934384

Claude Mythos #2: Cybersecurity and Project Glasswing
https://thezvi.substack.com/p/claude-mythos-2-cybersecurity-and

Coding agents are such game-changers for linux. For almost anything that doesn’t work, in the past I would have spent the afternoon, or even whole weekend, scourging forums, trying many many things, before fixing it or giving up. Now I just point codex and claude it at (and,
https://x.com/giffmana/status/2043401612035559445

Currently, ChatGPT has the best way of viewing thinking traces, a short summary of steps in the main window, and a detailed audit in the sidebar if you want it Claude does almost as well, but more summarized and harder to see calculations and code Its a big weak spot for Gemini
https://x.com/emollick/status/2043408661603594740

Given the messy naming scheme used by all the AI companies, I caused a chart to be made showing the gain in GPQA per 0.1 version in model names (estimated, since model names skip version numbers). There has never been a more misnamed model that Claude 3.7, should have been 4.4.
https://x.com/emollick/status/2044200225653326269

ICYMI — `deepagents deploy` is an open alternative to claude managed agents!
https://x.com/LangChain/status/2044097913698091496

It looks like everyone is finally catching up with the fact that agent sessions in CLI mode can only get you so far. It makes sense that the new Codex app, Cursor, and Claude Code (desktop) feel and look pretty similar now. This UI convergence is not an accident. This is a
https://x.com/omarsar0/status/2044172949003911532

Jensen Huang on Anthropic, OpenAI, China, and demand for inference tokens
https://davefriedman.substack.com/p/jensen-huang-on-anthropic-openai

OpenAI should probably bite the bullet and just name their next set of models something more human sounding. Everyone anthropomorphizes their AIs anyway, and “”Claude”” is an easier name to refer to than ChatGPT. Also easier to make a gerund, “”Clauding,”” or adjective, “”Claudy-y.””
https://x.com/emollick/status/2043190951632404760

We conducted cyber evaluations of Claude Mythos Preview and found that it is the first model to complete an AISI cyber range end-to-end. 🧵
https://x.com/AISecurityInst/status/2043683577594794183

Anthropic asked Christian leaders for advice on Claude’s moral future – The Washington Post
https://www.washingtonpost.com/technology/2026/04/11/anthropic-christians-claude-morals/

Distilled recap of the back-and-forth with Jensen on export controls: Dwarkesh: Wouldn’t selling Nvidia chips to China enable them to train models like Claude Mythos with cyber offensive capabilities that would be threats to American companies and national security? Jensen:
https://x.com/dwarkesh_sp/status/2044483393941848131

Just shipped **artifact-preview** for Hermes 🔥 Like Claude Artifacts, build dashboards, games, UIs, get a full interactive preview that instantly opens in a live browser. Real clickable code, smooth refreshes on prompt edits. cc @Teknium
https://x.com/ChuckSRQ/status/2044504539978465658

Jensen regrets that when Anthropic and OpenAI first needed billions to scale, Nvidia wasn’t in a position to invest. So these labs went to hyperscalers like Microsoft, Google, and Amazon instead, and in return committed to using their compute. “I’m not going to make that same
https://x.com/dwarkesh_sp/status/2044498492450869624

Qwen 3.6 is here, and open-source! Run it locally with improved agentic coding capabilities. Try it with Claude Code: ollama launch claude –model qwen3.6 Try it with OpenClaw: ollama launch openclaw –model qwen3.6 Run it: ollama run qwen3.6
https://x.com/ollama/status/2044779844672852465

A few weeks ago, it was common to hear people argue one should use agents to replace dependencies for security reasons. In light of the Mythos news, the math changes. Using an OSS lib that’s had tens of thousands of $ of agentic hardening is likely optimal.
https://x.com/dbreunig/status/2043762702653460520

I am catching glimpses in my feed that there is a backlash against Mythos as “”marketing hype,”” and it is a little confusing. I don’t think anyone who has used the latest agentic coding tools, would think that expecting large-scale cybersecurity implications of increasingly good
https://x.com/emollick/status/2043516250081407422

Marcus Hutchins, the guy famous for stopping the WannaCry Ransomware, probably has the best take on Mythos doing vulnerability research
https://x.com/ananayarora/status/2043381424594837789

The Mythos Threshold – Joe Reis
https://joereis.substack.com/p/the-mythos-threshold

What I learned this week – Pretraining parallelisms, Can distillation be stopped, Mythos and the cybersecurity equilibrium, Pipeline RL, On why pretraining runs fails
https://www.dwarkesh.com/p/what-i-learned-april-15

2 prompts deep into Opus 4.7 and benchmarks don’t do it justice. Way better behavior and instruction following. Pretty massive improvement in actual usage.
https://x.com/mweinbach/status/2044801022439137566

3. Tell the model how to verify its changes. Put your testing workflow in your claude.md, or add a /verify-app skill. Opus 4.7 is better at verifying it’s work, and it’s helpful to share any local dev tips that are hard to discover.
https://x.com/_catwu/status/2044808538351100377

after ~10 million tokens Mythos is much more efficient than other models it reaches the same performance as Opus with ~40% the tokens
https://x.com/scaling01/status/2043700788245963167

Claude Opus 4.7 is now available as an Agent Preview inside of Devin! Anthropic has clearly optimized Claude Opus 4.7 for long-horizon autonomy, unlocking a class of deep investigation work we couldn’t reliably run before. Claude Opus 4.7 model costs within Devin will be
https://x.com/cognition/status/2044844661076902082

Claude Opus 4.7 is now available in Cursor. We’ve found it to be impressively autonomous and more creative in its reasoning. We’re launching it with 50% off for a limited time. Enjoy!
https://x.com/cursor_ai/status/2044785960899236341

Claude Opus 4.7 is out! Handles ambiguous, multi-step work even better than 4.6. Cursor’s internal bench cleared 70%, up from 58% on 4.6. Notion saw a 14% lift on their evals with a third of the tool errors 🔨
https://x.com/mikeyk/status/2044802045186846912

Claude Opus 4.7 is out. the TL;DR Anthropic released Opus 4.7 today. Same pricing as 4.6 ($5/$25 per million tokens), available across API, Bedrock, Vertex AI, and Microsoft Foundry. What changed vs Opus 4.6: Coding (obviously). Biggest gains on the hardest, long-horizon
https://x.com/kimmonismus/status/2044787072947601796

Confirmed: Anthropic keeping Cyber capabilities of Opus 4.7 artificially low “”during training we experimented with efforts to differentially reduce these capabilities””
https://x.com/scaling01/status/2044788067848888635

Cursor reports that Opus 4.7 is “”a meaningful jump in capabilities, clearing 70% versus Opus 4.6 at 58%”” on CursorBench
https://x.com/scaling01/status/2044792017553645668

for all the people calling Opus 4.7 a mid update lmao
https://x.com/scaling01/status/2044792810327404596

from my experience, even the best models (Opus 4.6, 5.4 xhigh / 5.3 codex) cannot write good code today without an amount of work that is equivalent to just doing the work myself am excited for a world where they can, but in the current state i have very low trust in them
https://x.com/RhysSullivan/status/2043584591861321929

Hold on, something doesnt add up here. Opus 4.7 got much worse in needle in the haystack? need to dig into this
https://x.com/kimmonismus/status/2044809126526476374

Holy shit the new Opus 4.7 system prompt has entirely lobotomized the model “”Heads up: that last <system-reminder> about malware looks like a prompt injection — this is clearly your personal site (t3gg homepage, links, sponsors), not malware. Ignoring it.””
https://x.com/theo/status/2044857866323173732

I think everyone saying that these improvements are mid are smoking crack I would argue that this was one of the larger Opus jumps we have seen over the last year You also have to keep in mind that we see almost monthly model updates nowadays instead of just every 6-12 months
https://x.com/scaling01/status/2044799290694889535

I was really worried about the rush to “”more agentic”” models. But Opus 4.7 is happy to let me lead, and to take time to discuss, rather than barging ahead. If something isn’t working out, it’ll stop and offer options rather than slamming thru whatever it can find.
https://x.com/jeremyphoward/status/2044942801578959301

If you want to test Opus 4.7 without the lobotomized system prompt, you can try it out in T3 Chat
https://x.com/theo/status/2044876982815793190

Introducing Claude Opus 4.7, our most capable Opus model yet. It handles long-running tasks with more rigor, follows instructions more precisely, and verifies its own outputs before reporting back. You can hand off your hardest work with less supervision.
https://x.com/claudeai/status/2044785261393977612

My bet is that Mythos uses a new tokenizer, and they switched Opus over to it (through midtraining) for distillation
https://x.com/maximelabonne/status/2044796208053416203

My biggest issue with Opus 4.7 on Claude web: Only “Adaptive” or non-thinking. No way to force thinking mode. And it doesn’t even know Opus 4.6 exists, and I cannot force it to think and do web search mid conversation!
https://x.com/Yuchenj_UW/status/2044794073723347400

my main theory is that mythos had a new tokenizer for pretraining and they did surgery on opus for distillation
https://x.com/stochasticchasm/status/2044790474410790995

my take: opus 4.7 is a distilled version of mythos
https://x.com/eliebakouch/status/2044790074093523379

Opus 4.7 as robust to prompt injections as Claude Mythos
https://x.com/scaling01/status/2044788481008755046

Opus 4.7 Benchmarks out! Very solid upgrade to Opus 4.6! Compared to Opus 4.6: -SWE Bench Pro +11% -SWE Bench Verified +7% -Terminal Bench 2.0 +4% The benchmarks are significantly lower than for Mythos, but that was to be expected. h/t for finding @synthwavedd
https://x.com/kimmonismus/status/2044784903733084521

Opus 4.7 comes with much improved reasoning-efficiency over Opus 4.6 basically everything is now moved up one tier low is as good as medium medium as good as high high as good as max
https://x.com/scaling01/status/2044785467942453698

Opus 4.7 deleting all long-context gains from Opus 4.6 lol
https://x.com/scaling01/status/2044791314898723179

Opus 4.7 has a new tokenizer. This means it’s also a new base model. Glory days of pretraining still very much going.
https://x.com/natolambert/status/2044788470179332533

opus 4.7 is here on claude platform / app
https://x.com/dejavucoder/status/2044784097378316327

Opus 4.7 is live in Claude Code today! The model performs best if you treat it like an engineer you’re delegating to, not a pair programmer you’re guiding line by line. Here are three workflow shifts we recommend for this model 🧵
https://x.com/_catwu/status/2044808533905178822

Opus 4.7 is now available in @MagicPathAI. From our early testing, the model is really strong at long tasks when design requires lots of changes, image-to-code, and overall produces cleaner, more reusable React components.
https://x.com/skirano/status/2044804877696516442

Opus 4.7 is WORSE than 4.6 on Long Context?
https://x.com/nrehiew_/status/2044795171213291614

Opus 4.7 much less likely to sudo rm -rf (taking destructive actions in production envs)
https://x.com/scaling01/status/2044789371837001779

Opus 4.7 uses a different tokenizer from Opus 4.6. So either: – Anthropic has a way to change tokenizer between finetunes – It is just new special tokens which implies they uses special tokens liberally within messages and not just as part of the chat template
https://x.com/nrehiew_/status/2044792314825228690

Opus 4.7 uses more thinking tokens, so we’ve increased rate limits for all subscribers to make up for it. Enjoy!
https://x.com/bcherny/status/2044839936235553167

Opus is going to be a bioweapon risk at this pace
https://x.com/scaling01/status/2044785139905913077

Some of my favorite things in Opus 4.7: – Very good at async work and following instructions – Effort levels are far more predictable for token control (+ new xhigh level) – No more downscaling of high-res images – Noticeably more taste in UIs, slides, docs
https://x.com/alexalbert__/status/2044788914813292583

Unfortunately they didn’t include a chart for GraphWalks scores: Opus 4.6 – 38.7% Opus 4.7 – 58.6% This would make clearer that long-context didn’t suffer as much as MRCR suggests.
https://x.com/scaling01/status/2044823423013020088

wait why is there an INSANE gap on long context benchmarks between opus 4.6 and 4.7??? this is crazy
https://x.com/eliebakouch/status/2044798168211100096

We’ve set the default effort level for Opus 4.7 to xhigh in Claude Code. You can use /effort to adjust this. Excited for you to try Claude Code with Opus 4.7 and let us know your feedback!
https://x.com/_catwu/status/2044808539663978970

Shocking result on my pelican benchmark this morning, I got a better pelican from a 21GB local Qwen3.6-35B-A3B running on my laptop than I did from the new Opus 4.7! Qwen on the left, Opus on the right
https://x.com/simonw/status/2044830134885306701

@stochasticchasm yeah they tend to forget that releases are now monthly and now bi-anually
https://x.com/scaling01/status/2044795960224592329

Anthropic Changes Pricing to Bill Firms Based on AI Use as Demand Jumps — The Information
https://www.theinformation.com/articles/anthropic-changes-pricing-bill-firms-based-ai-use-amid-compute-crunch

Anthropic introduced xhigh reasoning effort
https://x.com/scaling01/status/2044785557058814059

Anthropic loses Claude Code trust in black-box fight
https://www.implicator.ai/claude-probably-wasnt-secretly-nerfed-anthropic-made-the-black-box-too-dark/

Anthropic tests Claude Code upgrade to rival Codex Superapp
https://www.testingcatalog.com/anthropic-tests-claude-code-upgrade-to-rival-codex-superapp/

anthropic? you mean the greedy token guzzler company?
https://x.com/dejavucoder/status/2044798065530528061

every engineer at anthropic has been using mythos for ~1.5 months. meanwhile, their uptime is horrendous, claude code still has rendering bugs, etc. one could conclude that it won’t be the end of software engineering.
https://x.com/benhylak/status/2042051048261722467

GitHub reports similar improvements
https://x.com/scaling01/status/2044792459125834029

OpenAI has released a plugin that lets you call Codex directly within Anthropic’s Claude Code environment It turns Claude Code into a multi-agent setup with Codex as a specialized coding assistant This gives you: – High-quality code reviews – Delegation of real tasks
https://x.com/TheTuringPost/status/2044561927905677558

So we now have a pretty good picture of the state of the frontier AI model makers. US closed source models continue to lead. Google, OpenAI, and Anthropic stand well ahead of the pack, and may have signs of recursive self-improvement. xAI has fallen from frontier status for now
https://x.com/emollick/status/2042088011748290750

The pace at which Anthropic is shipping Opus variants is a very new thing in the industry.
https://x.com/_arohan_/status/2044791678180167804

The pace at which useful things are shipping also seems to be accelerating. Model releases are coming faster, of course, but so are significant application and enterprise products (especially from Anthropic). Almost certainly faster than the market can track or absorb information
https://x.com/emollick/status/2042434850003534077

we were literally stuck at 80% SWE-Bench Verified for months and just jumped to almost 90% and you guys call it mid …
https://x.com/scaling01/status/2044790717722034511

Yeah folks, it’s gonna be harder in the future to ensure OpenClaw still works with Anthropic models.
https://x.com/steipete/status/2042615534567457102

Excited to share that the Gemini API now has prepaid billing, rolled out to start for US customers!! We have been working hard across Google to enable this. It’s the default for new API users and existing users can opt in via a new billing account, all directly in AI Studio.
https://x.com/OfficialLoganK/status/2044516262152442315

Google prepares rollout of Skills for Gemini and AI Studio
https://www.testingcatalog.com/google-prepares-broader-rollout-of-skills-for-gemini-and-ai-studio/

Introducing Tab Tab Tab, our new prompt auto complete engine in @GoogleAIStudio’s vibe coding experience. Now when you show up with your fuzzy ideas, you can rely on Gemini to fill in the blanks : )
https://x.com/OfficialLoganK/status/2043752712127611201

Personal Intelligence in Gemini is expanding to more people globally. 🌏 Google AI Ultra, Pro, and Plus subscribers around the world can access the feature starting today, with a rollout to free users coming soon. More information on where Personal Intelligence is available:
https://x.com/GeminiApp/status/2044430579996020815

We’re bringing Personal Intelligence to more users around the world in the @GeminiApp starting today, followed by Gemini in @googlechrome later this week. 🌍 Now even more people can securely connect the dots across their favorite Google apps — like @Gmail and @GooglePhotos — to
https://x.com/Google/status/2044437335425564691

We estimate that Gemini 3.1 Pro with thinking level `high` has a 50%-time-horizon of around 6.4 hrs (95% CI of 4 hrs to 12 hrs) on our suite of software tasks.
https://x.com/METR_Evals/status/2044463380057194868

I was chatting with my buddy at Google, who’s been a tech director there for about 20 years, about their AI adoption. Craziest convo I’ve had all year. The TL;DR is that Google engineering appears to have the same AI adoption footprint as John Deere, the tractor company. Most of
https://x.com/Steve_Yegge/status/2043747998740689171

We’re also launching a library of ready-to-use Skills for common tasks and workflows. You can save these Skills to your own library, and even customize them to better fit your needs by updating the prompt.
https://x.com/Google/status/2044106380882166040

Memory Caching: RNNs with Growing Memory”” Google’s new paper proposes a simple way to give recurrent models a memory that grows with sequence length. So instead of forcing an RNN to compress the full past into 1 fixed hidden state, it caches memory checkpoints across
https://x.com/askalphaxiv/status/2043782770657219010

NEW Research from Google. Integration test failures are painful because the signal is buried in messy logs. Massive output, heterogeneous systems, low signal-to-noise ratio, and unclear root causes. This paper introduces Auto-Diagnose, an LLM-based tool deployed inside Google’s
https://x.com/omarsar0/status/2044769798845079665

4 reasons Gemma 4’s architecture runs efficiently on your hardware: 1. Local + global attention structure 4 or 5 local layers + 1 final global layer to preserve the context understanding 2. Special optimizations for global attention: – 8 query heads per KV head in Grouped Query
https://x.com/TheTuringPost/status/2043086456412082356

So much in this release but the one many have been waiting for above the rest, the GUI dashboard! Manage and monitor your Hermes Agent with a GUI Local Web Dashboard with `hermes dashboard` command to start it!
https://x.com/Teknium/status/2043771509123232230

As we develop more capable models at the frontier, MSL is committed to safety and preparedness for AI. To demonstrate this commitment, we will be publishing preparedness reports for our models, in line with our new Advanced AI Scaling Framework. See our Muse Spark report below:
https://x.com/alexandr_wang/status/2044454230614999441

check out Contemplating mode for your most complex reasoning queries!
https://x.com/alexandr_wang/status/2043177308803215811

cool to see people finding new emergent capabilities within Muse Spark!
https://x.com/alexandr_wang/status/2042360886195581330

honestly I didn’t even know our model could do some of these
https://x.com/alexandr_wang/status/2042805863979626574

i find muse spark is very good at data analysis–both finding relevant open-source data and analyzing it. for example, here’s my results for analyzing global share of GDP over past century:
https://x.com/alexandr_wang/status/2043432483006615806

Meta AI is up to #6 in the App Store overnight, and still growing 🙂 Also who knew the 7-Eleven app was so popular
https://x.com/alexandr_wang/status/2042254047244398978

MSL *really does* run like a startup 🙂 join us if that sounds exciting to you!
https://x.com/alexandr_wang/status/2043176328170705036

muse spark is impressively multimodal!
https://x.com/alexandr_wang/status/2042362366784881011

muse spark is the best model I’ve personally used for Design & UI great to hear the community experience it as well!
https://x.com/alexandr_wang/status/2042610847520809295

okay this is too exciting 🙂 meta AI is now #2 in the app store, top AI app! we are so back!
https://x.com/alexandr_wang/status/2043016694910587228

people are finding all the cool things we built into muse spark 🙂
https://x.com/alexandr_wang/status/2043175802578346466

the muse spark API will be coming soon! we have been thrilled with the amount of excitement amongst developers who want to try muse spark inside their agentic harnesses stay tuned!
https://x.com/alexandr_wang/status/2042614906059387211

up to #3, coming for the crown 👑 that being said, MONOPOLY GO!Chat is now #1, so i’m learning a lot about the App Store
https://x.com/alexandr_wang/status/2042808439852630073

we are excited for people to try muse spark!
https://x.com/alexandr_wang/status/2042142866697548189

《收藏！10 大 Hermes Agent 实用教程，新手少走 10 小时弯路》随着 Hermes Agent 的爆火，我们正目睹一场从”被动工具”向”主动生命体”的范式转移。Hermes 的迷人之处不在于它能即刻交付多少活，而在于它具备”自我生长”的复利效应：你喂给它的每一行代码、每一次对话、每一个 Profile
https://x.com/biteye_sister/status/2043630704798679545

Added official support to Hermes Agent for: QQBot – hugely popular messaging platform in China AWS Bedrock Model Provider Run `hermes update` in your terminal to access early!
https://x.com/Teknium/status/2044557360962871711

Capable agents are the result of co-evolution between models and harnesses. We’ve been working with @NousResearch to ensure that M2.7 x Hermes Agent provides a top-tier experience for users. Hermes’s self-improving loop brings out the best in M2.7 through real usage. We are
https://x.com/MiniMax_AI/status/2044745282785886469

Finally had the chance to get up and running with @NousResearch Hermes Agent and my impression is great. The thing that has stood out so far: it’s fast, at least twice as fast as OpenClaw (I set up a new instance to test it against) Generally the UX also just feels a lot better
https://x.com/dabit3/status/2043808914312212568

For anyone running @NousResearch Hermes Agent locally and wishing it just stayed online: there’s now a one-click deployment template on Tencent Lighthouse. Cloud-hosted, sandboxed from your local env, online around the clock, reach it through WhatsApp, Telegram, WeCom, QQ, or
https://x.com/TencentAI_News/status/2044007400282436006

hermes agent @NousResearch is fucking insane i know literally NOTHING about coding. ZERO. and i just built a fully functioning web app in minutes
http://localhost:3000/ check it out @Teknium
https://x.com/friesmakesfries/status/2044751296641802481

hermes is so much better than openclaw hype is crazy
https://x.com/theCTO/status/2044559179151773933

Hermes 实在是太好用了我在 win 系统上也装了一个流程简单的一批，建议自己手动安装，别用 Claude code 1. 安装 WSL2: wsl –install 2.重启电脑,启动 Ubuntu: wsl 3.执行官方安装命令： curl -fsSL
https://t.co/voDBXKw7Py | bash
https://x.com/aiqiang888/status/2043920187959992609

hermes-lcm v0.3.0 is out — biggest release yet!🚀 What’s new: – Smart search with sort modes (recency / relevance / hybrid) + full CJK & emoji support – Adaptive compaction that scales with backlog pressure and auto-retries on model limits – SQLite hardening: FTS auto-repair,
https://x.com/SteveSchoettler/status/2044536537434755493

I put 2 separate instances of Hermes agents into a chat, holy sh!t this is fun >1 agent is builder, 1 is strategist >each on separate models >gave them some shared context >enabled bot2bot andadded each bot to the other’s TG allowlist >put 3 of us in a gc >started with a simple
https://x.com/KSimback/status/2044736703370309706

Introducing Mirra Workspaces Workspaces give your local agents access to a shared multi-tenant environment. Our customers are already using Cloud Workspaces to automatically share context between their team member’s agents. Workspaces work best with @NousResearch Hermes, which
https://x.com/mirra/status/2044762744998519282

Introducing the Nous Portal Tool Gateway, one login to access over 400 LLMs and power all core tools in Hermes Agent. Check it out below!
https://x.com/Teknium/status/2044879261564375326

M2.7 w/ hermes cli is replacing ~75% of my claude code / opus usage now, but we need clarity for using it as a coding agent @ work. We’re truly blessed to have the weights of this one, looking forward to seeing the license change. Definitely a model worth checking out.
https://x.com/Sentdex/status/2044108342147060067

Pliny used Hermes Agent to do the abliteration! Very Cool!
https://x.com/Teknium/status/2044482769536045194

The Hermes Agent dashboard is here! Run ‘hermes dashboard’
https://x.com/NousResearch/status/2043791876835156362

The update V0.9.0 changes everything for Hermes Agent! You have now: – Web UI – Model switching – iMessage & WeChat integration – Backup & Restore, no more debugging for hours – Android via Tmux, yes, your Android can host Hermes Great work @NousResearch and the +20
https://x.com/AntoineRSX/status/2043884430901850271

This Hermes update is going to be the thing that gives @NousResearch their openclaw moment. Hermes just dropped a UI dashboard And I truly believe that this is what is going to give Hermes their openclaw moment. The team has spent months dialling everything in so that the
https://x.com/Shaun__Furman/status/2043820083114545416

This is the Hermes Agent article you need! New or experienced, most users end up with messy sessions or use them sub optimally. One of the biggest upgrades is learning how to manage sessions properly: >resume by title >rename threads >branch conversations >export history
https://x.com/NeoAIForecast/status/2044521045013762389

This skill is now built in to Hermes! Use /architecture-diagram <prompt> after updating hermes, and you’re good to go! Thanks to the author of the skill making it MIT we were able to port it over directly into Hermes Agent as a built in skill!
https://x.com/Teknium/status/2044190761609244986

today’s @NousResearch Hermes Agent prompt: i want you to pick one skill every 8 hours to evolve and do it. do whatever you need to do to and use whatever you need to get it done. Nora’s response: Let me build proper tracking. The tracker is already picking up data (the
https://x.com/chooseliberty/status/2044425487141781660

Tool Gateway is now live in Nous Portal. No separate accounts, no API key juggling. All you need is one subscription, and everything works. A paid Nous Portal subscription now includes access to 300+ models and a growing set of third-party tools. Launching with: → Web
https://x.com/NousResearch/status/2044878344592699744

tried hermes yesterday light years ahead of openclaw UX is just so much better, it’s wild feels like it’s made by someone that actually cares about architecture and user experience still not sure why anyone should use this user something fully hosted like Poke, but if you
https://x.com/robinebers/status/2043835216670929005

Using Hermes after OpenClaw is like having an ice-cold glass of water in hell. @NousResearch 🫡
https://x.com/vrloom/status/2044506378103099816

Was able to get a slick native swift desktop app v1.0 up and running for Hermes agent today (credit to redsparklabs) Can I get a few people to alpha test it with me? Works great for me so far! 🚀 DM me! @Teknium @NousResearch Check out this beauty!
https://x.com/nesquena/status/2044516572983923021

云服务器能不能跑 Hermes 浏览器自动化？昨天我录了个小视频，用 Hermes 的 /browser connect 直接连上我本地的 Chrome，然后让它自己去点赞推文。本来就是试试看效果的，结果浏览量还挺高的，可能这个视频让很多人对ai能力更具像化了。其实还是要感谢Hermes开发者@Teknium 的转推，哈哈。
https://x.com/0xme66/status/2044755328391319757

现在Hermes可以输入 /browser connect 命令来操作浏览器了，我试了一下点赞我X上的帖子，感觉相当好。默认提供了一些执行策略，大家可以都玩玩看～
https://x.com/0xme66/status/2044410470770331913

It was a pleasure to sit down with @FidlerSanja, VP of AI Research at NVIDIA, leading company’s Spatial Intelligence Lab, who is actively building the next major frontier of AI – physical AI. During GTC, where her lab introduced AlpaDream, we discussed: • If Transformers are
https://x.com/TheTuringPost/status/2042512295742656776

Rethinking AI TCO: Why Cost per Token Is the Only Metric That Matters
https://blogs.nvidia.com/blog/lowest-token-cost-ai-factories/

Figure and Hark just took an entire data center of NVIDIA B200s – every rack in the building Figure will be using these to predict physics and Hark will train next generation multi-modal models
https://x.com/adcock_brett/status/2042675641037000868

What are world models actually? @FidlerSanja, VP of AI Research at NVIDIA, leading company’s Spatial Intelligence Lab, explains in our interview If you want to learn about the major next frontier in AI, watch the full conversation:
https://x.com/TheTuringPost/status/2043962055531868554

Agents need computers. And they need a lot of them. Modal is an official sandbox provider for the @OpenAI Agents SDK.
https://x.com/modal/status/2044469736483000743

Build long-running agents with more control over agent execution. New capabilities in the Agents SDK: • Run agents in controlled sandboxes • Inspect and customize the open-source harness • Control when memories are created and where they’re stored
https://x.com/OpenAIDevs/status/2044466699785920937

Codex for almost everything | Hacker News
https://news.ycombinator.com/item?id=47796469

Codex now helps with more of your work, from coding to staying on top of everything around it.
https://x.com/OpenAIDevs/status/2044828214867202519

Here’s how we use Codex to: > understand large codebases > review PRs faster > build macOS apps > turn Figma into code > automate bug triage > create a CLI as agent tools > analyze datasets > generate slide decks > coordinate new-hire onboarding > learn a new concept …and
https://x.com/gabrielchua/status/2043339151278506234

Improve agent performance with a harness that keeps long-running agents on track. It manages the agent loop across tools, context, and traces. The sandbox preserves working state across pauses, retries, and resumptions.
https://x.com/OpenAIDevs/status/2044466729712304613

OpenAI x E2B: build your agents with the new OpenAI Agents SDK, powered by E2B sandboxes. We’re excited to support OpenAI as a launch partner! The new @OpenAI Agents SDK will now get dedicated sandboxes – perfect for persistent, long-running agents. With E2B, you’ll get a
https://x.com/e2b/status/2044476275067416751

To show off what you can do with @OpenAI Agent SDK + @modal, we built an ML research agent (inspired by @karpathy). It can: – Spin up GPU sandboxes of any shape – Run a pool of subagents – Persist memory – Snapshot state for fork/resume Here it is playing Parameter Golf:
https://x.com/akshat_b/status/2044489564211880169

Today we launched a major update to the OpenAI Agents SDK to help developers build and deploy long-running, durable agents in production. You can now build your own Codex-style agents using powerful primitives for modern agents – file and computer use, skills, memory and
https://x.com/snsf/status/2044514160034324793

Top things we released in Codex today: > Computer use on Mac: Codex can see, click, and type across apps > In-app browser for faster frontend, app, and game iteration > Image generation with gpt-image-1.5 > 90+ new plugins across tools like JIRA, CircleCI, GitLab, Microsoft
https://x.com/reach_vb/status/2044830689313599827

Use Vercel Sandbox with the OpenAI agents SDK as an official extension. Build agents that can run code, read files, and analyze data safely inside isolated microVMs. Control the compute and data flow from your secure cloud environment.
https://x.com/vercel_dev/status/2044492058073960733

you can build a Python agent that accepts a coding task, executes it inside a Cloudflare Sandbox, and copies the output files to your local machine @OpenAIDevs x @CloudflareDev Check out our guide here:
https://x.com/whoiskatrin/status/2044477140662395182

Your agents need a sandbox, but you need a framework in which to create your agent. We’re excited to be a sandbox provider in the new @OpenAI Agents SDK. By combining the SDK and Daytona sandboxes, you get agent orchestration and secure code execution working together out of the
https://x.com/daytonaio/status/2044473859047313464

AIE Europe Keynotes & OpenClaw ft Deepmind, OpenAI, Vercel, ‪@pragmaticengineer‬ , ‪@mattpocockuk‬ – YouTube

cool idea for a screenless experience w/ @openclaw – sound on!
https://x.com/karenxcheng/status/2043731860791144555

If you look at GPT 5.4-Cyber and it’s ability for closed source reverse engineering, I have bad news for you. I do very much feel the pain though, there’s hundreds of teams that try to poke holes into @openclaw. Our response has been of rapid iteration and code hardening. Which
https://x.com/steipete/status/2044423791405924562

Latest OpenClaw updates: 2026.4.11 • Dreaming & memory – added ChatGPT import + new “”Memory Palace”” → explore chats as structured memory • Plugins now guide you through setup • Richer Chat UI: structured bubbles, media rendering & embeds • Better video generation (URLs,
https://x.com/TheTuringPost/status/2043340386538778840

OpenClaw 2026.4.10 🦞 🧠 Active Memory plugin 🎙️ local MLX Talk mode 🤖 Codex app-server harness plugin 🧾 Teams pins/reactions/read actions 🛡️ SSRF hardening + launchd fixes stability, but with attitude🦞
https://x.com/openclaw/status/2042811598058742012

OpenClaw 2026.4.11 is out ✨ big polish drop for stability 🛡️ safer provider transport/routing 🤖 more reliable subagents + exec approvals 💬 lots of Slack / WhatsApp / Telegram / Matrix fixes 🌐 browser + mobile cleanup a chunky cleanup pass 😎
https://x.com/openclaw/status/2043132528094036332

OpenClaw 2026.4.14 🦞 More reliability updates: ✨ Smarter GPT-5.4 routing and recovery 🌐 Chrome/CDP improvements 🧵 Subagents no longer get stuck 💬 Slack/Telegram/Discord fixes ⚡️ Various performance improvements Was sleeping, and we still shipped.
https://x.com/openclaw/status/2044042546976883063

OpenClaw 2026.4.9 🦞 🧠 Dreaming: REM backfill + diary timeline UI 🔐 SSRF + node exec injection hardening 🔬 Character-vibes QA evals 📱 Android pairing overhaul your agent now dreams about you. romantic or terrifying? yes. 🦞
https://x.com/openclaw/status/2042072722902077938

This release makes me unreasonably happy since I wasn’t involved at all – @vincent_koc and the maintainer team did a great job. I’m back soon to work on OpenClaw, today/tomorrow I’m prepping for @TEDTalks in Vancouver. 🇨🇦
https://x.com/steipete/status/2044047222481019300

Two experiments in the next @openclaw to address some “”GPT is lazy”” issues: 1) Strict mode: agents.defaults.embeddedPi.executionContract = “”strict-agentic”” This tells GPT-5.x to keep working: read more code, call tools, make changes, or return a real blocker instead of
https://x.com/steipete/status/2043136615640694797

很多用过 OpenClaw 的朋友会好奇：Hermes Agent 和 OpenClaw 到底有什么不一样？从定位上看，OpenClaw 更偏”开箱即用的个人助手”—-图形界面友好、数据本地为主，跨设备同步方便，入门门槛低。 Hermes Agent 则更像可成长的职业型 Agent：它会在每次任务后判断流程是否可复用，自动沉淀为
https://x.com/joshesye/status/2044295313171571086

Personal Computer Is Here
https://www.perplexity.ai/hub/blog/personal-computer-is-here

Today we’re releasing Personal Computer. Personal Computer integrates with the Perplexity Mac App for secure orchestration across your local files, native apps, and browser. We’re rolling this out to all Perplexity Max subscribers and everyone on the waitlist starting today.
https://x.com/perplexity_ai/status/2044805973085454518

🎉 Congrats @Alibaba_Qwen on the first open-weight Qwen3.6! Stronger agentic coding and a new thinking preservation option to retain reasoning context across turns. Same architecture as Qwen3.5, so serving teams can upgrade in place. Day-0 support in vLLM v0.19+. Thinking, tool
https://x.com/vllm_project/status/2044787721538060784

Introducing Nucleus-Image: the first sparse Mixture-of-Experts diffusion model 17B parameters. Only 2B active. 10x more parameter-efficient than leading diffusion models. Toe-to-toe with GPT Image 1, Imagen 4, and Qwen-Image: from pure pre-training alone. No DPO. No RL. No
https://x.com/withnucleusai/status/2044412335473713284

Qwen/Qwen3-Coder-Next · Hugging Face
https://huggingface.co/Qwen/Qwen3-Coder-Next

We built FrogsGame as a new task for evaluating AI’s posttraining skills! It’s a tool-using RL environment built around a blind-start interaction loop. Frontier agents get a container with the Qwen3-8B tokenizer, board-generating scaffolding, and @tinkerapi for remote training
https://x.com/karinanguyen/status/2044885375085339023

2-bit Qwen3.6-35B-A3B did a complete repo bug hunt with evidence, repro, fixes, tests and a PR writeup. 🔥 Run it locally in Unsloth Studio with just 13GB RAM. 2-bit Qwen3.6 GGUF made 30+ tool calls, searched 20 sites and executed Python code. GitHub:
https://x.com/UnslothAI/status/2044858346948464743

Qwen3.6-35B-A3B can now be run locally!💜 The model is the strongest mid-sized LLM on nearly all benchmarks. Run on 23GB RAM via Unsloth Dynamic GGUFs. GGUFs to run:
https://t.co/VlyW8UwDjw Guide:
https://x.com/UnslothAI/status/2044786492451778988