Agents and Copilots: AI News Week Ending 05/08/2026

Image created with gemini-3.1-flash-image-preview with claude-opus-4.7. Image prompt: Using the provided reference images, keep the Sonoran Desert trail vista with saguaro, volcanic rock, and wide Arizona sky, and the weathered brown-post ranger sign with bold ranger-style typography, but replace the header with bold all-caps ‘AGENTS’ and list entries like ‘← Task Handoff ……… 2.4 mi’, ‘Delegation Saddle →…. 3.7 mi’, ‘← Tool Use Overlook .. 5.1 mi’ with a small compass-rose badge replacing the WP3 medallion; in the middle distance along a switchback, show three small identical hikers in daypacks walking single-file, each carrying a different tool (trekking pole, map, water filter), photorealistic midday desert light.

Artificial Analysis is partnering with Harvey on their new Legal Agent Benchmark! Harvey’s Legal Agent Benchmark (LAB) is an agent-native take on how AI should be contributing to legal work in 2026 – made up 1200 agentic tasks across 24 practice areas. It’s highly aligned with
https://x.com/ArtificialAnlys/status/2052145762650431840

Introducing Harvey’s Legal Agent Benchmark
https://www.harvey.ai/blog/introducing-harveys-legal-agent-benchmark

LAB is the first long-horizon, open-source legal agent benchmark, from @harvey. it will help legal teams answer “”what can legal agents do today?””, plan deployment, and design human-agent cooperation. autonomous legal is a deep domain, and a good benchmark can accelerate progress
https://x.com/saranormous/status/2052061665596948894

Meta prepares Hatch AI Agent with waitlist and social skills
https://www.testingcatalog.com/meta-prepares-hatch-agent-under-waitlist-and-social-media-skills/

Meta is planning to power its AI data centers with solar energy beamed from space. If it works, solar farms could produce power 24/7 without batteries or backup generators. The company behind it all is Overview Energy — they want to launch 1,000 satellites into orbit, 22,000
https://x.com/rowancheung/status/2051320518905930208

we’re continuing to see clear examples where a model’s harness is a major determinant of overall performance. with the same model, running on same task, it’s easy to observe very different scores depending on (system) prompts, tools (& their descriptions), and middleware
https://x.com/masondrxy/status/2052054177749029164

Anthropic working on Orbit, its upcoming proactive assistant
https://www.testingcatalog.com/anthropic-is-working-on-orbit-its-upcoming-proactive-assistant/

Big tech has become a claude wrapper.
https://x.com/_arohan_/status/2052053181656641735

Today we’re releasing Refactoring, the final leaderboard of our SWE Atlas suite. This new leaderboard is the ultimate test of an agent’s ability to restructure code without breaking the system. Claude Opus 4.7 with Claude Code takes the top spot🥇
https://x.com/ScaleAILabs/status/2052434456510878021

A critical question in agent design is “how do we build agentic workflows so humans are given significant, interesting, or variance-producing decisions as they come up in the work?” A Claude-run company has no source of competitive advantage compared to other Claude-run firms.
https://x.com/emollick/status/2052066205226123472

New for financial services: ready-to-run Claude agent templates for building pitches, conducting valuation reviews, closing the books at month-end, and more. Install them as plugins in Cowork and Claude Code, or use our cookbooks to run them in production as Managed Agents.
https://x.com/claudeai/status/2051679629488865498

Codex has surpassed Claude Code in downloads. According to TickerTrends, the crossover happened on April 30, after which Codex continued to gain share while Claude Code’s growth visibly slowed. Claude 4.7 was released April 16th, GPT-5.5 April 24th. Connect the dots.
https://x.com/kimmonismus/status/2051515496567292310

You can now enable Claude to use your computer to complete tasks. It opens your apps, navigates your browser, fills in spreadsheets–anything you’d do sitting at your desk. Research preview in Claude Cowork and Claude Code, macOS only.
https://x.com/claudeai/status/2036195789601374705?s=20

Anthropic Orbit leaked Orbit, a proactive assistant for Claude Cowork that auto-generates briefings and insights from Gmail, Slack, GitHub, Calendar, Drive, and Figma, no prompting required. Users can also deploy and pin “”Orbit apps”” for quick access. It’s Anthropic’s answer to
https://x.com/kimmonismus/status/2051618156385366305

New in Claude Managed Agents: dreaming, outcomes, and multiagent orchestration | Claude
https://claude.com/blog/new-in-claude-managed-agents

With the help of Claude Mythos Preview, the Firefox team fixed more security bugs in April than in the past 15 months combined.
https://x.com/alexalbert__/status/2052468573516513762

Anthropic is building out their managed agents platform, adding Dreaming (memory) and Outcomes (rubrics). The idea I’m wrestling with: how differentiated are these platform features really? I initially thought the model would “”eat your scaffolding””, an argument best made by
https://x.com/RichNwan/status/2052085746526216601

Anthropic and OpenAI are both launching joint ventures for enterprise AI services | TechCrunch

Anthropic and OpenAI are both launching joint ventures for enterprise AI services

Both Anthropic and OpenAI have new initiatives to help enterprises deploy AI agents within their organizations. This is a trend that’s early but going to get very big fast. As agents enter knowledge work beyond coding, there is very real work to upgrade IT systems, get agents
https://x.com/levie/status/2051344780328858040?s=46

AlphaEvolve: Gemini-powered coding agent scaling impact across fields — Google DeepMind
https://deepmind.google/blog/alphaevolve-impact/

gemini 3.1 flash-lite is here it’s our most cost-efficient model, optimized for high-volume agentic tasks, translation, and simple data processing
https://x.com/GoogleAIStudio/status/2052453828272812310

Google tests screen sharing and custom agents in Antigravity
https://www.testingcatalog.com/google-tests-screen-sharing-and-custom-agents-in-antigravity-ide/

we’re evolving the gemini interactions api to support rich, multi-step agentic workflows instead of strict “”user”” and “”model”” roles, every action (from thinking to tool calls) is now represented as its own step
https://x.com/GoogleAIStudio/status/2052487438967140700

Gemma 4 just got even faster! We’re releasing Multi-Token Prediction (MTP) drafters that deliver up to a 3x speedup, without any degradation in output quality or reasoning logic.
https://x.com/googlegemma/status/2051713412431007808

Gemma 4 up to 3x faster, directly in your phone! 🚀 Check out the difference Speculative Decoding makes! Multi-Token Prediction (MTP) is supercharging inference speeds for Gemma 4.
https://x.com/googlegemma/status/2052468624657654194

Gemma 4: Now up to 3x Faster. ⚡ Same quality, way more speed. Our new MTP drafters allow Gemma 4 to predict multiple tokens at once, effectively tripling your output speed without compromising intelligence.
https://x.com/googledevs/status/2051700498328346945

I benchmarked Google’s new MTP for Gemma 4 31B using vLLM with 4 speculative tokens, a fairly conservative setup. Results: – Much higher throughput than Qwen3.6’s MTP – Lower latency too, helped by Gemma 4 generating fewer tokens – For coding tasks with reasoning enabled,
https://x.com/bnjmn_marie/status/2052286398707687650

Multi-token-prediction in Gemma 4
https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/

Gemma-4 lands in Code Arena: Frontend Webdev and shifts the Pareto Frontier! Among open models, Gemma-4-31b ranks #13 and Gemma-4-26b-a4b ranks #17. Congrats to @GoogleDeepMind on shifting the frontier!
https://x.com/arena/status/2052061349312921686

gog 0.16 is out. Google Workspace CLI for humans and agents. Lossless raw API output, sanitized Gmail reads, safer command profiles, Drive inventory, Docs tabs, Sheets tables, Gmail filter export, and official Docker images.
https://x.com/steipete/status/2051575048348074450

Agents SDK 2.0 is underrated
https://x.com/sama/status/2050998576671859003

Create Google Slides in Codex without opening your browser, clicking buttons, and manually aligning figures. Plus, you (and your team) can view the progress in realtime. Codex isn’t creating the deck locally, then uploading it. It’s actually iteratively building it, checking
https://x.com/gabrielchua/status/2051113129317408925

I’ve never used an agent for the cliches of ordering food, grocery shopping, or booking travel. But I repeatedly use Computer Use in Codex to add things to my family calendar in Apple Calendar. Like, I gave it my son’s little league schedule for the next four months, and it
https://x.com/_simonsmith/status/2050178967735353837

One week since the launch of GPT-5.5, and it’s already our strongest model launch yet. API revenue is growing more than 2x faster than any prior release, while Codex doubled revenue in under seven days as enterprise demand for agentic coding tools keeps climbing.
https://x.com/OpenAI/status/2050250926888468929

OpenAI Agents SDK – an open orchestration layer for building multi-agent workflows It lets you define agents as LLMs with instructions, tools (APIs, functions, external systems), guardrails, and supports: • sessions with conversation history management • human-in-the-loop •
https://x.com/TheTuringPost/status/2050903494010499113

Me and codex were busy. 🔊
https://t.co/FBNMbWOuFZ — Sonos 🗃️
https://t.co/YDdZyN2vwP — WhatsApp 🪶
https://t.co/eykEElx1Ez — X archive 🧰
https://t.co/txvYVtvhPg — GitHub archive 🛰️
https://t.co/2u2ACJEKKi — Discord archive 🎧
https://t.co/nrv2rzKfH4 — Spotify 💬
https://x.com/steipete/status/2051900143339704730

This is the most useful tooling I built for OpenClaw to date. It’s open source, runs on codex and you can fork and use it for any repo. For all the hard working oss folks that drown in issues and PRs, this is for you.
https://x.com/steipete/status/2051020548335874369

🎙️ Voice AI only feels natural when conversation keeps pace with speech. Here’s how we rebuilt our WebRTC stack with a thin relay and stateful transceiver to keep real-time media fast for ChatGPT voice, the Realtime API, and more.
https://x.com/OpenAIDevs/status/2051453905343828350

🚀 GPT-Realtime-2 just landed in Genspark. Our Call for Me Agent now runs on it. Genspark Realtime Voice is upgrading next. What Realtime 2 brings: Sharper reasoning. Tighter instruction following. +26% effective conversation rate. Far fewer dropped calls.
https://x.com/genspark_ai/status/2052524670088556557

Advancing voice intelligence with new models in the API | OpenAI
https://openai.com/index/advancing-voice-intelligence-with-new-models-in-the-api/

Building voice applications with GPT-Realtime-2? Our new prompting guide covers how to tune reasoning effort, use preambles, design tool behavior, handle unclear audio, capture exact entities, and maintain state in longer sessions.
https://x.com/OpenAIDevs/status/2052530378184032560

Dubbing for live events… in real time? 😮 Here’s OpenAI’s new GPT-Realtime-Translate model in action in Vimeo. Those translations are happening completely live. No pre-loaded captions. Live dubbing is one of the many features we’re exploring this year… (Hopefully) more
https://x.com/Vimeo/status/2052442588201029684

GPT-Realtime-2 audio input price remains steady at $1.15 per hour of audio input, and $4.61 per hour of audio output.
https://x.com/ArtificialAnlys/status/2052486478501204415

gpt-realtime-2 shows a 15pp improvement (vs 1.5) on Big Bench Audio, and is now close to saturation.
https://x.com/juberti/status/2052507302092296252

GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper are available in the Realtime API today.
https://x.com/OpenAIDevs/status/2052440968763515223

GPT-Realtime-2: Building a Live Translator
https://x.com/RayFernando1337/status/2052479718495318143

GPT-Realtime-Whisper brings low-latency streaming transcription to the Realtime API. Use it when your app needs to understand speech continuously while the interaction is still unfolding.
https://x.com/OpenAIDevs/status/2052440957258489859

Guess who’s back, back again. Whisper, but now with realtime streaming. Check out the new gpt-realtime-whisper transcription model in my
https://t.co/b2UTuSxhOI demo.
https://x.com/juberti/status/2052478775523512356

have been excited for realtime voice-to-voice translation as an AI application since we started OpenAI. extremely cool to see it now available in the API for anyone to build with:
https://x.com/gdb/status/2052480998668206262

Introducing GPT-Realtime-2 in the API: our most intelligent voice model yet, bringing GPT-5-class reasoning to voice agents. Voice agents are now real-time collaborators that can listen, reason, and solve complex problems as conversations unfold. Now available in the API
https://x.com/OpenAI/status/2052438194625593804

New ChatGPT Voice mode pretty much confirmed. And im really excited for it.
https://x.com/kimmonismus/status/2051571219040735423

New Voice Model from OpenAI in the API gpt-realtime-2 Here a quick demo I built
https://x.com/diegocabezas01/status/2052492653082681485

OpenAI has released GPT-Realtime-2, achieving 96.6% in our Speech Reasoning benchmark, Big Bench Audio, and #1 in our Conversational Dynamics benchmark Released today, GPT-Realtime-2 is OpenAI’s new flagship native Speech to Speech model, introducing adjustable reasoning effort
https://x.com/ArtificialAnlys/status/2052486470469140777

OpenAI shipped a new speech-to-speech model today: gpt-realtime-2 This is the first speech-to-speech model good enough to use in my voice agents that do “”real work.”” Or real play, for that matter. Here’s gpt-realtime-2 as the brain of the ship AI in Gradient Bang. The
https://x.com/kwindla/status/2052521318688739811

Our new voice models are now available in the Realtime API: 🎙️ GPT-Realtime-2: Build production-ready voice agents that can think harder, take action, handle interruptions, and keep conversations flowing. 🎙️ GPT-Realtime-Translate: Translate while streaming across more than 70
https://x.com/OpenAI/status/2052438196454379986

people are really starting to use voice to interact with AI, especially when they have a lot of context to dump. GPT-Realtime-2 comes to the API today; it is a pretty big step forward. (we are working on improvements to voice in chat.)
https://x.com/sama/status/2052462271667028211

pretty excited for voice models to get great its interesting to watch how people are already starting to change the way they interface with AI
https://x.com/sama/status/2051464865634742334

Saw this and thought “”yes! ChatGPT voice mode is going to stop acting like a two-year-model”” but that upgrade hasn’t shipped just yet
https://x.com/simonw/status/2052439091577496054

Taking talking shop to a whole new level. We just shipped Glean’s real-time voice capability, powered by @OpenAI’s newest speech model GPT-Realtime-2. Grounded in the context across your org, it feels like a real AI coworker and can keep up with how work gets finished. In
https://x.com/glean/status/2052440702169108990

Updated my hello-realtime demo to use the new gpt-realtime-2 model (now with reasoning). Check it out at
https://t.co/td6Cx2EOPO, or call 425-800-0042!
https://x.com/juberti/status/2052469176821002676

Using @OpenAI gpt-realtime-2 to get a glimpse of future voice-first experiences. A market dashboard you don’t click through. You direct it. Say, “Focus on Apple,” and the whole interface changes. Ask, “How did it do over the last 30 days?” and the chart updates. Say, “Go
https://x.com/levinstanley/status/2052506605044842672

Voice agents are getting more capable. Here’s what’s new: • GPT-Realtime-2 for voice agents that reason and take action • GPT-Realtime-Translate enabling translation from 70 input languages into 13 output languages • GPT-Realtime-Whisper, making transcription even faster
https://x.com/OpenAIDevs/status/2052440907933474954

Voice agents are so back!! Today we’re launching 3 new realtime audio models in the API: 🎙️ GPT-Realtime-2 GPT-5-class reasoning for voice agents that can use tools, recover from interruptions, and carry longer conversations with 128K context 🌍 GPT-Realtime-Translate Live
https://x.com/reach_vb/status/2052438371058737280

Voice workflows just got stronger with gpt-realtime-1.5 in the Realtime API. The model offers more reliable instruction following, tool calling, and multilingual accuracy. Demo with @charlierguo
https://x.com/OpenAIDevs/status/2026014334787461508

We know you’re eager for voice updates in ChatGPT. Stay tuned, we’re cooking.
https://x.com/OpenAI/status/2052438197695877316

Congrats to @OpenAI for taking the top spot on our Audio MultiChallenge S2S leaderboard with the release of GPT‑Realtime‑2 🥇 GPT-Realtime-2 more than doubles GPT-Realtime-1.5 on instruction retention, rising from 36.7% to 70.8% APR, and also stands out on voice editing,
https://x.com/ScaleAILabs/status/2052451341071683732

Perplexity and Computer now allow you to run Deep and Wide Research on sources trusted by doctors and medical professionals like the New England Journal of Medicine, the British Medical Journal, the American Diabetes Association, and so on.
https://x.com/AravSrinivas/status/2051711236224761983

Perplexity and Computer now connect to premium health sources, starting with NEJM and BMJ Group, with 9 more medical journals and clinical databases on the way. Ask health questions and get answers cited from the same sources relied on by hospitals and research institutions.
https://x.com/perplexity_ai/status/2051710342242480538

A new PR review experience is now available in Cursor 3. Take PRs from creation to merge, all in one place. You can see comments, diffs, commits, and review status to understand what changed and next steps. Navigate larger PRs more quickly with the file tree and changes picker.
https://x.com/cursor_ai/status/2052489387305488609

Advice for AI engineers 💡 A local AI assistant is just a while loop, an LLM and a set of tools. Here’s a step-by-step lesson on how to build one from scratch with @liquidai LFM2-24B-A2B running on llama.cpp. No cloud APIs. Enjoy ↓
https://x.com/paulabartabajo_/status/2051152294146617674

Can’t shake the feeling that file systems are overrated for agents. That in a year or so it’ll look like making a robot type on a keyboard rather than just letting it plug into usb.
https://x.com/dbreunig/status/2051083366410400132

Coding plan comparisons based on actual usage — sites.diy
https://sites.diy/blog/2026-05-01-coding-plan-comparisons/

Devin Review is now live in Windsurf, alongside Quick Review: a fast local bug detector powered by SWE-check. Fast and comprehensive code review and fixes are now available right in your IDE.
https://x.com/cognition/status/2052100630626607189

Earlier this week we announced Devin in your terminal. Now Devin is inside your shell. Hit Ctrl+G and it sees what you see to instantly help you out. try it: `devin shell setup`
https://x.com/cognition/status/2050268727997022498

new mode for LangChain’s human in the loop middleware: respond instead of running a tool, you can return the human’s interrupt response directly as the tool’s output handy for “”ask user”” stubs or headless tools that depend on direct user input!
https://x.com/sydneyrunkle/status/2050181039406858371

SWE-Check is live in Windsurf! So fun to collaborate with @cognition to post-train this. Try it out!
https://x.com/ypatil125/status/2052122827961278833

total realtime victory!!
https://x.com/reach_vb/status/2052442056392405383

we’re headed towards Jarvis that can do everything on your computer, listens all the time for your command, with complete fusion of speaking back to you, typing a response, or silently taking action… maybe even video?
https://x.com/willdepue/status/2052494388413235672

A TLDR on Harness Profiles: ✅ Model-specific profiles to adjust prompts, tools, and middleware. 📦 Profiles for @OpenAI, @Anthropic, and @Google models out of the box. 📈 A 10-20 point jump on a subset of tau2-bench over the default harness.
https://x.com/LangChain/status/2052054711440662864

Introducing Flue — The First Agent Harness Framework Flue is a TypeScript framework for building the next generation of agents, designed around a built-in agent harness. Flue is like Claude Code, but 100% headless and programmable. There’s no baked in assumption like requiring
https://x.com/FredKSchott/status/2050274923852210397

// OCR-Memory // Well this is a unique approach to store memory for long-horizon agents. Most of the agent memory systems compress trajectories into text summaries and hope the model remembers what matters. But that’s where the information loss hides. Long-horizon agents need
https://x.com/dair_ai/status/2049957482811056307

GLM-5V-Turbo Tech Report: Toward a Native Foundation Model for Multimodal Agents This report summarizes the main improvements behind GLM-5V-Turbo across model design, multimodal training, reinforcement learning, toolchain expansion, and integration with agent frameworks. These
https://x.com/Zai_org/status/2052426777654387168

// Recursive Multi-Agent Systems // Great read for the weekend. (bookmark it) Multi-agent systems often pass full text messages between agents at every step. This leads to token bloat, latency, and context dilution which all grow with the number of agents. RecursiveMAS asks a
https://x.com/omarsar0/status/2050261229315477988

🎉Introducing PyFlue: The Python-Native Agent Harness Framework.🧰 💡Flue for Python: Fred K. Schott @FredKSchott CEO of HTML has launched Flue: The Agent Harness Framework for TypeScript. It brings programmable harness right into your agents rather than DIY plumbing. Python
https://x.com/Shashikant86/status/2050999432569651221

🚀 New on the @vllm_project blog: Serving Agentic Workloads at Scale with vLLM x Mooncake. Agentic traces grow to 80K+ tokens with 94%+ reusable prefixes, but local KV caches evict them and cross-instance routing misses them. By integrating Mooncake Store as a distributed KV
https://x.com/vllm_project/status/2052113331927060840

9 New approaches to Multi-Agent Systems ▪️ RecursiveMAS ▪️ OneManCompany (OMC) ▪️ OrgAgent ▪️ CORAL ▪️ LLMA-Mem ▪️ Agentic Federated Learning ▪️ CASCADE ▪️ GRASP ▪️ Reinforced Agent These methods express truly interesting various ideas! Learn more about them here:
https://x.com/TheTuringPost/status/2050957812432580956

Agentic Business Orchestration & Automation Platform | UiPath
https://www.uipath.com/

Agents are already moving at machine speed, but security is still stuck on static, outdated rules. We can close the gap On May 5, @rubrikInc is hosting a technical webinar on AI security at scale – Building AI Resilience: Managing Agent Risk with Trust Infrastructure →
https://x.com/TheTuringPost/status/2049985228421361929

Agents can now add durable execution to their plans with Dynamic Workflows.
https://x.com/celso/status/2050211184129786084

Agents for Everything Else — swyx – YouTube

Coming back to drafting a set of “”10 commandments”” for coding with agents. Here are the current candidates: – Implement to learn – Rebuild often – E2E tests are gold – Document intent – Maintain your spec – Find what’s hard (that’s the value) Thoughts? What am I missing?
https://x.com/dbreunig/status/2051081626139210202

create_agent – how we build Deep Agents on the simplest harness primitive underlying all of the harness engineering, research, and API design in Deep Agents is a very simple primitive in LangChain called create_agent the entire design of deepagents comes from optionally
https://x.com/Vtrivedy10/status/2050239109038232005

Cursor can now automatically fix CI failures. Set up always-on agents that monitor GitHub, investigate root causes, and open PRs with fixes.
https://x.com/cursor_ai/status/2051739625958584659

deepagents-cli is quietly becoming the best place to start coding with open weight models. we’ve been investing heavily in making it a harness that’s truly model-agnostic, without compromising performance! different models perform best with different harnesses — prompts,
https://x.com/masondrxy/status/2051359502918648319

Domain-Specific Agent Applications Workshop | AWS Marketplace
https://pages.awscloud.com/awsmp-gro-pude-webinar-mss-module-5-use-case-specific-agents-workshop.html?trk=e7545cd0-d7fb-4544-85cc-1d4223437479&sc_channel=el

everyone would have a deeper appreciation for Agent Products that rock because of great Context/Harness Engineering if they… talked to: – LLM Base models – Post-Trained models with no harness (no tools, no built in prompts, nothing) helps internalize how much stuff needs to
https://x.com/Vtrivedy10/status/2051674478648742002

For individual AI use, the jagged frontier is increasingly well understood. In multi-agent workflows in organizations, AI is jagged in ways that have not been well identified yet. In fact, we don’t even have a vocabulary around multi-agent systems & the ways the fail or succeed.
https://x.com/emollick/status/2051479583585616023

How AI Agent Memory Works
https://memory.cobanov.dev/

I detected a bad Agent action, what do I do about it? this is pretty much the main question that will power the future’s Human+Agent driven improvement loops Gather data -> Mine Errors -> Find out which piece(s) of the agent is contribute to this behavior -> Apply Fix -> Test
https://x.com/Vtrivedy10/status/2051727418134593632

I was quoted a couple times in this Atlantic article, but that isn’t (the only) reason I think it is good. It lays out the reasons why we whipsawed from “AI is a bubble” to “there are not enough data centers” in less than six months. Spoiler: its agents.
https://x.com/emollick/status/2050396928798535990

if you haven’t read this one by @Vtrivedy10, it’s a must read! great overview of what components a harness needs to support an agent for long running, long context tasks
https://x.com/sydneyrunkle/status/2051637638239567953

Improving token efficiency in GitHub Agentic Workflows – The GitHub Blog

Improving token efficiency in GitHub Agentic Workflows

Introducing /orchestrate, a skill that recursively spawns agents to tackle your most ambitious tasks with the Cursor SDK. We’ve used it to: – Autoresearch our internal skills, cutting token use by 20% while improving evals – Cut cold start times on our internal backend by 80%
https://x.com/cursor_ai/status/2052432778743210127

Introducing Zyphra Cloud: A full stack AI platform on AMD. Launching today with Zyphra Inference: serverless inference for frontier open-weight models focused on long horizon agentic workloads. Powered by @AMD MI355X GPUs on @TensorWave. Learn more at
https://x.com/ZyphraAI/status/2051384562870329444

Its getting hard to benchmark frontier agent performance on longer tasks. Repeated measurement is very expensive and there are differences between using models in harnesses versus via APIs. I suspect benchmarks understate progress, they are built for models, not harnessed agents
https://x.com/emollick/status/2050892355331354850

langgraph is the runtime that powers langchain and deepagents! we’ve been cooking on some new features: 1. node level error handlers 2. static + dynamic node timeouts 3. delta (diff based) channels for optimized storage 4. tons of new streaming primitives our 1.2 alpha release
https://x.com/sydneyrunkle/status/2051382622517887479

most of the time, you want an agent loop to run uninterrupted. that’s where the utility comes from! but some decisions shouldn’t be delegated to the agent. two situations come up consistently: 1/ before a consequential action, like sending an email, executing a transaction, or
https://x.com/sydneyrunkle/status/2050195081995407429

Must-read research of the week ▪️ The Last Harness You’ll Ever Build ▪️ From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company ▪️ Recursive Multi-Agent Systems ▪️ Synthetic Computers at Scale for Long-Horizon Productivity Simulation ▪️ Co-Evolving Policy
https://x.com/TheTuringPost/status/2051707579785752909

No longer are Fleet agents constrained to a single model. With multi-model support, you can build more efficient agents at scale
https://x.com/LangChain/status/2051367244060598312

now, your agent can fix itself. introducing raindrop triage. an agent for finding and investigating agent issues.
https://x.com/benhylak/status/2051727888639250450

Observability helps power the agent improvement loop But it’s not just observability! It’s also feedback! You should be trying to get as much feedback (direct, indirect, generated) into your agent observability platform as possible
https://x.com/hwchase17/status/2051708980435853513

one of the features i’m most excited about in our upcoming langgraph release is delta channels! the langgraph runtime lets you “”checkpoint”” agent progress at every step (model call, tool call, hooks). the problem, though, is that checkpoints bloat quickly when context is long!
https://x.com/sydneyrunkle/status/2052344141963555312

open-weight LLMs have come a long way on agent tasks! but the harness you wrap them in matters just as much as the model itself, and arguably the interface you use to drive that harness matters even more. dev workflows are deeply personal. what works well for one developer may
https://x.com/masondrxy/status/2051714091924828480

Sakana Fugu: A Multi-Agent Orchestration System as a Foundation Model
https://x.com/SakanaAILabs/status/2050998826190667795

serving multiple users from a single agent deployment introduces three distinct problems. luckily, langsmith’s agent server has a solution for each! 1. data isolation: your @auth.authenticate handler tags every resource with ownership on write, filters on read. 2. delegated
https://x.com/sydneyrunkle/status/2049956826670911809

The next wave of AI will not be won by better prompts. It will be won by systems that learn from experience. Today, Prime Intellect Lab is out of beta, open for you to start training your own models. The era of self-improving agents is here.
https://x.com/PrimeIntellect/status/2052225145725698102

There’s a bunch of conflicting stances I don’t fully understand in the debate of Proprietary RL’d vs Open Harness, Model intelligence, and Agent Labs building harnesses for bespoke tasks Not all of the below can be true: 1. Model is post-trained with a harness in the loop so it
https://x.com/Vtrivedy10/status/2051451869017584112

this is the part of the deep agents production series i’ve been most excited to get to: sandboxes without an execution environment, a production agent is only as capable as its fixed toolset. give an agent an execution environment where it can write and run code, and you give
https://x.com/sydneyrunkle/status/2052459962169966752

To get the most out of agent observability, store feedback with your traces. That is what turns agent traces from logs into a learning system.””
https://x.com/LangChain/status/2051709642716135729

TokenSpeed: A Speed-of-Light LLM Inference Engine for Agentic Workloads | LightSeek Foundation
https://lightseek.org/blog/lightseek-tokenspeed.html

we’re on an Open Model mission to help builders create world class agents >20x cheaper than what they have today a couple things have become evident recently: 1. The age of the token subsidy is being pulled back 2. Open Models have crossed an intelligence threshold making them
https://x.com/Vtrivedy10/status/2051148084567052690

What AI agent harness are you daily driving these days?
https://x.com/bilawalsidhu/status/2051859826083336326

You can now see a breakdown of your agent’s context usage in Cursor 3.3. Use these stats to diagnose context issues and improve your setup across rules, skills, MCPs, and subagents.
https://x.com/cursor_ai/status/2052059748544249918

NEW paper from Microsoft Research. (bookmark it) The entire interpretability literature is built around human readers. As more analysis gets delegated to agents, the right target of interpretability shifts. This paper is a recipe for designing tools that agents can actually
https://x.com/dair_ai/status/2052125514266190286

Scenethesis: A Language and Vision Agentic Framework for 3D Scene Generation”” TL;DR: combines LLM planning with vision-guided refinement to generate physically plausible and coherent 3D scenes from text
https://x.com/Almorgand/status/2051320217674870795

Very excited to release Terminal-Bench 2.1! Coding agents are among the most economically consequential deployments of LLMs to date. As agents improve, benchmark reliability matters more. We audited TB2.0 and found and corrected issues in 28/89 tasks. 30% of the benchmark!
https://x.com/ekellbuch/status/2052165464655298866

We recently built HiL-Bench, the first benchmark to test a critical question: do AI agents know what they’re missing and when to ask? Frontier models perform well with perfect specs. But remove a few key details, and they confidently guess and ship plausible wrong answers. We
https://x.com/ScaleAILabs/status/2051333688798097567

I have been testing DeepSeek-V4-Pro with the Pi coding agent. I am mindblown by how well it works out of the box. A few notes: I spent a few hours building an LLM wiki with an agent powered entirely by DeepSeek-V4-Pro on @FireworksAI_HQ inference. This is the first time I
https://x.com/omarsar0/status/2050009901234282649

🧠 Introducing NeuralBench: a unified, open-source framework to benchmark NeuroAI models. v1.0: 36 EEG tasks, 94 datasets, task-specific + foundation models. MEG/fMRI ready. MIT-licensed, FAIR’s Brain & AI @AIatMeta. Code:
https://t.co/WdROYdjjNY Paper:
https://x.com/hubertjbanville/status/2052029372282888234

1) Our team at Meta has a tough new coding benchmark challenging models to code entire programs including ffmpeg and the PHP compiler from scratch. 2) Top accuracy is 0% 3) We will be making the benchmark harder.
https://x.com/OfirPress/status/2051678633035809159

Banger paper from Meta FAIR. They introduce Autodata, an agentic data scientist that builds high-quality training and evaluation data autonomously. The headline result: on a CS research QA task, an Agentic Self-Instruct loop produces a 34-point gap between weak and strong
https://x.com/dair_ai/status/2051311905353142328

Cool paper from Meta FAIR. It’s on self-improving LLMs but on the pretraining side. (bookmark it) Most LLM safety, factuality, and reasoning fixes get bolted on at post-training. By then, the patterns have already set. This work moves those behaviors into pretraining itself.
https://x.com/omarsar0/status/2050213732970848664

AI Agents and the Future of Digital Work with Microsoft
https://www.cdata.com/resources/ai-agents-future-digital-work-microsoft/

NEW paper from Microsoft Research. If you care about training computer-use agents, this is one to keep. (bookmark it) The team builds 1,000 synthetic computers (each with realistic directory structures, documents, and artifacts) then runs long-horizon simulations on top of
https://x.com/dair_ai/status/2050263752147456238

NEW paper from Microsoft Research. Nice study on long-horizon agent generalization. (bookmark it) The team runs a study where the only variable is task horizon length. They use the same decision rules, reasoning structure but different sequence length to the goal. The main
https://x.com/dair_ai/status/2051679862788878354

📝 Agentic RL Infra Notes Insights from Zhihu Contributor 低级炼丹师 📝 🔍 Core Difference: Agentic RL vs Traditional RL • Traditional RL (RLVR): Single-time generation (answer → reward → update) — trains a “”response-generating”” model, no dynamic interaction. • Agentic RL:
https://x.com/ZhihuFrontier/status/2051691071634301064

Claude Opus 4.7 | Hacker News
https://news.ycombinator.com/item?id=47793411

Code with Claude 2026
https://x.com/i/broadcasts/1qGoNegbnRNKv

Code with Claude 2026: Opening Keynote – YouTube

It feels like agent harness evolution runs on two axes that usually get conflated. There’s the temporal axis: simplify as models improve, stripping components that compensated for limitations the new model no longer has. Sonnet 4.5 needed context resets; opus 4.5 didn’t. Opus
https://x.com/jakebroekhuizen/status/2052058987580051566

@bcherny @_catwu omg @bcherny with banger quotes “the future is more async agents… this is why we emphasize verification” “if you’re familiar with higher order functions, routines are higher order prompts” “default is i will now have claude prompt claude code” “the capability is already here
https://x.com/latentspacepod/status/2052068066167816369

Btw a bunch of the questions were just off the cuff – nothing @reinerpope prepped for. The guy is just first principles deriving how many tokens GPT 5 was pretrained on, or the bytes per token in Gemini 3’s KV cache, or which kind of memory each Claude cache hit sits on.
https://x.com/dwarkesh_sp/status/2049688865259286806

Code with Claude is happening now! ▪︎ 9:00AM – Keynote ▪︎ 10:30AM – What’s new in Claude Code ▪︎ 11:15AM – Building on Claude at GitHub scale ▪︎ 12:00PM – Get to production faster with Managed Agents All times PT.
https://x.com/ClaudeDevs/status/2052055459272761661

Did I understand correctly in their livestream that Anthropic is doubling the rate limits in Claude Code at no extra charge on max tier?
https://x.com/kimmonismus/status/2052059082886910251

Effective today, we are: 1) Doubling Claude Code’s 5-hour rate limits for Pro, Max, and Team plans; 2) Removing the peak hours limit reduction on Claude Code for Pro and Max plans; and 3) Substantially raising our API rate limits for Opus models.
https://x.com/claudeai/status/2052060693269008586

I am not sure I would agree with all of this, but the relationship between Anthropic and Claude is quite different than the relationship between other labs and their models. And that shows up in lots of ways, from the models themselves to how different labs think about the future
https://x.com/emollick/status/2051049394326081571

i have yet to meet a single person who feels like claude code is getting exponentially better on some kind of fast take off
https://x.com/TheEthanDing/status/2051516204607578132

I love Claude code but I feel like it’s had the same utility for me since, like, last fall
https://x.com/finbarrtimbers/status/2051652067480179020

I’ll be at Code with Claude all day today so come find me and let’s chat about Claude! I’ll also be giving a talk on the main stage at 530pm PT so tune in, it will be on the livestream!
https://x.com/alexalbert__/status/2052067009605861764

Increasingly, I think, we will see a gap between what you can do with frontier model APIs & what you can do with the native apps from the frontier labs (Codex, Claude Code). Models developed and trained with their native harnesses in mind have more capabilities in their harnesses
https://x.com/emollick/status/2049865091739209868

it is a literal and useful description of anthropic that it is an organization that loves and worships claude, is run in significant part by claude, and studies and builds claude. this phenomenon is also partially true of other labs like openai but currently exists in its most
https://x.com/tszzl/status/2051045196260167790?s=46

it is endlessly fascinating to me that we still don’t have a true 1M-context model it’s an unusual case where the infra is far ahead of the science. Claude discontinued 1M+ context bc it didn’t really work past ~200k we don’t have the right data? training techniques? not sure
https://x.com/jxmnop/status/2051357363815526523

Lets go: Claude Code’s 5-hour rate limits are doubling for Claude Pro, Max, Team, and seat-based Enterprise plans, while API limits for Claude Opus are being raised significantly. This was made possible by a new compute partnership with SpaceX!
https://x.com/kimmonismus/status/2052059448261177367

Live from Code with Claude: we’re launching dreaming in Claude Managed Agents as a research preview. Outcomes, multiagent orchestration, and webhooks are now in public beta.
https://x.com/claudeai/status/2052067399088664981

PSA: 2x’ed Claude Code’s 5-hour rate limits for Pro, Max, and Team plans. Compute is coming for users, builders, and knowledge coworkers.
https://x.com/claude_code/status/2052071730190123094

So, the weekly rate limits remain the same? “”First, we’re doubling Claude Code’s five-hour rate limits for Pro, Max, Team, and seat-based Enterprise plans.””
https://x.com/btibor91/status/2052067002412335435

Strong Opinions, Loosely Held on Agent + Harness Engineering: 1. You can outperform any default harness+model (including codex & claude code) on pretty much any Task by engineering the harness around it. Using the exact same model, curate prompts, tools, skills, hooks for that
https://x.com/Vtrivedy10/status/2052100726608781363

PostTrainBench results for GPT-5.5 are in it doesn’t beat Opus 4.7 in the Claude Code harness even with almost 2 more hours of working time via reprompting
https://x.com/scaling01/status/2050289320699818417

It’s seeming kind of obvious that Anthropic capabilities for addressing real business work are just inflecting exponentially. I can see it w/ my own tests of Claude + Factset, Excel Copilot powered by Claude, etc. where the outputs have gone from experimental to “”oh sh#t, this
https://x.com/TechFundies/status/2051733955049853053

Two weeks after release, Hy3 preview is #1 on @OpenRouter’s weekly leaderboard with 3.66T tokens processed, up 298% week-over-week. #1 in overall usage, tool calls, and coding. 15.4% market share across all providers.🏆 Top apps running Hy3 preview: Hermes Agent, Claude Code,
https://x.com/TencentHunyuan/status/2051978552900538403

you know what all of these “”which is better”” polls are silly use codex or claude code, whatever works best for you i am grateful we live in a time with such amazing tools, and grateful there is a choice
https://x.com/sama/status/2050274547061129577

A very worthwhile substack (written by @natalia__coelho ) article that focuses particularly on Claude Mythos and GPT-5.5 cyber. tl;dr according to the analysis, GPT-5.5 is basically tied with Claude Mythos Preview on cyber capabilities, and may even be more cost-efficient;
https://x.com/kimmonismus/status/2052040471829004627

Behind the Scenes Hardening Firefox with Claude Mythos Preview – Mozilla Hacks – the Web developer blog

Behind the Scenes Hardening Firefox with Claude Mythos Preview

Also, it’s insane how much slower Claude Code feels compared to Codex. GPT has faster TTFT & TPS, requires fewer tokens to start, requires fewer tool calls to succeed, prices “”fast mode”” less egregiously, and lets you use fast mode on a Codex subscription
https://x.com/theo/status/2050025533950587075

We just shipped Webhooks in the Gemini API 🙂 This is a big step towards making the DevX for long running tasks (batch, agents, GenMedia, etc) way better.
https://x.com/OfficialLoganK/status/2051434527931953490

🚀 Day-0 MTP support for Gemma4 now available at vLLM with ready-to-use docker image! ⚡️Enjoy up to 3x faster decoding performance to supercharge your development with zero quality degradation! Check out the full vLLM recipes for Gemma 4 model series👇
https://x.com/vllm_project/status/2051744111116574950

Excited to introduce Gemma 4 Multi-Token Prediction Drafters⚡️Accelerated inference right in your pockets – Up to a 3x speedup – Same quality guarantees – Available in your favorite open-source tools
https://x.com/osanseviero/status/2051695861801820475

Gemma 4 just got a massive speed-up with MTP drafters ⚡️ > speculative decoding (up to 3x tokens/sec improvement compared to normal Gemma-4 🔥) > identical reasoning, just faster > day-0 support in transformers, MLX, vLLM > A2.0 licensed 🤗
https://x.com/mervenoyann/status/2051702372339003841

Gemma 4 shifts Pareto Frontier on Code @arena.🔥 Among open models, Gemma-4-31b ranks #13 and Gemma-4-26b-a4b ranks #17. Pretty good for open models you can run a MBP. 👀
https://x.com/_philschmid/status/2052104144706588699

Make Gemma go brrrr!!! Multi-Token Prediction drafters are here for Gemma 4, making inference up to 3x faster with zero quality loss. ⚡️ – Up to 3x inference speedup – Zero degradation in output – Available for E2B and E4B versions – Apache 2.0 license
https://x.com/_philschmid/status/2051752856319926475

The DFlash draft model for Gemma-4 is one of the best draft models we’ve ever trained, with especially strong performance in coding and math. Try it out!
https://x.com/jianchen1799/status/2051902953376923946

Copilot Cowork: From conversation to action across skills, integrations, and devices | Microsoft 365 Blog

Copilot Cowork: From conversation to action across skills, integrations, and devices

🎯 Orchestration War Room: una capa visual para el orquestador de tareas de Hermes. Abstrae la dificultad: tú contratas perfiles expertos, pides una tarea, y ves el progreso en tiempo real. En el hilo enlace al repositorio y puntos clave. Y aquí el vídeo en acción. 🧵
https://x.com/naroh/status/2050998576486973759

Comparing deepseek-tui, opencode and Hermes side by side with the same tasks to V4-Pro, I am fairly confident that Hermes is the best agent among them right now. Highest success rate, fastest, and cheapest. A bit of a shame because the other two appeal to me aesthetically.
https://x.com/teortaxesTex/status/2051549309707928028

Introducing Hermes Agent v 0.13.0 – Multi-Agent orchestration through the Kanban system – Enforced goal completion with /goal – Big optimizations for disk usage – Much more extensibility, custom LLM Providers, custom gateway channels, and much more
https://x.com/Teknium/status/2052495174404874714

it’s very important to build agents that maximize cache hits. That’s the main axis of cost reduction with V4. Here, with OpenCode, I’ve had 91.6% cache hit. It it were the (typical of eg Hermes) 96%, the cost would’ve been ≈30% lower.
https://x.com/teortaxesTex/status/2051525774851682409

Lightpanda is now a browser backend in Hermes by @NousResearch. Open source autonomous agent. Open source browser built for machines. It had to be done. Set Lightpanda as default with automatic Chrome fallback.
https://x.com/lightpanda_io/status/2052369346928758861

opencode-GUI seems a bit buggy, it just burned $0,15 without building anything lol. Testing cli… $0.25, fail. Tui built something… which didn’t work. After a kick, fixed splendidly. ≈84% cache hit, $0.17. Hermes did well first try, auto-debugged, 95% cache hit, $0.12 or so.
https://x.com/teortaxesTex/status/2051551506134896976

Our first dive into Multi-Agent Coordination and Cooperation is here, with Hermes Agent Kanban Orchestrate tasks across multiple agent profiles and dependencies easily and visually. Achieve more. See the docs here:
https://x.com/Teknium/status/2051001156005151226

People love to ask what they can use Hermes Agent for. Hermes scraped the internet for answers and added them to our docs as inspiration. And if you’ve found an interesting use case, you can submit your own!
https://x.com/NousResearch/status/2052140057222369541

study @Teknium: >me asking him the best way to host Hermes on windows >him explaining that WSL2 is the preferred way right now >him sending a previous NousResearch documentation about the set up >him deciding that it is too sparse and reworking the documentation >1 hour
https://x.com/witcheer/status/2052033039379673374

Traditional cron jobs are great for silent tasks on a machine, and Hermes Agent cronjobs are great for extending that to your agent, but why not utilize the gateway and hermes’ cron to access things that don’t need to cost an agent’s time across any messenger service you have
https://x.com/Teknium/status/2052219963591762194

Trinity-Large-Thinking, @arcee_ai’s latest model, is now free on Nous Portal for the next week Sign up for Nous Portal to use it in your Hermes Agent today
https://x.com/NousResearch/status/2051321586980880506

Video content creation sounds simple, but what if you don’t have time to: • Write the script, • Prepare the visuals, • Generate the voiceover, • Create the subtitles, • And finally render the video? This is why we built Noustiny on top of @NousResearch Hermes Agent by
https://x.com/UfukDegen/status/2051088239579345329

You can now use `hermes profile create <name> –no-skills` to create a new agent with no built in skills whatsoever, start with a blank slate, fresh canvas!
https://x.com/Teknium/status/2052351650279645590

.@thsottiaux told me on my podcast this week: more than half of Codex prompts now come from non-engineers As a knowledge worker, I can’t be more excited about what’s shipping. Testing Codex this weekend. Will report back
https://x.com/siliconvalleymm/status/2052110961654296627

/goal: The Six-Hour Codex Run That Survived a Five-Hour Pause | Blog | Tecton & Tide
https://tectontide.com/en/blog/codex-goal-six-hour-run/

5.5 in codex is so good for non-coding tasks. i keep assuming it won’t be able to do something, but a lot of the time i am pleasantly surprised.
https://x.com/sama/status/2051783339502375418

5.5 xhigh in fast mode is really good i think i got psyoped by twitter on medium for a bit
https://x.com/sama/status/2050658558174437701

Auto Review in Codex is a game changer! It keeps long-running tasks moving with fewer approvals for routine work, while escalating higher-risk actions back to me. Try it in Codex today!!
https://x.com/reach_vb/status/2051782942314078553

big upgrade for codex today! try it for non-coding computer work.
https://x.com/sama/status/2049946120441520624

Bring your workflow to Codex in just a few clicks. Import settings, plugins, agents, project configuration, and more so you can keep working with fewer interruptions. Your move.
https://x.com/OpenAI/status/2050290618187055175

Built Petdex, a public gallery to discover, share, and install Codex pets with one curl. Submissions open at link below 👇
https://x.com/RaillyHugo/status/2050498466669887571

Codex 0.128.0 is huge, even better than a @thsottiaux reset. Codex is moving more goal oriented with a new /goal command, think Ralph loop on steroids: – /goal <objective> to set a new goal – after agent turn finishes, Codex injects a message nudging the model to pick the next
https://x.com/mattlam_/status/2049907603829121354

codex app becoming incredible
https://x.com/gdb/status/2049971410479796521

Codex can now take on more of your browser dev work. With the new Chrome plugin in the Codex app, it can test web apps, gather context across tabs, use web DevTools efficiently in parallel, and keep results organized without taking over your browser.
https://x.com/OpenAIDevs/status/2052481136971125158

Codex is my favorite coding app right now. It’s clean, but has everything I need to ship fast. It’s also quite delightful to use and snappy, and shows enough context without overwhelming. I was hesitant to try it because I don’t like locking in with a single provider, and I was
https://x.com/linuz90/status/2051273382327685207

Codex now works directly in Chrome on macOS and Windows. It’s even better at working with apps and sites in Chrome, and now works in parallel across tabs in the background without taking over your browser. To get started, install the Chrome plugin in the Codex app.
https://x.com/OpenAI/status/2052480800004956323

Codex redefines my workflow to the point where I should probably buy a new machine Last year I bought a 36GB M4 Pro MBP thinking it was a rocketship. Now I can work back and forth across 4 apps using Codex instead of scrolling Twitter while it builds or thinks (🤡) With a
https://x.com/TinaDebove/status/2050218817880473644

CODEX SKILL TO BRUTALLY TEST ANY STARTUP IDEA! Most startup ideas sound good. This Codex skill tells you why they probably won’t work. Just give Codex your idea and it pressure-tests it for you -> finds the core assumption -> exposes fatal flaws -> checks if the problem is
https://x.com/Kappaemme1926/status/2050908233158816122

CodexBar 0.24 is live 🤖 New Windsurf, Codebuff + DeepSeek providers 👥 Copilot multi-account switching 🧹 Opt-in local storage breakdowns 🔋 Hung Codex RPC + redraw battery drain fixed Tiny menu bar, ridiculous changelog.
https://x.com/steipete/status/2051882417292525950

Got my dog as a Codex pet, but more interestingly got Codex to add the rings to show my Codex limits. Outer ring is 5 hours, inner the weekly one
https://x.com/petergostev/status/2051076960911077796

GPT-5.5 is going to have a party for itself. it chose 5/5 at 5:55 pm for the date and time. if you’d like to come, let us know here:
https://t.co/OupLcJnf14 codex will help the team pick people from the replies. 5.5 had some good ideas/requests for the party, which we’ll do.
https://x.com/sama/status/2049653810558353746

i have brand new anxiety about not hitting cache with codex/gpt-5.5 btw since the input costs are so much higher i leave my agent on and come back to it asking a stupid question, it’s been too long and i see it charge me a dollar in input costs on next message LMFAO
https://x.com/cheatyyyy/status/2051332852546228533

I have to go out of town for a funeral thru the weekend but I am leaving everyone with one new cool feature inspired by ralph loops and Codex’s upcoming /goal feature. If you use /goal <prompt>, it will start a loop with a supervisor model determining whether the task completed
https://x.com/Teknium/status/2050098631907434871

I love that Codex App now shows the Progress of your task in an easily parseable UI right in the chat! ✨
https://x.com/reach_vb/status/2051655026574057593

I still stand by Droid being the best agent harness out there. I’ve tested everything under the sun. #1 Droid #2 Pi #3 Amp #4 OpenCode #5 Codex CLI I am still working on a few reviews but performance wise this has been my experience.
https://x.com/0xSero/status/2051689733793755405

It’s never been easier to do everyday work with Codex. Choose your role, connect the apps you use every day, and try suggested prompts. Codex helps with everything from research and planning to docs, slides, spreadsheets, and more.
https://x.com/OpenAI/status/2049928776147230886

it’s still experimental so we hide it a bit, but in the codex app, try: > what have i been doing very inefficiently on my computer (according to Chronicle). make some recommendations. be direct. tell me what i need to hear.
https://x.com/ajambrosino/status/2049839184110645691

ok its not the most important thing we’ve ever done but i find it more useful than it seems on the surface. check out pets in codex! (and try hatching one)
https://x.com/sama/status/2050304809572688289

OpenAI adds animated Pets and config imports to Codex
https://www.testingcatalog.com/openai-adds-animated-pets-and-config-imports-to-codex/

Pets. Now in Codex. Use /pet to wake your pet.
https://x.com/OpenAIDevs/status/2050275713824211041

QoL upgrade: Codex tells you the status of your CI directly in the chat it’s the little things!
https://x.com/reach_vb/status/2050194266505277902

Settings – Codex app | OpenAI Developers
https://developers.openai.com/codex/app/settings#codex-pets

Team shipped a Codex Security plugin with 5 AppSec workflows: > Security Scan Scans PRs, commits, branches, patches, folders, or full repos. Runs the full pipeline end-to-end > Threat Model Maps the repo: assets, trust boundaries, attacker inputs, invariants, and failure
https://x.com/reach_vb/status/2051019108028969251

the same model in a different harness can yield much different performance! we’ve seen this on a few different occasions now – we took gpt-5.2-codex from 52.8% to 66.5% on Terminal-Bench 2.0 (Top 30 to Top 5 at the time of publishing) just by applying harness layer changes like
https://x.com/masondrxy/status/2051016743905305007

The updated Agents SDK is now available in TypeScript, with support for sandbox agents and an open-source harness built in.
https://x.com/OpenAIDevs/status/2051725072873001338

We added a device tool bar to the Codex in-app browser, so it’s easier to build and test responsive apps! Now, you can have Codex test your app in different dimensions, so it can fix bugs & improve UI for every device. Just click the 3 dots on the right of the URL bar to use
https://x.com/JamesZmSun/status/2050050523794165816

we have very efficient models, especially for their capability level happy codexing
https://x.com/sama/status/2051670144842395990

With codex I don’t need a second monitor I turned it into a standing desk
https://x.com/jxnlco/status/2050639436866892075

You should be using subagents in Codex! They let Codex split work across specialized agents, explore in parallel, and bring the results back into one focused answer. Great for bigger codebases, PR reviews, and anything that needs more than one thread of thought!!
https://x.com/reach_vb/status/2052090279344120278

🦀📦Crabbox 0.4.0. Often I need to quickly recreate conditions on macOS, Linux and Windows and need fast empheral machines. Crabbox are machines for agents on the fly, using AWS spot instances, Hetzner or @useblacksmith. Infinite codex + tests!
https://x.com/steipete/status/2051025056306790833

closed source, open source, nothing can stop codex.
https://x.com/steipete/status/2052144503595716790

codex doesn’t create random markdowns 😉
https://x.com/steipete/status/2050003238498226541

Codex… what is this… are these signs of CHARACTER?
https://x.com/steipete/status/2051011229674508485

Here’s codex validating a [macOS only] launchd issue I previously had that you can’t reliably reproduce on a non-fresh install. Crabboxes ftw!
https://x.com/steipete/status/2051026592764240204

I learned a lot about the security ecosystem in the last few months. Amazing to work with @nvidia @OpenAI @Microsoft @GitHub @TencentHunyuan @convex @Atlassian @useblacksmith to get secure the claw.
https://x.com/steipete/status/2049976855617314991

If you tried OpenClaw in group chats and got mixed results, you GOTTA try again. I changed how agents talk there, it IS SO GOOD NOW.
https://t.co/uW9tcnynWr And if you used GPT and got subpar performance, switch to codex harness.
https://t.co/9DDpY6TeAH Enable both and boom.
https://x.com/steipete/status/2049988836160074022

OpenClaw 2026.5.6 🦞 🩺 doctor leaves Codex OAuth routes alone 🔌 plugin fetch handles odd headers 🌐 web_fetch cleans up timeouts Small maintenance release:
https://x.com/openclaw/status/2052096219233587451

The new /goal feature in codex slaps.
https://x.com/steipete/status/2050275598178586921

told codex I had to pay up to make @xai work again.
https://x.com/steipete/status/2050384648119734683

you can sign in to openclaw with your chatgpt account now and use your subscription there! happy lobstering.
https://x.com/sama/status/2050357911915028689

alignment failure
https://x.com/sama/status/2049715178611380317

🤖 Kept hitting @github rate limits across my agents. Shipped two things: – RepoBar got a JUICE METER – gitcrawl is now also a drop-in gh cache → symlink it as gh, reads served from local SQLite
https://t.co/mtKoH8ybWR
https://x.com/steipete/status/2051579838780072173

🦀 Crabbox 0.3.0 is out. Remote Linux runs for dirty worktrees 🔐 GitHub browser login 🧰 Blacksmith Testbox wrap 📡 crabbox attach for live run replay 📜 Durable run events ☁️ AWS image create 🛡️ Cloudflare Access brew upgrade openclaw/tap/crabbox
https://x.com/steipete/status/2050490163810230579

ClawSweeper 0.2.0 🦞 The OpenClaw maintenance bot now handles the loop: issue → @clawsweeper fix/build → guarded PR → review → repair → re-review → automerge Still conservative. Much less manual.
https://x.com/openclaw/status/2051020186833015243

Crabbox 0.5.0 is live 🦀 🖥️ Desktop/browser leases 🧑‍💻 VNC + authenticated WebVNC 🪟 AWS Windows + WSL2 📸 Screenshots + app launch Remote CI boxes, now suspiciously usable.
https://x.com/steipete/status/2051485798613111116

Do I have anyone from @discord in my timeline? Our @openclaw guild is down the whole day and idk what’s going on.
https://x.com/steipete/status/2051341022731407365

goblins have fat fingers
https://x.com/steipete/status/2050676702242644465

I added Googe Meet support to OpenClaw and now Molty is eager to join every meeting.
https://x.com/steipete/status/2051697991266795793

I asked Molty to review my PR and it made a song.
https://x.com/steipete/status/2051707256396267913

It’s been quite a week. Good stuff is coming though. I hired a team!
https://x.com/steipete/status/2051612829304659972

Merci! imsg 0.6 + 0.7 are live 🔵 Private API bridge landed 📡 Watch/history reliability fixes 💬 Better chat + account diagnostics 🛠️ Long fallback messages decode correctly Private APIs, public receipts.
https://x.com/steipete/status/2051905175355351440

New claw beta is up! Id you’re on our Discord, you can get the soundtrack.
https://x.com/steipete/status/2051033065367970195

OpenClaw 2026.4.29 🦞 💬 Group chats feel much better now 📌 Follow-up commitments from context 🔐 Safer exec, pairing, and owner controls 🟩 NVIDIA provider + model catalogs ⚡ Faster startup + plugin/channel fixes Group chat finally feels agent-native.
https://x.com/openclaw/status/2049986075221692678

OpenClaw 2026.5.2 🦞 🧠 xAI Grok 4.3 🔌 Plugin installs/updates are sturdier ⚡ Gateway + agent hot paths are leaner 💬 Discord, Slack, Telegram, WhatsApp fixes 🎙️ TTS, Realtime, web search, voice-call polish Less drama. More uptime.
https://x.com/openclaw/status/2050735037230801042

OpenClaw 2026.5.3 🦞 📁 File transfer for paired nodes 🧭 /steer + /side for live agent control 🔌 Plugin installs/updates hardened 🛠️ Channel + upgrade fixes Big release, fewer paper cuts.
https://x.com/openclaw/status/2051218126218445289

OpenClaw 2026.5.4 🦞 🧩 Cleaner plugin installs + updates ⚡ Faster Gateway startup paths 🛠️ Better doctor/repair hints 🪟 Windows + Discord reliability fixes The release where boring got fast.
https://x.com/openclaw/status/2051582130417721696

OpenClaw 2026.5.5 🦞 💬 Feishu, LINE, Telegram, Discord fixes 🖥️ Control UI/TUI stay responsive 🔌 Plugins update without losing SDK links 🛠️ Gateway status/restarts clearer Tiny bugfix release. Extremely tiny.
https://x.com/openclaw/status/2051952017900265634

OpenClaw plugins keep the core fast and lean: install only the channels, providers, tools, or skills you need. Example: `openclaw plugins install @openclaw/discord`, restart Gateway, then inspect. Inventory + install notes:
https://x.com/openclaw/status/2051227952575115647

Our Discord was unavailable for a bit, but it’s back now. Discord is still digging into what caused it. 🛠️ Status: crab walked back online. 🦞
https://x.com/openclaw/status/2051400401660920230

Released 🚦RepoBar 0.4.0. This one makes the GitHub menu a lot smarter: persistent SQLite caching, fewer wasted API calls, visible rate limits, better Issues/PR loading, archive fallback support. Tiny menubar app, increasingly useful daily tool.
https://x.com/steipete/status/2051088325100831046

Seems I have to build all the tooling for the future of software myself. With Claws and Tokens!
https://x.com/steipete/status/2051025224708079737

Shipping 🛡️openclaw/fs-safe: a reusable filesystem safety primitive extracted from OpenClaw. If your Node app accepts paths from agents, plugins, uploads, configs, or users, stop treating string normalization as a filesystem boundary. Use a root handle.
https://x.com/steipete/status/2051852940554481901

that’s a lotta token.
https://x.com/steipete/status/2051690175252594720

This one fixes the depenency issues/slowness some had when installed via npm. Plugins are hard, worth it tho! Package is way leaner now, we moved [almost] everything into extensions!
https://x.com/steipete/status/2050735979477008412

Too many agents, too many test suites, one very tired Mac. Run them remote: Crabbox 0.1.0 🦀 ⚡ Remote Linux test boxes (AWS, Hetzner) 🔁 Dirty checkout sync 🦀 Warm boxes with friendly slugs ⏱️ Idle auto-free brew install openclaw/tap/crabbox
https://x.com/steipete/status/2050140050168451286

Turns out the safest lobster is the one everyone can inspect. We wrote about the advisory flood, the real fixes, ClawHub, Agents of Chaos, and the companies helping harden OpenClaw in public. 🦞
https://x.com/openclaw/status/2049972008515957056

WAT
https://x.com/steipete/status/2049839420312891768

We can now reproduce issues directly in empheral crabboxes with WebVNC (Linux/Windows/macOS). Agents set up the exact state to test + fix and post videos on the PR. Working hard to level up our QA.
https://x.com/steipete/status/2051557150040711425

Designing, Refining, and Maintaining Agent Skills at Perplexity
https://research.perplexity.ai/articles/designing-refining-and-maintaining-agent-skills-at-perplexity

Personal Computer is now available to all users in a new Perplexity Mac app. Personal Computer is an advanced version of Perplexity Computer. It operates on any Mac, running tasks across your local files, native Mac apps, the web, and Perplexity’s secure servers.
https://x.com/perplexity_ai/status/2052445405754040816

We’ve developed our own inference engine Runtime-Optimized Serving Engine (ROSE) to serve models ranging from embeddings to trillion-parameter LLMs. With CuTeDSL integrated into our inference engine, Perplexity can build the specialized GPU kernels faster to bring models up to
https://x.com/perplexity_ai/status/2052041903970148647

On March 31, a malicious axios version shipped with a hidden dependency on an impersonator package. Devin Review flagged it for customers in under an hour, before the attack was publicly known.
https://x.com/cognition/status/2051708731671331171

Security remediation is an engineering capacity problem. AI has collapsed the time to exploit, but defensive tools haven’t kept up. Today we’re introducing Devin for Security: a set of workflows for reducing security debt, securing every release, and accelerating response
https://x.com/cognition/status/2051708729880416614

Ha love to see this port of flue to python Harness engineering is a fun time! Need more people exploring!
https://x.com/hwchase17/status/2051004516674457965