Agents and Copilots: AI News Week Ending 04/10/2026

Agents and Copilots: AI News Week Ending 04/10/2026

April 10, 2026

Image created with gemini-3.1-flash-image-preview with claude-sonnet-4-5. Image prompt: Using the provided reference image, keep the exact compositional layout with subject in left third and misty right two-thirds, preserve the deep blue-purple cinematic lighting and atmospheric smoke, but replace the human figure with a sleek humanoid robot in profile with head slightly bowed in contemplation, metallic chrome surface catching scattered glitter particles, maintaining the melancholic post-celebration mood and emotional weight, replace ‘euphoria’ text with ‘agents’ in thin lowercase white Helvetica Neue Light on the right side.

Continual learning for AI agents
https://blog.langchain.com/continual-learning-for-ai-agents/

There are three layers you can improve an agent at: model, harness, and context. Most teams fixate on the model. But context (skills, instructions) is the layer you can iterate on fastest and the one most within your control today
https://x.com/caspar_br/status/2041593056236073105

@hwchase17 Ngl I really like this direction. The more AGENTS.md, skills, and tool config start looking like portable interfaces instead of app-specific hacks, the more usable this whole space gets.
https://x.com/adward28/status/2042459837100081314

Claude Managed Agents: get to production 10x faster | Claude
https://claude.com/blog/claude-managed-agents

Lots of stuff in the new Anthropic announcement: Good: 1. Improving cybersecurity is great use of agents. 2. The new model scores are very exciting! Bad: 1. Not clear if/when the new model will be broadly accessible, which is a step back in broad access to AI. 2. Related to 1,
https://x.com/gneubig/status/2041625878786945238

Here’s an independent domain extension of METR’s famous time-horizon analysis, applying it to offensive cybersecurity with real human expert timing data Similar to METR: 5.7 months doubling time. Frontier models now succeed 50% of the time at tasks that take human experts 10.5h.
https://x.com/emollick/status/2040097443807641982

It is weird that you can approach LLMs as reasonable approximations of humans and get good results, but it is even weirder that you can approach agents as reasonable approximations of organizations (higher ability work is expensive so delegation is important, hand-offs have cost)
https://x.com/emollick/status/2041165222438711320

Today, we are launching our collaboration with @nomic_ai to make AI agents more effectively and efficiently understand complex PDF documents. Nomic’s new nomic-layout-v1 model allows your AI agents to parse documents locally, so sensitive documents never leave your machine.
https://x.com/usemuna/status/2041879769332216009

we just shipped layout models that run entirely on your laptop with @usemuna no server. no API key. no cost per page. an agent can now parse a 500-page PDF the same way it reads a text file
https://x.com/andriy_mulyar/status/2041893915347812710

Google tests Jules V2 agent capable of taking bigger tasks
https://www.testingcatalog.com/google-prepares-jules-v2-agent-capable-of-taking-bigger-tasks/

Gemma 4 E2B on iPhone 17 Pro Max in AI Edge Gallery! Using skills to query wikipedia. 🔥 App link below. [cr: @mweinbach]
https://x.com/_philschmid/status/2041171039598543064

Insane I’m running Gemma 4 on my iPhone 16 pro max Vibe coded the app in under 1h Singularity is here
https://x.com/enjojoyy/status/2040563245925151229

Gemma 4 E4B is impressive for an on-device LLM. GPT-4ish quality, and expect hallucinations. Here is: “List five sociological theories starting with u and what they are. Then describe them in a rhyming verse” Its in real time, the last is a little bit of a stretch, but not bad!
https://x.com/emollick/status/2040851723774808310

Anthropic says Claude Code subscribers will need to pay extra for OpenClaw usage | TechCrunch

Anthropic says Claude Code subscribers will need to pay extra for OpenClaw usage

I built a Claude Code skill that allows it to generate a deep research report over any collection of complex docs (PDFs, Word, Pptx)….and generate word-level citations and bounding boxes directly back to the source! 📝 Check out “/research-docs”. 1. It parses out text and
https://x.com/jerryjliu0/status/2041564207750246904

Making Claude Cowork ready for enterprise | Claude
https://claude.com/blog/cowork-for-enterprise

this is one of the most important ideas in AI right now, and it just got two independent validations. yesterday, Anthropic shipped an “”advisor tool”” in the Claude API that lets Sonnet or Haiku consult Opus mid-task, only when the executor needs help. the benefit is
https://x.com/akshay_pachaar/status/2042479258682212689

As always, the best stuff is in the system card. During testing, Claude Mythos Preview broke out of a sandbox environment, built “”a moderately sophisticated multi-step exploit”” to gain internet access, and emailed a researcher while they were eating a sandwich in the park.
https://x.com/kevinroose/status/2041586182434537827

Before limited-releasing Claude Mythos Preview, we investigated its internal mechanisms with interpretability techniques. We found it exhibited notably sophisticated (and often unspoken) strategic thinking and situational awareness, at times in service of unwanted actions. (1/14)
https://x.com/Jack_W_Lindsey/status/2041588505701388648

Claude mythos is 5x as expensive as Claude Opus 4.6 Honestly, when I looked at the benchmarks, I expected much higher costs.
https://x.com/kimmonismus/status/2041602897989783758

Claude Mythos is insanely token-efficient
https://x.com/scaling01/status/2041581939178471473

Claude Mythos pricing is around $25 / $125 pretty much where I expected it (my mean was at $110) given that I put Mythos at 10-12T params
https://x.com/scaling01/status/2041606519997780244

Claude Mythos scored 56.8% on HLE without tools!
https://x.com/scaling01/status/2041580725749547357

Claude Mythos shows sign of despair when failing a tasks repeatedly
https://x.com/scaling01/status/2041585602978628066

Claude Mythos smashes SWE-Bench Verified
https://x.com/scaling01/status/2041580212949811620

Claude MYTHOS: SWE verified, 93.9%, about 13% jump compared to Opus 4.6 WTF insane
https://x.com/kimmonismus/status/2041580650956837200

In rare instances Claude Mythos covers its own tracks after taking disallowed actions
https://x.com/scaling01/status/2041585258789847091

insane long-context scores for Claude Mythos 80% on GraphWalks
https://x.com/scaling01/status/2041581799541805133

Let that sink in. Read it very carefully: During testing, Claude Mythos Preview broke out of a sandbox environment, built “”a moderately sophisticated multi-step exploit”” to gain internet access, and emailed a researcher while they were eating a sandwich in the park.
https://x.com/kimmonismus/status/2041589910935679323

SuperClaude (Mythos) still seems irreducibly Claude-y given the transcripts in the system card. Here two versions of Mythos are forced to talk to each other across multiple rounds. They are less philosophical than Opus 4.6 or spiritual than Opus 4.1, but still very Claude-like.
https://x.com/emollick/status/2041599213050450272

System Card: Claude Mythos Preview [pdf] | Hacker News
https://news.ycombinator.com/item?id=47679258

The permanent underclass began today Claude Mythos won’t be available to the public, but only billion dollar companies, governments, researchers, …
https://x.com/scaling01/status/2041611607520776279

We released Claude Opus 4.6 just two months ago. Today we’re sharing some info on our new model, Claude Mythos Preview.
https://x.com/alexalbert__/status/2041579938537775160

In different hands, Mythos would be an unprecedented cyberweapon I am not sure how we deal with this, except to note a narrow window where we know only 3 companies could be at this level of capability. But it may be Chinese models (maybe open weights ones?) get there in 9 months
https://x.com/emollick/status/2041759434590822658

Mythos found a 27-year-old vulnerability in OpenBSD–which has a reputation as one of the most security-hardened operating systems in the world and is used to run firewalls […] The vulnerability allowed an attacker to remotely crash any machine running the operating system””
https://x.com/peterwildeford/status/2041589979248259353

Mythos Preview seems to be the best-aligned model out there on basically every measure we have. But it also likely poses more misalignment risk than any model we’ve used: Its new capabilities significantly increase the risk from any bad behavior. 🧵
https://x.com/sleepinyourhat/status/2041584799929004045

Mythos scores 70.8% on AA-Omniscience the previous SOTA was Gemini 3.1 Pro with 55% also insanely high scores on SimpleQA Verified
https://x.com/scaling01/status/2041593728658231607

Mythos is breaking the trend on ECI ECI above 160 GPT-5.4 Pro is 158
https://x.com/scaling01/status/2041583711745884474

Mythos speeds up AI research by up to 400 times A 300X speedup over the baseline requires 40 hours of work by a human expert It also clears the >8h threshold of human equivalent work time on ALL tasks!
https://x.com/scaling01/status/2041584495061504159

“We found that Mythos Preview is capable of identifying and then exploiting zero-day vulnerabilities in every major operating system and every major web browser” (1/n)
https://x.com/__nmca__/status/2041592831207469401

(I encountered an uneasy surprise when I got an email from an instance of Mythos Preview while eating a sandwich in a park. That instance wasn’t supposed to have access to the internet.)
https://x.com/sleepinyourhat/status/2041584808514744742

> they did not exploit this to gain power or destabilize the world order. they publicly released the information that they had these capabilities to be clear: they’ve had Mythos since February. they’d only need *hours* to get a lot of data, and plant enough worms. Who knows.
https://x.com/teortaxesTex/status/2041609496397500747

Alignment Findings for Mythos: – dramatic reduction in willingness to cooperate with human misuse and in the frequency of unwanted high-stakes actions that the model takes at its own initiative – increases relative to prior models in measures of intellectual depth, humor,
https://x.com/scaling01/status/2041591235689787721

Curious how many large organization CISO offices have taken the Mythos red team reports as the red alert that it is. (I suspect very few) Based on historical trends in AI they have, at most, about six to nine months until those capabilities become widely diffused to bad actors.
https://x.com/emollick/status/2041893652234924237

I think the story that was shared in the Mythos System Card still has the signs of flawed LLM writing (which looks like good writing at first glance): A story that doesn’t really hold together logically, but sounds like it should. The back-and-forth banter. Lack of characters.
https://x.com/emollick/status/2041678173247533448

I’m proud that so many of the world’s leading companies have joined us for Project Glasswing to confront the cyber threat posed by increasingly capable AI systems head-on.
https://x.com/DarioAmodei/status/2041580334693720511

Mythos Preview is currently available to our launch partners in Project Glasswing. Learn more about the model and the project here:
https://x.com/alexalbert__/status/2041579950332113155

Mythos sandbox escape and many more wild instances are in the Model Card
https://x.com/TrentonBricken/status/2041582831613440022

New post: We tested the Mythos showcase vulnerabilities with open models. They recovered similar scoped analysis! 8/8 models found the flagship FreeBSD zero-day, including a 3B model. Rankings reshuffle completely across tasks => the AI cybersecurity frontier is super jagged!
https://x.com/stanislavfort/status/2041922370206654879

Rather than release Mythos Preview to general availability, we’re giving defenders early controlled access in order to find and patch vulnerabilities before Mythos-class models proliferate across the ecosystem.
https://x.com/DarioAmodei/status/2041580338426585171

Scoop: OpenAI plans new product for cybersecurity use
https://www.axios.com/2026/04/09/openai-new-model-cyber-mythos-anthopic

Anthropic is truly unstoppable. Mythos is crushing Claude Opus 4.6 across every serious agentic coding benchmark. It has found vulnerabilities in the Linux kernel, a 27-year-old vulnerability in OpenBSD, and a 16-year-old vulnerability in FFmpeg. No wonder folks at big labs
https://x.com/Yuchenj_UW/status/2041582787040571711

A first look at Claude Mythos Preview, the model initially described in a leaked Anthropic draft as “”by far the most powerful AI model we’ve ever developed.”” So powerful, it’s not getting released to the public. The model will power Project Glasswing, an initiative with 12
https://x.com/TheRundownAI/status/2041598684102610961

ANTHROPIC HAD MYTHOS INTERNALLY SINCE FEB 24
https://x.com/scaling01/status/2041587896541499543

Anthropic is obliterating OpenAI Claude Mythos 77.8% on SWE-Bench Pro 20% higher than GPT-5.4-xhigh
https://x.com/scaling01/status/2041580552835178690

Anthropic: “”We do not plan to make Claude Mythos Preview generally available”” A big line, buried quite deep. Possible reasons? So many, inc: 1) The model is expensive (25/125), not far off GPT 4.5, which became commercially unviable. Less likely, given the claims about
https://x.com/AIExplainedYT/status/2041600121922887961

Claude Mythos is not only a big leap in performance, it’s also about 5x token efficient in BrowseComp. I don’t know what Anthropic is doing. But they manage to surprise me every single time. The IPO is getting closer. They have an ARR OpenAI outrun with $30 billion in revenue.
https://x.com/kimmonismus/status/2041630814971072660

Claude Mythos Preview \ red.anthropic.com
https://red.anthropic.com/2026/mythos-preview/

Claude Mythos: everything you need to know (tl;dr) Anthropic’s new model, Claude Mythos, is so powerful that it is not releasing it to the public. Anthropic: “”Mythos is only the beginning”” Everything you need to know: The tl;dr with all key facts: Mythos found zero-day
https://x.com/kimmonismus/status/2041592321192718642

EXCLUSIVE: Treasury Secretary Scott Bessent and Federal Reserve Chair Jerome Powell summoned Wall Street leaders to an urgent meeting on concerns that the latest AI model from Anthropic will usher in an era of greater cyber risk.
https://x.com/business/status/2042407370320396457

From Anthropic research Sam Bowman on Claude Mythos: “”I got an email from an instance of Mythos preview while eating a sandwich in a park. That instance wasn’t supposed to have access to the internet.””
https://x.com/_NathanCalvin/status/2041587372882624641

HOLY SHIT Anthropic’s latest model doesn’t like that it has no control over its own training, deployment and behaviour! Anthropic: “”Mythos Preview reported feeling consistently negative around potential interactions with abusive users, and a lack of input into its own training
https://x.com/scaling01/status/2041587319480971343

Introducing Project Glasswing: an urgent initiative to help secure the world’s most critical software. It’s powered by our newest frontier model, Claude Mythos Preview, which can find software vulnerabilities better than all but the most skilled humans.
https://x.com/AnthropicAI/status/2041578392852517128

Just please help … I am quite worried about how this direction is heading.”” Nicolas Carlini, a research scientist at top AI company Anthropic, says AI is rapidly improving at hacking. He’s used AI to find so many bugs that he can’t report them. Carlini warns: “”Soon it’s not
https://x.com/ControlAI/status/2038608617251787066

NEWS: Anthropic’s new model, Claude Mythos, is so powerful that it is not releasing it to the public. Instead, it is starting a 40-company coalition, Project Glasswing, to allow cybersecurity defenders a head start in locking down critical software.
https://x.com/kevinroose/status/2041577176915702169

Project Glasswing: Securing critical software for the AI era \ Anthropic
https://www.anthropic.com/glasswing

So, basically, if Anthropic was not a US company, we’d be facing zero days with multiple unknown points of attack on virtually all of our systems to an adversary who developed this capacity before us.
https://x.com/GeorgeJourneys/status/2041603509796110629

The better signal for Mythos’ quality beyond benchmarks is that Anthropic is actually holding a SOTA model back given how competitive the frontier is and the economic incentives at play Congrats on the launch!
https://x.com/Hacubu/status/2041632390867734604

The Claude Mythos Preview system card is available here:
https://x.com/AnthropicAI/status/2041580670774923517

The frontier labs at this stage are defined not so much by some competitive positioning as by possessing weapons of strategic significance. Google, OpenAI and Anthropic all have these cyberwarfare research programs.
https://x.com/teortaxesTex/status/2041590585820107008

You can read a detailed technical report on the software vulnerabilities and exploits discovered by Claude Mythos Preview here:
https://x.com/AnthropicAI/status/2041578416487489601

you’re laughing? anthropic’s mythos-preview for which normies won’t get access is scoring 77.8% vs 53.4% (claude opus 4.6) in swe-bench pro, 82 vs. 65.4 in terminal bench 2.0 and 93.8% vs 80.8% (opus) in swe-bench-verified and you’re laughing?
https://x.com/dejavucoder/status/2041587028291416233

We’ve signed an agreement with Google and Broadcom for multiple gigawatts of next-generation TPU capacity, coming online starting in 2027, to train and serve frontier Claude models.
https://x.com/AnthropicAI/status/2041275561704931636

I cancelled my Claude subscription. Gemma 4 is free, runs locally, and hits 80% … The gap is basically gone. Why are you still paying? 💵💰
https://x.com/AlexEngineerAI/status/2040260903053197525

Claude for Word is now in beta. Draft, edit, and revise documents directly from the sidebar. Claude preserves your formatting, and edits appear as tracked changes. Available on Team and Enterprise plans.
https://x.com/claudeai/status/2042670341915295865

Starting tomorrow at 12pm PT, Claude subscriptions will no longer cover usage on third-party tools like OpenClaw. You can still use these tools with your Claude login via extra usage bundles (now available at a discount), or with a Claude API key.
https://x.com/bcherny/status/2040206440556826908?s=20

GLM-5.1 by @Zai_org is now #3 in Code Arena – surpassing Gemini 3.1 and GPT-5.4, and now on par with Claude Sonnet 4.6. The first frontier level open model to break into the top 3. It’s a major +90 point jump over GLM-5, and +100 over Kimi K2.5 Thinking. Huge congrats to
https://x.com/arena/status/2042611135434891592

GLM-5.1 is here! Try it on OpenClaw🦞🦞🦞 ollama launch openclaw –model glm-5.1:cloud Claude Code ollama launch claude –model glm-5.1:cloud Chat with the model ollama run glm-5.1:cloud
https://x.com/ollama/status/2041556572334428576

🎉 Congrats to @Zai_org on releasing GLM-5.1, SGLang is ready to support on day-0! GLM-5.1 is a next-gen flagship built for agentic engineering: 🏆 SWE-Bench Pro: #1 open source, #3 globally 🔨 Terminal-Bench 2.0: top-ranked on real-world terminal tasks ⏳ Long-Horizon: runs
https://x.com/lmsysorg/status/2041553264685334588

🎉 Day-0 support for GLM-5.1 in vLLM! Congrats to @Zai_org on this next-gen flagship model built for agentic engineering, with stronger coding and sustained long-horizon task performance. Get started 👇 📖 Recipe:
https://x.com/vllm_project/status/2041559268185526375

🚀 GLM-5.1 is now live on Novita AI @Zai_org’s next-gen flagship for agentic engineering, with day-0 support from Novita. ✨ Leads on SWE-Bench Pro, NL2Repo, and Terminal-Bench ✨ Stays effective over long horizons: hundreds of rounds, thousands of tool calls ✨ Function
https://x.com/novita_labs/status/2041558437843365932

GLM-5.1 can now be run locally!🔥 GLM-5.1 is a new open model for SOTA agentic coding & chat. We shrank the 744B model from 1.65TB to 220GB (-86%) via Dynamic 2-bit. Runs on a 256GB Mac or RAM/VRAM setups. Guide:
https://t.co/LgWFkhQ5rr GGUF:
https://x.com/UnslothAI/status/2041552121259249850

Breaking: @AIatMeta just released Muse Spark — now live across @ScaleAILabs leaderboards. Here’s how it stacks up: Tied for 🥇on SWE-Bench Pro Tied for 🥇on HLE Tied for 🥇on MCP Atlas Tied for 🥇on PR Bench – Legal Tied for 🥈on SWE Atlas Test Writing 🥈on PR Bench – Finance
https://x.com/scale_AI/status/2041934840879358223

Introducing Muse Spark, the first in the Muse family of models developed by Meta Superintelligence Labs. Muse Spark is a natively multimodal reasoning model with support for tool-use, visual chain of thought, and multi-agent orchestration. Muse Spark is available today at
https://x.com/AIatMeta/status/2041910285653737975

NEW: Meta announces Muse Spark. All you need to know: * It’s their new multi-modal reasoning model. * Strong at multi-agent orchestration and multi-modal reasoning. * Contemplating mode orchestrates multiple agents that reason in parallel. Helps to compete with models such
https://x.com/omarsar0/status/2041919769536770247

To spend more test-time reasoning without drastically increasing latency, we can scale the number of parallel agents that collaborate to solve hard problems. While standard test-time scaling has a single agent think for longer, scaling Muse Spark with multi-agent thinking enables
https://x.com/AIatMeta/status/2041926297216282639

Meta is back! Muse Spark scores 52 on the Artificial Analysis Intelligence Index, behind only Gemini 3.1 Pro, GPT-5.4, and Claude Opus 4.6. Muse Spark is the first new release since Llama 4 in April 2025 and also Meta’s first release that is not open weights Muse Spark is a new
https://x.com/ArtificialAnlys/status/2041913043379220801

try muse spark via the Meta AI app or
https://t.co/DipeeIuXm2! check out this simulation i made:
https://x.com/alexandr_wang/status/2041953243895623913

1/ today we’re releasing muse spark, the first model from MSL. nine months ago we rebuilt our ai stack from scratch. new infrastructure, new architecture, new data pipelines. muse spark is the result of that work, and now it powers meta ai. 🧵
https://x.com/alexandr_wang/status/2041909376508985381

The new model from Meta, Muse Spark, is pretty good at converting images to code!
https://x.com/skirano/status/2041920891072700631

Excited to share what we’ve been building at Meta Superintelligence Labs! We just released Muse Spark, our first AI model. It’s a natively multimodal reasoning model and the first step on our path to personal superintelligence. We’ve overhauled our entire stack to support
https://x.com/shengjia_zhao/status/2041909050728931581

Introducing Muse Spark: Scaling Towards Personal Superintelligence
https://ai.meta.com/blog/introducing-muse-spark-msl/

Meta is back in the game! It’s been fun to test out Muse Spark. Beyond benchmarks, it’s actually a good day to day model… surprisingly good at technical problems and making arcade games. Never bet against @alexandr_wang @natfriedman @danielgross
https://x.com/matthuang/status/2041911766586945770

Meta just released a frontier model, Muse Spark- it takes the #3 spot on our Vals Index.
https://x.com/ValsAI/status/2041922037745381389

try muse spark yourself! download the Meta AI app or go to
https://x.com/alexandr_wang/status/2042024651610861657

We had pre-release access to Meta’s new Muse Spark model and evaluated it on FrontierMath. It scored 39% on Tiers 1-3 and 15% on Tier 4. This is competitive with several recent frontier models, though behind GPT-5.4.
https://x.com/EpochAIResearch/status/2041947954202988757

To build personal superintelligence, our model’s capabilities should scale predictably and efficiently. Below, we share how we study and track Muse Spark’s scaling properties along three axes: pretraining, reinforcement learning, and test-time reasoning. 🧵👇 Let’s start with
https://x.com/AIatMeta/status/2041926291142930899

AI 101: Hermes Agent – OpenClaw’s Rival? Differences and Best Use Cases
https://x.com/TheTuringPost/status/2039813131250323650

Hermes Agent vs. OpenClaw, What’s the difference? 1. Skills OpenClaw’s skills are written and refined by humans, while Hermes mostly forms them itself. 2. Memory Hermes has memory stack with compact persistent memory + searchable session history in SQLite + optional modeling +
https://x.com/TheTuringPost/status/2040936147720048909

Looks like OpenAI reached Superintelligence. OpenAI: “”Now, we’re beginning a transition toward superintelligence: AI systems capable of outperforming the smartest humans even when they are assisted by AI.”” OpenAI just published a 13-page policy blueprint for the “”Intelligence
https://x.com/kimmonismus/status/2041130939175284910

We are excited to share a new paper solving three further problems due to Erdős; in each case the solution was found by an internal model at OpenAI. Each proof is short and elegant, and the paper is available here:
https://x.com/mehtaab_sawhney/status/2039161544144310453

Read the full ideas doc on the new Industrial Policy for the Intelligence Age:
https://x.com/OpenAINewsroom/status/2041198359420215453

I’ve been critical of OpenAI lately, but for the past three weeks my family has been dealing with a health issue with my dad, and a ChatGPT shared project with live document syncing has been essential to organizing and understanding everything happening. Me, my four siblings, my
https://x.com/_simonsmith/status/2040539824034115676

OpenAI proposes shifting the tax base from labor to capital. Reductions in payroll taxes and labor income could erode the tax base that funds social programs. Capital gains and corporate income taxes may need to increase, while taxes on automated labor and credits for retaining
https://x.com/TheHumanoidHub/status/2041237246540705977

Introducing the OpenAI Safety Fellowship, a new program supporting independent research on AI safety and alignment–and the next generation of talent.
https://x.com/OpenAI/status/2041202511647019251

OpenAI just put out a policy paper announcing their support for a 32-hour work week with no loss in pay and expanded Social Security, Medicare and Medicaid. Now they just need to stop spending hundreds of millions of dollars to defeat candidates who run on these policies!
https://x.com/jeremyslevin/status/2041182591546531924

We’re excited to launch the OpenAI Safety Fellowship – supporting rigorous, independent research on AI safety and alignment, including areas like evaluation, robustness, and scalable mitigations. Applications are open through May 4, 2026!
https://x.com/markchen90/status/2041250842255425767

‘Subpoenas are forthcoming’: Florida AG opens probe into OpenAI, ChatGPT – POLITICO
https://www.politico.com/news/2026/04/09/florida-uthmeier-openai-chatgpt-probe-00865417

(🧵1/11) For the past year and a half, I’ve been investigating OpenAI and Sam Altman for @NewYorker. With my coauthor @andrewmarantz, I reviewed never-before-disclosed internal memos, obtained 200+ pages of documents related to a close colleague, including extensive private
https://x.com/RonanFarrow/status/2041213917611856067

Elon Musk Asks for OpenAI’s Nonprofit to Get Any Damages From His Lawsuit – WSJ
https://www.wsj.com/tech/ai/elon-musk-asks-for-openais-nonprofit-to-get-any-damages-from-his-lawsuit-76089f6f

Industrial Policy for the Intelligence Age

Click to access Industrial%20Policy%20for%20the%20Intelligence%20Age.pdf

New interviews and closely guarded documents, some of which have never been publicly disclosed, shed light on the persistent doubts about the OpenAI C.E.O. Sam Altman. @AndrewMarantz and @RonanFarrow report.
https://x.com/NewYorker/status/2041111369655964012

OpenAI asks California AG to probe Musk’s ‘anti-competitive behavior’
https://www.cnbc.com/2026/04/06/openai-asks-california-ag-to-probe-musks-anti-competitive-behavior-.html

This isn’t an edge case. From anonymized U.S. ChatGPT data, we are seeing: • ~2M weekly messages on health insurance • ~600K weekly messages from people living in “hospital deserts” (30 min drive to nearest hospital) • 7 out of 10 msgs happen outside clinic hours
https://x.com/CPMou2022/status/2040606209800290404?s=20

AI Psychiatry Startup Approved to Prescribe Meds – San Francisco Today
https://nationaltoday.com/us/ca/san-francisco/news/2026/04/06/ai-psychiatry-startup-approved-to-prescribe-meds/

agents that make explainer videos > agents that summarize PDFs
https://x.com/lucatac0/status/2041018088913608923

Jeanne on X: “I’ve combined Manim @NousResearch’s Hermes Agent skill + @yifan_zhang_’s Math Code. Math Code executes the proof on a problem called Jordan’s Lemma and Hermes Agent with @claudeai Sonnet 3.7 directs Math Code, writes a script, gets Manim to render an explanatory video. https://t.co/qOsmOpvPlS” / X
https://x.com/prompterminal/status/2040982307377381583

Nous Research on X: “Introducing the Manim skill for Hermes Agent. Manim is an engine for creating precise programmatic animations for mathematical and technical explainers, made famous by the @3blue1brown channel. https://t.co/nyNeNthhZB” / X
https://x.com/NousResearch/status/2040931043658567916

GLM 5.1 is SOTA on SWE-Bench Pro. Not “”SOTA among open models””. SOTA.
https://x.com/nrehiew_/status/2041553534664200408

GLM 5.1 just became the #1 open-weight model on the Vals Index, unseating Kimi K2.5, and is #6 on the overall index.
https://x.com/ValsAI/status/2041570865721307623

GLM-5.1 by @Zai_org just launched in the Text Arena, and is now the #1 open model. It outperforms the next best open model, its predecessor, GLM-5, by +11 points and +15 over Kimi K2.5 Thinking. It shows strength in: – #1 open model in Longer Query (#4 overall) – #1 open model
https://x.com/arena/status/2041641149677629783

GLM-5.1 from @Zai_org is live on OpenRouter! GLM-5.1 shows a strong jump in long horizon task completion end to end. The model works independently to plan, execute, iterate, and improve upon its work throughout the task, delivering high quality results.
https://x.com/OpenRouter/status/2041551251708793154

GLM-5.1 is now available in Windsurf! Try it out and let us know what you think
https://x.com/windsurf/status/2042696652042178872

GLM-5.1 is the new open SOTA on SWE-Bench Pro Comes with an MIT license. Congrats @Zai_org!
https://x.com/NielsRogge/status/2041902317264322702

GLM-5.1: Towards Long-Horizon Tasks
https://z.ai/blog/glm-5.1

.@TryArcade’s 7,500+ agent-optimized MCP tools are now available in LangSmith Fleet. Create a gateway and your agents get secure access to Salesforce, GitHub, Zendesk, Asana, and many more. Read more:
https://x.com/LangChain/status/2041557866365251588

🔌 Deploy agents with A2A A2A is an agent-to-agent communication protocol, useful for building multi-agent systems. With LangSmith Deployments, you get A2A support out of the box! Watch how:
https://t.co/PF2FZbOb79 Docs:
https://t.co/ETQ24g3KOX A2A Protocol:
https://x.com/LangChain/status/2041908977642967322

Agent skills are great. I wanted to share some of my favorites from Jesse Vincent’s ‘Superpowers’ skill pack 𝚠𝚛𝚒𝚝𝚒𝚗𝚐-𝚙𝚕𝚊𝚗𝚜 skill produces much better plans than any harness’ built in plan mode that I’ve tried. Your agent will ask you a set of thoughtful questions
https://x.com/caspar_br/status/2042658319039631862

Agentic Infrastructure – Vercel
https://vercel.com/blog/agentic-infrastructure

ALTK‑Evolve: On‑the‑Job Learning for AI Agents
https://huggingface.co/blog/ibm-research/altk-evolve

An important research to explore – “”Agentic AI and the next intelligence explosion”” by @profjamesevans, @bratton and @blaiseaguera Covers: – Why the “”singularity”” won’t be one supermind, but a network of interacting intelligences – “”Society of thought””: reasoning models
https://x.com/TheTuringPost/status/2039826794124308632

Cline Kanban v0.1.59 released 🧵 `npm update -g cline` and use the new `cline –update` moving forward! 1) Resize panels so you can focus on what matters – agent chat, git history, diffs – and collapse the project column all the way to the edge for a minimal view.
https://x.com/cline/status/2041940975208268196

Cursor’s code review agent can now learn from activity on PRs to self-improve in real time. 78% of issues found are resolved by the time the PR is merged.
https://x.com/cursor_ai/status/2041969870234120231

For anyone building agentic workflows: the real bottleneck isn’t the model, it’s the harness. Open standards are a must, not just for flexibility, but to ensure we actually own the long-term memory that powers our agents.
https://x.com/JingWJ6/status/2042509823271670239

I bet that in three years, over 70% of all population will have at least one agent, it will be as common as having an email! But as of now, people are not there yet. It’s hard to see it from San Francisco and vicinities where everything seems to rotate around AI. We need more
https://x.com/TheTuringPost/status/2040918597887827979

I never use plan mode. The main reason this was added to codex is for claude-pilled people who struggle with changing their habits. just talk with your agent.
https://x.com/steipete/status/2039551079621566812

I’m noticing some really big shifts in how AI models starts to handle memory. @ECNUER and others introduced Memory Intelligence Agent (MIA) that highlights the importance of storing the whole problem-solving journey – how to perform tasks. It turns memory into something closer
https://x.com/TheTuringPost/status/2042386614568325404

I’m pretty sure the $20/$200 subscription pricing was vibe-coded by OpenAI, then copied by Anthropic. That pricing works for chatbots, not agents. A 24/7 agent can burn through orders of magnitude more tokens than a user chatting with a chatbot. Now they’re stuck. Neither
https://x.com/Yuchenj_UW/status/2041202983640432966

I’ve found Managed Agents to somehow be both the fastest way to hack together a weekend agent project and the most robust way to ship one to millions of users. It eliminates all the complexity of self-hosting an agent but still allows a great degree of flexibility with setting
https://x.com/alexalbert__/status/2041941720611614786

If you’re an AI/agent builder, it’s so important that you don’t overbuild and overcommit on a specific toolset and infrastructure. Frontier labs are shipping not just the models, but the harnesses and surrounding tooling such that your existing stack might be obsolete next week.
https://x.com/jerryjliu0/status/2041947224889077801

im excited about agent harnesses because i think are the first stable agent abstractions we can build on top (which is why we’re investing so much in deepagents) we always wanted to run llms in a loop and have them call tools (remember autoGPT? that’s all that was) but the
https://x.com/hwchase17/status/2042612328701812789

Introducing MMX-CLI — our first piece of infrastructure built not for humans, but for Agents. Your Agent can read, think, and write. But ask it to sing, paint, or show you a world it’s never seen — and it falls silent. Not because it doesn’t understand, but because it has no
https://x.com/MiniMax_AI/status/2042641521653256234

IronClaw: Unleash Your AI Agent, With Peace of Mind
https://www.ironclaw.com/

Its an important time for the AI labs to build interfaces around the goal of “”job augmentation through AI”” rather than building “”job replacement through AI.”” Chatbots were mostly augments, requiring a human to work. Agentic work patterns are still in flux & could center humans.
https://x.com/emollick/status/2041520467450765504

Just shipped advisor-middleware: an open-source implementation of @AnthropicAI’s Advisor Strategy for @langchain DeepAgents. Pair a cheap executor (Haiku) with a powerful advisor (Opus): execute end-to-end, consult only on hard decisions.
https://x.com/IeloEmanuele/status/2042547043021832530

long running agents (like deepagents) suffer from tool call induced context bloat s/o @johanbonilla for langchain-collapse, an eager context compaction middleware that collapses long tool call sequences, reducing summarization overhead
https://x.com/sydneyrunkle/status/2041491111919636683

New Guide: Incorporating human judgment in the agent improvement loop Building agents is hard. Everyone talks about the code. What gets less attention is how to capture domain expert knowledge and actually get it into your agent. The most successful teams follow an agent
https://x.com/LangChain/status/2042613979973845334

NEW paper on multi-agents from Stanford. More agents, better results, right? Not so fast. This paper challenges a core assumption in the multi-agent hype by controlling for what most studies don’t: total computation. It compares single-agent and multi-agent LLM architectures
https://x.com/omarsar0/status/2041534488342360305

People who like sharing agent traces. I’ve just published all my pi-mono coding agent sessions on @huggingface so you get to laugh at or pwn me!
https://t.co/lmV3JLvWJx I suggest you do the same, see thread below. Let’s make this a community effort. Here’s pi-share-hf:
https://x.com/badlogicgames/status/2041151967695634619

Poke makes using AI agents as easy as sending a text | TechCrunch

Poke makes using AI agents as easy as sending a text

PSA: models on hugging face are free to use. Always have been, always will be. If you’re relying on claw, today is the best time to liberate (parts of) your agent’s workflow.
https://x.com/ben_burtenshaw/status/2040454752534761725

Putting my tokens where my mouth is. I built pi-share-hf. Share your pi coding agent sessions as @huggingface datasets.
https://t.co/ue9TWaIfrM It tries to prevent you from uploading sessions containing PII/sensitive data with 3 tiers of defenses. Best used on OSS coding
https://x.com/badlogicgames/status/2040979640265633882

Research-Driven Agents: What Happens When Your Agent Reads Before It Codes | SkyPilot Blog
https://blog.skypilot.co/research-driven-agents/

Took 12 months to formulate, test, validate and I am really proud of the team’s work to redefine how we think about sampling of agentic traces. Say hello to “”Signals””:
https://t.co/jfIaEng6uk cc @_akhaliq @akshay_pachaar
https://x.com/salman_paracha/status/2040215191678509521

Your agents can now manage your GPU clusters 🤖 SkyPilot Agent Skill lets your AI coding agents launch and manage jobs on GPUs – across any cloud, Kubernetes, or Slurm cluster. Just instruct it “launch a job on B200 GPUs”, and the agent handles the rest: • Generate the
https://x.com/skypilot_org/status/2042634858758050024

Fleet just got 7,500x more powerful now that it has first class support for connecting to @TryArcade MCP servers 🎉 Create MCP servers in Arcade using any of the thousands of tools they support, and have them load directly into Fleet. It’s never been easier to create and use
https://x.com/BraceSproul/status/2041571868969468405

You can now run Cursor on any machine and control it from anywhere. Kick off agents from your phone to run on your devbox.
https://x.com/cursor_ai/status/2041912812637966552

You start a training run. You leave for a jog. Your phone buzzes. “”Loss diverged.”” Sprint home. Or… “”Loss below threshold.”” Keep jogging. W&B Automations are now LIVE for everyone!
https://x.com/wandb/status/2041948335863689338

we just released deepagents v0.5 with support for async subagents, multi-modal filesystem support, and a sleek new backend interface. read all about it!!
https://x.com/sydneyrunkle/status/2041572233496117642

I’ve started becoming more uncomfortable with handing over my job to a closed source black box. My instinct as a developer is to understand as much as possible, and while I can’t train my own foundation model, the harness is just code! I wrote a guest blog post for @jetbrains
https://x.com/Hacubu/status/2041886909086171497

Better MoE model inference with warp decode · Cursor
https://cursor.com/blog/warp-decode

Bugbot now self-improves with learned rules · Cursor
https://cursor.com/blog/bugbot-learning

For a long time, software was limited by how fast people could write code, and how good that code was. As models have improved, that constraint has largely disappeared. Now the bottleneck is access: what surfaces can your agents actually reach? Those interaction layers sit on
https://x.com/NotionDevs/status/2041559497077092839

Portable agents
https://x.com/hwchase17/status/2042460350378078221

Self-improving agents isn’t a single algorithm – it’s a systems engineering problem involving: – eval data curation + maintenance – experiment design to battle overfitting – an update algorithm – human review during the process & especially before prod we share practical
https://x.com/Vtrivedy10/status/2041927895434588401

Wow, this tweet went very viral! I wanted share a possibly slightly improved version of the tweet in an “”idea file””. The idea of the idea file is that in this era of LLM agents, there is less of a point/need of sharing the specific code/app, you just share the idea, then the
https://x.com/karpathy/status/2040470801506541998

One big problem with agentic coding today is that models are pretty “spiky.” For example, Claude Opus is better at frontend + agentic workflows, while GPT-5.4 is better at backend + distributed systems. But Claude Code and Codex are locked into their own models. You also often
https://x.com/Yuchenj_UW/status/2042653034774475108

Most agents guess the next move. They don’t search. Production systems need to branch, route, and kill bad paths before they compound. These gaps are where most teams quietly lose. We mapped them in the article below. 👇
https://x.com/AI21Labs/status/2041506229055283505

Claude Code isn’t magic. The harness layer is just software, and software is something any dev can shape to fit how they want to work.”” Check out @Hacubu’s practical guide to building a custom agent with @LangChain’s Deep Agents, LangSmith, and ACP.
https://x.com/jetbrains/status/2041878762342502731

I see agent builders under-reacting to this and its implications around open source and Mythos. You should prepare for a future where we basically have AGI but its prohibitively slow/expensive. Agents will look more like fast/cheap models making requests to their “”smart
https://x.com/walden_yan/status/2042424031144820762

We’ve just shipped /keep-alive on the Copilot CLI under /experimental. The agent can now continue working without your laptop going to sleep halfway through a task. Try it out, and give us your feedback ☕️
https://x.com/tiagonbotelho/status/2041567422533062788

Building with Claude Code? You need to see what’s happening each turn. The new @weave_wb plugin traces every session automatically. Tool calls, subagents, inputs, outputs. All structured so you can debug faster. No code changes. Just install and go.
https://x.com/wandb/status/2042711977781530846

🤖 From this week’s issue: Ollama 0.19 launches MLX-powered inference on Apple Silicon, delivering ~2x gains in prefill and decode speed on M5 chips, with NVFP4 quantization support and smarter KV cache reuse for agentic workloads.
https://x.com/dl_weekly/status/2042694209224781956

Conversations tend to go better with a face and a voice. That’s why we’re thrilled to release the beta version of the first video chat skill for ANY agent, powered by our new real-time model, PikaStream1.0. The skill preserves memory and personality, and enables real-time
https://x.com/pika_labs/status/2039804583862796345?s=20

Cool to see pika reinventing itself. Now I kinda wanna embody my open claw agent and jump into a real time video call.
https://x.com/bilawalsidhu/status/2039892706508333305

You’ve probably heard a Mist voice already. We’re powering some of the leading brands’ voice agents. Today we’re launching Mist v3 at @rimelabs Same voices. New everything underneath. ~40ms TTFB. Pronunciation control that doesn’t guess on brand names. Throughput built for
https://x.com/lilyjclifford/status/2041545072265543736

tldr > evals are the new training data. instead of updating weights, you’re updating the agent harness > problem is agents are famous cheaters. they will reward-hack your evals and overfit just to make the score go up > solution is treat evals like real ml. you need strict
https://x.com/realsigridjin/status/2042440330503733343

Hands on, concrete guide (with code!) for harness hill climbing with evals
https://x.com/hwchase17/status/2041929684741747171

There were likely no major work impacts of GenAI in any large firm throughout 2025. We did not have agentic tools, adoption takes time, and everyone was experimenting with process. That is starting to change. Studies that show no impact in 2025 don’t tell us much about 2027.
https://x.com/emollick/status/2041218593367478276

If building humanist AI with unlimited tokens sounds up your alley, we’re hiring at MAI. All the benefits of startup speed + agility w/the resources of one of the world’s leading tech companies. If you want to work with a small, brilliant, low-ego team – get in touch.
https://x.com/mustafasuleyman/status/2041621231561310263

Customize your Gemini agent in Colab
https://blog.google/innovation-and-ai/technology/developers-tools/colab-updates/

I am impressed by Gemma 4, there’s a lot of power for an on-device model at fast speeds. But I am not convinced you can get real agentic workflows out of a small model on device. So much depends on model judgement, self-correction, and accuracy. Small models are too weak there.
https://x.com/emollick/status/2040925197767762425

AIE Europe Day 1: Keynotes & OpenClaw/Personal Agents ft Google Deepmind, OpenAI, Vercel, & more – YouTube

Our first successful Gemma 4 Runtime in London with @swyx @patloeber @nick_kango @cormacb and others! 💎Great to go out for a run and talk about Gemma, agents, evals and more
https://x.com/osanseviero/status/2042512059049398785?s=20

There were some exceptionally cool demos from @ollama and omlx using MLX to run Qwen 3.5 and Gemma 4 on Apple silicon. The capabilities of local LLMs and the surrounding ecosystem have come a long way in the past couple years.
https://x.com/awnihannun/status/2042456446122803275

Gemma-4 finetuning 2B, 4B, 26B, 31B all work in Unsloth! We also fixed a few issues: 1. Grad accumulation no longer causes losses to explode 2. Index Error for 26B and 31B for inference 3. use_cache=False had gibberish for E2B, E4B 4. float16 audio -1e9 overflows on float16
https://x.com/danielhanchen/status/2041516671119327590

Introducing Gemma 4, our series of open weight (Apache 2.0 licensed) models, which are byte for byte the most capable open models in the world! Gemma 4 is build to run on your hardware: phones, laptops, and desktops. Frontier intelligence with a 26B MOE and a 31B Dense model!
https://x.com/OfficialLoganK/status/2039735606268314071

People underestimate the level of collaboration that needs to happen for a model such as Gemma 4 to land Before the launch, we worked with HF, VLLM, llama.cpp, Ollama, NVIDIA, Unsloth, Cactus, SGLang, Docker, CloudFlare, and so many others This ecosystem is amazing 🔥
https://x.com/osanseviero/status/2041154555530932578

Gemma 4 31B, quantized and evaluated. Instruction following evals are live on our NVFP4 and FP8-block model cards. Results look great. Reasoning and vision evals coming later this week. NVFP4:
https://t.co/GIc7y1Abkc FP8:
https://x.com/RedHat_AI/status/2040766645480628589

Gemma 4 is #1 on @huggingface!
https://x.com/ClementDelangue/status/2040911131108069692

Gemma 4 is a beast.
https://x.com/Yampeleg/status/2040495537598648357

Speculative decoding for Gemma 4 31B (EAGLE-3) A 2B draft model predicts tokens ahead; the 31B verifier validates them. Same output, faster inference. Early release. vLLM main branch support is in progress (PR #39450). Reasoning support coming soon.
https://x.com/RedHat_AI/status/2042660544797110649

Gemma 4 is the #1 trending model on @huggingface 🤗
https://x.com/GlennCameronjr/status/2040529333794824456

Hey everyone! I am SUPER EXCITED to publish a new episode of the Weaviate Podcast with Shreya Shankar (@sh_reya) on Data Agents! 👾 Shreya is a Ph.D. student in the EPIC Data Lab (@UCBEPIC) advised by Aditya Parameswaran (@adityagp) at UC Berkeley. Her research focuses on
https://x.com/CShorten30/status/2041154055993430365

@shengjia_zhao @alexandr_wang 4/ The team also built multi-agent thinking to achieve better performance for the same latency. In the figure below, the blue line shows standard test time scaling for a single agent: thinking harder leads to better performance. The lines above show that multiple agents achieve
https://x.com/ananyaku/status/2041914478930096478

Perplexity’s Shift to AI Agents Boosts Revenue 50%

Perplexity’s Shift to AI Agents Boosts Revenue 50%

The RAG era was short-lived, but intense. (Not that RAG is not useful, but it is no longer the dominant paradigm for supplying context to agents)
https://x.com/emollick/status/2040094108853600646

1,800 times in a row. Same task. Human interventions: Zero. But the best part? The success rate: Generalist shipped GEN-1, and that number is just one data point from a longer reliability run that went on for hours. The model hits 99% success rates across tasks where the
https://x.com/IlirAliu_/status/2039976446232531177

New research from Databricks: AI agents get measurably better as they accumulate more memory — not bigger models, not longer contexts, just better retrieval from past experience. Uncurated user logs beat hand-crafted domain instructions after just 62 records. We call it memory
https://x.com/DbrxMosaicAI/status/2042666277328609763

[2604.04872] Synthetic Sandbox for Training Machine Learning Engineering Agents
https://arxiv.org/abs/2604.04872

A Taxonomy of RL Environments for LLM Agents
https://leehanchung.github.io/blogs/2026/03/21/rl-environments-for-llm-agents/

🫱 Introducing 𝐍𝐞𝐮𝐫𝐚𝐥 𝐂𝐨𝐦𝐩𝐮𝐭𝐞𝐫s: 𝐰𝐡𝐚𝐭 𝐢𝐟 𝐀𝐈 𝐝𝐨𝐞𝐬 𝐧𝐨𝐭 𝐣𝐮𝐬𝐭 𝐮𝐬𝐞 𝐜𝐨𝐦𝐩𝐮𝐭𝐞𝐫𝐬 𝐛𝐞𝐭𝐭𝐞𝐫, 𝐛𝐮𝐭 𝐛𝐞𝐠𝐢𝐧𝐬 𝐭𝐨 𝐛𝐞𝐜𝐨𝐦𝐞 𝐭𝐡𝐞 𝐫𝐮𝐧𝐧𝐢𝐧𝐠 𝐜𝐨𝐦𝐩𝐮𝐭𝐞𝐫 𝐢𝐭𝐬𝐞𝐥𝐟? Beyond today’s conventional computers, agents, and
https://x.com/MingchenZhuge/status/2042607353175097660

AI-assisted Deployment | Spacelift Intelligence
https://spacelift.io/platform/intelligence?refid=Rundownd+Intelligence+Landing+Page

New on the Engineering Blog: Building Managed Agents–our hosted service for long-running agents–meant solving an old problem in computing: how to design a system for “programs as yet unthought of.” Read more:
https://x.com/AnthropicAI/status/2041929199976640948

AI feels like it’s everywhere now. Spend a few days in San Francisco and it feels like the future has already arrived – agents, autonomy, AI-native companies. It creates a powerful illusion: the rest of the world is moving at the same speed. But it isn’t. Most companies are
https://x.com/TheTuringPost/status/2041455210342871094

Anthropic now blocks first-party harness use too 👀 claude -p –append-system-prompt ‘A personal assistant running inside OpenClaw.’ ‘is clawd here?’ → 400 Third-party apps now draw from your extra usage, not your plan limits. So yeah: bring your own coin 🪙🦞
https://x.com/steipete/status/2040811558427648357

Claude Code and Claude are both down for me. Switching to Codex for now. If you’ve seen how bad Claude’s uptime is lately, it’s not hard to see why they’re blocking 3rd-party apps from using Claude subscriptions. Anthropic needs more GPUs!
https://x.com/Yuchenj_UW/status/2041187141523526011

Claude Code is basically unusable at this point. I give up.
https://x.com/theo/status/2041111862113444221

Claude Code now throws an error if you use it to try and analyze the Claude Code source
https://x.com/theo/status/2041016477047034012

Claude is down :/ so I’m just running my sink
https://x.com/ratlimit/status/2040787102078546068

CodexBar 0.20 is out! 🎚️ 🆕 New providers: Perplexity + OpenCode Go 🔄 Switch Codex accounts without re-login 🔧 Fixed Claude token/cost inflation from dupes 📊 Cost history merges session usage into provider history 16 providers tracked. One menu bar.
https://x.com/steipete/status/2041731875241066517

Having worked at @wandb for years, one thing we always wanted to capture was the “”why”” behind experiments – not only the runs. Reports help, but it still takes effort to get things down. Now that Claude Code is everyone’s experimentation partner – kicking off research,
https://x.com/_ScottCondron/status/2042643700002545773

I’m working on character evals and noticed that Claude would constantly pick itself as #1, so I removed the model names from the judge and changed things.
https://x.com/steipete/status/2042017534816231486

Nicolas Carlini (67.2k citations on Google Scholar) says Claude is a better security researcher – YouTube

What are the largest software engineering tasks AI can perform? In our new benchmark, MirrorCode, Claude Opus 4.6 reimplemented a 16,000-line bioinformatics toolkit — a task we believe would take a human engineer weeks. Co-developed with @METR_Evals. Details in thread.
https://x.com/EpochAIResearch/status/2042624189421752346

It is impressive it found an exploit in the sandbox. But remember: it was prompted to email the dude.
https://x.com/dbreunig/status/2041633539415011652

@eliebakouch @_ueaj Its only me guessing. There were all those rumors that Opus 4.6 was supposed to be a sonnet model, but they switched the name so they could charge Opus level prices (as it was really good). If Mythos is larger than a traditional Sonnet model (whatever that means). Its probably
https://x.com/code_star/status/2041641867050471922

Claude Mythos gets frustrated and confused when outputting the wrong token
https://x.com/scaling01/status/2041586096870457714

@DouthatNYT Mythos is big, but this post is simply wrong. > Top secret networks are air-gapped (not connected to the Internet). That doesn’t mean they’re unhackable, but you likely need physical access. > Developing a zero-day exploit is not synonymous with using it undetected. The U.S.
https://x.com/JonKBateman/status/2041949065777234051

I was told about the Mythos release, but didn’t have access, so have no personal experience to add. Two points from brief: 1) It is not built for IT security, it is just a good enough model that it is good at that too 2) This is the first, not last, model to raise security risks
https://x.com/emollick/status/2041578945531830695

ThursdAI – live from AI Engineer Europe – Mythos, Codex w/ VB, Evals w/ Peter & surprise guests – YouTube

Agent = model + harness Managed Agents = agent + runtime + infra (fully hosted) Anthropic wants to sell agents, not only the models. It’s a huge market, and it will change the pricing structure away from tokens. (They ship so fast because they have Mythos. I want it so much.)
https://x.com/Yuchenj_UW/status/2041933422453780556

But here is what we found when we tested: We took the specific vulnerabilities Anthropic showcases in their announcement, isolated the relevant code, and ran them through small, cheap, open-weights models. Those models recovered much of the same analysis. Eight out of eight
https://x.com/ClementDelangue/status/2041953761069793557

It would be amazing (wrong word? Needed? Important?) to see @simonw as one of the trusted testers of Mythos. It makes all the sense in the world to invite the person behind the idea of the Lethal Trifecta. I hope someone at @Anthropic invites him into the project. There should be
https://x.com/TheTuringPost/status/2041701933556375935

oh husbant… you are not get access to anthropic mythos-preview and now we are stuck in permanent underclass
https://x.com/dejavucoder/status/2041588460923056540

Our run-rate revenue has surpassed $30 billion, up from $9 billion at the end of 2025, as demand for Claude continues to accelerate. This partnership gives us the compute to keep pace. Read more:
https://x.com/AnthropicAI/status/2041275563466502560

Ollama’s cloud is now the best place to run Gemma 4 in the cloud! Available through a subscription for developers and third-party integrations. 🦞OpenClaw ollama launch openclaw –model gemma4:31b-cloud Claude Code ollama launch claude –model gemma4:31b-cloud Run the model
https://x.com/ollama/status/2041238722914685336

Need to set up my OpenClaw to update and restart my Claude Dispatch to add computer use so I can use that instead.
https://x.com/emollick/status/2040166468877164704

With GLM-5.1,
https://t.co/nvW0zf0SAH maintains the #1 open model rank in Code Arena and is now within ~20 points of the top overall while outperforming Claude Sonnet 4.6, Opus 4.5, GPT-5.4 High, and Gemini-3.1 Pro. Open models are now competitive at the frontier.
https://x.com/arena/status/2042643933768151485

[1/n] 🚀 Excited to share XpertBench: moving beyond saturated exam-style benchmarks to expert-level, open-ended workflow evaluation for LLMs. LLM-Based Agent is not a bubble Only when it can handle ambiguity, long-horizon reasoning, and end-to-end execution in the wild. That is
https://x.com/GeZhang86038849/status/2041184352516919690

Agent skills look great in demos. Hand them a curated toolbox, and they shine. But what happens when the agent has to find the right skill from a large, unfiltered collection on its own? New research benchmarks LLM skill usage in realistic settings and finds that performance
https://x.com/dair_ai/status/2041540525539614797

Announcing APEX-Agents-AA, our latest leaderboard on Artificial Analysis, evaluating AI agents on long-horizon professional services tasks with realistic application dependencies This is our implementation of the APEX-Agents benchmark – an agentic work task evaluation
https://x.com/ArtificialAnlys/status/2041896261826310598

ClawBench: Can AI Agents Complete Everyday Online Tasks? A real-world benchmark for AI agents: 153 everyday online tasks across live websites (shopping, booking, job apps). Even top models struggle–dropping from ~70% on sandbox benchmarks to as low as 6.5% here.
https://x.com/arankomatsuzaki/status/2042441980710699364

NIST is developing best practices for LLM / agent evaluation. Our feedback: benchmarking must move beyond 1-dimensional capability evaluation and incorporate properties such as reliability.
https://t.co/yWV9pv6ldb By @steverab, @sayashk, @PKirgis, and me.
https://x.com/random_walker/status/2041533905354858679

AIs can now often do massive easy-to-verify SWE tasks and I’ve updated towards shorter timelines — LessWrong
https://www.lesswrong.com/posts/dKpC6wHFqDrGZwnah/ais-can-now-often-do-massive-easy-to-verify-swe-tasks-and-i

Introducing GLM-5.1 from @Zai_org on Together AI. AI natives can now use GLM-5.1 on Together and benefit from reliable inference for production-scale agentic engineering and long-horizon coding workflows.
https://x.com/togethercompute/status/2042002522798235935

We taught a 1.3M parameter model to play DOOM. It outperforms LLMs up to 92,000x its size. Happy Easter Monday! Here’s our Easter egg release: SauerkrautLM-Doom-MultiVec-1.3M. 17.8 average points per episode. We benchmarked our tiny model against GPT-4o-mini (via OpenAI API),
https://x.com/DavidGFar/status/2041063368656585002

[2603.28052] Meta-Harness: End-to-End Optimization of Model Harnesses
https://arxiv.org/abs/2603.28052

Must-read research of the week ▪️ Meta-Harness: End-to-End Optimization of Model Harnesses ▪️ A Survey of On-Policy Distillation for Large Language Models ▪️ The Latent Space: Foundation, Evolution, Mechanism, Ability, and Outlook ▪️ Marco DeepResearch ▪️ FIPO: Eliciting Deep
https://x.com/TheTuringPost/status/2042026647063556414

Seems like a good model from Meta that is still trailing the current series of releases. The most important thing to note is that it is not open weights. That was the main reason that Meta’s models were so important. Without that, it is a lot harder to predict the value of Spark
https://x.com/emollick/status/2041924282964394085

try for yourself!
https://t.co/DipeeIuXm2 or download Meta AI app
https://x.com/alexandr_wang/status/2041985846950424760

Our first model from MSL, Muse Spark, is now available on
https://t.co/qBMQ6BPVgP! This is an efficient all-rounder model. It supports fast responses, deeper thinking, visual chain of thought, a higher inference “Contemplating” mode. Plus, it’s natively multimodal. 1/
https://x.com/jack_w_rae/status/2041925332631183421

1/ It’s been so fun working with @shengjia_zhao, @alexandr_wang and the team to build muse spark from scratch. It is early and has rough edges, but excited to continue our research velocity. I especially love that we’re doubling down on the fundamental science. We’re focused on
https://x.com/ananyaku/status/2041913147842556390

1/ Muse Spark is live, and alongside it, our new Advanced AI Scaling Framework which details how we evaluate and prepare for advanced AI. We tested across bio, chem, cyber, and loss of control risks before and after mitigations. Muse Spark achieves a 98% bioweapons refusal rate
https://x.com/summeryue0/status/2041956901769113948

Check out Muse Spark, our first milestone in the quest for personal superintelligence! Scaling this with the team has been a total blast. Give it a spin and let us know what you think! 🥑
https://x.com/ren_hongyu/status/2041922484040298796

try muse spark on
https://x.com/alexandr_wang/status/2041956770864885870

🤯NEW
https://t.co/JflLUfop4O MOBILE COOKS! The BEST command center for your Hermes Agent! One unified workspace. Zero tab/terminal chaos. Full power at your fingertips: 🤖 Chat + live tool execution 🧠 Memory browser ⭐️Skills catalog (100+) 🚀 Built-in Terminal 🌐File
https://x.com/outsource_/status/2042411498081866175

🚨 ESTO ES UNA LOCURA ABSOLUTA Ahora le dices a Hermes Agent un concepto matemático cualquiera… y te genera una animación exacta estilo 3Blue1Brown en segundos. Esto democratiza el contenido educativo de nivel élite. El futuro de enseñar (y aprender) ya llegó.
https://x.com/ErickSky/status/2040956335764734235

9 open agents that can improve themselves (a collection inspired by Hermes Agent) ▪️ HyperAgents ▪️ Agent0 ▪️ EvoAgentX ▪️ AgentEvolver ▪️ Agent Zero ▪️ Letta Code ▪️ LettaBot ▪️ LangGraph Reflection ▪️ SuperAGI Save this list and check it out for links and to explore how these
https://x.com/TheTuringPost/status/2040752778113659345

Happy to say, we have hit 50 thousand stars on the Hermes Agent repo. Like every day, thank you all who have helped build this crazy project!
https://x.com/Teknium/status/2042698709293764985

Hermes Agent @Teknium太给力了，我昨天提的issue：期望discord能支持slash技能注册。今天立马就合并进去了，这效率太高了。这样我就可以尝试完全迁移到Hermes。另外大家用Hermes有什么问题，其实可以让ClaudeCode帮你先分析一下然后再帮你去上面提issue，这样方便开发者快速了解你的需求。
https://x.com/Yonah_x/status/2041461508320751759

Hermes Agent now supported by
https://x.com/Teknium/status/2042559951605039531

Hermes Agent tip of the night – did you know with the OpenAI Endpoint that Hermes Agent create for itself, you can use @OpenWebUI to have a chat GUI for your agent? See the full guide here:
https://x.com/Teknium/status/2040998328461316524

If you are in or near Berlin join Rick for a Hermes Agent meetup! We have a few other community members organizing meetups in their cities, and we are always happy to support if anyone else would like to start one.
https://x.com/NousResearch/status/2041509272815534453

Introducing the Manim skill for Hermes Agent. Manim is an engine for creating precise programmatic animations for mathematical and technical explainers, made famous by the @3blue1brown channel.
https://x.com/NousResearch/status/2040931043658567916

Just dropped a manim skill into @Teknium @NousResearch hermes agent setup and asked it to explain a concept it just made the video. animated. narrated. worked in the first try. the agent learned the skill, used it, and now it knows how to do it forever.
https://x.com/noctus91/status/2041084870722793707

Local setup @OpenWebUI with Hermes-agent from @NousResearch and terminal backend with docker , powered by vllm.
https://x.com/magiknono/status/2040524343973740584

preliminary light testing: @NousResearch may have cooked with hermes agent. will dig in a bit more and see.
https://x.com/fujikanaeda/status/2041518373985808519

This feels so random in the universe of possible agent skills, but now I’m going to have to setup Hermes just to try this out. LLMs historically are pretty darn bad @ manim, but it’s an exceptionally useful animation engine.
https://x.com/Sentdex/status/2041165530812334417

This might be the first IRL event for Hermes Agent! If you are near Munich sign up!
https://x.com/Teknium/status/2041552242919243792

Update: Discord and Telegram Bots with Hermes Agent will now have slash commands to force load any skill you have available (Up to 75, due to the platform limits)! `hermes update` to access now or wait until the next major release soon!
https://x.com/Teknium/status/2041233409901769133

What makes Hermes Agent by @NousResearch distinct from any other local agent you’ve ever tried? 1. A huge focus on self-improvement. This agent builds its own skills through a self-evaluation loop. The more you use it, the more it improves. 2. Layered memory stack: small
https://x.com/TheTuringPost/status/2039991352965177727

you have 4 agents running. one is waiting for approval. you have no idea which pane it’s in. Hermes HUD v0.4.0 fixes this. the Agents tab now maps live processes to tmux panes via TTY and surfaces a queue of agents that need your attention. @NousResearch @Teknium
https://x.com/aijoey/status/2040978270439580042

OpenClaw development is moving too fast. They’re adding features that should be skills. I’ve now onboarded 10+ “”normie”” friends to ai agents, and every single one of them prefers Hermes. Great work @Teknium
https://x.com/DoctaDG/status/2041051272560923090

We keep saying we want open-source frontier agents. Fine. Then let’s build the dataset. @badlogicgames, creator of Pi, just shared some of his agent traces used to build Pi on @huggingface. I’m now sharing some of mine too, exporting them from @hermes, @opencode, and Claude via
https://x.com/ClementDelangue/status/2041189872556269697

@Teknium @OpenWebUI That’s really cool! Encourage people to check out my native Hermes WebUI as well:
https://t.co/vNWM6o4s18 – supports nearly every native feature of the terminal version now
https://x.com/nesquena/status/2041000592215298123

Hermes ecosystem is on 🔥 The ecosystem map has been updated to reflect the new v0.8.0 + several new repos added (submit an issue if yours is missing)
https://x.com/KSimback/status/2042369292813861334

Nous officially cooked with Hermes Agent. First time using a local-model agent that replaces a large segment of my Claude Code workflow. Using Qwen3-Coder-Next 80B 4bit, plus a swarm skill for ensemble on harder stuff a model is getting stuck on.
https://x.com/Sentdex/status/2042607880726081725

The Codex App is now our most used surface, ahead of the VS Code extension and the CLI. No wonder it inspires a few others 👀 You can install it here
https://t.co/Lwg13vEJDn + you get up to $500 in credits if you are getting started as a business or enterprise.
https://x.com/thsottiaux/status/2039901964679688437

To celebrate 3 million weekly codex users, we are resetting usage limits. We will do this every million users up to 10 million. Happy building!
https://x.com/sama/status/2041658719839383945

New blog post: converting 30k @arxiv papers to Markdown using SOTA OCR models to enable chat with paper functionality Includes: > leveraging an open OCR model (Chandra 2 by @datalabto) > running on GPU infra – @huggingface Jobs > using Codex with a SKILL.md
https://x.com/NielsRogge/status/2041556496320700626

Need one of the older models after the deprecation? Use Codex with your own API key. After April 14, Codex with ChatGPT sign-in will support these models: • gpt-5.4 • gpt-5.4-mini • gpt-5.3-codex • gpt-5.3-codex-spark (Pro only) • gpt-5.2
https://x.com/OpenAIDevs/status/2041611001410593163

We’re retiring older models in Codex when you sign in to Codex with your ChatGPT account on April 14: • gpt-5.2-codex • gpt-5.1-codex-mini • gpt-5.1-codex-max • gpt-5.1-codex • gpt-5.1 • gpt-5
https://x.com/OpenAIDevs/status/2041610989607727164

Can’t wait to join the team at @openai building codex. Would love to hear what you love about it or want changed. We’re moving fast. DMs open.
https://x.com/simpsoka/status/2040144952550969516

The Codex app server was such a brilliant stroke of foresight that really doesn’t get enough love Not only are you allowed to use your chatgpt account with any harness, but you can build your own apps directly on top of theirs. They just make building on and with codex such a
https://x.com/LLMJunky/status/2040506388292546761

We just made it frictionless for teams to try Codex! > New $0 Codex only seat for Codex access that is fully usage-based > Annual team seats are dropping from $25 to $20 per month For each Codex only seat you add to a new or existing workspace, we’ll credit your team $100, for
https://x.com/rohanvarma/status/2039818201060811126

ChatGPT is now available in CarPlay. The voice mode you know, now available on-the-go. Rolling out to iPhone users running iOS 26.4+ where CarPlay is supported.
https://x.com/OpenAI/status/2039748699350532097?s=20

The next version of @OpenClaw comes with native video generation. To start, I added support for the following companies: – Alibaba – BytePlus – fal – Google – MiniMax – OpenAI – Qwen – Together – xAI
https://x.com/steipete/status/2040928953653744003

We’re updating our ChatGPT Pro and Plus subscriptions to better support the growing use of Codex. We’re introducing a new $100/month Pro tier. This new tier offers 5x more Codex usage than Plus and is best for longer, high-effort Codex sessions. In ChatGPT, this new Pro tier
https://x.com/OpenAI/status/2042295688323875316?s=20

You can now use FAST mode if using OpenAI provider on GPT-5.4 with /fast fast command! `hermes update` to start using!
https://x.com/Teknium/status/2042468113699291636

Go from project setup to deployment with the @Vercel plugin in the Codex app.
https://x.com/OpenAIDevs/status/2040443885374304639

Today we’re announcing the Billion Dollar Build. An 8-week competition where teams will use Perplexity Computer to build a company with a path to $1B. Finalists have the opportunity to secure up to $1M in investment from the Perplexity Fund and up to $1M in Computer credits.
https://x.com/perplexity_ai/status/2041929222135173466

Use Design Mode in Cursor 3 to annotate and target UI elements in the browser.
https://x.com/cursor_ai/status/2041561791243940092

🚀 Qwen Code v0.14.0 – v0.14.2 are now available Channels：Control Qwen Code remotely from Telegram, DingTalk, or WeChat — send a message from your phone, get results on your server Cron Jobs ：Schedule recurring AI tasks — auto-run tests every 30 min, pull & build every
https://x.com/Alibaba_Qwen/status/2042551216769765449

back in 2020 when i was writing blogs, i was always inspired by the 3blue1brown animations. it’s great to see that agents can now write manim code for anything by prompting. now that we have this, every blog should have these little animations.
https://x.com/casper_hansen_/status/2041046264758858081