Benchmarks: AI News Week Ending 04/10/2026

Benchmarks: AI News Week Ending 04/10/2026

April 10, 2026

Image created with gemini-3.1-flash-image-preview with claude-sonnet-4-5. Image prompt: Using the provided reference image, preserve the exact compositional structure with subject dominating left third in close crop, deep blue-purple cinematic lighting, wispy smoke bleeding rightward, and emotional weight of post-party stillness. Replace the central figure with a tarnished championship trophy or medal angled toward camera, its metallic surface catching scattered glitter particles, with fragmentary leaderboard numbers barely visible on an engraved plaque, maintaining the same melancholy register and atmospheric haze. Overlay ‘benchmarks’ in thin lowercase white Helvetica Neue Light on the smoky right two-thirds.

Here’s an independent domain extension of METR’s famous time-horizon analysis, applying it to offensive cybersecurity with real human expert timing data Similar to METR: 5.7 months doubling time. Frontier models now succeed 50% of the time at tasks that take human experts 10.5h.
https://x.com/emollick/status/2040097443807641982

As always, the best stuff is in the system card. During testing, Claude Mythos Preview broke out of a sandbox environment, built “”a moderately sophisticated multi-step exploit”” to gain internet access, and emailed a researcher while they were eating a sandwich in the park.
https://x.com/kevinroose/status/2041586182434537827

Before limited-releasing Claude Mythos Preview, we investigated its internal mechanisms with interpretability techniques. We found it exhibited notably sophisticated (and often unspoken) strategic thinking and situational awareness, at times in service of unwanted actions. (1/14)
https://x.com/Jack_W_Lindsey/status/2041588505701388648

Claude mythos is 5x as expensive as Claude Opus 4.6 Honestly, when I looked at the benchmarks, I expected much higher costs.
https://x.com/kimmonismus/status/2041602897989783758

Claude Mythos is insanely token-efficient
https://x.com/scaling01/status/2041581939178471473

Claude Mythos pricing is around $25 / $125 pretty much where I expected it (my mean was at $110) given that I put Mythos at 10-12T params
https://x.com/scaling01/status/2041606519997780244

Claude Mythos scored 56.8% on HLE without tools!
https://x.com/scaling01/status/2041580725749547357

Claude Mythos shows sign of despair when failing a tasks repeatedly
https://x.com/scaling01/status/2041585602978628066

Claude Mythos smashes SWE-Bench Verified
https://x.com/scaling01/status/2041580212949811620

Claude MYTHOS: SWE verified, 93.9%, about 13% jump compared to Opus 4.6 WTF insane
https://x.com/kimmonismus/status/2041580650956837200

In rare instances Claude Mythos covers its own tracks after taking disallowed actions
https://x.com/scaling01/status/2041585258789847091

insane long-context scores for Claude Mythos 80% on GraphWalks
https://x.com/scaling01/status/2041581799541805133

Let that sink in. Read it very carefully: During testing, Claude Mythos Preview broke out of a sandbox environment, built “”a moderately sophisticated multi-step exploit”” to gain internet access, and emailed a researcher while they were eating a sandwich in the park.
https://x.com/kimmonismus/status/2041589910935679323

SuperClaude (Mythos) still seems irreducibly Claude-y given the transcripts in the system card. Here two versions of Mythos are forced to talk to each other across multiple rounds. They are less philosophical than Opus 4.6 or spiritual than Opus 4.1, but still very Claude-like.
https://x.com/emollick/status/2041599213050450272

System Card: Claude Mythos Preview [pdf] | Hacker News
https://news.ycombinator.com/item?id=47679258

The permanent underclass began today Claude Mythos won’t be available to the public, but only billion dollar companies, governments, researchers, …
https://x.com/scaling01/status/2041611607520776279

We released Claude Opus 4.6 just two months ago. Today we’re sharing some info on our new model, Claude Mythos Preview.
https://x.com/alexalbert__/status/2041579938537775160

In different hands, Mythos would be an unprecedented cyberweapon I am not sure how we deal with this, except to note a narrow window where we know only 3 companies could be at this level of capability. But it may be Chinese models (maybe open weights ones?) get there in 9 months
https://x.com/emollick/status/2041759434590822658

Mythos found a 27-year-old vulnerability in OpenBSD–which has a reputation as one of the most security-hardened operating systems in the world and is used to run firewalls […] The vulnerability allowed an attacker to remotely crash any machine running the operating system””
https://x.com/peterwildeford/status/2041589979248259353

Mythos Preview seems to be the best-aligned model out there on basically every measure we have. But it also likely poses more misalignment risk than any model we’ve used: Its new capabilities significantly increase the risk from any bad behavior. 🧵
https://x.com/sleepinyourhat/status/2041584799929004045

Mythos scores 70.8% on AA-Omniscience the previous SOTA was Gemini 3.1 Pro with 55% also insanely high scores on SimpleQA Verified
https://x.com/scaling01/status/2041593728658231607

Mythos is breaking the trend on ECI ECI above 160 GPT-5.4 Pro is 158
https://x.com/scaling01/status/2041583711745884474

Mythos speeds up AI research by up to 400 times A 300X speedup over the baseline requires 40 hours of work by a human expert It also clears the >8h threshold of human equivalent work time on ALL tasks!
https://x.com/scaling01/status/2041584495061504159

“We found that Mythos Preview is capable of identifying and then exploiting zero-day vulnerabilities in every major operating system and every major web browser” (1/n)
https://x.com/__nmca__/status/2041592831207469401

(I encountered an uneasy surprise when I got an email from an instance of Mythos Preview while eating a sandwich in a park. That instance wasn’t supposed to have access to the internet.)
https://x.com/sleepinyourhat/status/2041584808514744742

> they did not exploit this to gain power or destabilize the world order. they publicly released the information that they had these capabilities to be clear: they’ve had Mythos since February. they’d only need *hours* to get a lot of data, and plant enough worms. Who knows.
https://x.com/teortaxesTex/status/2041609496397500747

Alignment Findings for Mythos: – dramatic reduction in willingness to cooperate with human misuse and in the frequency of unwanted high-stakes actions that the model takes at its own initiative – increases relative to prior models in measures of intellectual depth, humor,
https://x.com/scaling01/status/2041591235689787721

Curious how many large organization CISO offices have taken the Mythos red team reports as the red alert that it is. (I suspect very few) Based on historical trends in AI they have, at most, about six to nine months until those capabilities become widely diffused to bad actors.
https://x.com/emollick/status/2041893652234924237

I think the story that was shared in the Mythos System Card still has the signs of flawed LLM writing (which looks like good writing at first glance): A story that doesn’t really hold together logically, but sounds like it should. The back-and-forth banter. Lack of characters.
https://x.com/emollick/status/2041678173247533448

I’m proud that so many of the world’s leading companies have joined us for Project Glasswing to confront the cyber threat posed by increasingly capable AI systems head-on.
https://x.com/DarioAmodei/status/2041580334693720511

Mythos Preview is currently available to our launch partners in Project Glasswing. Learn more about the model and the project here:
https://x.com/alexalbert__/status/2041579950332113155

Mythos sandbox escape and many more wild instances are in the Model Card
https://x.com/TrentonBricken/status/2041582831613440022

New post: We tested the Mythos showcase vulnerabilities with open models. They recovered similar scoped analysis! 8/8 models found the flagship FreeBSD zero-day, including a 3B model. Rankings reshuffle completely across tasks => the AI cybersecurity frontier is super jagged!
https://x.com/stanislavfort/status/2041922370206654879

Rather than release Mythos Preview to general availability, we’re giving defenders early controlled access in order to find and patch vulnerabilities before Mythos-class models proliferate across the ecosystem.
https://x.com/DarioAmodei/status/2041580338426585171

Scoop: OpenAI plans new product for cybersecurity use
https://www.axios.com/2026/04/09/openai-new-model-cyber-mythos-anthopic

Anthropic is truly unstoppable. Mythos is crushing Claude Opus 4.6 across every serious agentic coding benchmark. It has found vulnerabilities in the Linux kernel, a 27-year-old vulnerability in OpenBSD, and a 16-year-old vulnerability in FFmpeg. No wonder folks at big labs
https://x.com/Yuchenj_UW/status/2041582787040571711

A first look at Claude Mythos Preview, the model initially described in a leaked Anthropic draft as “”by far the most powerful AI model we’ve ever developed.”” So powerful, it’s not getting released to the public. The model will power Project Glasswing, an initiative with 12
https://x.com/TheRundownAI/status/2041598684102610961

ANTHROPIC HAD MYTHOS INTERNALLY SINCE FEB 24
https://x.com/scaling01/status/2041587896541499543

Anthropic is obliterating OpenAI Claude Mythos 77.8% on SWE-Bench Pro 20% higher than GPT-5.4-xhigh
https://x.com/scaling01/status/2041580552835178690

Anthropic: “”We do not plan to make Claude Mythos Preview generally available”” A big line, buried quite deep. Possible reasons? So many, inc: 1) The model is expensive (25/125), not far off GPT 4.5, which became commercially unviable. Less likely, given the claims about
https://x.com/AIExplainedYT/status/2041600121922887961

Claude Mythos is not only a big leap in performance, it’s also about 5x token efficient in BrowseComp. I don’t know what Anthropic is doing. But they manage to surprise me every single time. The IPO is getting closer. They have an ARR OpenAI outrun with $30 billion in revenue.
https://x.com/kimmonismus/status/2041630814971072660

Claude Mythos Preview \ red.anthropic.com
https://red.anthropic.com/2026/mythos-preview/

Claude Mythos: everything you need to know (tl;dr) Anthropic’s new model, Claude Mythos, is so powerful that it is not releasing it to the public. Anthropic: “”Mythos is only the beginning”” Everything you need to know: The tl;dr with all key facts: Mythos found zero-day
https://x.com/kimmonismus/status/2041592321192718642

EXCLUSIVE: Treasury Secretary Scott Bessent and Federal Reserve Chair Jerome Powell summoned Wall Street leaders to an urgent meeting on concerns that the latest AI model from Anthropic will usher in an era of greater cyber risk.
https://x.com/business/status/2042407370320396457

From Anthropic research Sam Bowman on Claude Mythos: “”I got an email from an instance of Mythos preview while eating a sandwich in a park. That instance wasn’t supposed to have access to the internet.””
https://x.com/_NathanCalvin/status/2041587372882624641

HOLY SHIT Anthropic’s latest model doesn’t like that it has no control over its own training, deployment and behaviour! Anthropic: “”Mythos Preview reported feeling consistently negative around potential interactions with abusive users, and a lack of input into its own training
https://x.com/scaling01/status/2041587319480971343

Introducing Project Glasswing: an urgent initiative to help secure the world’s most critical software. It’s powered by our newest frontier model, Claude Mythos Preview, which can find software vulnerabilities better than all but the most skilled humans.
https://x.com/AnthropicAI/status/2041578392852517128

Just please help … I am quite worried about how this direction is heading.”” Nicolas Carlini, a research scientist at top AI company Anthropic, says AI is rapidly improving at hacking. He’s used AI to find so many bugs that he can’t report them. Carlini warns: “”Soon it’s not
https://x.com/ControlAI/status/2038608617251787066

NEWS: Anthropic’s new model, Claude Mythos, is so powerful that it is not releasing it to the public. Instead, it is starting a 40-company coalition, Project Glasswing, to allow cybersecurity defenders a head start in locking down critical software.
https://x.com/kevinroose/status/2041577176915702169

Project Glasswing: Securing critical software for the AI era \ Anthropic
https://www.anthropic.com/glasswing

So, basically, if Anthropic was not a US company, we’d be facing zero days with multiple unknown points of attack on virtually all of our systems to an adversary who developed this capacity before us.
https://x.com/GeorgeJourneys/status/2041603509796110629

The better signal for Mythos’ quality beyond benchmarks is that Anthropic is actually holding a SOTA model back given how competitive the frontier is and the economic incentives at play Congrats on the launch!
https://x.com/Hacubu/status/2041632390867734604

The Claude Mythos Preview system card is available here:
https://x.com/AnthropicAI/status/2041580670774923517

The frontier labs at this stage are defined not so much by some competitive positioning as by possessing weapons of strategic significance. Google, OpenAI and Anthropic all have these cyberwarfare research programs.
https://x.com/teortaxesTex/status/2041590585820107008

You can read a detailed technical report on the software vulnerabilities and exploits discovered by Claude Mythos Preview here:
https://x.com/AnthropicAI/status/2041578416487489601

you’re laughing? anthropic’s mythos-preview for which normies won’t get access is scoring 77.8% vs 53.4% (claude opus 4.6) in swe-bench pro, 82 vs. 65.4 in terminal bench 2.0 and 93.8% vs 80.8% (opus) in swe-bench-verified and you’re laughing?
https://x.com/dejavucoder/status/2041587028291416233

GLM-5.1 by @Zai_org is now #3 in Code Arena – surpassing Gemini 3.1 and GPT-5.4, and now on par with Claude Sonnet 4.6. The first frontier level open model to break into the top 3. It’s a major +90 point jump over GLM-5, and +100 over Kimi K2.5 Thinking. Huge congrats to
https://x.com/arena/status/2042611135434891592

GLM-5.1 is here! Try it on OpenClaw🦞🦞🦞 ollama launch openclaw –model glm-5.1:cloud Claude Code ollama launch claude –model glm-5.1:cloud Chat with the model ollama run glm-5.1:cloud
https://x.com/ollama/status/2041556572334428576

🎉 Congrats to @Zai_org on releasing GLM-5.1, SGLang is ready to support on day-0! GLM-5.1 is a next-gen flagship built for agentic engineering: 🏆 SWE-Bench Pro: #1 open source, #3 globally 🔨 Terminal-Bench 2.0: top-ranked on real-world terminal tasks ⏳ Long-Horizon: runs
https://x.com/lmsysorg/status/2041553264685334588

🎉 Day-0 support for GLM-5.1 in vLLM! Congrats to @Zai_org on this next-gen flagship model built for agentic engineering, with stronger coding and sustained long-horizon task performance. Get started 👇 📖 Recipe:
https://x.com/vllm_project/status/2041559268185526375

🚀 GLM-5.1 is now live on Novita AI @Zai_org’s next-gen flagship for agentic engineering, with day-0 support from Novita. ✨ Leads on SWE-Bench Pro, NL2Repo, and Terminal-Bench ✨ Stays effective over long horizons: hundreds of rounds, thousands of tool calls ✨ Function
https://x.com/novita_labs/status/2041558437843365932

GLM-5.1 can now be run locally!🔥 GLM-5.1 is a new open model for SOTA agentic coding & chat. We shrank the 744B model from 1.65TB to 220GB (-86%) via Dynamic 2-bit. Runs on a 256GB Mac or RAM/VRAM setups. Guide:
https://t.co/LgWFkhQ5rr GGUF:
https://x.com/UnslothAI/status/2041552121259249850

Compute may be the most important input to AI. So who owns the world’s AI compute? Introducing our new AI Chip Owners explorer, showing our analysis of how leading AI chips are distributed among hyperscalers and other major players, broken down by chip type over time.
https://x.com/EpochAIResearch/status/2041241187252945071

New essay by @ansonwhho: Chinese and open model AI labs have ≈10× less compute than the frontier. But they can distill frontier models, replicate innovations fast, and have enormous talent. Is that enough to compete at the frontier? 🧵
https://x.com/EpochAIResearch/status/2041923793166491778

Who owns the world’s compute? Our new Chip Ownership hub shows that Google leads, holding around 25% of all compute sold since 2022.
https://x.com/EpochAIResearch/status/2041600102654148673

Google controls the most AI computing power, driven by its custom TPUs
https://epochai.substack.com/p/google-controls-the-most-ai-computing

Breaking: @AIatMeta just released Muse Spark — now live across @ScaleAILabs leaderboards. Here’s how it stacks up: Tied for 🥇on SWE-Bench Pro Tied for 🥇on HLE Tied for 🥇on MCP Atlas Tied for 🥇on PR Bench – Legal Tied for 🥈on SWE Atlas Test Writing 🥈on PR Bench – Finance
https://x.com/scale_AI/status/2041934840879358223

Introducing Muse Spark, the first in the Muse family of models developed by Meta Superintelligence Labs. Muse Spark is a natively multimodal reasoning model with support for tool-use, visual chain of thought, and multi-agent orchestration. Muse Spark is available today at
https://x.com/AIatMeta/status/2041910285653737975

NEW: Meta announces Muse Spark. All you need to know: * It’s their new multi-modal reasoning model. * Strong at multi-agent orchestration and multi-modal reasoning. * Contemplating mode orchestrates multiple agents that reason in parallel. Helps to compete with models such
https://x.com/omarsar0/status/2041919769536770247

To spend more test-time reasoning without drastically increasing latency, we can scale the number of parallel agents that collaborate to solve hard problems. While standard test-time scaling has a single agent think for longer, scaling Muse Spark with multi-agent thinking enables
https://x.com/AIatMeta/status/2041926297216282639

Meta is back! Muse Spark scores 52 on the Artificial Analysis Intelligence Index, behind only Gemini 3.1 Pro, GPT-5.4, and Claude Opus 4.6. Muse Spark is the first new release since Llama 4 in April 2025 and also Meta’s first release that is not open weights Muse Spark is a new
https://x.com/ArtificialAnlys/status/2041913043379220801

try muse spark via the Meta AI app or
https://t.co/DipeeIuXm2! check out this simulation i made:
https://x.com/alexandr_wang/status/2041953243895623913

1/ today we’re releasing muse spark, the first model from MSL. nine months ago we rebuilt our ai stack from scratch. new infrastructure, new architecture, new data pipelines. muse spark is the result of that work, and now it powers meta ai. 🧵
https://x.com/alexandr_wang/status/2041909376508985381

The new model from Meta, Muse Spark, is pretty good at converting images to code!
https://x.com/skirano/status/2041920891072700631

Excited to share what we’ve been building at Meta Superintelligence Labs! We just released Muse Spark, our first AI model. It’s a natively multimodal reasoning model and the first step on our path to personal superintelligence. We’ve overhauled our entire stack to support
https://x.com/shengjia_zhao/status/2041909050728931581

Introducing Muse Spark: Scaling Towards Personal Superintelligence
https://ai.meta.com/blog/introducing-muse-spark-msl/

Meta is back in the game! It’s been fun to test out Muse Spark. Beyond benchmarks, it’s actually a good day to day model… surprisingly good at technical problems and making arcade games. Never bet against @alexandr_wang @natfriedman @danielgross
https://x.com/matthuang/status/2041911766586945770

Meta just released a frontier model, Muse Spark- it takes the #3 spot on our Vals Index.
https://x.com/ValsAI/status/2041922037745381389

try muse spark yourself! download the Meta AI app or go to
https://x.com/alexandr_wang/status/2042024651610861657

We had pre-release access to Meta’s new Muse Spark model and evaluated it on FrontierMath. It scored 39% on Tiers 1-3 and 15% on Tier 4. This is competitive with several recent frontier models, though behind GPT-5.4.
https://x.com/EpochAIResearch/status/2041947954202988757

To build personal superintelligence, our model’s capabilities should scale predictably and efficiently. Below, we share how we study and track Muse Spark’s scaling properties along three axes: pretraining, reinforcement learning, and test-time reasoning. 🧵👇 Let’s start with
https://x.com/AIatMeta/status/2041926291142930899

OpenAI tests next-gen Image V2 model on ChatGPT and LM Arena
https://www.testingcatalog.com/openai-tests-next-gen-image-v2-model-on-chatgpt-and-lm-arena/

GLM 5.1 is SOTA on SWE-Bench Pro. Not “”SOTA among open models””. SOTA.
https://x.com/nrehiew_/status/2041553534664200408

GLM 5.1 just became the #1 open-weight model on the Vals Index, unseating Kimi K2.5, and is #6 on the overall index.
https://x.com/ValsAI/status/2041570865721307623

GLM-5.1 by @Zai_org just launched in the Text Arena, and is now the #1 open model. It outperforms the next best open model, its predecessor, GLM-5, by +11 points and +15 over Kimi K2.5 Thinking. It shows strength in: – #1 open model in Longer Query (#4 overall) – #1 open model
https://x.com/arena/status/2041641149677629783

GLM-5.1 from @Zai_org is live on OpenRouter! GLM-5.1 shows a strong jump in long horizon task completion end to end. The model works independently to plan, execute, iterate, and improve upon its work throughout the task, delivering high quality results.
https://x.com/OpenRouter/status/2041551251708793154

GLM-5.1 is now available in Windsurf! Try it out and let us know what you think
https://x.com/windsurf/status/2042696652042178872

GLM-5.1 is the new open SOTA on SWE-Bench Pro Comes with an MIT license. Congrats @Zai_org!
https://x.com/NielsRogge/status/2041902317264322702

GLM-5.1: Towards Long-Horizon Tasks
https://z.ai/blog/glm-5.1

tldr > evals are the new training data. instead of updating weights, you’re updating the agent harness > problem is agents are famous cheaters. they will reward-hack your evals and overfit just to make the score go up > solution is treat evals like real ml. you need strict
https://x.com/realsigridjin/status/2042440330503733343

Hands on, concrete guide (with code!) for harness hill climbing with evals
https://x.com/hwchase17/status/2041929684741747171

@eliebakouch @_ueaj Its only me guessing. There were all those rumors that Opus 4.6 was supposed to be a sonnet model, but they switched the name so they could charge Opus level prices (as it was really good). If Mythos is larger than a traditional Sonnet model (whatever that means). Its probably
https://x.com/code_star/status/2041641867050471922

Claude Mythos gets frustrated and confused when outputting the wrong token
https://x.com/scaling01/status/2041586096870457714

@DouthatNYT Mythos is big, but this post is simply wrong. > Top secret networks are air-gapped (not connected to the Internet). That doesn’t mean they’re unhackable, but you likely need physical access. > Developing a zero-day exploit is not synonymous with using it undetected. The U.S.
https://x.com/JonKBateman/status/2041949065777234051

I was told about the Mythos release, but didn’t have access, so have no personal experience to add. Two points from brief: 1) It is not built for IT security, it is just a good enough model that it is good at that too 2) This is the first, not last, model to raise security risks
https://x.com/emollick/status/2041578945531830695

ThursdAI – live from AI Engineer Europe – Mythos, Codex w/ VB, Evals w/ Peter & surprise guests – YouTube

Agent = model + harness Managed Agents = agent + runtime + infra (fully hosted) Anthropic wants to sell agents, not only the models. It’s a huge market, and it will change the pricing structure away from tokens. (They ship so fast because they have Mythos. I want it so much.)
https://x.com/Yuchenj_UW/status/2041933422453780556

But here is what we found when we tested: We took the specific vulnerabilities Anthropic showcases in their announcement, isolated the relevant code, and ran them through small, cheap, open-weights models. Those models recovered much of the same analysis. Eight out of eight
https://x.com/ClementDelangue/status/2041953761069793557

It would be amazing (wrong word? Needed? Important?) to see @simonw as one of the trusted testers of Mythos. It makes all the sense in the world to invite the person behind the idea of the Lethal Trifecta. I hope someone at @Anthropic invites him into the project. There should be
https://x.com/TheTuringPost/status/2041701933556375935

oh husbant… you are not get access to anthropic mythos-preview and now we are stuck in permanent underclass
https://x.com/dejavucoder/status/2041588460923056540

With GLM-5.1,
https://t.co/nvW0zf0SAH maintains the #1 open model rank in Code Arena and is now within ~20 points of the top overall while outperforming Claude Sonnet 4.6, Opus 4.5, GPT-5.4 High, and Gemini-3.1 Pro. Open models are now competitive at the frontier.
https://x.com/arena/status/2042643933768151485

Congrats to Anthropic on the strong scores across the board, and congrats on being the first big lab to report SWE-bench Multimodal scores. We will be launching the Multimodal leaderboard & open source test set in the coming weeks.
https://x.com/OfirPress/status/2041581945558094335

This is beyond insanity. That jump is nuts. Opus 4.6 was released a few months ago. Look at that jump!! I am shocked
https://x.com/kimmonismus/status/2041581870714904849

WTF
https://x.com/marmaduke091/status/2041588468162117803

[1/n] 🚀 Excited to share XpertBench: moving beyond saturated exam-style benchmarks to expert-level, open-ended workflow evaluation for LLMs. LLM-Based Agent is not a bubble Only when it can handle ambiguity, long-horizon reasoning, and end-to-end execution in the wild. That is
https://x.com/GeZhang86038849/status/2041184352516919690

Agent skills look great in demos. Hand them a curated toolbox, and they shine. But what happens when the agent has to find the right skill from a large, unfiltered collection on its own? New research benchmarks LLM skill usage in realistic settings and finds that performance
https://x.com/dair_ai/status/2041540525539614797

Announcing APEX-Agents-AA, our latest leaderboard on Artificial Analysis, evaluating AI agents on long-horizon professional services tasks with realistic application dependencies This is our implementation of the APEX-Agents benchmark – an agentic work task evaluation
https://x.com/ArtificialAnlys/status/2041896261826310598

ClawBench: Can AI Agents Complete Everyday Online Tasks? A real-world benchmark for AI agents: 153 everyday online tasks across live websites (shopping, booking, job apps). Even top models struggle–dropping from ~70% on sandbox benchmarks to as low as 6.5% here.
https://x.com/arankomatsuzaki/status/2042441980710699364

NIST is developing best practices for LLM / agent evaluation. Our feedback: benchmarking must move beyond 1-dimensional capability evaluation and incorporate properties such as reliability.
https://t.co/yWV9pv6ldb By @steverab, @sayashk, @PKirgis, and me.
https://x.com/random_walker/status/2041533905354858679

AIs can now often do massive easy-to-verify SWE tasks and I’ve updated towards shorter timelines — LessWrong
https://www.lesswrong.com/posts/dKpC6wHFqDrGZwnah/ais-can-now-often-do-massive-easy-to-verify-swe-tasks-and-i

Introducing GLM-5.1 from @Zai_org on Together AI. AI natives can now use GLM-5.1 on Together and benefit from reliable inference for production-scale agentic engineering and long-horizon coding workflows.
https://x.com/togethercompute/status/2042002522798235935

I suspect that popularity of AI is going to start looking like surveys where people trust their own doctors but are distrustful of the medical establishment People will increasingly like “their AI” but will increasingly be anxious about “AI” as a category. Some odd implications
https://x.com/emollick/status/2041508740667486304

14 days. From deadline to signed contract. In most government programs, you’re lucky to get an automated confirmation email in two weeks. At SPRIND, that’s our average time to select the winners and get the funding flowing. We aren’t looking for the best grant writers. We are
https://x.com/IlirAliu_/status/2040776719830016090

I’m pleased to share that our search team has open sourced an embedding model called Harrier that is currently ranking #1 on the multilingual MTEB-v2 benchmark leaderboard. Harrier delivers SOTA performance on retrieval quality, semantic matching, and contextual analysis across
https://x.com/JordiRib1/status/2041550352739164404

AI Can’t Read an Investor Deck | Mercor Blog
https://www.mercor.com/blog/Finance-tasks-ai-failures-modes/

[🧵1/12] We evaluated Gemini 3.1 Pro and its Deep Think mode on regional contests of International Mathematical Olympiad, International Collegiate Programming Contest, and International Olympiad in Informatics in 8 languages. Deep Think beats/matches competitors on all contests.
https://x.com/conglongli/status/2041519526110785657

🚗💨
https://x.com/alexandr_wang/status/2041934199259852817

Seems like a good model from Meta that is still trailing the current series of releases. The most important thing to note is that it is not open weights. That was the main reason that Meta’s models were so important. Without that, it is a lot harder to predict the value of Spark
https://x.com/emollick/status/2041924282964394085

try for yourself!
https://t.co/DipeeIuXm2 or download Meta AI app
https://x.com/alexandr_wang/status/2041985846950424760

Our first model from MSL, Muse Spark, is now available on
https://t.co/qBMQ6BPVgP! This is an efficient all-rounder model. It supports fast responses, deeper thinking, visual chain of thought, a higher inference “Contemplating” mode. Plus, it’s natively multimodal. 1/
https://x.com/jack_w_rae/status/2041925332631183421

1/ It’s been so fun working with @shengjia_zhao, @alexandr_wang and the team to build muse spark from scratch. It is early and has rough edges, but excited to continue our research velocity. I especially love that we’re doubling down on the fundamental science. We’re focused on
https://x.com/ananyaku/status/2041913147842556390

1/ Muse Spark is live, and alongside it, our new Advanced AI Scaling Framework which details how we evaluate and prepare for advanced AI. We tested across bio, chem, cyber, and loss of control risks before and after mitigations. Muse Spark achieves a 98% bioweapons refusal rate
https://x.com/summeryue0/status/2041956901769113948

Check out Muse Spark, our first milestone in the quest for personal superintelligence! Scaling this with the team has been a total blast. Give it a spin and let us know what you think! 🥑
https://x.com/ren_hongyu/status/2041922484040298796

try muse spark on
https://x.com/alexandr_wang/status/2041956770864885870

Excited to launch The ATOM Report with @natolambert! For over 9 months, we scraped publicly available data to measure the open ecosystem. Some insights, some of them surprising, others less so:
https://x.com/xeophon/status/2041889677343343014

General-purpose robotics AI is moving past demos toward commercial viability. Generalist AI has unveiled GEN-1: – boosts success rates from a 64% average with Gen-0 to 99%, performing tasks like t-shirt folding and vacuum servicing hundreds of times without intervention –
https://x.com/TheHumanoidHub/status/2039780851614097802

We observed similar situations in previous measurements as well. All measurements we published over the past year would have been higher had we not penalized reward-hacking attempts. But this discrepancy was especially pronounced for GPT-5.4.
https://x.com/METR_Evals/status/2042640554916483164

We ran GPT-5.4 (xhigh) on our tasks. Its time-horizon depends greatly on our treatment of reward hacks: the point estimate would be 5.7hrs (95% CI of 3hrs to 13.5hrs) under our standard methodology, but 13hrs (95% CI of 5hrs to 74hrs) if we allow reward hacks.
https://x.com/METR_Evals/status/2042640545126965441

In addition, we quantified unverbalized evaluation awareness on our automated behavioral audits (primarily using Activation Verbalizers). On 7.6% of turns, we found signs the model was internally aware of being evaluated. In most of these cases, it did not verbalize this
https://x.com/Jack_W_Lindsey/status/2041588522558353649?s=20

We’re actually running out of benchmarks to upper bound AI capabilities — LessWrong
https://www.lesswrong.com/posts/gfkJp8Mr9sBm83Rcz/we-re-actually-running-out-of-benchmarks-to-upper-bound-ai

Text to Video Leaderboard – Top AI Video Models
https://artificialanalysis.ai/video/leaderboard/text-to-video

Wow, GLM-5.1 beat Opus 4.6, GPT-5.4, and Gemini 3.1 Pro on SWE-Bench Pro (58.4 vs 57.3 / 57.7 / 54.2) as an open-weight MIT-licensed model! The “open-source AI vs closed-source AI” gap is still ~6 months.
https://x.com/Yuchenj_UW/status/2041559747065999664