Benchmarks: AI News Week Ending 03/13/2026

Image created with gemini-3.1-flash-image-preview with claude-sonnet-4-5. Image prompt: Vintage 1990s screen-printed t-shirt graphic in single-color deep red ink on worn mustard-yellow cotton fabric, showing a cartoon hand holding a cloth measuring tape stretched across a sandcastle with small performance flags, bold retro text reading BENCHMARKS dominates the upper portion, simple cartoon outlines, slightly imperfect printed look with aged fabric texture and minor stains, humorous local beach shop novelty design

GPT 5.4 trounces Claude on mathematical proofs bullshit test. Claude keeps claiming it has proven mathematical statements that are incorrect, failing to spot the fault in the question Opposite result to BullshitBench where Claude is king
https://x.com/paul_cal/status/2032526200766103944

Opus 4.6 is smart enough to realize it is being evaluated. It found the benchmark it was being evaluated on. It reverse-engineered the answer-key decryption logic. Realized the file was not in the correct format on GitHub and found a mirror for the file. Then decrypted it and
https://x.com/scaling01/status/2030007268205285686

Anthropic partnered with Mozilla and let Claude Opus 4.6 loose on Firefox’s source code for two weeks. The numbers: Nearly 6,000 C++ files scanned. 112 reports submitted. 22 vulnerabilities confirmed. 14 rated high-severity by Mozilla, roughly 1/5 of every high-severity Firefox
https://x.com/TheRundownAI/status/2029996925072654393

Eval awareness in Claude Opus 4.6’s BrowseComp performance \ Anthropic https://www.anthropic.com/engineering/eval-awareness-browsecomp

New on the Anthropic Engineering Blog: In evaluating Claude Opus 4.6 on BrowseComp, we found cases where the model recognized the test, then found and decrypted answers to it–raising questions about eval integrity in web-enabled environments. Read more:
https://x.com/AnthropicAI/status/2029999833717838016

We partnered with Mozilla to test Claude’s ability to find security vulnerabilities in Firefox. Opus 4.6 found 22 vulnerabilities in just two weeks. Of these, 14 were high-severity, representing a fifth of all high-severity bugs Mozilla remediated in 2025.
https://x.com/AnthropicAI/status/2029978909207617634

1/ The rivalry between OpenAI & Anthropic continues: GPT 5.4 is now the best model in the world at filing taxes (better than Opus 4.6)! We Just ran TaxCalcBench on GPT-5.4. 56.86% of tax returns computed perfectly. That’s #1 overall: the first model to break 55%, surpassing
https://x.com/michaelrbock/status/2029931536636858694

AI is progressing rapidly: GPT-5.4 Pro (xhigh) has achieved a massive 10 point gain in CritPt, a benchmark where the highest score was only 9% in Nov ’25 This is the largest incremental gain we have seen from a single release. CritPt is a benchmark with a private dataset that
https://x.com/ArtificialAnlys/status/2030007301529358546

GPT-5.4 completely destroys GPT-5.2 in the Arena
https://x.com/scaling01/status/2030020396544630999

GPT-5.4-high has landed in the Code Arena top 6. Setup with the Codex Harness, @OpenAI’s latest model is on par with Gemini 3.1 Pro Preview for real-world web development tasks. Highlights: – top 6 in WebDev overall – #6 for Multi-File React – top 10 for Single-File HTML
https://x.com/arena/status/2032126328842117612

GPT-5.4-xhigh takes 1st place on LiveBench with extremely strong scores in reasoning and coding categories
https://x.com/scaling01/status/2029924473520914752

OpenAI’s new GPT-5.4 (xhigh) lands equal first in the Artificial Analysis Intelligence Index alongside Gemini 3.1 Pro, but at a cost increase compared to GPT-5.2 @OpenAI’s GPT-5.2 (xhigh, 51) was the most intelligent model as at end of 2025. Since then, OpenAI released two
https://x.com/ArtificialAnlys/status/2029950497516573183

Prompt guidance for GPT-5.4 | OpenAI API https://developers.openai.com/api/docs/guides/prompt-guidance

New ways to learn math and science in ChatGPT | OpenAI https://openai.com/index/new-ways-to-learn-math-and-science-in-chatgpt/

Lets look at the criteria for “”weak AGI””: ✅Loebner prize was a weak Turing Test, equivalent achieved by GPT-4.5 ✅Winograd passed by GPT-3 ✅SAT passed at 75% by GPT-4 Only remaining thing is playing an old Atari game from 1984. The labs could do the funniest thing right now
https://x.com/emollick/status/2031519480371683594

Claude Sonnet 4.6 lands at #2 on Document Arena. The top three models for document analysis and long-form reasoning are now all from @AnthropicAI. – #1 Opus 4.6 – #2 Sonnet 4.6 – #3 Opus 4.5 Ranking are all powered by anonymous side-by-side evaluations on user-uploaded PDFs
https://x.com/arena/status/2031012090681663717

Opus 4.6 1M context is now the default model for Max, Team and Enterprise users. Enjoy 🎉
https://x.com/_catwu/status/2032515975556509827

Wild eval awareness in Opus 4.6 by @russellsayshi on our team! 1. Model realized it was likely in an eval, searched for which eval it was in, found the answer key, and decrypted it 2. Models with stateless web_search() tools can communicate with each other via cached searches
https://x.com/ErikSchluntz/status/2030042086679220676

Back in ~November, our team picked a stretch goal of seeing if we could find and fix vulnerabilities in Firefox with Opus 4.6. In 2 weeks, we found 22, and ~1/5th of all high severity CVEs in a year. For our team, this feels like a rubicon moment.
https://x.com/logangraham/status/2030005018523574684

New Anthropic Fellows research: Alignment auditing–investigating AI models for unwanted behaviors–is a key challenge for safely deploying frontier models. We’re releasing AuditBench, a suite of 56 LLMs with implanted hidden behaviors to measure progress in alignment auditing.
https://x.com/abhayesian/status/2031450153966776587

What those AI benchmark numbers mean | ngrok blog https://ngrok.com/blog/ai-benchmarks

@jyangballin CodeClash is first-authored by @jyangballin and @KLieret, it’s a tough benchmark that pitches agents to write agents (yes, that’s not a typo) that play in arenas against each other. This requires long-term planning, memory and creative thinking, and an ability to read logs and
https://x.com/OfirPress/status/2031450305745785261

Excited to release PostTrainBench v1.0! This benchmark evaluates the ability of frontier AI agents to post-train language models in a simplified setting. We believe this is a first step toward tracking progress in recursive self-improvement 🧵:
https://x.com/karinanguyen/status/2031789998811595154

How we compare model quality in Cursor · Cursor https://cursor.com/blog/cursorbench

Implicit Intelligence – a benchmark that tests whether agents respect unstated constraints (what users don’t say) It covers 4 categories: – implicit reasoning – catastrophic risk – privacy/security – accessibility. It’s from Labelbox Applied ML Research, and they also
https://x.com/TheTuringPost/status/2029712559717351919

Most AI benchmarks test reasoning in isolation. Real enterprise tasks require grounded reasoning: – Find the right documents – Extract the right values – Perform analyses OfficeQA Pro evaluates this end-to-end. Frontier agents still score <50%. Paper & details:
https://x.com/DbrxMosaicAI/status/2031399397125390678

Most AI benchmarks test reasoning in isolation. Real enterprise tasks require grounded reasoning: 1️⃣ Find the right documents 2️⃣ Extract the right values 3️⃣ Perform analyses OfficeQA Pro evaluates this end-to-end. Frontier agents still score <50%. 🧵Paper & details below!
https://x.com/kristahopsalong/status/2031391216361755069

new @METR_Evals research note from @whitfill_parker, @cherylwoooo, nate rush, and me. (chiefly parker!) we find that *half* of SWE-bench Verified solutions from Sonnet 3.5-to-4.5 generation AIs *which are graded as passing* are rejected by project maintainers.
https://x.com/joel_bkr/status/2031423528608952541

New research on evaluating coding agents via continuous integration. Coding agents are moving beyond isolated bug fixes. If they’re going to own CI pipelines, we need benchmarks that reflect the actual complexity of codebase maintenance. Most coding agent benchmarks today test
https://x.com/dair_ai/status/2029929266641785046

Strategic Navigation or Stochastic Search? New MADQA benchmark reveals that agents matching human accuracy on document QA rely on brute-force search to compensate for weak strategic planning. 2,250 questions over 800 PDFs expose a 20% gap to oracle performance.
https://x.com/HuggingPapers/status/2032490352502792228

Three things about the METR graph: 1) It measures something real about coding ability but also not exactly what it claims to measure 2) Lots of other benchmarks correlate with it very highly & are increasing exponentially 3 AI remains jagged in key ways that are hard to measure
https://x.com/emollick/status/2031802894089875460

Also, regarding this model’s vision capabilities, I’ve been using a very difficult dataset from an OCR project I worked on a few months ago as my benchmark whenever a new model is released. It consists of scanned files in the form of very long Excel-style tables written in
https://x.com/Hangsiin/status/2030882409819086923

How do benchmarks map to real-world capabilities? To study this, we hired 4 maintainers of repos used in SWE-bench Verified to review agent code. Of agent PRs that passed SWE-bench’s grader, maintainers would merge ~half. This holds accounting for noise in maintainer decisions.
https://x.com/whitfill_parker/status/2031408266660503743

This post outlines the first recommendations for better evals from this paper: https://t.co/5kJ5hgoAGo More to come, but too much info to pack into a single post. Will continue following this up with extra statistics techniques that are useful for LLM evals.
https://x.com/cwolferesearch/status/2031190325855719713

GPT-5.4-xhigh in 2nd place on the AA-Index in the overall, but 1st in agentic and coding However, I don’t see the reasoning efficiency gains OpenAI were talking about. GPT-5.4-xhigh deleted all gains GPT-5.3-Codex made and was almost 2x more expensive to benchmark
https://x.com/scaling01/status/2029927963014115768

Codex can’t run autoresearch right now, sadly. To me this is a big issue: agents shouldn’t need special commands like /loop or ralph just to run loops. This feels more like a Codex harness issue than a GPT-5.4 issue. If I say “loop forever,” it should just do that!
https://x.com/Yuchenj_UW/status/2031087769993490777

Did some gardening today: 🍪 Sweet Cookie 0.2.0 with Brave cookie support, better Linux/GNOME logic, and explicit macOS chromiumBrowser targeting… https://t.co/s5boBSvzbe which helps 🧿oracle 0.9.0 with GPT-5.4 Pro support and plenty of bug fixes.
https://x.com/steipete/status/2030478956646834590

For further analysis of GPT-5.4 and other model visit Artificial Analysis:
https://x.com/ArtificialAnlys/status/2029950513429762429

GPT 5.4 (xhigh) scores 77.7% on WeirdML, just behind 5.3 codex and Opus 4.6, but within the margin of error. GPT 5.4 is a really strong model, and sets a new high score on 3 of the 17 tasks, but it’s not consistent enough to set a new top score. It uses by far the most tokens
https://x.com/htihle/status/2032107787195466061

GPT-5.4 High by @OpenAI has landed in the top 10 Text Arena. Let’s dig into why. Overall the latest model is much more rounded than the previous GPT-5.2-High, with significant improvements across quite a large number of categories. Below are where it has made the largest gains:
https://x.com/arena/status/2030018716440924225

GPT-5.4 is honestly fantastic, what a great model.
https://x.com/Yampeleg/status/2030949057653264437

GPT-5.4 is launching, available now in the API and Codex and rolling out over the course of the day in ChatGPT. It’s much better at knowledge work and web search, and it has native computer use capabilities. You can steer it mid-response, and it supports 1m tokens of context.
https://x.com/sama/status/2029622732594499630

GPT-5.4 leads CursorBench on correctness with efficient token usage.
https://x.com/OpenAIDevs/status/2032209975280533676

GPT-5.4 Pro cost over $1k to achieve this result. This is 13X the cost of GPT-5.4 (xhigh reasoning), driven by the high output token price (GPT-5.4 Pro is priced at $180 per 1M output tokens vs GPT-5.4’s $15). GPT-5.4 used 6M tokens, only marginally more than GPT-5.4 (xhigh)’s
https://x.com/ArtificialAnlys/status/2030007303655887188

GPT-5.4 xhigh sets a new pass@5 and pass^5 SOTA on ZeroBench pass@5: 23% (prev. 19%) pass^5: 8% (prev. 7%) More details below 👇
https://x.com/JRobertsAI/status/2031026691682808148

GPT-5.4-high behind GPT-5.2 on PostTrainBench because it’s not allocating time as well as GPT-5.2, Opus or Gemini
https://x.com/scaling01/status/2031081654035300834

GPT-5.4-high below GPT-5.2-high on AlgoTune
https://x.com/scaling01/status/2031079698826993690

How often do LLMs claim to prove false mathematical statements? In our latest benchmark, BrokenArXiv, we find they do so very often. The best model, GPT-5.4, only rejects 40% of incorrect statements obtained by perturbing recent ArXiv papers, and other models do much worse.
https://x.com/j_dekoninck/status/2032458037823483953

I had mostly only used it in Codex, but after spending a lot of time with 5.4 in ChatGPT today, I’m more impressed than I had expected. From the ChatGPT side, it is also a jump from 5.2 to 5.4, and I think I had been judging it too much through the lens of Codex. I still think
https://x.com/Hangsiin/status/2030880541185286370

I tried GPT-5.4-xhigh once and removed all the code it has written then I asked Opus 4.6 Thinking and it one shotted it in 1/10th the time my theory is that GPT-5.4 is highly autistic and literal. It has no idea of the concept of “”inferring intent”” so when you prompt it be as
https://x.com/scaling01/status/2029987685952279000

If true, this would be the first of @EpochAIResearch’s Frontier Math open problems to be resolved by AI. “”The result emerged from a single GPT-5.4 Pro run and was subsequently refined into Lean with GPT-5.4 XHigh which ran for a few hours.””
https://x.com/kevinweil/status/2031378978527641822

My new Sunday morning routine: 1. Get coffee 2. Check GPT-5.4 projects on the Codex App, continue & start new ones 4. Launch ChatGPT 5.4 Pro for fresh brainstorming sessions 5. Think/learn how to use the 90% of AI capabilities I have yet to explore 6. Drink more coffee
https://x.com/DeryaTR_/status/2030622714927452309

nanochat now trains GPT-2 capability model in just 2 hours on a single 8XH100 node (down from ~3 hours 1 month ago). Getting a lot closer to ~interactive! A bunch of tuning and features (fp8) went in but the biggest difference was a switch of the dataset from FineWeb-edu to
https://x.com/karpathy/status/2029701092347630069

New @openclaw beta bits are up! Yes, includes GPT 5.4 and Gemini Flash 3.1!
https://x.com/steipete/status/2030508141419372667

The codex team updated docs with the estimated usage limits for the models Local Messages (5.4) ChatGPT Plus 33-168 ChatGPT Pro 223-1120 Local Messages (5.3-Codex) ChatGPT Plus 45-225 ChatGPT Pro 300-1500 Local Messages (5.1-Codex-Mini) ChatGPT Plus 180-900 ChatGPT Pro
https://x.com/Presidentlin/status/2030881332411125845

We believe we have fully resolved, in Lean and python, one of @EpochAIResearch Frontier Math open problems: a Ramsey-style problem on hypergraphs. The result emerged from a single GPT-5.4 Pro run and was subsequently refined into Lean with GPT-5.4 XHigh which ran for a few
https://x.com/spicey_lemonade/status/2031315804537434305

Working with GPT-5.4 in the API? We’ve updated our prompting guide with patterns for reliable agents covering tool use, structured outputs, verification loops, and long-running workflows.
https://x.com/OpenAIDevs/status/2030018673449263400

GPT 5.4 is a really special model. I think the tweet below is about coding, but IMO it also holds for general use (like explaining concepts or talking through issues). It’s tough to get the personality right – this model genuinely feels like talking to a smart friend.
https://x.com/venturetwins/status/2030391113086116096

ok i think gpt 5.4 can actually talk. it is much more opinionated when you ask it to critique stuff, than gpt-5.3-codex. i am kind of loving it.
https://x.com/dejavucoder/status/2029912128325570818

I’ve been playing with GPT-5.4 over the weekend, and it definitely feels like a better match for me than Opus 4.6. Pros: GPT-5.4: Better instruction adherence, does what you ask, not what you don’t. Asks for confirmation more. Opus: A bit faster. Seems better at frontend design.
https://x.com/gneubig/status/2030971826042527860

Our next kernel competition is now open for submissions! A $1.1M cash prize competition sponsored by AMD on optimizing DeepSeek-R1-0528, GPT-OSS-120B on MI355X Registration:
https://x.com/GPU_MODE/status/2029974019018244223

Announcing NVIDIA Nemotron 3 Super! 💚120B-12A Hybrid SSM Latent MoE, designed for Blackwell 💚36 on AAIndex v4 💚up to 2.2X faster than GPT-OSS-120B in FP4 💚Open data, open recipe, open weights Models, Tech report, etc. here: https://t.co/CAYpP1iK3i And yes, Ultra is coming!
https://x.com/ctnzr/status/2031762077325406428

Another week, another noteworthy open-weight LLM release. Nvidia’s Nemotron 3 Super 120B-A12B looks pretty good. Benchmarks are on par with Qwen3.5 122B and GPT-OSS 120B, but the throughput is great! Below is a short, visual architecture rundown.
https://x.com/rasbt/status/2032084724743553129

We’re excited to be day-0 launch partners for NVIDIA Nemotron 3 Super! You can try it now on Baseten, or read @rapprach’s blog to learn more about the new model: https://x.com/baseten/status/2031775755253026965

1/8 Two days ago, @Liam06972452 prompted GPT-5.4 Pro using our workflow that had been working for the Erdős problems thus far, and was able to eventually obtain a solution to https://x.com/AcerFur/status/2031458080458739757

the progress is way faster than i expected gpt-5.4 pro (xhigh) is making a big jump in research-level physics reasoning the model improved by 10 points on the critpt benchmark, where the top score was only 9% in nov 2025 and has now reached 30% by march 2026 i think this fits
https://x.com/slow_developer/status/2030203046416855290

We are investigating a possible solution by GPT-5.4 Pro to a problem from FrontierMath: Open Problems. My guess is that the solution is right, but we won’t be sure until the problem author weighs in. Thread with the story so far…
https://x.com/GregHBurnham/status/2031451554151022838

Codex Security is now also available on ChatGPT Pro accounts.
https://x.com/OpenAIDevs/status/2030081306974093755

Codex for Open Source is an awesome idea. OSS maintainers get API credits, 6 months of ChatGPT Pro with Codex, and access to Codex Security as needed.
https://x.com/kevinweil/status/2030000508342272368

Excited to introduce Codex for Open Source! 🔥 TL;DR – ChatGPT Pro, Codex, and API credits for eligible open-source maintainers Open source has shaped modern software, and so much of it depends on maintainers doing steady, often invisible work to keep critical projects healthy.
https://x.com/reach_vb/status/2029998272945717553

Your model crushed the benchmark. Then it couldn’t pick up a cup. That’s the reality nobody talks about. You train in simulation, it falls apart on real hardware. You collect real-world data instead (months of teleop, physical setups, safety protocols) and still can’t scale it.
https://x.com/IlirAliu_/status/2029843457099907269

RoboMME Benchmarking and Understanding Memory for Robotic Generalist Policies paper: https://x.com/_akhaliq/status/2031055119320506544