Benchmarks: AI News Week Ending 03/06/2026

Benchmarks: AI News Week Ending 03/06/2026

March 6, 2026

Image created with gemini-3.1-flash-image-preview with claude-sonnet-4-5. Image prompt: A white rectangular scorecard with perfect 100% scores and gold stars falls joyfully through a bright blue sky with arms spread wide and a happy smiley face, the word BENCHMARKS in bold black typography across its top, ground visible far below, high altitude aerial photography, crisp daylight, clean composition, wide angle shot.

GPT-5.4 set a new record on FrontierMath, our benchmark of extremely challenging math problems! We had pre-release access to evaluate the model. On Tiers 1-3, GPT-5.4 Pro scored 50%. On Tier 4 it scored 38%. See thread for commentary and additional experiments.”” https://x.com/EpochAIResearch/status/2029626255776395425

📊 How to evaluate skills❓️ Lots of companies are building skills for coding agents. But how do you know if your skill is actually working? It’s tempting to go by vibes, but performance varies a lot across tasks — and coding agents have a huge action space, which makes that”” https://x.com/LangChain/status/2029618086374944771

Agent reliability being a cross-functional problem is the most underrated ops shift right now. You can’t engineer your way out of bad eval criteria — PMs and domain experts have to own their part.”” https://x.com/saen_dev/status/2028411962712088767

Agent skills are powerful but they are often AI-generated and not tested. Here is a practical guide to evaluating agent skills with code, prompts, and real results. 📋 Define success criteria (outcome, style, and efficiency). 🧪 Create 10-12 prompts with deterministic checks. 🤖”” https://x.com/_philschmid/status/2029570052530360719

Agents, for real work. The latest @code release gives you better agent orchestration, extensibility, and continuity. Here’s what’s new: 🪝 Hooks support 🎯 Message steering and queueing 🌐 Agentic integrated browser 🧠 Shared memory And more…”” https://x.com/code/status/2029279963778515372

AI agents are tackling more and more “”human work”” But are they benchmarked on the work people actually do? tl;dr: Not really Most benchmarks focus on math & coding, while most human labor and capital lie elsewhere. 📒 We built a database linking agent benchmarks & real-world”” https://x.com/ZhiruoW/status/2028847081507488011

Can AI agents agree? Communication is one of the biggest challenges in multi-agent systems. New research tests LLM-based agents on Byzantine consensus games, scenarios where agents must agree on a value even when some participants behave adversarially. The main finding: valid”” https://x.com/omarsar0/status/2028823724196343923

Clerk Skills for AI Agents https://clerk.com/changelog/2026-01-29-clerk-skills?dub_id=AlTGRISXA0vckDDY

Introducing SWE-Atlas. We built SWE-Atlas as the next evolution of SWE-Bench Pro, expanding agent evaluation beyond change accuracy to better reflect the real, interactive workflows that define software development. Results for Codebase QnA, the first eval under SWE-Atlas that”” https://x.com/scale_AI/status/2029244660905095359

Last week, we did an internal deep dive into enterprise environments/benchmarks like τ²-𝐁𝐞𝐧𝐜𝐡 and 𝐂𝐨𝐫𝐞𝐂𝐫𝐚𝐟𝐭. This type of high-fidelity RL env is becoming increasingly popular as frontier labs push their models into more and more agentic capabilities.”” https://x.com/Shahules786/status/2029603934944235943

Long-running agents accumulate context while model memory stays fixed. This leads to a tradeoff: either discard older information or compress it. New work by @charles0neill explores repeated KV-cache compression for persistent agents using Attention Matching. Our research shows”” https://x.com/basetenco/status/2029654320971665651

Today, we’re sharing 🌁 Knowledge Agents from Reinforcement Learning (KARL) 🌁 We trained an agent that excels on challenging grounded reasoning tasks. KARL matches Sonnet 4.5 quality at a fraction of the cost, and with test-time scaling reaches Opus 4.6 levels. This was a fun”” https://x.com/mrdrozdov/status/2029580506698850692

ByteDance just published something I’ve been waiting for someone to build: CUDA Agent! It trained a model that writes fast CUDA kernels. Not just correct ones — actually optimized ones. It beats torch.compile by 2× on simple/medium kernels, ~92% on complex ones, and even”” https://x.com/BoWang87/status/2028599174992949508

Beyond the flashiness, what’s exciting about this is that products you create with Perplexity Computer don’t require you to manage your own API keys, unlike other agent frameworks. Everything will be run on a secure sandbox that we orchestrate end to end. The stateful abstracted”” https://x.com/AravSrinivas/status/2028903680616087946

BullshitBench v2 is out! It is one of the few benchmarks where models are generally not getting better (except Claude) and where reasoning isn’t helping. What’s new: 100 new questions, by domain (coding (40 Q’s), medical (15), legal (15), finance (15), physics(15)), 70+ model”” https://x.com/petergostev/status/2028492834693677377

We added Claude-Opus-4.6 to MathArena! It is a strong model, only second to Gemini-3.1-Pro on most benchmarks. One exception: it scores quite poorly in visual mathematics. Also, it is expensive: we spent around USD 8,000 to add the model, 10x any other model we ever evaluated.”” https://x.com/j_dekoninck/status/2029160582687985727

The Document Arena is now live with leaderboard scores! See which frontier AI models rank highest in document reasoning, all powered by side-by-side evaluations on user-uploaded PDFs from real work use cases. – #1 is Claude Opus 4.6 scoring 1525, +51 pts in the lead – While”” https://x.com/arena/status/2028915403704156581

I looked into how Claude Code and Codex compare to the default scaffolds METR uses for time horizon measurements. It looks like they don’t significantly outperform our default scaffolds on any models we’ve tried them on so far.”” https://x.com/nikolaj2030/status/2022398669337825737

Building community trust through open science is core to Arena. That’s why the Arena leaderboard runs on Arena-Rank, our open-source Python package for transparent ranking. With it, anyone can construct statistically grounded, reproducible leaderboards using pairwise comparison”” https://x.com/arena/status/2027528061508587728

He’s back with an improved “”BullshitBench V2″” Anthropic models are still dominating everything”” https://x.com/scaling01/status/2028494129710133725

Honestly, there should be a standard that any model release with benchmark scores should also release the prompts/trajectory. It’s easier for people to build on top of these models since we won’t have to keep worrying if the eval harness is the problem or not”” https://x.com/nrehiew_/status/2029558608393109769

I wish more research teams did this. I remember some time ago we couldn’t repro the Llama 3 scores on MATH because the 1B model was terrible at producing \boxed{} with a vanilla CoT prompt. It turned out you need a detailed system prompt that was not present in any tech report,”” https://x.com/_lewtun/status/2029571193624306016

Must-read AI research of the week: ▪️ Doc-to-LoRA ▪️ Does Your Reasoning Model Implicitly Know When to Stop Thinking? ▪️ ARLArena ▪️ Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization ▪️ On Data Engineering for Scaling LLM Terminal Capabilities ▪️”” https://x.com/TheTuringPost/status/2028777919057949106

We’re close to saturating WeirdML v2 wouldn’t surprise me to see 5.4 going beyond 86%”” https://x.com/teortaxesTex/status/2028444160517144683

What a great illustration of the central problem of AI benchmarking for real work All of the effort is going into benchmarking for coding, but that is a small part of the actual jobs people do, which leaves the true trajectory of AI progress less clear. https://x.com/emollick/status/2028870529906622677

Top 10 Open Models: February 2026 in Code Arena. In the Code Arena, currently 46 different agentic coding models are on the leaderboard, and only 18 are open source being produced by 7 different labs. Here’s how the labs stack up this month: – GLM-5 scoring 1451, ranking”” https://x.com/arena/status/2027540296276607105

You can complain about Europe. Or you can apply for €125M to build the next frontier AI lab. They’ll come from 10 teams funded with €125M. Non-dilutive. 24 months. Zero equity taken. On March 19, 2026, Next Frontier AI comes to Paris. SPRIND is launching a €125M Challenge”” https://x.com/IlirAliu_/status/2027097090619220083

GPT-5.4 scores 83% on GDPval”” https://x.com/scaling01/status/2029618924375965992

GPT 5.3 Codex (xhigh) scores 79.3% and takes the lead on WeirdML, just ahead of opus 4.6 (77.9%) at less than half the prize. It is very solid across the board, but I still feel the peak performance of gemini 3.1 is stronger.”” https://x.com/htihle/status/2028441018865955244

BullshitBench v2, created by Peter Gostev, is a benchmark that does something refreshingly different: it tests whether AI models can detect and reject nonsensical prompts instead of confidently rolling with them. Only Anthropic’s Claude models and Alibaba’s Qwen 3.5 score”” https://x.com/kimmonismus/status/2029230388028358726

Top 10 Open Models: February 2026 in Text Arena. The top 3 labs have not changed since January, but the scores have gotten tighter between them: – @Zai_org’s GLM-5, scoring 1455 – @Alibaba_Qwen’s Qwen-3.5 397B A17B, scoring1454 – @Kimi_Moonshot’s Kimi-K2.5 Thinking, 1452 The”” https://x.com/arena/status/2027511779417592173