Benchmarks: AI News Week Ending 04/03/2026

Benchmarks: AI News Week Ending 04/03/2026

April 3, 2026

Image created with gemini-3.1-flash-image-preview with claude-sonnet-4-5. Image prompt: Using the provided reference image, preserve the square faceted perfume bottle with warm amber liquid, crystal stopper, pure white background, soft shadow, and glass refractions exactly as shown. Replace the label text with ‘Benchmarks’ in the same clean black serif font style. Add a delicate sterling silver chain draped naturally around the bottle neck with a small dainty micrometer caliper pendant in high-fashion jewelry aesthetic–refined, miniature, precise like a Tiffany charm.

🚀 Imagine running Claude 4.6 Opus-level reasoning… but entirely on your own GPU with just 16GB VRAM. This 27B Qwen3.5 variant, distilled on Claude 4.6 Opus reasoning traces, delivers frontier coding power locally. It’s beating Claude Sonnet 4.5 on SWE-bench in 4-bit
https://x.com/outsource_/status/2038999111039357302

This model has been #1 trending for 3 weeks now. It’s Qwen3.5-27B fine-tuned on distilled data from Claude-4.6-Opus (reasoning). Trained via Unsloth. Runs locally on 16GB in 4-bit or 32GB in 8-bit. Model:
https://x.com/UnslothAI/status/2038625148354679270

Very bullish on open source and local models Imagine running near-Opus-level model locally on that $600, 16GB Mac Mini you bought last month This 27B Qwen3.5 distill was trained on Claude 4.6 Opus reasoning traces and is putting up real numbers: – beats Claude Sonnet 4.5 on
https://x.com/TheCraigHewitt/status/2039303217620627604

METR time horizons are doubling every ~107 days Opus 4.6 reached 11.98 hours in February today we should be at around ~15.2h and by end of year ~87.4h 90% CI’s today April 3rd 2026: [11.64h, 21.88h] EOY: [53.13h, 164.19h]
https://x.com/scaling01/status/2040047917306876325

We’ve added Pareto frontier charts to the leaderboard. Now available across: Text, Vision, Search, Document, and Code Arena. The Pareto frontier curve demonstrates which models are most efficient at their level of performance (by Arena score) vs. a blended price per 1M tokens
https://x.com/arena/status/2039377186432618885

Collinear presents YC-Bench This benchmark evaluates agent capability to run a simulated startup over a one-year horizon spanning hundreds of turns.
https://x.com/arankomatsuzaki/status/2039541189968626047

evals rhyme with training data the same rigor and care we put into data quality/curation for training should go into eval design training data updates the weights of our models, each example contributes a weight push in some direction to correctly classify that datapoint Evals
https://x.com/Vtrivedy10/status/2039029715533455860

I just published a blog that covers 30+ popular LLM evals / benchmarks and how they are created. Here are the common themes for success… For full details, find the blog post here:
https://t.co/sWSNkbCEhm (1) Domain Taxonomy. Most popular LLM benchmarks categorize their data
https://x.com/cwolferesearch/status/2039009111711367557

I really like the strategy used by CursorBench to evaluate Composer 2. Many good design decision: – Benchmark items are sourced from real coding sessions (from the Cursor team, so no issues with opt-in), which makes the evals realistic and less prone to contamination. – The
https://x.com/cwolferesearch/status/2037726856699420987

Introducing AA-AgentPerf – the hardware benchmark for the agent era. Key details: ➤ Real agent workloads, not synthetic queries: we’ve captured real coding agent trajectories where our agents used up to 200 turns and worked with sequence lengths >100K tokens ➤ Production
https://x.com/ArtificialAnlys/status/2037562417836929315

Introducing Contra Labs. The first frontier data and evaluation lab for Creative AI.
https://x.com/contraben/status/2039021014244262000?s=20

New conceptual guide: 🔄 The agent improvement loop starts with a trace Tracing is the foundational primitive for improving agents. A trace gives you the full behavioral record of what an agent actually did. From there, teams can enrich traces with evals and human feedback,
https://x.com/LangChain/status/2039028327030079565

Reasoning over Mathematical Objects Our 70-page(!) paper is out on arXiv, as covered by several of our recent blog posts. We study how to improve reasoning on hard tasks (e.g., math expressions) via: • better training data (& new evals) • better reward models (on-policy
https://x.com/jaseweston/status/2040062089725645039

Tau Bench got an update! Tau Bench is one of the most adopted Agentic Benchmarks. They now added “Banking” a fintech-inspired customer support domain built around a realistic knowledge base of 698 documents across 21 product categories. Tasks require agents to search this
https://x.com/_philschmid/status/2038655544613826985

The Agent Evaluation Readiness Checklist Starting to think through how to test your agents? We put together a step-by-step checklist for building, running, and shipping agent evals. 🧪 We walk through: → How to read traces in LangSmith and analyze errors, before building evals
https://x.com/LangChain/status/2037590936234959355

we’re leaning incredibly hard into Open Models + Open Harnesses evals show that current open models get near frontier (or better) intelligence on many tasks, they’re way cheaper, and usually faster real world tasks need to take perf, cost, latency into account many tasks don’t
https://x.com/Vtrivedy10/status/2039805753905840159

we’re leaning into the future of Agent Improvement with Traces, Evals, & Infra the future will be deeply grounded in data so that we can win against slop that means we’ll need to: – point smart agentic compute towards traces to surface and monitor errors – use human & agent
https://x.com/Vtrivedy10/status/2039035899938267334

Weekend over. Here’s what I built:
https://t.co/me1qexYWgw A simple agent-native CLI to parse, sanitise, and commit agent traces to public or private Hugging Face datasets for analytics, evals, and training. What I focused on: – a schema that is actually useful for downstream
https://x.com/jayfarei/status/2038385591818023278

Why it’s getting harder to measure AI performance
https://www.understandingai.org/p/why-its-getting-harder-to-measure

World Reasoning Arena – A comprehensive benchmark for evaluating world model – Expose a substantial gap between current models and human-level hypothetical reasoning
https://x.com/arankomatsuzaki/status/2038443186255991169

GLM-5V-Turbo is now live in Vision Arena. Test its ability to reason over visual inputs using your real-world prompts. Don’t forget to vote so we can see how it stacks up.
https://x.com/arena/status/2039400189178556814

Are open source models catching up to proprietary models? We’ve looked back at 3 years of Arena’s data to show how the race has evolved. For comparison, we’ve taken the top 20% of the models and uncovered the following: – Before mid 2024: The gap was between 100-150 points – In
https://x.com/arena/status/2037584085997216100

Today we drop Trinity-Large-Thinking. SOTA on Tau2-Airline, frontier-class on Tau2-Telecom, and the #2 model on PinchBench, right behind Opus. On BCFLv4, we’re in the mix with the best. 26 people with under $50M raised and a ruthless pursuit of greatness. What this team just
https://x.com/MarkMcQuade/status/2039375842560872834

One way to see the advancement of AI is to see how much further you can get with new models on the same hardware Here is “”an otter using a laptop on an airplane”” generated on my home computer using the open weights Wan 2.1, first try. We have come pretty far in 18 months.
https://x.com/emollick/status/2037616578787713194

Today we announce a new evaluation framework to improve AI benchmark reproducibility. By optimizing the ratio of the number of items to human raters per item, we can better capture the nuance of human disagreement in subjective tasks. Learn more:
https://x.com/GoogleResearch/status/2039014600927043926

Gemma 4 31B shifts the Pareto frontier, scoring +30 Arena points above similarly priced models like DeepSeek 3.2. Its position on the Pareto frontier is based on early pricing indicators from third parties.
https://x.com/arena/status/2040128319719670101

impressive, very nice. now let’s compare a 31b dense to a 31b active 670b total instead. flop for flop
https://x.com/stochasticchasm/status/2039912148676264334

GEditBench v2 A Human-Aligned Benchmark for General Image Editing paper:
https://x.com/_akhaliq/status/2039007111741366620

Trinity-Large-Thinking achieves state of the art results on Tau2 airline, and is at frontier level on Tau2 telecom. It’s also the #2 model on PinchBench, just behind Opus 4.6, and we’re among the giants on BCFLv4
https://x.com/latkins/status/2039370549743243353

A sign that human creativity is a bottleneck is that this year everyone can generate almost any image or video they can think of for nearly free and the April Fools posts are basically just as bad as any other year.
https://x.com/emollick/status/2039379053480914959