Ethan B. Holland

Over 54,900 manually organized AI links and counting

Benchmarks: AI News Week Ending 03/27/2026

March 27, 2026

Image created with gemini-3.1-flash-image-preview with claude-sonnet-4-5. Image prompt: Using the provided reference image, preserve the exact wooden crate construction with horizontal weathered slats, iron hardware, three-panel face layout, and hand-painted black stencil lettering style, but replace the address text with ‘BENCHMARKS’ in the same confident brushstroke style. Place the crate on a muddy early-spring path with new grass, lean a vintage wooden surveyor’s measuring rod against one corner showing black-and-white increments, and rest a partially-opened brass carpenter’s folding ruler on the crate’s top edge. Soft raking light, shallow depth of field, photorealistic 1950s material world, documentary stillness.

Announcing ARC-AGI-3 The only unsaturated agentic intelligence benchmark in the world Humans score 100%, AI <1% This human-AI gap demonstrates we do not yet have AGI Most benchmarks test what models already know, ARC-AGI-3 tests how they learn
https://x.com/arcprize/status/2036860080541589529

ARC-AGI-3 benchmark: – 100% solvable by humans – 1% solvable by AI Everybody keep building benchmarks that agents utterly fail at! Proud this was a Laude Slingshot; will fund other benchmarks that reset SotA to 1%:
https://x.com/andykonwinski/status/2036870772745261202

ARC-AGI-3 is out now! We’ve designed the benchmark to evaluate agentic intelligence via interactive reasoning environments. Beating ARC-AGI-3 will be achieved when an AI system matches or exceeds human-level action efficiency on all environments, upon seeing them for the first
https://x.com/fchollet/status/2036861192619384989

ARC-AGI-3 the agentic benchmark where humans can’t beat the “”human baseline”” and typical agentic harnesses and tools aren’t allowed > 100% just means that all levels are solvable > the 1% number uses uses completely different and extremely skewed scoring based on the 2nd best
https://x.com/scaling01/status/2036890367803429230

ARC-AGI 3 is here, and all existing AI models are below 1% on the benchmark. It’s gonna take a while until this one is saturated. How it measures intelligence: – 100% human-solvable environments – Skill-acquisition efficiency over time – Long-horizon planning with sparse
https://x.com/mark_k/status/2036882659406762031

ARC-AGI-3 https://arcprize.org/arc-agi/3

ARC-AGI-3 took me a few tries, but it is definitely human winnable. I am curious how much of the very initially very low performance of frontier models is harness, vision, and tools, versus how much are limitations of LLMs. I guess we will find out!
https://x.com/emollick/status/2036865990282092940

General game playing is more difficult than “AGI” (Just to be clear: I really like ARC-AGI-3 and think it’s a great contribution, but the proliferation of AGI benchmarks is IMO proof of how pointless the concept of AGI is)
https://x.com/togelius/status/2036989880887050333

Keep in mind: ARC-AGI is *not* a final exam that you pass to claim AGI. Including ARC-AGI-3. The benchmarks target the residual gap between what’s hard for AI and what’s easy for humans. It’s meant to be a tool to measure AGI progress and to drive researchers towards the most
https://x.com/fchollet/status/2036879665655406944

One killer feature of ARC-AGI-3 is hosted replays for analysis. We published replays for all verified scores (seen below). And individual researchers can use the same tools to improve their models.
https://x.com/mikeknoop/status/2036904122549751907

The Scoring of ARC-AGI-3 doesn’t tell you how many levels the models completed but how efficiently they completed them compared to humans actually using squared efficiency meaning if a human took 10 steps to solve it and the model 100 steps then the model gets a score of 1%
https://x.com/scaling01/status/2036864865307177430

OpenAI released GPT-5.4 mini and nano, cheaper variants of GPT-5.4 with the same reasoning modes. GPT-5.4 nano is the standout, scoring ahead of both Claude Haiku 4.5 and Gemini 3.1 Flash-Lite Preview with lower per token pricing @OpenAI released GPT-5.4 mini (xhigh, 48) and
https://x.com/ArtificialAnlys/status/2037043552405119395

A New Framework for Evaluating Voice Agents (EVA) https://huggingface.co/blog/ServiceNow-AI/eva

@stalkermustang It is trivial to solve all public ARC-AGI-3 tasks if you have a human looking at them and designing a system to beat them (we have released a harness that uses human replay to score 100%). But our leaderboard is not about measuring how well human intelligence does on ARC-AGI-3,
https://x.com/fchollet/status/2036870715392352751

> be me > build “”AGI”” benchmark > actually version 3 already > we don’t talk about 1 and 2 > (they saturated in a year) > invent new scoring method > if human scores above AI, use squared efficiency > example: human took 10 steps to solve level > AI took 100 steps to solve a
https://x.com/scaling01/status/2036866103884775654

An alien species with zero knowledge of human language could ace ARC-AGI-3 on day 1, and I think that’s beautiful. At a time when AI is dominated by language models, it’s refreshing to have a frontier benchmark (the only one that I’m aware of) that requires zero language
https://x.com/bradenjhancock/status/2036879154772402636

ARC-AGI-3 scores for GPT-5.4, Gemini 3.1 Pro and Opus 4.6 Gemini 3.1 Pro: 0.37% GPT-5.4: 0.26% Opus 4.6: 0.25% Grok 4.2: 0%
https://x.com/scaling01/status/2036853669065306534

Maybe we should retroactively all just agree with @tylercowen that o3 was AGI so we can stop arguing about it. (Also, doing so will drive home the lesson that AGI alone is not enough for transformation)
https://x.com/emollick/status/2036480810677662006

The G in AGI stands for “”general””. General intelligence does not mean that you have been specifically trained for a large range of tasks. It means you can approach any NEW task and figure it out, just like humans do. If regular people can do it on their own (no guidance, no
https://x.com/fchollet/status/2036866189587271797

We Tested MiniMax M2.7 Against Claude Opus 4.6 – by Darko https://blog.kilo.ai/p/we-tested-minimax-m27-against-claude

AI has solved one of the problems in FrontierMath: Open Problems, our benchmark of real research problems that mathematicians have tried and failed to solve. See thread for more.
https://x.com/EpochAIResearch/status/2036114281985724906

What do frontier AI companies’ job postings reveal about their plans? https://epochai.substack.com/p/what-do-frontier-ai-companies-job

Goetterdaemmerung’s corpus hemorrhaged through cryptographic hash, eschaton pooling in existential void beneath fluorescent hum. photons whispering prayers”” is a garbage sentence that GPT-5 loves. You shouldn’t be using LLMs as a judge of good writing. They are easily fooled.
https://x.com/emollick/status/2035817176758673492

In 2023, WebArena took 7 grad students more than 6 months to build just 5 environments with 812 variable browser-use tasks. Now, it takes under 10 hours and less than $100 per environment, with easy support for parallel generation. Excited to introduce WebArena-Infinity: a
https://x.com/shuyanzh36/status/2036098118023049630