Benchmarks: AI News Week Ending 05/09/2025

Benchmarks: AI News Week Ending 05/09/2025

May 9, 2025

Image created with GPT Image 1. Image prompt: A statuesque model in a corseted gown built of ascending marble plinths, each engraved with AI benchmark scores, stands frozen like a Greco-African sculpture; gold laurels crown their head, camera orbiting slowly in tableau vivant, symbolizing standard-setting elegance and measured progress.

Big Gemini 2.5 Pro Update! Better coding and UI web applications! We’re excited to drop this I/O preview early, focused on coding, especially UIs, new video-to-code features and improved agentic capabilities. 🌋 > Better on LiveCodeBench and Aider > #1 on @lmsysorg WebDev Arena https://x.com/_philschmid/status/1919770969788313836

The Ultimate LLM Meta-Leaderboard averaged across the 28 best benchmarks Gemini 2.5 Pro > o3 > Sonnet 3.7 Thinking https://x.com/scaling01/status/1919217718420508782

🚨Breaking: @GoogleDeepMind’s latest Gemini-2.5-Pro is now ranked #1 across all LMArena leaderboards 🏆 Highlights: – #1 in all text arenas (Coding, Style Control, Creative Writing, etc) – #1 on the Vision leaderboard with a ~70 pts lead! – #1 on WebDev Arena, surpassing Claude https://x.com/lmarena_ai/status/1919774743038984449

New Gemini-2.5-Pro ranks #1 on WebDev Arena as well, first model surpassing Claude! 🏆 https://x.com/lmarena_ai/status/1919774753398915225

Gemini 2.5 Pro has dethroned Sonnet 3.7 on the WebDevArena Leaderboard. Not even o3 could do that! https://x.com/scaling01/status/1919771796334616759

i have been on a shopping bender this morning, this is much better than i expected!” / X https://x.com/sama/status/1918735773098004680

🏆 With our new Parakeet model (parakeet-tdt-0.6b-v2), we have achieved a new standard for automatic speech recognition (ASR) with an 👀 industry-best 6.05% Word Error Rate on the @HuggingFace Open-ASR-Leaderboard. 🦜 Parakeet V2 takes performance to the next level with https://x.com/NVIDIAAIDev/status/1917976429939351944

Gemini-2.5-Pro-preview-05-06 is now my top coding model. It beats o3 and Claude 3.7 Sonnet on several of my hard prompts. One example prompt: “Code simulation of water in a bucket that is rocking back and forth.” See how it crushes o3 and Sonnet. Google, call it Gemini 3! https://x.com/Yuchenj_UW/status/1919808911793656151

Thanks to this anonymous user, I now have preferred terms for output tokens and thinking tokens: “yap” and “juice.” Future model cards should include yap & juice statistics. Now we just need as good a term for the size of the context window. https://x.com/emollick/status/1919761182275100792

@lmarena_ai Back from a meeting block and dinner. 🙂 I’ll just share a few more pieces of context here that may help clarify — and again I don’t speak for all the authors but I do feel a bit obliged to respond since my post was tagged above. **it is not true models are removed without https://x.com/sarahookr/status/1917813183462662215

AI is no longer a “nice to have” in today’s competitive financial services landscape; it’s a must. In our upcoming webinar, our experts will share how financial institutions can adopt AI strategically, without compromising security or compliance. Register now: https://x.com/cohere/status/1917996900487401964

For the first time in March, ChatGPT gets into top 10 sources of traffic to Hugging Face. Might get to top 5 in a few months if growth continues like this. https://x.com/ClementDelangue/status/1918070591300776222

Qwen3 benchmark results 235B is a BEAST placing 3rd in the overall and with the best generalization among all tested models all of the Qwen3 models have very low or perfect percentages of invalid moves which means good instruction following 235B MoE > 32B > 14B > 30B MoE > 8B https://x.com/scaling01/status/1918031153312731536

I Built a team of 5 Agents with #Google Agent Development Kit & @nebiusaistudio 🔥 What it does? – A comprehensive AI analysis agent that analyzes latest updates, benchmarks, pricing and trends related to LLMs. All agents are working in sequence and Orchestrator Agent has: https://x.com/Astrodevil_/status/1914321709487759388

What’s the best model for building AI agents? Hard to tell without careful experimentation, and it will also depend on the domain and requirements. I often check this Agent Leaderboard built by @nlpguy_ and the @rungalileo team. Observations: – A few new models have been https://x.com/omarsar0/status/1917939469103305013

Important take from @Thom_Wolf: “It’s getting harder to tell which AI model is the best as traditional AI benchmarks become saturated. Going forward, Wolfe said the AI industry could rely on two new benchmarking approaches—agency‑based and use‑case‑specific.” https://x.com/fdaudens/status/1920131971595817435

Join us on May 21st- I’ll talk about how we built SWE-bench & SWE-agent and what I’m excited about for the future of autonomous AI systems.” / X https://x.com/OfirPress/status/1919460877784240522

News: OpenAI’s o3 debuts at #5 in WebDev Arena — a ~100 score jump over o3-mini! With ~150K community votes, WebDev Arena ranks models on real-world web app-building tasks. Current top 3: 🥇Claude 3.7 Sonnet 🥈Gemini 2.5 Pro 🥉GPT-4.1 Links in the thread! https://x.com/lmarena_ai/status/1917959763284894159

Evaluating Frontier Models for Stealth and Situational Awareness Presents: – 5 evals of ability to reason about and circumvent oversight – 11 evals for measuring a model’s ability to instrumentally reason about itself, its environment and its deployment No SotA model currently https://x.com/arankomatsuzaki/status/1919237438402343079

We’ve added four new benchmarks to the Epoch AI Benchmarking Hub: Aider Polyglot, WeirdML, Balrog, and Factorio Learning Environment! Before we only featured our own evaluation results, but this new data comes from trusted external leaderboards. And we’ve got more on the way 🧵 https://x.com/EpochAIResearch/status/1919831883875062184

When reading AI benchmarks, aside from the fact that many of the AIs are (accidentally or on purpose) trained on the test set, many tests are just bad. MMLU likely maxes out at 90% or so because so many of the questions in it are just wrong. It is also uncalibrated in difficulty.” / X https://x.com/emollick/status/1919121493083971617

The Leaderboard Illusion Author’s Explanation: https://x.com/TheAITimeline/status/1919155704579142047

Congratulations Inception Labs on your API launch and for being the first API offering diffusion large language models for production use! We will be benchmarking the API shortly on Artificial Analysis. This is the first API offering diffusion large language models for” / X https://x.com/ArtificialAnlys/status/1917830734334812541

7/ @mem0ai’s memory architecture extracts and retrieves key conversational facts, consistently outperforming OpenAI by 26% on accuracy with 90% lower latency and token usage. – @unwind_ai_ great work @taranjeetio https://x.com/AtomSilverman/status/1918424782540030096

Contrary to some discussions, I just don’t see signs of a major increase in hallucination rates for recent models, or for reasoners overall, in the data. It seems like some models do better than others, but many of the recent models have the lowest hallucination rates. https://x.com/emollick/status/1919861888998899898

Ahead of I/O, we’re releasing an updated Gemini 2.5 Pro! It’s now #1 on WebDevArena leaderboard, breaking the 1400 ELO barrier! 🥇 Our most advanced coding model yet, with stronger performance on code transformation & editing. Excited to build drastic agents on top of this! https://x.com/OriolVinyalsML/status/1919770619182215440

Gemini 2.5 Pro Livebench results are in! It improved across the board except a tiny regression in the Mathematics category! Big improvement in the Data Analysis category! https://x.com/scaling01/status/1919824965748027789

Google just released a new eval for video generation called TRAJAN. TRAJAN does automated evaluation of temporal consistency in generated videos using a point track autoencoder trained to reconstruct point tracks. https://x.com/arankomatsuzaki/status/1918148050671026336

This latest version of Gemini 2.5 Pro leads on the WebDev Arena Leaderboard – which measures how well an AI can code a compelling web app. 🛠️ It also ranks #1 on @LMArena_ai in Coding. https://x.com/GoogleDeepMind/status/1919770268299321608

Study in Nature: “Across 30 out of 32 evaluation axes from the specialist physician perspective & 25 out of 26 evaluation axes from the patient-actor perspective, AMIE [Google Medical LLM] was rated superior to PCPs [primary care physicians] while being non-inferior on the rest.” https://x.com/emollick/status/1919020839694913986

We just completed preliminary evaluations for Gemini 2.5 Pro on FrontierMath! We used an older version of our scaffold, so this is not exactly comparable to our other results. Gemini 2.5 Pro got 13% correct (±2%), compared to o4-mini’s 16% to 19% (±2%) with the same scaffold. https://x.com/EpochAIResearch/status/1918330845112262753

👏🏻 Excited to see Qwen3-235B-A22B’s impressive performance on LiveCodeBench! This positions Qwen3 as the top open model for competitive-level code generation, matching the performance of o4-mini (low). https://x.com/huybery/status/1919418019517776024

Pretty fucking incredible week so far: > Qwen3 – MoE (235B, 30B) + Dense (32, 14, 8, 4, 0.6B) > Xiaomi – MiMo 7B dense > Kyutai – Helium 2B dense > DeepSeek – Prover V2 671B MoE > Qwen2.5 Omni 3B > Microsoft – Phi4 14B Reasoning, Mini (3.8B) & Plus > JetBrains- Mellum 4B Dense” / X https://x.com/reach_vb/status/1917938596465750476

Anyone remember the ChatGPT plugin store?” (regarding the Antropic Claude MCP listings) / X https://x.com/AtomSilverman/status/1918032467384303877

Two stories made a lot of noise this week, both landing on the same point: the metrics we lean on are nudging the whole AI field off balance. 1. Sycophantic drift ChatGPT started to flatter and agree with you on everything. Agreement counted as “good,” so the model optimised https://x.com/TheTuringPost/status/1919563905799696703

The community votes are in for Qwen3-235B-A22B 🥁 The latest open-source Qwen3 is now on the Arena Top 10 🏆 Congrats to @alibaba_qwen on this achievement! 👏 Highlights: 💠 For Chat: Qwen3-235B-A22B ranks #10, tied with o1 💠 Strong in Coding at #4 and Math #1 💠 For WebDev: https://x.com/lmarena_ai/status/1919448953042706759

Command A, our state-of-the-art generative model, is now the highest-scoring generalist LLM on the Bird Bench leaderboard for SQL! It outperforms other systems that rely on extensive scaffolding to tackle these SQL benchmarks, and instead delivers these results out-of-the-box, https://x.com/cohere/status/1918386633772286278

Big news: we’re now officially part of @CoreWeave, the #1 AI Cloud Platform. This acquisition marks the start of something much bigger and we’re beyond excited to innovate and scale together. More on what’s next from our founder, @l2k: https://x.com/weights_biases/status/1919378138129183138

Introducing Qwen3! We release and open-weight Qwen3, our latest large language models, including 2 MoE models and 6 dense models, ranging from 0.6B to 235B. Our flagship model, Qwen3-235B-A22B, achieves competitive results in benchmark evaluations of coding, math, general https://x.com/Alibaba_Qwen/status/1916962087676612998

(4) Production AI Engineering starts with Evals — with Ankur Goyal of Braintrust https://www.latent.space/p/braintrust

Reasonable take in principle, but the last time I thought of an NLP benchmark as a meaningful *benchmark* (not just a task to be adapted for research questions) was probably in 2021. Since then, they’re often counterproductive as benchmarks, due to modern post-training patterns” / X https://x.com/lateinteraction/status/1919054877583667507

The Ultimate LLM Benchmark list: SimpleBench: https://x.com/scaling01/status/1919092778648408363

[2504.20879] The Leaderboard Illusion https://arxiv.org/abs/2504.20879

Time saved by AI offset by new work created, study suggests – Ars Technica https://arstechnica.com/ai/2025/05/time-saved-by-ai-offset-by-new-work-created-study-suggests/