Benchmarks: AI News Week Ending 10/24/2025

Benchmarks: AI News Week Ending 10/24/2025

October 24, 2025

Image created with gemini-2.5-flash-image with claude-sonnet-4-5. Image prompt: A prestigious chess tournament leaderboard with AI model names ranked vertically, unique ornate chess pieces on glass pedestals at varying heights representing each model, championship hall background with velvet ropes and trophy display cases, dramatic spotlighting from above, deep mahogany wood paneling, brass nameplate details, competitive elegant atmosphere

We launched SWE-Bench Pro last month to incredible feedback, and we’ve now updated the leaderboard with the latest models and no cost caps. SoTA models now break 40% pass rate. Congrats to @Anthropic for sweeping the top spots! 🥇Claude 4.5 Sonnet 🥈Claude 4 Sonnet 🥉Claude 4.5 https://x.com/scale_AI/status/1980685992987431368

AI trading in real markets https://nof1.ai/

They should have broken the 10k to 10-100 stacks of $100-1k and given them to the identical copies of the same model to be able to see anything remotely meaningful. Right now we are looking at noise!”” / X https://x.com/abeirami/status/1980434468398883076

This AI trading benchmark is interesting. Each model got $10,000 to invest. ~3 days in: ranking atm: – DeepSeek V3.1: +$2,658 – Grok 4: +$2,236 – Claude 4.5 Sonnet: +$1,911 – Qwen 3 Max: −$211 – GPT-5: −$3,139 – Gemini 2.5 Pro: −$3,719 DeepSeek beats all the other models https://x.com/Yuchenj_UW/status/1980318499185823760

TLDR: OpenAI Atlas > Perplexity Comet in an agent mode head to head. Here is my use case: I have a very real, very tedious use case, which is a manual task that I do every day. 1. I go to the school website to look at each of my daughter’s classes 2. I look at her grades 3. I https://x.com/raizamrtn/status/1980695747227210213

openai’s codex cli with gpt 5 became better than claude code 🤯 it crawls the codebase to a degree i have never seen seen from claude code. Instantly one-shotted a bug i couldn’t solve with claude code for 3 days new $200 per month subscription **check** https://x.com/samuelstroschei/status/1957655293942460670

🚨 New Model Update MiniMax-M2 by @MiniMax_AI is expected to land next week but is already in the Arena for testing as MiniMax-M2-Preview! Let’s see how it stacks up. Early details suggest it’s an advanced agentic model with strong reasoning and long-context capabilities, https://x.com/arena/status/1981850766039187901

If you want to understand AGI, study the humanities (I am only partially trolling – psychology is a young field, computer science is younger. The brightest minds in history spent a lot of time considering what it meant to be a general intelligence, that’s what the humanities is)”” / X https://x.com/emollick/status/1979637468149530660

New research with @AdtRaghunathan, Nicholas Carlini and Anthropic! We built ImpossibleBench to measure reward hacking in LLM coding agents 🤖, by making benchmark tasks impossible and seeing whether models game tests or follow specs. (1/9) https://x.com/fjzzq2002/status/1981745974700581191

Anthropic is catching up with OpenAI – by Alex Wilhelm https://www.cautiousoptimism.news/p/anthropic-is-catching-up-with-openai

🚨 WebDev Arena: Top 15 Disrupted! 4 new models have been added to the WebDev leaderboard: 🔸 #4 Claude Sonnet 4.5 Thinking 32k by @AnthropicAI 🔸 #4 GLM 4.6 (the new #1 open model) by @Zai_org 🔸 #11 Qwen3 235B A22B Instruct (and #7 open model) by @Alibaba_Qwen 🔸 #14 Claude https://x.com/arena/status/1980367208300835328

Mini Models Battle: Claude Haiku 4.5 vs GLM-4.6 vs GPT-5 Mini https://blog.kilocode.ai/p/mini-models-battle-claude-haiku-45

nanochat d32, i.e. the depth 32 version that I specced for $1000, up from $100 has finished training after ~33 hours, and looks good. All the metrics go up quite a bit across pretraining, SFT and RL. CORE score of 0.31 is now well above GPT-2 at ~0.26. GSM8K went ~8% -> ~20%, https://x.com/karpathy/status/1978615547945521655

Global AI Tracker https://www.similarweb.com/corp/wp-content/uploads/2025/10/attachment-Global-AI-Tracker-1.pdf

Thanks for sharing the internal benchmarks, @rauchg ! We love to see it. 🔥”” / X https://x.com/Kimi_Moonshot/status/1980219115840385349

A lot I like & some I don’t in this paper: Like: Clear definition of AGI, diverse authors, shows jaggedness, tracking metrics over time (huge leap from GPT-4 to GPT-5) Dislike: AGI defined as replicating a model of human cognition, benchmarks are scattershot, narrow view of AI https://x.com/emollick/status/1978874737892667718

🚨🎬 Big news from Video Arena! @GoogleDeepMind’s latest Veo 3.1 now ranks #1 in both Text-to-Video and Image-to-Video leaderboards. 🏆 This is a +30-point leap from Veo 3.0 → 3.1, making it the first model to break 1400 in Video Arena history! Huge congrats to the https://x.com/arena/status/1980319296120320243

Open source coding benchmarks are operating in a different reality. They don’t test real world tasks and expect users to come prepared with a detailed page-long spec of exactly what they want to build or fix. But real people don’t use AI this way. They write vague prompts like https://x.com/pashmerepat/status/1981431374386233840

Choose the “”:exacto”” version of open-source models in Cline automatically route to the best inference provider for models like GLM-4.6, Qwen3-Coder, and Kimi-K2. Provider quality varies wildly, meaning the same model can yield completely different results at different endpoints. https://x.com/cline/status/1981370535176286355

Across most medical benchmarks, including when real cases & human doctors are involved, there is a clear trend of AI models improving over time (and many where today’s AI beats human doctors) But we do not have many studies measuring real-world performance of AI in medicine, yet https://x.com/emollick/status/1980474407656227258

Kimi K2 is up to 5x faster and 50% more accurate ：）”” / X https://x.com/crystalsssup/status/1980147163629047854

Finally, researchers have open-sourced a new reasoning approach that actually prevents hallucinations in LLMs. It beats popular techniques like Chain-of-Thought and has a SOTA success rate of 90.2%. Here’s the core problem with current techniques that this new approach solves: https://x.com/_avichawla/status/1980159925109309799

New User Trends on Wikipedia – Diff https://diff.wikimedia.org/2025/10/17/new-user-trends-on-wikipedia/

I wrote this piece in the Harvard Business Review in December, 2022, two weeks after ChatGPT was released (its seven pages, so these are the first two and last two). I think my predictions all played out, though I underestimated how good they would get at accurate math. https://x.com/emollick/status/1979573319121916037

Awesome to see Veo 3.1 top the LMArena video leaderboards by a large distance with big improvements over Veo 3.0 for text-to-video (+30) and image-to-video (+70)! 🔥Huge congrats to the team! Try it for yourself in https://x.com/demishassabis/status/1980397419658645708