Benchmarks: AI News Week Ending 10/17/2025

Benchmarks: AI News Week Ending 10/17/2025

October 17, 2025

Image created with gemini-2.5-flash-image with claude-sonnet-4-5. Image prompt: A classic 1961 Ferrari 250 GT California Spyder in Rosso Corsa red racing through a pristine automotive test track with white timing gates, yellow measurement markers, and checkered flags at golden hour, cinematic automotive photography with warm highlights, subtle timing laser beams crossing the track surface, polished chrome gleaming, soft shadows, premium studio-grade lighting, landscape orientation, minimal clean composition with space for text overlay.

Cranston AI (@cranston_ai) does your company’s bookkeeping & taxes with AI. Their agents pull in context from across the business and, after human review, file a full corporate tax return with the IRS. https://x.com/ycombinator/status/1975591950255358411

Big progress on this important benchmark (but still weird artifacts). https://x.com/emollick/status/1976702663330038205

an interesting forecasting benchmark: at current trends, we’re one year away from models matching the performance of superforecasters (and gpt-4.5 is sota on this benchmark!)”” / X https://x.com/gdb/status/1976139319787364408

Over 1.3 quadrillion tokens a month across Google, so much progress : ) so much more to go! https://x.com/OfficialLoganK/status/1976359039581012127

GPT-5 and Gemini 2.5 Pro just achieved gold medal performance in the International Olympiad of Astronomy and Astrophysics (IOAA). AI is now world class at cutting edge physics. https://x.com/deedydas/status/1977029236390285608

I don’t think people have updated enough on the capability gain in LLMs, which (despite being bad at math a year ago) now dominate hard STEM contests: The International Math Olympiad, the International Olympiad on Astronomy & Astrophysics, International Informatics Olympiad… https://x.com/emollick/status/1977460160197956089

🚨 🎬 Video Arena Disrupted! @Openai’s Sora 2 and Sora 2 Pro have landed on the Text-to-Video leaderboard. 🏆 Sora 2 Pro is the first to tie rank with Veo 3 variants for #1. 🥉 Sora 2 comes in at #3, pushing the non-audio variants of Veo 3 into 5th! Video models with audio https://x.com/arena/status/1978149396996051007

OpenAI’s Head of Sora @billpeeb says a stunning 70% of Sora’s nearly 2 million weekly active users are creating content. https://x.com/tbpn/status/1976759087456305191

This matches what the GDPEval paper found. Experts should try using AI a couple times on any task, and then resort to doing it themselves (with appropriate minor AI assistance) if they can’t get AI to work for them. You still save time overall, even when AI fails on some cases. https://x.com/emollick/status/1977874249214779558

Surfer 2 is here. 🏄🏄 Our new Cross-Platform Computer-Use Agent exceeds state-of-the-art on the 4 main benchmarks: WebVoyager, AndroidWorld, WebArena and OSWorld. 🖥️🌐📱 Find out more at https://x.com/hcompany_ai/status/1978935436111229098

Google’s Gemini 2.5 Native Audio Thinking is the new leading Speech to Speech model per our Artificial Analysis Big Bench Audio benchmark The new model achieves a score of 92% on Big Bench Audio, the highest result recorded by Artificial Analysis to date. This not only places it https://x.com/ArtificialAnlys/status/1977720537519636756

Princeton’s HAL leaderboard has AssistantBench, and o3 beats GPT-5 med on it to obtain 38.8% accuracy! Models have definitely been getting much better at answering personal-assistant questions, but there’s still more room to grow. https://x.com/OfirPress/status/1978925179876020247

Defining and evaluating political bias in LLMs | OpenAI https://openai.com/index/defining-and-evaluating-political-bias-in-llms/

The 2025 BEHAVIOR Challenge, launched by @drfeifei’s team at Stanford, is a global competition to train robots in household tasks, testing reasoning, navigation, and manipulation in simulated homes. ⦿ 50 tasks, 1,000 activities (e.g., cooking, cleaning) ⦿ 10,000 expert demos https://x.com/TheHumanoidHub/status/1976355634510737626

We have a new state-of-the-art result on TheAgentCompany from Shanghai AI lab: MUSE + Gemini 2.5, solving 41.1% of the real-world inspired tasks. The new method is based on “”learning on the job””, a memory-based method. https://x.com/gneubig/status/1978564697499574761

Readers responded with both surprise and agreement last week when I wrote that the single biggest predictor of how rapidly a team makes progress building an AI agent lay in their ability to drive a disciplined process for evals (measuring the system’s performance) and error”” / X https://x.com/AndrewYNg/status/1978867684537438628

Ran Haiku-4.5 against my NYT Connections Eval with DSPY project and the results are in! – Baseline score of 64% – Optimized score of 71% – Complete in only 25 minutes – Total cost $11 This means Haiku 4.5 is the fastest model I’ve tested so far (ignoring Haiku 3.5 which did https://x.com/pdrmnvd/status/1978570006863790299

Salesforce AI Research introduces MCP-Universe: the first benchmark to truly test LLM agents in real-world scenarios with live Model Context Protocol servers. https://x.com/HuggingPapers/status/1959347736429674567

We tested Search Mode from Weaviate’s Query Agent on five popular Information Retrieval benchmarks — BEIR, LoTTe, EnronQA, WixQA, and BRIGHT! 📊 Of these benchmarks, we found the largest relative improvement from Search Mode over Hybrid Search on BRIGHT! ⚖️🚀 BRIGHT from https://x.com/CShorten30/status/1978107101936230745

GPQA Diamond and 𝜏²-Bench Telecom (an agentic benchmark requiring models to act in a customer service role) both show outsized performance for GPT-5 and o3 compared to GPT-4.1, but while the reasoning models cost >10x to run GPQA, in 𝜏²’s customer service environment they cost https://x.com/ArtificialAnlys/status/1978561356401111051

📣New paper: Rigorous AI agent evaluation is much harder than it seems. For the last year, we have been working on infrastructure for fair agent evaluations on challenging benchmarks. Today, we release a paper that condenses our insights from 20,000+ agent rollouts on 9 https://x.com/sayashk/status/1978565190057869344

Reasoning models are expensive to run with traditional benchmarks, but often get cheaper in agentic workflows as they get to answers in fewer turns Through 2025 we’ve seen test-time compute drive up the cost of frontier intelligence, but with agentic workflows there’s a key https://x.com/ArtificialAnlys/status/1978561353792344302

Saw that DGX Spark vs Mac Mini M4 Pro benchmark plot making the rounds (looks like it came from @lmsysorg). Thought I’d share a few notes as someone who actually uses a Mac Mini M4 Pro and has been tempted by the DGX Spark. First of all, I really like the Mac Mini. It’s https://x.com/rasbt/status/1978608882156269755

7. Small models can punch above their weight. Their best 4B-parameter model, trained with this recipe (real data + diverse RL + GRPO-TCR), beats 14B–32B models on tough benchmarks like AIME25 and GPQA-Diamond. Smart data and tuning trump raw size. Really good paper for AI devs”” / X https://x.com/omarsar0/status/1978112412743258361

The return of the physicists: “”CMT-Benchmark: A benchmark for condensed matter theory built by expert researchers.”” https://x.com/SuryaGanguli/status/1977740051108036817

Concern and excitement about AI around the world | Pew Research Center https://www.pewresearch.org/global/2025/10/15/concern-and-excitement-about-ai/

Introducing MAI-Image-1, debuting in the top 10 on LMArena | Microsoft AI https://microsoft.ai/news/introducing-mai-image-1-debuting-in-the-top-10-on-lmarena/

Tiny Recursion Model (TRM) results on ARC-AGI – ARC-AGI-1: 40%, $1.76/task – ARC-AGI-2: 6.2%, $2.10/task Thank you to @jm_alexia for contributing TRM, a well written, open source, and thorough research to the community based on the HRM from @makingAGI https://x.com/arcprize/status/1978872651180577060

Meet our third @MicrosoftAI model: MAI-Image-1 #9 on LMArena, striking an impressive balance of generation speed and quality Excited to keep refining + climbing the leaderboard from here! We’re just getting started. https://x.com/mustafasuleyman/status/1977827977338716626

One of the most fun parts of OpenAI is watching people here level up so fast and do such excellent work. We are operating at a high level across many different disciplines and many of the people doing it have never done it before, and joined us at the beginning of their career.”” / X https://x.com/sama/status/1976799538523292027

AI is apparently already accelerating science. Measuring academic publications of authors: “we find that productivity among GenAI users rose by 15 percent in 2023 relative to non-users and further increased to 36 percent in 2024” and the quality of publications also went up. https://x.com/emollick/status/1977073589406122443

Imagine your LLM inference automatically getting faster in production (by up to 400%!) 🆕Enter: ATLAS–a not so traditional speculator that adapts to your workload as it evolves. The more you use it, the better it performs. https://x.com/togethercompute/status/1978210662095475097

BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution “”we introduce BigCodeArena, an open human evaluation platform for code generation backed by a comprehensive and on-the-fly execution environment. Built on top of Chatbot Arena, BigCodeArena https://x.com/iScienceLuvr/status/1977694597603291492

GLM-4.6 is now live on BigCodeArena. Shout-out to @qinkai1028 and the whole @Zai_org team for this great model!”” / X https://x.com/terryyuezhuo/status/1978554496058851650

⬆️ LLMs’ forecasting abilities are steadily improving. GPT-4 (released March 2023) achieved a difficulty-adjusted Brier score of 0.131. Nearly two years later, GPT-4.5 (released Feb 2025) scored 0.101—a substantial improvement. A linear extrapolation of state-of-the-art LLM https://x.com/Research_FRI/status/1975909516777537614

The Turing Test for video … 😅”” / X https://x.com/demishassabis/status/1978644313824534954