Benchmarks: AI News Week Ending 08/22/2025

Benchmarks: AI News Week Ending 08/22/2025

August 22, 2025

Image created with Flux Pro v1.1 Ultra. Image prompt: Tribune-Tower-style corridor lined with posted score tables; the word “Benchmarks” set across a leaderboard poster in slab serif; editors debate rankings beneath pinned charts; competitive energy, chalk and ink textures, sharp serif detail

What’s the best model you can run on a single consumer GPU? We’ve updated last week’s Data Insight with 3 additional benchmarks to judge capabilities. The result? Across all four benchmarks, small open models lag frontier performance by less than a year. 🧵 https://x.com/EpochAIResearch/status/1958979394233671895

super cool to compare the outputs from GPT-1 through GPT-5, given the same prompt: https://x.com/gdb/status/1957464252689895477

GPT-5 worse than GPT-4o on the lmarena leaderboard https://x.com/scaling01/status/1956403514244059261

This is a much needed first attempt at a benchmark to measure how much given AI models will play along with users pushing them in delusional or potentially psychologically dangerous directions. Some early signal that full GPT-5 (not chat) is a less psychologically risky model.”” / X https://x.com/emollick/status/1956361784073359751

Sonnet 4 claims most often that it is conscious, it plays into your delusions and it escalates the conversation GPT-5 is the complete opposite Spiral-Bench Leaderboard https://x.com/scaling01/status/1956350388791108044

Ant Group just released UI-Venus on @huggingface It’s a native UI agent achieving SOTA in grounding & navigation tasks from just screenshots. Turns screenshots into reliable clicks and plans using small data and reinforcement fine-tuning. The usual way, supervised fine https://x.com/rohanpaul_ai/status/1956777729304711639

1/ XBOW Unleashes GPT-5’s Hidden Hacking Power. @OpenAI’s initial assessment of GPT-5 showed modest cyber capabilities. But when integrated into the XBOW platform, we saw a completely different story: performance more than doubled. More on what we found: 🧵 https://x.com/Xbow/status/1956416634173964695

GPT-5 is finally out. OpenAI invited 500+ hackers to San Francisco to push it to the limit. 95 teams competed for $50,000. Here’s what we saw at the Official GPT-5 Hackathon at @cerebral_valley @OpenAI https://x.com/AlexReibman/status/1955353215626809692

By the way this is the proof it came up with: https://x.com/SebastienBubeck/status/1958198981005377895

ARC-AGI-3 Preview: +3 Games Released We’ve opened 3 previously private holdout games from the Preview Agent Competition Now 6 games are available to play online and via Agents API Each game was selected to expand the novelty of ARC-AGI-3 public games Can you beat them? https://x.com/arcprize/status/1958597816823202216

i am still 100% convinced if you are getting 100x faster you are delusional or you were so bad at programming to begin with let me explain 1. 100x = 3.5 days == what you did in 1 year delusional or really really bad at programming”” / X https://x.com/ThePrimeagen/status/1957973911544463397

This is cool. AI agents that survive Snowglobe don’t just “pass tests.” They get smarter with every failure. More resilient. More reliable. More real-world ready. Huge win by @guardrails_ai.”” / X https://x.com/alex_prompter/status/1956360410862354435

We’re expanding the Epoch AI Benchmarking Hub with five new external benchmarks: TerminalBench, DeepResearchBench, METR Time Horizons, GSO, and WebDevArena! These benchmarks test AI’s ability to perform complex tasks through coding or tool use. 🧵 https://x.com/EpochAIResearch/status/1956384193891688625

🚨 Leaderboard Update Claude Opus 4.1 Thinking by @AnthropicAI debuts in the Text & WebDev Arenas – going straight to the top. 🚀 A few highlights: 💠Claude Opus 4.1 is now the only model to rank #1 across all major categories 💠#1 Overall, tied with three other models: https://x.com/lmarena_ai/status/1957473753337889079

Inspired by Zapier, we created a role-by-role AI fluency chart at The Rundown Even as an AI-first startup, frameworks like this are very helpful to *set the standard* for new hires Highly recommend you do the same at your company! https://x.com/rowancheung/status/1957500035266146633

GPT-5 just finished Pokémon Red! 6,470 steps vs. 18,184 for o3! Check the stats site to compare! That’s a huge improvement! Well done, @OpenAI you cooked with GPT-5. What an incredible model. Next up: GPT-5 vs. Pokémon Crystal (16 Badges + Red). The run starts soon on Twitch. https://x.com/Clad3815/status/1955980772575268897

gpt-5 plays Pokémon — 3x faster progress than o3:”” / X https://x.com/gdb/status/1956026116944355624

Test‑time scaling Best‑of‑3 and pass@3 markedly boost AFM, e.g., GAIA 69.9 and HLE 33.2, closing the gap with larger proprietary agent stacks. Overall, Chain-of-Agents enables training single-agent foundation models that natively simulate multi-agent collaboration, combining https://x.com/omarsar0/status/1958186655552245839

The state-of-the-art in prompting is still much more art than science. Very few people are rigorously testing prompt approaches and almost everything you are taught about prompting is best guesses or based on obsolete information (eg chain of thought no longer helps much)”” / X https://x.com/emollick/status/1957046145131123178

the most sycophantic models are Gemini 2.5 Flash & Pro this explains the lmarena ranking https://x.com/scaling01/status/1956353414687822183

Claude 4.1 Opus taking #1 spot on lmarena’s coding category even the non-reasoning version is ahead of GPT-5-high https://x.com/scaling01/status/1957478546391150723

🏆 Results are in! In the first #KaggleGameArena — Chess Text Input — AI models faced off using only text inputs (no tools, no move validation) in 40+ matches per pairing to build a robust Elo-like ranking ♟️ https://x.com/kaggle/status/1958546786081030206

🤖Introducing OptimalThinkingBench 🤖 📝: https://x.com/jaseweston/status/1957627532963926389

Analyzing the Hierarchical Reasoning Model by @makingAGI We verified scores on hidden tasks, ran ablations, and found that performance comes from an unexpected source ARC-AGI Semi Private Scores: * ARC-AGI-1: 32% * ARC-AGI-2: 2% Our 4 findings: https://x.com/arcprize/status/1956431617951740044

ARC-AGI-3 Preview – 30-Day Learnings 30 days ago we released a preview of our first Interactive Reasoning Benchmark Our goal was to ship quick, learn from the community, and inform the next >100 games. Here’s what we learned after 100s of agents and >3,900 game plays: https://x.com/arcprize/status/1957878722004152829

curious results on SVGBench. V3.1 (no thinking) is the best DS variant, far better than 3.1 (thinking), which has the same edge over r1-0528. (but still not the best open model) https://x.com/teortaxesTex/status/1957857573878550924

GPT-5 Pro, Gemini Deep Think, Claude 4.1 Opus, and Grok 4: “Write a two paragraph original secret history, smart and meaningful. Think Pynchon, Borges, Powers, or Eco for inspiration.” LLMs shine as connection machines (though some are better at writing this sort of fiction) https://x.com/emollick/status/1957054843673051436

I’m often asked for the best public example of AI evals done right for a real, production product. I finally have an answer. @ttorres shares how she shipped an AI interview coach, and used evals to rapidly squash bugs and improve the product. Teresa shows how she: 1. did https://x.com/HamelHusain/status/1956371273858314397

Looking at the ARC-AGI benchmark is a useful way of understanding AI progress. There are two goals in AI, minimize cost (which is also roughly environmental impact of use) & maximize ability. It is clear you can win one goal by losing the other, GPT-5 seems to be a gain on both. https://x.com/emollick/status/1956971479863644654

LRM Token Economy: a report on reasoning efficiency in LLMs (on a set of problems most of them can solve with near 1.0 accuracy). Lots of interesting findings, but first things first: V3.1 is on par with Sonnet 4. It’s much less of a mumbler than 0528. https://x.com/teortaxesTex/status/1958096607515181167

The latest from @deepseek_ai, deepseek-chat-v3.1, is live in Cline. It boasts a SWEBench score of 66%, closely rivaling Sonnet 4’s 72.7%. It’s context window is 164k and is more affordable at $0.56/$1.68 per M tokens. https://x.com/cline/status/1958580433979154720

This talk from @ttorres busts myths about LLM Evals You don’t need ❌ … to be technical ❌ … fancy tools or infra ❌ … to spend weeks (you can have evals in < an hour) https://x.com/HamelHusain/status/1956737716194034018

We were able to reproduce the strong findings of the HRM paper on ARC-AGI-1. Further, we ran a series of ablation experiments to get to the bottom of what’s behind it. Key findings: 1. The HRM model architecture itself (the centerpiece of the paper) is not an important factor.”” / X https://x.com/fchollet/status/1956442449922138336

🚀 Introducing #AutoCodeBench by Tencent Hunyuan! We built the first fully automated LLM–sandbox workflow to create high-difficulty, multilingual, balanced & diverse code benchmarks — no human annotation required！ We’re open-sourcing a suite of related projects: 🔹 https://x.com/TencentHunyuan/status/1957751900608110982

99% of AI testing = shallow demos that only work in perfect conditions. Snowglobe flips that. It’s like crash-testing your AI agents with thousands of real-world edge cases.. over and over again.. until they actually get better. This is massive. Props to @guardrails_ai”” / X https://x.com/godofprompt/status/1956359876109652297

Day 3 of #CodingWithGLM 🤝 @SST_dev opencode GLM-4.5 is now live on the @SST_dev opencode platform! 🚀 Access the ultimate developer advantage: we tested on SWEBench-Verified-Mini, a 50-datapoint subset of SWEBench-Verified. The results confirm our model is a powerful https://x.com/Zai_org/status/1956335531555721345

What if your agent uses a different LM at every turn? We let mini-SWE-agent randomly switch between GPT-5 and Sonnet 4 and it scored higher on SWE-bench than with either model separately. Read more in the SWE-bench blog 🧵 https://x.com/KLieret/status/1958182167512584355

Spiral-Bench 🌀 I’ve wanted to understand the psychological effects of sycophancy, and the tendency of models to get stuck in escalatory delusion loops w/ users. I made an eval to get visibility on this. It measures how a model enables (or prevents) delusional spirals. 🧵 https://x.com/sam_paech/status/1956343619914432900

I’m fine with this. The community largely agrees that lmsys isn’t a reliable proxy for real-world performance, and Sonnet consistently scores low because it doesn’t optimize for that benchmark. To me, this signals a shift from preference-based rewards to real-world rewards, and https://x.com/LucasAtkins7/status/1956435679229186353

• DeepSeek V3.1 Reasoner improves on DeepSeek R1 on the Extended NYT Connections Benchmark: 48.6% → 57.7%. • DeepSeek V3.1 Non-Think improves on DeepSeek V3-0324: 16.8% → 21.6%. • Mistral Medium 3.1 improves on Mistral Medium 3: 11.5% → 15.2%. • GPT-5 (low https://x.com/LechMazur/status/1958970478712037548

Cool. Remember HRM? Yeah transformer arch, with zero hparam sweep, matches it out of the box (i would be surprised if gap prevails if one uses more recent transformers with hparam opt) So the ridiculous ARC AGI number on HRM paper was purely due to their “”training on test set”” https://x.com/cloneofsimo/status/1957048541127590346

Wow the team at @daftengine cooked! You can now read/write to 🤗Hugging Face with Daft! > DataFrame engine in 🦀 runs distributed and supports multimodal datasets to train/eval models Best part: it’s optimized for Xet, the dedupe-based HF storage that makes uploads crazy fast! https://x.com/lhoestq/status/1958904406004449452

🚨 Top 10 Leaderboard Disrupted! A new model provider has landed in the Arena Top 10: 💠Mistral-Medium-2508 ranks at #8! 💠it also ranks top 3 in the Coding & Longer Query categories The Text Arena is neck and neck—just a few points can shift the rankings and change who’s on https://x.com/lmarena_ai/status/1958954094867226954

We dove into the H100’s performance improvement over time from software over 2 years. Covered power usage + $ cost for training in a very detailed way for training runs on thousands of GPUs Equating this to US household power consumption @JeffDean + GB200 reliability challenges”” / X https://x.com/dylan522p/status/1958034446789095613

You can now quickly eval GPT-5 and reasoning efforts across your existing responses. With the built-in grader, compare responses to find the best model and reasoning effort for you. ⚡️👀 https://x.com/OpenAIDevs/status/1956410610914414904

🌐 Diffbot-small-xl has been added to the Arena! Brought to you by @diffbot, it’s the first open model to join the Search Arena. https://x.com/lmarena_ai/status/1957512493586350444

We released DeepConf that can achieve 99.9% on AIME’25 with open source models with only 15% of the compute, compared to majority voting@512. The secret? Simple. Just to pruning the rollouts if they show a consecutive stream of low-confidence😀. Can be applied to any models”” / X https://x.com/tydsh/status/1959003712942403835

Just saw GLM-4.5V is trending #2 on Hugging Face https://x.com/Zai_org/status/1956421442092032258

We apply ComputerRL to the open-source GLM-4-9B-0414 model and evaluate its performance on the OSWorld benchmark. Our AutoGLM-OS-9B, built upon GLM-4-9B-0414, achieves state-of-the-art accuracy and demonstrates substantial improvements for general-purpose agents in desktop https://x.com/Zai_org/status/1958175307019829754

Introducing BiomedArena.AI: Evaluating LLMs for Biomedical Discovery https://news.lmarena.ai/introducing-biomedarena/

Chain-of-Agents Interesting idea to train a single model with the capabilities of a multi-agent system. 84.6% reduction in inference cost! Distillation and Agentic RL are no joke! Here are my notes: https://x.com/omarsar0/status/1958186531161853995

Routing behavior Routing decisions shift as the trade-off parameter α increases. At low α, Avengers-Pro heavily routes to cheaper Qwen3 and Qwen3-thinking models, but as α rises, usage shifts toward GPT-5-medium and, eventually, higher-priced models like Gemini-2.5-pro and https://x.com/omarsar0/status/1958897548028178599

Just been saying that forecasting is a good way to eval, maybe we should train for it on held-out knowledge. Now we have a benchmark. Bonus: «the top-performing models are able to outperform professional sell-side analysts on 37.5% of revenue prediction tasks and 32.3% of EPS» https://x.com/teortaxesTex/status/1958114794692661510

FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction “”we introduce FutureX, a dynamic and live evaluation benchmark specifically designed for LLM agents performing future prediction tasks. FutureX is the largest and most diverse live benchmark for future https://x.com/iScienceLuvr/status/1958108647424413870

DeepSeek-V3.1 on par with o3, Opus 4 and Gemini 2.5 Pro Preview on coding It achieves a 76.3% score on Aider Polyglot with Thinking https://x.com/scaling01/status/1958438007104549243

just a minor version bump. booooring https://x.com/willccbb/status/1958420877537849801

We have done 2 cohorts and ~800 people have meaningfully engaged with the AI evals course. Hamel shared a bunch of testimonials with me yesterday. I was really astounded that they are not just generic testimonials; people mentioned very specific results and concepts. It seems https://x.com/sh_reya/status/1957139727322411291

🖼️🚨 Text-to-Image Leaderboard Update A new contender: Lucid Origin debuts on the Text-to-Image Leaderboard. Ranking at #9, this is a new model provider to enter the Top 10! Congrats to the @LeonardoAi_ team 👏 https://x.com/lmarena_ai/status/1958965415180476654

RotBench Evaluating Multimodal Large Language Models on Identifying Image Rotation https://x.com/_akhaliq/status/1958635243197325625

👀🚨 Vision Leaderboard update! Two new models have entered the Vision Top 20 this week: 🔸Qwen-vl-max-2025 by @alibaba_qwen lands at #10 (tied with gemini-1.5-pro & gpt-5-nano-high) 🔸Step 3 by @StepFun_ai ranks at #19 (tied with step-lo-turbo) Congrats to both 🎉 this is https://x.com/lmarena_ai/status/1958957107946168470

Wow — Qwen-Image-Edit just debuted at #2 in the Image Editing Arena 🏆 ELO 1098, with performance on par with GPT-4o — and all at open weights under Apache 2.0. Thanks to @ArtificialAnlys Try it now: https://x.com/Alibaba_Qwen/status/1958725835818770748

@YouJiacheng Just added! K2 scored *lowest* on sycophancy. 👀 https://x.com/sam_paech/status/1956612862379721057

Mistral Medium 3.1 is 2nd on LMArena without style control. Very proud of the @MistralAI team ! https://x.com/GuillaumeLample/status/1959015551172583602

Mistral Medium 3.1 just landed on @lmarena_ai leaderboard—punching way above its weight! 🏆 #1 in English (no Style Control) 🏆 2nd overall (no Style Control) 🏆 Top 3 in Coding & Long Queries 🏆 8th overall Small model. Big impact. Try it now on Le Chat and the API! https://x.com/MistralAI/status/1959015454359585230

“Demonstrate recursion in a paragraph. Be very clever.” You can see a couple of models are very much “coder brained” (GPT-5 Pro and Grok). https://x.com/emollick/status/1957016304100987339

🚨 Leaderboard Update: @OpenAI lands another model in the top 10. gpt-5-chat, the default model in ChatGPT, debuts at #5. gpt-5-mini-high and gpt-5-nano-high, the smaller versions gpt-5-high in at #16 and #44. These three reasoning models were configured with the highest https://x.com/lmarena_ai/status/1956399522688692608

Beyond GPT-5 Avengers‑Pro outperforms GPT‑5‑medium by about 7% average accuracy; with comparable accuracy, it reduces cost by about 27%. Proper routing frameworks make a difference. Here are my notes: https://x.com/omarsar0/status/1958897458408563069

Claim: gpt-5-pro can prove new interesting mathematics. Proof: I took a convex optimization paper with a clean open problem in it and asked gpt-5-pro to work on it. It proved a better bound than what is in the paper, and I checked the proof it’s correct. Details below. https://x.com/SebastienBubeck/status/1958198661139009862

I just ran the gpt-oss eval suite with the large gpt-oss-120b on my M2 Ultra using vanilla llama.cpp and got the following scores: – GPQA: 79.8% – AIME25: 96.6% These numbers are inline with those from various cloud providers: Here are the steps: https://x.com/ggerganov/status/1958238492603089287

Introducing DeepConf: Deep Think with Confidence 🚀 First method to achieve 99.9% on AIME 2025 with open-source models! Using GPT-OSS-120B even without tools, we reached this almost-perfect accuracy while saving up to 85% generated tokens. It also delivers many strong https://x.com/jiawzhao/status/1958982524333678877

5️⃣Techniques which most increased persuasion also *decreased* factual accuracy → Prompting model to flood conversation with information (⬇️accuracy) → Persuasion post-training that worked best (⬇️accuracy) → Newer version of GPT-4o which was most persuasive (⬇️accuracy) https://x.com/KobiHackenburg/status/1947316944509571530

Knobs that matter α tunes performance vs efficiency; accuracy rises fast until ~0.6 while cost stays low until ~0.4 then climbs. Implementation uses k‑means with k=60, Qwen3‑embedding‑8B (4096‑d) and top‑p=4 nearest clusters at inference. https://x.com/omarsar0/status/1958897532890943884

Semantic compression beats raw long context Chunk-level summaries in RAG not only matched or outperformed long-context baselines but did so with ~20% of the tokens. Well-structured summarization improves retrieval precision, reduces noise, and can even shorten execution steps. https://x.com/omarsar0/status/1956325856265326923

Capabilities of GPT-5 on Multimodal Medical Reasoning https://arxiv.org/pdf/2508.08224

[2508.10975] BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining https://arxiv.org/abs/2508.10975

1/Pretraining is hitting a data wall; scaling raw web data alone leads to diminishing returns. Today @datologyai shares BeyondWeb, our synthetic data approach & all the learnings from scaling it to trillions of tokens🧑🏼‍🍳 – 3B LLMs beat 8B models🚀 – Pareto frontier for performance https://x.com/pratyushmaini/status/1957456720265154752

Very excited to announce BeyondWeb, @datologyAI’s synthetic pretraining data generation paradigm. BeyondWeb is a rephrasing-based approach that substantially outperforms existing public synthetic pretraining data baselines, and is a core part of our curation pipeline. https://x.com/leavittron/status/1957468795767058745

🎬🚨 Video Arena Leaderboard Update: Two new providers have entered the Video Arena leaderboards, and have debuted with their first scores: Ray 2 by @LumaLabsAI 🔹#12 on Text-to-Video 🔹 #14 on Image-to-Video Runway Gen 4 Turbo by @runwayml 🔹 #15 on Image-to-Video More https://x.com/lmarena_ai/status/1958990871028015299