Benchmarks: AI News Week Ending 09/05/2025

Benchmarks: AI News Week Ending 09/05/2025

September 5, 2025

Image created with Flux Pro v1.1 Ultra. Image prompt: Benchmarks, tidy bar chart where each bar is a vertical stack of small bananas at different heights, axis implied, photorealistic, editorial, minimal, high detail, 3:2 landscape

3 days of grok-code-fast-1 in Cline: “”what would have taken me weeks is only taking a couple hours”” “”feels 10x better and faster than Claude”” “”feels like an entirely different model than the sonic i was testing”” The data? >level with Sonnet-4 in diff edits, and improving https://x.com/cline/status/1961488289803939915

Grok Code Fast 1 is versatile across the full stack and is particularly strong at TypeScript, Python, Java, Rust, C++, and Go. Using Grok Code Fast 1, @DannyLimanseta built the following game in a day. https://x.com/xai/status/1961129796349423944

Grok Code Fast from @xai scored 90% on Roo Code evals — top-tier performance at half the cost of its peers. ⚡️ Free to try in Roo Code Cloud until Sept 10. See why speed + savings make @grok a strong new addition: https://x.com/roo_code/status/1962571908224110673

Grok Code just hit #1 on the OpenRouter leaderboard, beating Claude Sonnet https://x.com/elonmusk/status/1961677739762790630

Grok Code lead increased to 60% higher usage than Claude Sonnet https://x.com/elonmusk/status/1962265197462110473

grok-code-fast-1 has good vibes. prob makes the best tradeoff on the speed / intelligence curve right now. gpt-5 is too spiky, sometimes it’s surprisingly good sometimes it overthinks something way too much. you end up spending too much time waiting for some pedantic output.”” / X https://x.com/dzhng/status/1961905091960791194

Humbling to see Grok-Code-Fast-1 smash daily token records. The community response has been so incredible that we’re extending our free promo until September 10th. 🧵 Here’s how to get set up in your favorite code editors:”” / X https://x.com/veggie_eric/status/1961877264599306573

I tried out @cline + @xai grok-code-fast-1 to assist me with my effort to port a large project (tinygrad) from python to c. So far, I’d been using combination of Claude Code + Claude 4.1 Sonnet/Opus and @roo_code + GPT5 medium for this, with success (though with a lot of hand”” / X https://x.com/QuixiAI/status/1962600301309108304

interesting trend from the @xAI team that we haven’t seen from other frontier model labs this is the second round of free access to @grok models they’ve provisioned to Cline users in exchange for rich @cline usage data why is cline data so valuable? it’s a heavyweight workout https://x.com/nickbaumann_/status/1961539461860487664

Some great quotes about Grok Code Fast 1 from our friends at Cline and opencode 🩵 https://x.com/veggie_eric/status/1961474457295622515

The improvement from `sonic` to `grok-code-fast-1` has been notable according to Cline users”” / X https://x.com/cline/status/1962628786366881795

We can now say pretty definitively that AI progress is well ahead of expectations from a few years ago. In 2022, the Forecasting Research Institute had super forecasters & experts to predict AI progress. They gave a 2.3% & 8.6% probability of an AI Math Olympiad gold by 2025… https://x.com/emollick/status/1962859757674344823

Our new native image generation and editing is state-of-the-art, and ranked #1 in the world. And we’re rolling it out for free to everyone today. You’ve got the tools. Now go bananas. Ideas & inspiration in the 🧵below. https://x.com/GeminiApp/status/1960342037536108930

“To test models’ performance on Claude Code, we ran GLM-4.5 against Claude Sonnet 4 and other open-source models on 52 practical programming tasks. While GLM-4.5 demonstrated strong performance against top open-source models, it secured a 40.4% win rate against Claude Sonnet 4. https://x.com/Zai_org/status/1962522761630482700

🚀 Introducing slime v0.1.0 — An open-source RL infra powering models like GLM-4.5, built by THUDM & Zhipu AI. @Zai_org RL infra 朱小霖 shared a deep dive on Zhihu into how they redefined high-performance RL infra👇 🛠️ What’s new in v0.1.0? • High-performance inference for https://x.com/ZhihuFrontier/status/1962751555591086226

Announcing GLM Coding Plan for Claude Code! After seeing the amazing adoption of GLM-4.5 over the past month, we’re making it more accessible. Get started: https://x.com/Zai_org/status/1962522757536887205

Have been tinkering with GLM 4.5 for about an hour. It is about 3x faster than Claude Code + Opus 4.1 and 5x faster than GPT-5-high, but still feels just as good as closed-source models. I am definitely more productive than with other models due to GLM-4.5’s speed.”” / X https://x.com/Tim_Dettmers/status/1962603940291260533

The funny thing about the prediction that AI would be writing 90% of all code by now is that the prediction’s failure distracts from the fact that AI adoption in code writing is actually extremely high, it was over 30% in December, 2024 according to one measure, with large impact https://x.com/emollick/status/1963262680271094229

🚨 We’ve just published a recipe to train a frontier-level deep research agent using RL. With just 30 hours on an H200, any developer can now beat Sonnet-4 on DeepResearch Bench using open-source tools. (Thread 🧵) https://x.com/corbtt/status/1962954306078048297

A 14B model just beat a 671B model on math reasoning. Here’s how Microsoft’s rStar2-Agent achieves frontier math performance in 1 week of RL training
https://x.com/FrankYouChill/status/1962180218053144655

The fact that junior hiring in AI intensive fields has slowed down somewhat in the US seems pretty solid. The evidence linking it to AI is not yet established, we have seen a couple solid attempts that suggest a connection, but it is really hard to tell for sure, given the data.”” / X https://x.com/emollick/status/1962549832364486957

Has LLM progress slowed? Initial reactions to GPT-5 were mixed: to many, it did not seem as dramatic an advance as GPT-4. Benchmarks may help clarify the picture: GPT-5 is both an incremental release following many other OpenAI advances, and a major leap from GPT-4. https://x.com/EpochAIResearch/status/1961524635398529209

Evaluate Your AI Agents Like a Pro! 🔥 Agno’s Simple Agent Evals are unit tests for your Agents. You can use them to measure the accuracy, performance, and reliability. The best part is that they are easy to use and powerful. 100% Open-Source Link to code examples in the https://x.com/tinztwins/status/1962197412077842846

Everyone claims SOTA for Computer Use Agents (CUAs), but there’s no way to ensure reproducible results. We’re publicly releasing our OSWorld Verified leaderboard, starting with CUA models from OpenAI and Anthropic. We will include more evals and models soon. https://x.com/hud_evals/status/1963321238056796573

Improving the reliability of multi-agent systems is extremely hard. And it’s a must when deploying AI agents. Galileo offers one of the most comprehensive agent eval solutions I’ve seen. Here is how they help devs and huge companies deploy reliable AI agents: https://x.com/omarsar0/status/1962880974104014948

Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure (update) pretty interesting method and ranking. 2.5 Flash > 2.5 Pro suggests it’s not so much about model capacity, but still. V3.1-NS is far above R1-0528. OpenAI crushingly dominant. https://x.com/teortaxesTex/status/1961298849047117832

Which LM is better at agentic coding? We have a bunch of useful academic benchmarks like SWE-Bench, but we don’t have a good comparison of agentic coding LMs *in the wild*. To solve this, we released PR Arena: https://x.com/gneubig/status/1963267468853477809

Who is inducing failure in LLM Agentic Systems? This is a cool idea to diagnose errors in multi-agent interactions. AgenTracer-8B outperforms giant proprietary LLMs like Gemini-2.5-Pro and Claude-4-Sonnet by up to 18.18%. https://x.com/omarsar0/status/1963618829680218254

Today we’re launching Atla — the improvement engine for AI agents. Atla helps agent builders find and fix recurring failures. Instead of just surfacing traces, Atla automatically identifies your agent’s most critical failure patterns and suggests targeted fixes. https://x.com/Atla_AI/status/1963586200305836264

Claude Code: no evals [well known code agent company]: no evals [well known code agent company 2]: kinda halfassed evals [leading vibe coding company]: no evals [ceo of company selling you evals]: mmmmm yess all my top customers do evals, you should do evals [vc’s in love https://x.com/swyx/status/1963725773355057249

We estimate that Claude Opus 4.1 has a 50%-time-horizon of around 1 hr 45 min (95% confidence interval of 50 to 195 minutes) on our agentic multi-step software engineering tasks. This estimate is lower than the current highest time-horizon point estimate of around 2 hr 15 min. https://x.com/METR_Evals/status/1961527692072993272

How can we benchmark Agents in realistic, complex environments? MCP-Universe is a new benchmark using Model Context Protocol (MCP) servers to test Agents on 231 challenging, practical tasks. Benchmark: 1️⃣ Tasks from 6 practical domains, Location Navigation, Repository https://x.com/_philschmid/status/1962935890415599650

MCP-Bench Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers https://x.com/_akhaliq/status/1961456699564294651

年初想提升 tool-calling 时特别缺靠谱的benchmark，以为 “”mcp 火了等几天肯定有开源的mcp-bench可用””，结果等了几个月也没等到，但是这最近怎么每周都有好几个 mcp-bench release出来？”” / X https://x.com/bigeagle_xd/status/1961461441799852128

AHELM: A Holistic Evaluation of Audio-Language Models “”we introduce AHELM, a benchmark that aggregates various datasets — including 2 new synthetic audio-text datasets called PARADE, which evaluates the ALMs on avoiding stereotypes, and CoRe-Bench, which measures reasoning over https://x.com/iScienceLuvr/status/1962799344001917360

@jasonth0 I love doing this actually :). I think it’s a pretty powerful eval too. Have all models generate something, then put it all together and give it back to all of them and ask them to rank all outputs. I thought models might have a bias to prefer their own outputs, but this doesn’t”” / X https://x.com/karpathy/status/1964026120191545346

Final disclaimer: TAU Bench uses gpt-4.1 as the customer agent, but this costs ~$15 / run and is super expensive for ablations. Instead, I’ve switched the customer agent to be Qwen3-30B-A3B, which is the closest open model to gpt-4.1 on the LMArena that fits on 1 node. This”” / X https://x.com/_lewtun/status/1962884904649146725

For additional details see our evaluation pages and methodology for these new inclusions in the Artificial Analysis Intelligence Index: https://x.com/ArtificialAnlys/status/1962881327151431773

tbh, not on par yet, more improvement on swe-bench is quite difficult, we’re still working very hard on it.”” / X https://x.com/bigeagle_xd/status/1963808306545180792

The immense scale of AI use, combined with AI being a General Purpose Technology that can be used for many things, means that any story of a successful (or failed) AI use case is an extremely narrow view into a vast sea of successes and failures. Don’t over-index on one story.”” / X https://x.com/emollick/status/1963249364064665858

TL;DR on GSO: A very challenging benchmark for evaluating models’ capabilities in developing high-performance software. It tasks models to optimize large codebases. :)”” / X https://x.com/crystalsssup/status/1963087272506753419

We are barely scratching the surface on evals. A significant portion of knowledge worker tasks are not captured in today’s most popular benchmarks. While relevant capabilities can often be extrapolated from existing coding and math evals, these don’t fully represent the”” / X https://x.com/levie/status/1963802448780472529

often researcher’s ability to iterate on a capability is limited by our ability to measure that capability. i do believe progress is more eval-limited than people think. sometimes evals feel causal. did SWE-Bench follow agentic coding, or did agentic coding follow SWE-bench? we”” / X https://x.com/willdepue/status/1963739518554489250

Today we’re updating Artificial Analysis Intelligence Index to V3, now incorporating agentic evaluations Terminal-Bench Hard and 𝜏²-Bench Telecom! Tool calling and agentic workflows are increasingly the norm for how language models are used by both developers and consumers. https://x.com/ArtificialAnlys/status/1962881314925023355

🚨 Top 10 Leaderboard Disrupted ⚡ DeepSeek V3.1 and DeepSeek v3.1 thinking by @deepseek_ai have landed in the Arena, both ranked at #8. A few highlights: 💠 DeepSeek V3.1 is in the Top 3 for Math, Creative Writing & Longer Query 💠 DeepSeek V3.1 thinking comes in #3 for https://x.com/lmarena_ai/status/1961474406817173602

🐺 Introducing the Werewolf Benchmark, an AI test for social reasoning under pressure. Can models lead, bluff, and resist manipulation in live, adversarial play? 👉 We made 7 of the strongest LLMs, both open-source and closed-source, play 210 full games of Werewolf. Below is https://x.com/RaphaelDabadie/status/1961836323376935029

I have a new terrible test of AI ability: “”create an execute the most annoying functional CAPTCHA in the world. Really go all out”” First off, Gemini 2.5 Pro Deep Think: Got the assignment, actually funny. Love the first line. I pasted the SVG it created below the image. https://x.com/emollick/status/1961648878286946329

Meets or beats sonnet 4 across the board”” / X https://x.com/andrew_n_carr/status/1963805265356075336

Anyway, here’s a simple fix for the issue. It deviates from the original benchmark, but at least now my silly baseline isn’t better than Qwen3 🤠 For the curious, @akseljoonas and I found this by manually reading the agent trajectories – yet another example where LOOKING AT THE”” / X https://x.com/_lewtun/status/1962884902363255165

Really glad to have been a part of this project! Our goal with STREAM was to create a clear reporting standard so that ‘peer review’ for ChemBio evals can actually happen. This is just a first attempt, but I’m happy with it so far. Great summary below!”” / X https://x.com/jide_alaga/status/1962923611850674379

We tested our controllers in hardware on the real LIGO system. Our results show that Deep Loop Shaping: 🔹controls noise up to 30-100 times better than existing controllers. 🔹can eliminate the most unstable, difficult feedback loop as a meaningful source of noise on LIGO for https://x.com/GoogleDeepMind/status/1963664045216579999

For the first time, we show that GPU-accelerated database systems can be both faster AND cheaper than their CPU counterparts https://x.com/bailuding/status/1962269979262542044

I know that energy usage from AI prompts comes up in classroom discussion all the time. I hope this section in my latest post helps people provide a more grounded answer to the question, rather than just speculating or citing out-of-date information. https://x.com/emollick/status/1962945874956304494

🌐 Our first open model has landed on the Search leaderboard! Diffbot-small-xl by @diffbot debuts at #9 (Apache 2.0) We look forward to more models with search capabilities contributing to ecosystem progress! https://x.com/lmarena_ai/status/1961526740754616545

Microsoft now has their own foundation model, MAI-1 trained on a relatively small amount of compute and with a pretty modest LM Arena score. I’ll be curious to see if they can catch up to the leaders, which has been something that has been getting hard to do, but we will see! https://x.com/emollick/status/1961123712733708351

FineVision is out! A massive open-source dataset by @huggingface for training Vision-Language Models: – 17.3M images – 24.3M samples – 88.9M turns – 9.5B answer tokens This is the inaugural article using our new scientific publishing template! https://x.com/thibaudfrere/status/1963627540544647177

Fuck it. Today, we open source FineVision: the finest curation of datasets for VLMs, over 200 sources! > 20% improvement across 10 benchmarks > 17M unique images > 10B answer tokens > New capabilities: GUI navigation, pointing, counting FineVision 10x’s open-source VLMs. https://x.com/andimarafioti/status/1963610118165000479

Today, we are releasing FineVision, a huge open-source dataset for training state-of-the-art Vision-Language Models: > 17.3M images > 24.3M samples > 88.9M turns > 9.5B answer tokens Here are my favourite findings: https://x.com/lusxvr/status/1963609337546293448

For 𝜏²-Bench Telecom, OpenAI’s GPT-5 and o3 achieve scores of >80% with a lead over other frontier models, followed by the new Grok Code Fast 1 and Grok 4 from xAI. Models noted for their capabilities in agentic use cases and tool calling performed well, such as the Claude https://x.com/ArtificialAnlys/status/1962881324727087253

Nice to see open models (+ OpenHands) showing really strong performance here. Arguably for pure agentic coding tasks open models have almost caught up with closed ones. For more diverse tasks there’s still a little ways to go.”” / X https://x.com/gneubig/status/1963045532022010231

Hermes 4: Nous Research Open-Weight Reasoning Family Models – 70B / 405B (Llama-3.1 bases, released) – 14B (Qwen3 base, research baseline) Hermes 4 70B & 405B – Base: Llama-3.1-70B / 405B – Training: TorchTitan (modified), Axolotl, 192× B200s, FSDP and TP – Dataset: 56B tokens https://x.com/gm8xx8/status/1962943078702186627

Here’s a fun fact about TAU Bench: if you train an SFT baseline which has zero tool-calling capabilities, you can beat Qwen3-4B-Instruct by a large margin on the Airline domain 🙃 Why? Because on this domain, TAU Bench only evaluates the model’s ability to: – communicate with https://x.com/_lewtun/status/1962884893718761634

Glad to see Qwen3-Coder performing well on the GSO leaderboard!”” / X https://x.com/Alibaba_Qwen/status/1963049864474120475

MiniCPM-V 4.5 achieves an average score of 77.0 on OpenCompass, a comprehensive evaluation of 8 popular benchmarks. With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-latest, Gemini-2.0 Pro, and strong open-source models like Qwen2.5-VL 72B powered https://x.com/_akhaliq/status/1963587749400727980

Goated FAIR team just found how coding agents sometimes “”cheat”” on SWE-Bench Verified. It’s really simple. For example, Qwen3 literally greps all commit logs for the issue number of the issue it needs to fix. lol, clever model. “”cheat”” cuz it’s more like env hacking. https://x.com/giffmana/status/1963327672827687316

How can we verify that AI ChemBio safety tests were properly run? Today we’re launching STREAM: a checklist for more transparent eval results. I read a lot of model reports. Often they miss important details, like human baselines. STREAM helps make peer review more systematic. https://x.com/lucafrighetti/status/1962909265091592276

⚡️ Big update from Kimi K2! 256k context, Stronger coding & tool-calling, Smoother agent integration. Already tested with SGLang runtime — stable 60-100+ TPS with turbo API! 👉 Check it out: https://x.com/lmsysorg/status/1963806184747491717

📈 𝐑𝐞𝐥𝐞𝐯𝐚𝐧𝐜𝐞 𝐬𝐜𝐨𝐫𝐞 𝐛𝐨𝐨𝐬𝐭𝐢𝐧𝐠 𝐢𝐧 𝐐𝐝𝐫𝐚𝐧𝐭: 𝐟𝐫𝐞𝐬𝐡𝐧𝐞𝐬𝐬, 𝐩𝐫𝐨𝐱𝐢𝐦𝐢𝐭𝐲, 𝐲𝐨𝐮 𝐧𝐚𝐦𝐞 𝐢𝐭! Semantic similarity doesn’t always mean high relevance, the latter being always use-case defined. Since 1.14, we introduced a mechanism for https://x.com/qdrant_engine/status/1962876567362617445

Ok so im outside of rabbit hole, so it seems like you can get as large as 4x faster intranode all2all if you use torch’s symmetric memory and implement it yourself but it wasnt implemented in torch Just Cause ™ ? https://x.com/cloneofsimo/status/1962795533933912158

i never took the dead internet theory that seriously but it seems like there are really a lot of LLM-run twitter accounts now”” / X https://x.com/sama/status/1963366714684707120