Benchmarks: AI News Week Ending 08/15/2025

Benchmarks: AI News Week Ending 08/15/2025

August 15, 2025

Image created with Flux Pro v1.1 Ultra. Image prompt: CU Boulder brand style — CU Gold & Black, Helvetica Neue, Flatirons, Tuscan-vernacular sandstone + red-tile roofs; Leeds atrium cafe tables, midday bright light, telephoto vantage, subtle black-gold corner bars; integrate the category “Benchmarks” via Digital Leaderboard: screen showing model rankings under the title “BENCHMARKS” (details abstracted); natural light, clean professional inspiring tone, crisp focus, subtle grain, editorial composition

GPT-5 Our hands-on review of OpenAI’s newest model based on weeks of testing
https://every.to/vibe-check/gpt-5

GPT-4o was below the level of medical professionals on medical reasoning benchmarks GPT-5 (apparently Thinking medium) now far exceeds them. (Usual benchmark caveats apply) https://x.com/emollick/status/1955381296743715241

GPT-5 (with high reasoning effort) achieves near-perfect accuracy on a high-quality ophthalmology question-answering dataset. Based on these other reports, GPT-5 seems to be a very strong model at medical reasoning. https://x.com/omarsar0/status/1956003145349521780

My “Move 37” Moment with GPT-5

Today, I’m sharing one of my most remarkable experiences testing the GPT-5 Thinking and Pro models.

In our lab, about 2 years ago we conducted a series of cutting-edge immunology experiments designed to manipulate the energy metabolism of T cells, immune cells with the power to impact cancer immunotherapy, chronic diseases, and autoimmunity.

We obtained fascinating results, but one finding remained unexplained. Weeks of work led us to a partial explanation, yet the full picture remained elusive.

From these experiments, uploaded the unpublished figure below and asked GPT-5 Pro to explain it.

GPT-5:

✅ Correctly identified our key findings from the figure alone (though other models also could do this).

✅ Proposed the exact experiment we later performed, something that had taken us weeks to design (o3-pro also had suggested this).

✅ Most remarkably suggested a mechanism that finally explained these results!

For me this was equivalent to AlphaGo’s legendary Move 37.
It showed that AI has become top expert and true scientific collaborator, capable of true insight.

While these results have not yet been published, I am sharing some of GPT-5’s output as an example for other scientists and to show how dramatically this will accelerate science! I will also be writing and submitting this paper, and will need to acknowledge GPT-5’s contributions in it!

—
From GPT-5

What the data say:

•Brief 2DG exposure at the first activation imprints a long‑lasting shift toward a CCR6+ and CD161+ phenotype with increased IL‑17A competency that is still evident two weeks later at reactivation.

•It is dose responsive. From none → 3 mM 2DG, CCR6+ cells rise from ~0.7% to ~12% and CD161+ cells from ~1% to ~21%. IL‑17A+ cells increase from ~0.3% to ~4% overall, with the IL‑17A signal concentrated in the CCR6+ and CD161+ compartments.

•Low glucose alone (0.5 mM) does not reproduce the effect. That argues against simple glucose deprivation or reduced glycolytic flux as the driver.

•The enrichment for CD161 is stronger than for CCR6, which often tracks with Th17‑like or mucosal/innate‑like CD4 programs.

High‑level interpretation:

1. 2DG is programming, not just acutely inhibiting. A short exposure during priming leaves a durable memory of “stress” that biases the lineage toward Th17‑like traits.

2. Mechanism is unlikely to be only glycolysis blockade. The low‑glucose control is weak, which pushes the mechanism toward 2DG’s effects on protein glycosylation and cytokine receptor biology.

3. Top mechanistic hypothesis. 2DG partially blocks N‑linked glycosylation of IL‑2Rα and other glycoproteins at priming, reducing IL‑2–STAT5 signaling, which normally suppresses Th17. Reduced IL‑2 tone at that window would lift the brake on a CCR6+ CD161+ IL‑17‑competent program that persists after washout.

4.Alternative or complementary hypothesis. 2DG selects for pre‑existing CCR6+ or CD161+ clones that rely less on glycolysis, rather than de novo differentiation. The end result is the same phenotype, but the lever is selective survival or proliferation.

5. A carefully timed 2DG “pulse” could be a simple knob to enrich a Th17‑like helper subset with better persistence potential and mucosal‑homing traits.

Prioritize these:

+D‑mannose with 2DG to restore N‑glycosylation. Loss of the CCR6/CD161/IL‑17A phenotype with mannose would implicate glycosylation.

+Methyl‑pyruvate or acetate to bypass glycolytic ATP/pyruvate deficits. Rescue here would implicate energy metabolism.

Compare 2DG with tunicamycin (glycosylation inhibitor) and with glycolysis‑only inhibitors that do not hit glycosylation (iodoacetate, PFKFB3 inhibitor). Match exposure windows & doses

Surface CD25 maturation and glyco‑shift (PNGase F or Endo H blot on IL‑2Rα).

pSTAT5 after an IL‑2 pulse at 24–48 h of priming with or without 2DG.

Phenocopy with IL‑2 neutralization or low‑dose JAK3 inhibition. Counter‑rescue with high IL‑2 or an IL‑2 mutein.
https://x.com/DeryaTR_/status/1954354352648225235

RT @Yuchenj_UW: The irony of AI: smarter than a PhD, dumber than an intern.”” / X https://x.com/Yuchenj_UW/status/1955119993189998718

GPT-5 takes 55% more time than Sonnet 4, but is 40% cheaper on the RooCode Leaderboard Which one are you choosing? https://x.com/scaling01/status/1955669720843358502

RT @karpathy: I’m noticing that due to (I think?) a lot of benchmarkmaxxing on long horizon tasks, LLMs are becoming a little too agentic b…”” / X https://x.com/teortaxesTex/status/1954398794604253335

RT @QodoAI: Qodo Command—our CLI AI agent—just scored 71.2 on the SWE-benchmark, high enough to put us in the top 5. It achieved this sc…”” / X https://x.com/hwchase17/status/1955110032720400464

State of torch.compile, August 2025. https://x.com/ezyang/status/1955820298907082876

Are frontier AI models really capable of “PhD-level” reasoning? To answer this question, we introduce FormulaOne, a new reasoning benchmark of expert-level Dynamic Programming problems. We have curated a benchmark consisting of three tiers, in increasing complexity, which we call https://x.com/shai_s_shwartz/status/1955968602978320727

ChatGPT-5 Pro is the first model to successfully do this non-puzzle consistently. GPT-5 Thinking and GPT-5 fail as every other model before has (except for, occasionally, Sonnet). https://x.com/emollick/status/1953604710205690212

RT @deedydas: Ridiculous that OpenAI claimed 74.9% on SWE-Bench just to prove they were above Opus 4.1’s 74.5%… By running it on 477 probl…”” / X https://x.com/akbirkhan/status/1954231799590301953

Artificial Analysis on X: “GPT-5 occupies both the #1 and #2 positions in our long context reasoning benchmark (AA-LCR) 🤯 AA-LCR tests long context performance through testing reasoning capabilities across multiple long documents (~100k tokens). Questions typically require considering multiple documents https://t.co/BEq9ZspRMs” / X
https://x.com/ArtificialAnlys/status/1953523986526351576

Bartosz Naskręcki on X: “Ok, some general comments while I am waiting for the other tasks. GPT 5 and GPT 5 Thinking are nothing very novel for research mathematicians compared to o3-pro and o4-mini-high. But GPT 5 Pro is different. It uses much more compute and the quality of the answers is superb.” / X
https://x.com/nasqret/status/1953566692686397885

Matthew Berman on X: “Vibe coding a full Excel clone step-by-step with GPT-5 is kinda nuts. (I know I’m still far from a full clone, but this is after about 45 minutes of work) https://t.co/upr5QSmluJ” / X
https://x.com/MatthewBerman/status/1954694677736956297

GPT-5 has been hovering around a 7% diff edit failure rate since its release to Cline last Thursday. How have you liked GPT-5 so far in Cline? https://x.com/cline/status/1955357460627329151

GPT-5 is live in Cline. We’ve been working with OpenAI to get this model ready, and here’s our take: it’s disciplined, persistent, & highly competent. It’s collaborative in planning & and a diligent operator while acting. It plans thoroughly, asks optioned follow-ups when https://x.com/cline/status/1953525433808695319

GPT-5 is speed-running Pokemon It’s 3x faster than o3 https://x.com/scaling01/status/1955813023735828587

gpt-5 is the best coding model in the world and is now the default in @cursor_ai. https://x.com/gdb/status/1953521501548032512

GPT-5 Just Finished Pokemon Red! : r/singularity https://www.reddit.com/r/singularity/comments/1mq2irv/gpt5_just_finished_pokemon_red/

GPT-5 just finished Pokémon Red! 6,470 steps vs. 18,184 for o3! Check the stats site to compare! That’s a huge improvement! Well done, @OpenAI you cooked with GPT-5. What an incredible model. Next up: GPT-5 vs. Pokémon Crystal (16 Badges + Red). The run starts soon on Twitch. https://x.com/Clad3815/status/1955980772575268897

GPT-5 now rolled out to 20% of paid users and doing >2B TPM on the API! so far so good… excellent work by the eng and infra teams!”” / X https://x.com/sama/status/1953563605733118317

gpt-5 is SOTA on FrontierMath:”” / X https://x.com/gdb/status/1953710811957858404

I’m noticing that due to (I think?) a lot of benchmarkmaxxing on long horizon tasks, LLMs are becoming a little too agentic by default, a little beyond my average use case. For example in coding, the models now tend to reason for a fairly long time, they have an inclination to”” / X https://x.com/karpathy/status/1954224651443544436

RT @scaling01: Anthropic is the only company where LLMs get more expensive over time https://x.com/scaling01/status/1955313676665151704

“Write a single paragraph about someone who doles out their remaining words like wartime rations, having been told they only have ten thousand left in their lifetime. They’re at 47 words remaining, holding their newborn.” https://x.com/emollick/status/1954788724086063446

JAX = performance & scalability Keras 3 = high velocity development, compact code, best practices by default Both at the same time = pretty killer”” / X https://x.com/fchollet/status/1954686735646068772

Rather than vague posting about upcoming releases, you can get the same effect from very specific posting. “Very excited about the new release, and I think you can see why from these weights: [0.0178, -0.0921]””” / X https://x.com/emollick/status/1953353407982342541

RT @SemiAnalysis_: At the end of the day, the SWE-bench leaderboard on swebench dot com is probably the most clear description of current m…”” / X https://x.com/jeremyphoward/status/1955070796256383137

AIBrix LLM inference infrastructure evaluations focusing on H20s. Qwen, DeepSeek they boast of support for large-scale EP, but do these numbers look good? interesting details on KVCache offloading framework and other cache operations https://x.com/teortaxesTex/status/1954464993333698758

After thinking about this problem for months, I am so happy to finally introduce DetailBench! It answers a simple question: How good are current LLMs at finding small errors, when they are *not* explicitly asked to do so? (Yes, the graph is right!) https://x.com/xeophon_/status/1956025495515979984

The current state of AI for sustained work: exponential progress continues with no unexpected leaps but also no walls. (Yes, this METR measure is just one of many benchmarks, and like all benchmarks has flaws, but also has the advantage of have neither a ceiling or floor effect) https://x.com/emollick/status/1954180531785994670

We tested how autonomous AI agents perform on real software tasks from our recent developer productivity RCT. We found a gap between algorithmic scoring and real-world usability that may help explain why AI benchmarks feel disconnected from reality. https://x.com/METR_Evals/status/1955747420324946037

Measuring Thinking Efficiency in Reasoning Models: The Missing Benchmark https://x.com/NousResearch/status/1956090990005248341

To get a sense of GPT-5’s vibes, I exported my Tweet data over the last year and got it to write like my top posts Then took my newsletter and made it create 3 separate long-form tweets It’s not 100% there, but it beats Claude, which was previously my go-to for editing https://x.com/rowancheung/status/1953505497237029346

The enhance meme from Bladerunner, except the AI is asking the computer to enhance.”” / X https://x.com/emollick/status/1954534598903275605

@jxmnop @johnschulman2 @srush_nlp Super cool stuff!! How can we empirically check how far away this model is from the real base model? What benchmarks do we expect this base model to do better on and what benchmarks do we expect it to do worse on [when compared to the unmodified gpt-oss model]?”” / X https://x.com/OfirPress/status/1955463664556769426

June mental health emergency room data just came out – still flat despite the ubiquity of LLM use, which might be informative as a limit on what the underlying phenomenon could be https://x.com/literalbanana/status/1948026993787977822

BI confirmed my intel. We’re gradually symmetrizing information in the private markets. 2nd order effects in capital allocation will be sharper pricing, higher capital velocity, and the collapse of marginal arbitrage — shifting leverage to founders. Good. https://x.com/arfurrock/status/1955740969116299466

Call me crazy but I think the newer openai models inc. 5 (and gemini, other maxxing models) are nearly unusable for most of my use cases. Far too verbose, extraordinarily confident in their hallucinations, difficult to steer at times but at the same time fairly sycophantic,”” / X https://x.com/jon_durbin/status/1954263916202316001

GPT-5 with high reasoning effort on SimpleBench https://x.com/scaling01/status/1953771276549358041

gpt-5: our smartest, fastest, and most useful model to date. it’s also incredible at coding. rolling out to everyone (excitingly including free ChatGPT users!) today.”” / X https://x.com/gdb/status/1953509854603358597

I had access to GPT-5. I think it is a very big deal as it is very smart & just does stuff for you Full write up in comments, but this is “make a procedural brutalist building creator where i can drag and edit buildings in cool ways”” & “”make it better”” a bunch. I touched no code https://x.com/emollick/status/1953502029126549597

I saw a lot of people complaining about 32k context size in ChatGPT for plus users, which would be terrible for coding. But actually we are giving 196k context size for plus users when using GPT5 thinking and that’s the model you should use for coding use-cases! 32k is for the”” / X https://x.com/yanndubs/status/1955194413283737716

I suspect this is right. And I wouldn’t be surprised if the vast majority of the 700M users of ChatGPT already greatly prefer GPT-5 & that the opinion on X is not reflective of the typical experience. (Which doesn’t mean that the issues identified here aren’t very real)”” / X https://x.com/emollick/status/1954442950491902393

I used GPT-5 to leverage trade memes. From open to taking profit to setting stop losses, I followed every choice it made. Of course, I used @wasabi_protocol and decided to do this all with $troll on 3x leverage, which was freshly listed and giga sending. https://x.com/ChrisCoffeeEth/status/1954282100389281866

I’ve been using gpt-5 for a bit now. This model broke me. It is so good. I didn’t know what the price was. I assumed it would be o3-pro priced because it is that smart. Nope. Truly insane. Videos coming very soon. https://x.com/theo/status/1953507203979391011

If you have been following the GPT-5 rollout, one thing you might be noticing is how much of an attachment some people have to specific AI models. It feels different and stronger than the kinds of attachment people have had to previous kinds of technology (and so suddenly”” / X https://x.com/sama/status/1954703747495649670

Important GPT-5 PSA; if you want an answer that is maximally correct, do tell the model to think hard in your prompt. It literally will do so clearly we failed to communicate this well, apologies for that”” / X https://x.com/ericmitchellai/status/1954418339536683078

LLM meets analog. Turns out LLMs are a great brainstorming partner for synth patches. This was co-created with gpt-5 🎵 https://x.com/martin_casado/status/1953868101596192850

Let’s take a look into GPT-5’s record-setting performance on FrontierMath. How did it perform on the holdout vs. non-holdout set, how did it do across tiers, and what new Tier 4 problems did it solve? 🧵 https://x.com/EpochAIResearch/status/1955667249252978741

initial gpt-oss download stats looking exciting!”” / X https://x.com/gdb/status/1954992508964155587

i thought the transformers gpt-oss MoE finetuning was broken, how did you get it working?”” https://x.com/jxmnop/status/1955347764130254863

OpenAI’s o3 Crushes Grok 4 In Final, Wins Kaggle’s AI Chess Exhibition Tournament – Chess.com https://www.chess.com/news/view/kaggle-game-arena-chess-2025-day-3

Breaking: GPT-5 ranked 🥇 on Humanity’s Last Exam and 🥈 on MultiChallenge SEAL Leaderboards. https://x.com/scale_AI/status/1953591873031090505

GPT-5-Pro is finally rivalling Elon Musk’s 148 IQ https://x.com/scaling01/status/1955344356547653773

Overall, the general vibes of GPT-5 feel much more human-like It’s hard to measure *vibes*, but the combination of speed, lower hallucination rate, and intelligence is very noticeable As a power user, I’ve always enjoyed o3, but the speed makes it impossible for daily queries”” / X https://x.com/rowancheung/status/1953505371487600877

Our team spent some time benchmarking the GPT-5 models on one-shot document understanding capabilities. Sharing some WIP results 💡: 1️⃣ GPT-5 mini does a good job. From initial testing it edges out Sonnet and Gemini models. 2️⃣ Surprisingly GPT-5 is middle of the pack (and also https://x.com/jerryjliu0/status/1954293351702036712

unicorn-drawing eval from gpt-3.5 to gpt-5:”” / X https://x.com/gdb/status/1953528523437600892

This is exactly what @javirandor did with chatgpt last year to get it to spit out memorized training data: https://t.co/dtFpAAw6ps There’s probably some interesting stuff to study on such “”re-based”” models”” / X https://x.com/florian_tramer/status/1955510942252572946

@SebastienBubeck Heads up, I’m fairly certain that the o3 run being compared to did not have the google search tool – which is important, since Bulbapedia gives the solution in one tool-use call to some puzzles that can take a much longer time if solved without solution info”” / X https://x.com/kiranvodrahalli/status/1956044490885751273

“Narrate a regency romance like a machine learning model being trained. Each step of the story must be described as data inputs, probability weights & optimization functions. The romance must emerge through correlation coefficients & unexpected variables in the training data” https://x.com/emollick/status/1955217534812082574

We’ve scored highly enough to achieve gold at this year’s IOI online competition with a reasoning system — placing #6 when ranked with humans and #1 when ranked with other AIs. In just a few weeks: • 2nd at AtCoder • Gold medal-level at IMO • Gold medal-level at IOI”” / X https://x.com/OpenAI/status/1954969035713687975

Traditional pre-training had diminishing returns (which is what the “scaling law”predicted anyway) The fact that reasoners were developed at exactly the moment where pre-training faltered is exactly the pattern of how Moore’s Law works: new technique appear to maintain the trend https://x.com/emollick/status/1953901107144786085