Benchmarks: AI News Week Ending 08/08/2025

Benchmarks: AI News Week Ending 08/08/2025

August 8, 2025

Image created with Flux Pro v1.1 Ultra. Image prompt: Ornate showgirl glamour in orange-and-teal tones, glitter scoreboard backdrop featuring stylized luminous performance bars, stylized text “Benchmarks” spelled in sequined marquee letters shimmering across the display; spotlit, dramatic contrast, vintage grain, cinematic, high-detail

Red teamers assemble! ⚔️💰 We’re putting $500K on the line to stress‑test just released open‑source model. Find novel risks, get your work reviewed by OpenAI, Anthropic, Google, UK AISI, Apollo, and help harden AI for everyone.”” / X https://x.com/woj_zaremba/status/1952886644090241209

We’re launching a $500K Red Teaming Challenge to strengthen open source safety. Researchers, developers, and enthusiasts worldwide are invited to help uncover novel risks—judged by experts from OpenAI and other leading labs. https://x.com/OpenAI/status/1952818694054355349

Claude Opus 4.1 (“”claude-leopard-v2-02-prod””) “”Opus 4.1 is here – Try our latest model for more problem solving power.”””” / X https://x.com/btibor91/status/1952366658326036781

Claude Opus 4.1 \ Anthropic https://www.anthropic.com/news/claude-opus-4-1

Claude Opus 4.1 beats GPT-5 on SWE bench https://x.com/Sauers_/status/1953504854044704973

Claude Opus 4.1 is available in Cursor! Let us know what you think.”” / X https://x.com/cursor_ai/status/1952782293925298655

Going live with the fellas @tbpn in an hour to talk about Opus 4.1 and Claude Code”” / X https://x.com/alexalbert__/status/1952801100299681959

GPT-5 (medium reasoning) is the new leader on the Short Story Creative Writing benchmark! GPT-5 mini (medium reasoning) is much better than o4-mini (medium reasoning). Claude Opus 4.1 shows gains over Opus 4. https://x.com/LechMazur/status/1953658077300875656

Gemini 2.5 Deep Think is state-of-the-art performance across many challenging benchmarks!”” / X https://x.com/demishassabis/status/1951468051578142848

Gemini 2.5: Deep Think is now rolling out https://blog.google/products/gemini/gemini-2-5-deep-think/

Gemini Deep Think, our SOTA model with parallel thinking that won the IMO Gold Medal 🥇, is now available in the Gemini App for Ultra subscribers!! Should we put it in the Gemini API next? https://x.com/OfficialLoganK/status/1951260803459338394

not enough people are talking about the delta between the parallel thinking uplifts of oai vs gdm AIME o3 pro: +3% (from 90->93 on 2024) deep think: +11.2% (from 88->99.2 on 2025) Knowledge o3 pro: +3% (on GPQA) deep think: +13.2% (on HLE) Coding o3 pro: +9.1% (on Codeforces https://x.com/swyx/status/1951460518293807241

Played with Deep Think, a dramatic improvement for Google. It is getting close to O3 Pro – I’d say, a solidly second best model right now. Far less verbose! With limits of about 10 a day, not ready for the professional use, though. https://x.com/MParakhin/status/1952028947153371631

(3) GPT-5 Hands-On: Welcome to the Stone Age https://www.latent.space/p/gpt-5-review

(3) GPT-5’s Router: how it works and why Frontier Labs are now targeting the Pareto Frontier https://www.latent.space/p/gpt5-router

@aidan_mclau @cursor_ai The straight up GPT-5 in Codex CLI fixed a bug in 3 minutes that I was working on for three or four hours this morning…can’t wait to try in Cursor.”” / X https://x.com/sound4movement/status/1953583522587017345

💥 It’s here! GPT-5 is rolling out in ChatGPT for everyone, starting today. It’s a 🤯 good model, and we’ve simplified the UI alongside it. No more choosing between gpt-4o and o4-mini. When you ask a hard question and the model needs to think hard, it does. When it can give you”” / X https://x.com/kevinweil/status/1953502681181618277

AMA with @sama + some members of the GPT-5 team Tomorrow 11am PT. https://x.com/OpenAI/status/1953548075760595186

Codex CLI + GPT-5:”” / X https://x.com/gdb/status/1953556751762288653

Does OpenAI not do basic integration testing? At the time of release, the first code sample provided in the GPT-5 docs could not be run, because someone accidentally deleted the `output_text` property. My CI notified me. Why didn’t theirs? https://x.com/jeremyphoward/status/1953610071654772985

going to try live-tweeting the GPT-5 livestream. first, GPT-5 in an integrated model, meaning no more model switcher and it decides when it needs to think harder or not. it is very smart, intuitive, and fast. it is available to everyone, including the free tier, w/reasoning!”” / X https://x.com/sama/status/1953502614676811865

GPT-5 (medium reasoning) sets a new record on the Confabulations/Hallucinations on Provided Texts benchmark! https://x.com/LechMazur/status/1953582063686434834

GPT-5 claims #1 spot on LiveBench https://x.com/scaling01/status/1953602929375813677

gpt-5 for long context reasoning:”” / X https://x.com/gdb/status/1953747271666819380

GPT-5 gets 74.9 on SWE-bench. Wonder what the budget per task is. https://x.com/OfirPress/status/1953502998627221519

GPT-5 in the high reasoning setting hit the 100K token limit for our evaluations on 10/290 Tier 1-3 samples (3%). This means our evaluation might slightly underestimate the reasoning capabilities of GPT-5.”” / X https://x.com/EpochAIResearch/status/1953615908695314564

GPT-5 is extremely sensitive to instructions. Either give it demonstrations or tell it explicitly how you want the output. Avoid doing both. If you do, GPT-5 will override the examples with your output instructions. Sharing more just in case you face this issue:”” / X https://x.com/omarsar0/status/1953876255037612531

GPT-5 is here – and it’s #1 across the board. 🥇#1 in Text, WebDev, and Vision Arena 🥇#1 in Hard Prompts, Coding, Math, Creativity, Long Queries, and more Tested under the codename “summit”, GPT-5 now holds the highest Arena score to date. Huge congrats to @OpenAI on this https://x.com/lmarena_ai/status/1953504958378356941

GPT-5 is here! 🚀 For the first time, users don’t have to choose between models — or even think about model names. Just one seamless, unified experience. It’s also the first time frontier intelligence is available to everyone, including free users! GPT-5 sets new highs across”” / X https://x.com/ElaineYaLe6/status/1953607005144506454

GPT-5 is here. Rolling out to everyone starting today. https://x.com/OpenAI/status/1953504357821165774

GPT-5 is live in Cline. We’ve been working with OpenAI to get this model ready, and here’s our take: it’s disciplined, persistent, & highly competent. It’s collaborative in planning & and a diligent operator while acting. It plans thoroughly, asks optioned follow-ups when https://x.com/cline/status/1953525433808695319

GPT-5 is now available in Cursor. It’s the most intelligent coding model our team has tested. We’re launching it for free for the time being. Enjoy!”” / X https://x.com/cursor_ai/status/1953519580627742750

GPT-5 is now available on Perplexity and Comet for Max and Pro subscribers. Just ask. https://x.com/perplexity_ai/status/1953537170964459632

GPT-5 new SOTA on WeirdML beating o3-pro https://x.com/scaling01/status/1953919743842238472

GPT-5 only a 3% improvement over o3 at reproducing scientific papers https://x.com/scaling01/status/1953503883331846629

GPT-5 pricing is insane IT’S OVER https://x.com/scaling01/status/1953509084008710547

GPT-5 rollout updates: *We are going to double GPT-5 rate limits for ChatGPT Plus users as we finish rollout. *We will let Plus users choose to continue to use 4o. We will watch usage as we think about how long to offer legacy models for. *GPT-5 will seem smarter starting”” / X https://x.com/sama/status/1953893841381273969

GPT-5 sentiment from the trenches (AKA 24 hours in Cline users’ hands): It’s a precision instrument, not a Swiss Army knife. Give it detailed prompts and it delivers exactly what you asked for — no tangents, no hallucinations about “”finished”” code. However, it’s less performant https://x.com/cline/status/1953898747928441017

GPT-5 sets a new record on FrontierMath! On our scaffold, GPT-5 with high reasoning effort scores 24.8% (±2.5%) and 8.3% (±4.0%) in tiers 1-3 and 4, respectively. https://x.com/EpochAIResearch/status/1953615906535313664

GPT-5 system card capability evals reactions thread. First observation: ~no improvement on all the coding evals that aren’t SWEBench https://x.com/eli_lifland/status/1953507434238288230

GPT-5 Thinking is less deceptive than o3 However when elicited to display deceptive behaviour it jumps to 28% https://x.com/scaling01/status/1953504438691221856

GPT-5 was doing 2B tokens per minute 3 hours after launch 🤯”” / X https://x.com/kevinweil/status/1953649263411704195

GPT-5 with big improvements in Tau-Bench except the airline category https://x.com/scaling01/status/1953505637242974695

GPT-5 with high reasoning effort on SimpleBench https://x.com/scaling01/status/1953771276549358041

GPT-5: $0.625/$5.00 with flex pricing is ridiculous https://x.com/scaling01/status/1953517149768593903

Hallucinations are almost gone with GPT-5 https://x.com/scaling01/status/1953507569609134506

ICYMI, OpenAI released an insane amount of guides on how to use GPT-5. > Examples > Prompting guide > New features guide > Reasoning tips > Setting verbosity > New tool calling features > Migration guide And much more. https://x.com/omarsar0/status/1953583336603234726

If GPT-5 made this chart I’m bearish 😭 https://x.com/iScienceLuvr/status/1953503815292092904

In a new report, we evaluate whether GPT-5 poses significant catastrophic risks via AI R&D acceleration, rogue replication, or sabotage of AI labs. We conclude that this seems unlikely. However, capability trends continue rapidly, and models display increasing eval awareness. https://x.com/METR_Evals/status/1953525150374150654

Introducing GPT-5 | OpenAI https://openai.com/index/introducing-gpt-5/

Introducing GPT-5 Our best AI system yet, rolling out to all ChatGPT users and developers starting today. https://x.com/OpenAI/status/1953526577297600557

Long context reasoning performance: A stand out is long context reasoning performance as shown by our AA-LCR evaluation whereby GPT-5 occupies the #1 and #2 positions. https://x.com/ArtificialAnlys/status/1953507713222422866

Lots of excitement about GPT-5 in Codex CLI via your ChatGPT plan. Some details: 1. Yes, if you sign in with ChatGPT, usage is included via your paid plan! 2. Still determining exact rate limits, but the goal is to be generous: — Pro users should basically not hit limits”” / X https://x.com/embirico/status/1953590991870697896

made a little Sankey to show you why I’m fuming ChatGPT Plus before vs after the GPT-5 release https://x.com/scaling01/status/1953780931552031056

Markets disappointed by GPT-5 OpenAI getting crushed on Polymarket https://x.com/scaling01/status/1953515099257282763

model switching in gpt-5 very cool!”” / X https://x.com/sama/status/1953526708742537220

New in Notion AI’s toolbelt: @OpenAI’s GPT-5 It’s fast, thorough, and handles complex work 15% better than other models we’ve tested. A great choice for tasks with multiple moving parts. Gradual rollout starting today. https://x.com/NotionHQ/status/1953506907924443645

OpenAI GPT-5 System Card released “”GPT-5 is a unified system with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, https://x.com/iScienceLuvr/status/1953503173932724614

Priority Processing debuts with GPT-5. under-hyped imo for apps where millisecond matters, pay extra and get our fastest token speeds just add “”service_tier””: “”priority”” to your requests https://x.com/jeffintime/status/1953857260729643136

Quick PSA. Settings for minimizing GPT-5 latency (time to first token). “”service_tier””: “”priority””, “”reasoning_effort””: “”minimal””, “”verbosity””: “”low””. P50 TTFT with these settings is ~750ms. With the defaults, it’s >3s. The default settings are the right starting point for https://x.com/kwindla/status/1953868672470331423

RT @lmarena_ai: GPT-5 is here – and it’s #1 across the board. 🥇#1 in Text, WebDev, and Vision Arena 🥇#1 in Hard Prompts, Coding, Math, Cre…”” / X https://x.com/aidan_mclau/status/1953517672941158577

Think harder is back! Routing changes in GPT-5 OpenAI means capability is moving from model selection to prompting https://x.com/dariusemrani/status/1953591404003045562

this is the detail of GPT-5 I’m most proud of GPT-4 launched at $30/$60, no cache discount since then, it’s been an unrelenting cross-team push to collapse the cost of intelligence. we’re nowhere near done”” / X https://x.com/jeffintime/status/1953534466854453751

We are actively evaluating GPT-5 models on document understanding capabilities 🔎📄 – specifically screenshotting the page and feeding it into the model. A WIP preliminary finding is that even though on paper GPT-5 is $1.25 per 1M tokens, it uses 4-5x more tokens than GPT-4.1, https://x.com/jerryjliu0/status/1953582723672814054

We’re also releasing v0.16 of the Codex CLI today. – GPT-5 is now the default model – Use with your ChatGPT plan – A new, refreshed terminal UI `npm i -g @openai/codex` to update”” / X https://x.com/OpenAIDevs/status/1953559797883891735

We’ve put together some guides on how to get started with GPT-5: 💬 Prompting guide: https://x.com/OpenAIDevs/status/1953528513480347840

What the hell man, this is such a lame way to technically not lie. «A unified system» is… literally just SEPARATE CoT + non-CoT models + a router. > OpenAI reasoning models, including gpt-5-thinking, gpt-5-thinking-mini, and gpt-5-thinking-nano > gpt-5-main just fuck off washed https://x.com/teortaxesTex/status/1953512363031757048

This week, ChatGPT is on track to reach 700M weekly active users — up from 500M at the end of March and 4× since last year. Every day, people and teams are learning, creating, and solving harder problems. Big week ahead. Grateful to the team for making ChatGPT more useful and”” / X https://x.com/nickaturley/status/1952385556664520875

8.6% of the world’s population uses ChatGPT weekly…”” / X https://x.com/emollick/status/1952389693502370198

Frontier models, capable of agentic reasoning, can now run on your Macbook Pro 🧑‍💻 @OpenAI’s release of GPT-OSS 20B and 120B are the biggest releases in open-source this year. Build agentic workflows with @llama_index that run 100% locally! Huge props to @LoganMarkewich and https://x.com/jerryjliu0/status/1952883595787239563

GPT-OSS models seem to be slopmaxxed on math/coding and reasoning – they are great at that but they completely lack taste and common sense at least that’s my vibe so far”” / X https://x.com/scaling01/status/1952881329772564764

I think gpt-oss was always expected to be put in an agent harness that uses search for all its world knowledge. Ive always argued this is not a valid replacement, the rich connections it builds from actual backprop on the worlds knowledge – not just facts, but the aggregate”” / X https://x.com/Teknium1/status/1953230352568467761

I was just about to make a post that GPT-OSS-120B is nontheless an overall good for the very low end. But I honestly don’t know what it is good at, except benchmarks. Coding seems to suck, creative writing is terrible… So it’s just a math model? https://x.com/scaling01/status/1953047913954791696

I’m thrilled @OpenAI has released two open weight models. Thank you to all my friends at OpenAI for this gift! I’m also encouraged that from my quick tests gpt-oss-120b looks strong (though we should still wait for rigorous 3rd party evals).”” / X https://x.com/AndrewYNg/status/1952838045235126510

i’ve spent the last couple hours talking to gpt-oss and can safely say it’s unlike any model i’ve tested one second it’s coding for me at a professional level, the next it’s making up basic facts and clinging to them no matter what i say something very strange is going on”” / X https://x.com/jxmnop/status/1953216881361600729

I’ve written the full story of Attention Sinks — a technical deep-dive into how the mechanism was developed and how our research ended up being used in OpenAI’s new OSS models. For those interested in the details: https://x.com/Guangxuan_Xiao/status/1953656755109376040

ICYMI: you can vibe test the latest gpt-oss models on gpt-oss[.]com 💥 We partnered with @OpenAI to bring easy access to the model right down to a browser near you! https://x.com/reach_vb/status/1953041435999010916

Introducing gpt-oss | OpenAI https://openai.com/index/introducing-gpt-oss/

Is it over for gpt-oss ? What are these Aider Polyglot scores? https://x.com/scaling01/status/1952780629772321257

It’s looking bad bois.. Aider Polyglot results for GPT-OSS-120B: 41.8% for comparison: Kimi-K2: 59.1% DeepSeek-R1: 56.9% Qwen3 32B: 40.0% https://x.com/scaling01/status/1953047534122713130

Our new @OpenAI open models https://x.com/polynoamial/status/1952778238368887184

Thank you @OpenAI for open-sourcing these great models! 🙌 We’re proud to be the official launch partner for gpt-oss (20B & 120B) – now supported in vLLM 🎉 ⚡ MXFP4 quant = fast & efficient 🌀 Hybrid attention (sliding + full) 🤖 Strong agentic abilities 🚀 Easy deployment 👉🏻”” / X https://x.com/vllm_project/status/1952784530466849091

The gpt-oss models have been post-trained to use two specific first-party tools: 1. a web browser that can search, read pages, follow links, and cite sources 2. an interactive python notebook This will give gpt-oss based agents super powerful capabilities out of the box! https://x.com/corbtt/status/1952810876165312805

We fixed some issues for @OpenAI’s gpt-oss model! 1. Jinja template has extra \n s, didn’t parse thinking sections + tool calling wasn’t rendered correctly 2. Some versions miss <|channel|>final -> this is a must! 3. F16 infs: use F32+BF16! We made a few free Colab notebooks as https://x.com/danielhanchen/status/1953901104150065544

Well, it took just 2 hours for OSS-GPT to hit #1 on @huggingface. Don’t remember seeing anything rise that fast! https://x.com/fdaudens/status/1952814865795698954

RT @cb_doge: 🚨 BREAKING: Grok 4 defeats Google’s Gemini in the Kaggle AI Chess semi-final and moves on to the grand finale! 🤖♟️🔥 https://t.…”” / X https://x.com/hyhieu226/status/1953220787084902888

Grok 4 Imagine generates images faster than I can scroll. How is this even possible? It’s so good. 😭”” / X https://x.com/tetsuoai/status/1951444393065586840

RT @elonmusk: For the next few days, Grok Imagine video generation is free to all US users! Download the Grok app and try it out.”” / X https://x.com/Yuhu_ai_/status/1953367318521655594

Super fast image & video generation via Imagine in the @Grok app is now available to all 𝕏 Premium users”” / X https://x.com/elonmusk/status/1952535613560983757

You guys need to try imagine mode on grok app. It’s incredible.”” / X https://x.com/tobi/status/1951789462268391749

Grok 4 is still state-of-the-art on ARC-AGI-2 among frontier models. 15.9% for Grok 4 vs 9.9% for GPT-5. https://x.com/fchollet/status/1953511631054680085

Grok-4 BEATS GPT-5 on ARC-AGI-2 https://x.com/scaling01/status/1953509485453902173

concerning https://x.com/DZhang50/status/1953510507631071658

🚨New prompting report, from us: Don’t bother with threats. Does threatening an AI really make it perform better (the way Google founder Brin claimed)? How about offering to tip the AI? We find no impact of threats or tips on average performance (but variance at question level) https://x.com/emollick/status/1951289250915221589

Grok Imagine usage is growing like wildfire. 14 million images generated yesterday, now over 20 million today!”” / X https://x.com/elonmusk/status/1952636922477572324

Grok Imagine is now live to all SuperGrok and Premium+ subscribers. Update your Grok app to version 1.1.33 and try it out. https://x.com/chaitualuru/status/1952174534142067092

Grok-4 ranks #1 on LisanBench https://x.com/scaling01/status/1953843352366903622

Ethan Mollick on X: “Going almost two years with no substantive improvements to GPTs is surprising. I know the GPT store & consumer use was quietly abandoned by OpenAI, but when I talk to organizations they often view GPTs as an important tool for non-technical people to create & distribute AI uses.” / X
https://x.com/emollick/status/1953221298743578646

I wonder if the fact that there’s no frontier leap in Agentic evals like SWE Bench, OpenAI internal PRs and implementing ML papers implies that models are saturating and that agent scaffolds actually matter more than ever. There’s never been a better time to be an agent wrapper?”” / X https://x.com/nrehiew_/status/1953531014825095492

👀 we care a lot about correctness, ran many evals and stared at many tensors to compare them. numerics of vLLM on hopper should be solid and verified! if you run into any correctness issue on vLLM, we would love to know and debug them! https://x.com/vllm_project/status/1952940603773468926

Every single benchmark out there has problems either with the questions, the harness or something else. The alpha is from actually inspecting the outputs. For eg, instead of overindexing on a few % on SWEBench, look at the rollouts to understand the shape of the intelligence.”” / X https://x.com/nrehiew_/status/1953657627294224732

How many more model releases do we need for folks to realize we are not getting to magical superintelligence with what we got? How many times do you have to see a model benchmaxxing to realize Humanity’s Last Exam is a freaking idiotic name and that answering questions on it”” / X https://x.com/Dan_Jeffries1/status/1953567646248567029

I’m excited to introduce Hieroglyph, a new benchmark for lateral reasoning. Hieroglyph measures a model’s ability to identify the link between seemingly unrelated and often niche subjects. On the 20-question set of the hardest Only Connect questions, no model scores above 50%. https://x.com/synthwavedd/status/1951645151203324099

Introducing Kaggle Game Arena | Kaggle https://www.kaggle.com/blog/introducing-game-arena

One thing to pay attention to in benchmarking AI is how success is being measured. Models can be very fragile, getting the right answer rarely, but measurably more than chance, and look very good on benchmarks using PASS@10, but fail often in reality. https://x.com/emollick/status/1951046579009495249

Terminal-Bench results confirmed by the official: a GLM-4.5 swimming among Claudes. https://x.com/Zai_org/status/1952411485742760324

Thrilled to announce the @Kaggle Game Arena, a new leaderboard testing how modern LLMs perform on games (spoiler: not very well atm!). AI systems play each other, making it an objective & evergreen benchmark that will scale in difficulty as they improve. https://x.com/demishassabis/status/1952436066524299432

We have a long history of using games to measure progress in AI. 🎮 That’s why we’re helping unveil the @Kaggle Game Arena: an open-source platform where models go head-to-head in complex games to help us gauge their capabilities. 🧵 https://x.com/GoogleDeepMind/status/1952406075996533077

We’re updating the Artificial Analysis Intelligence Index! We’ve added IFBench to cover instruction following, done some housekeeping and are adding more benchmarks over the next couple of weeks 👀 The focus of the Artificial Analysis Intelligence Index is to provide a useful”” / X https://x.com/ArtificialAnlys/status/1952302030812483982

Yeah, they mogged. First thing I see here, besides insane scores to params ratio (20B is MoE too! 4.25 bpw!), is that they have open sourced their reasoning effort scaling that totally destroys every other attempt. Second thing is that this 120B is trained on *a lot* of tokens. https://x.com/teortaxesTex/status/1952790053241172229

When you’re helping someone build an LLM judge, the goal is to maximize the bits of information you get about a task per unit of human effort/frustration. Our work on ALHF shows that natural language feedback is both information-dense and ergonomic. Check it out in Agent Bricks!”” / X https://x.com/jefrankle/status/1953297527089897944

Fixed the SWE-Bench chart and added Claude to it https://x.com/nrehiew_/status/1953516694238114187

Evaluation is the hardest problem for physical AI systems: do you crash test cars every time you debug a new FSD build? Traditional game engine (sim 1.0) is an alternative, but it’s not possible to hard-code all edge cases. A neural net-based sim 2.0 is purely programmed by data,”” / X https://x.com/DrJimFan/status/1952755998197948667

ByteDance’s SeedProver scores 331/657 on PutnamBench, almost 4 times the previous SOTA. More impressively, it gets 201/657 under the *light* inference setting, ie equivalent to pass@64-256. DeepSeek-Prover-V2 is just 3 months old… Things go fast now. https://x.com/teortaxesTex/status/1951875052967739787

A fourth problem on FrontierMath Tier 4 has been solved by AI! Written by Dan Romik, it had won our prize for the best submission in the number theory category. https://x.com/EpochAIResearch/status/1951432847148888520

We built an open source game arena (RL environments) to put frontier models against each other head to head. Games are an area I’m super excited to see Gemini shine more! https://x.com/OfficialLoganK/status/1952538175106404768

The OpenAI open weights models are very impressive. These basically beat every model from eight months ago & the small one runs on a laptop. For example, when HLE came out in January, the top score was 3-4%. Been playing with the models and so far they feel like their scores. https://x.com/emollick/status/1952796976279662596

we have a lot of new stuff for you over the next few days! something big-but-small today. and then a big upgrade later this week.”” / X https://x.com/sama/status/1952759361417466016

We evaluated the new GPT models with a minimal agent on SWE-bench verified. GPT-5 scores 65%, mini 60%, nano 35%. Still behind Opus 5 (68%), on par with Sonnet 4 (65%). But a lot cheaper, especially mini! Complete cost breakdown + details in 🧵 https://x.com/KLieret/status/1953835750723584357

OpenAI has developed a “”universal verifier”” that could help translate its gains in domains like math and coding to other, more subjective domains like business decision-making or creative writing. We have the details here: https://x.com/steph_palazzolo/status/1952375778361954801

2.5 years later, OpenAI open source (smol) is not cracking the Dreaded Diamond Problem. https://x.com/teortaxesTex/status/1952822222298726786

🔥BREAKING: @Zai_org’s GLM-4.5 enters the top-5 in Arena! With 4K+ community votes, it now ranks #5 Overall in the Text Arena – matching DeepSeek-R1 and Kimi-K2 as the top open models. Huge congrats to the Zai team on this incredible milestone and contribution to the open https://x.com/lmarena_ai/status/1952402506497020330

interesting swiglu variant from the gpt-oss model: clamps inputs and adds a skip connection https://x.com/vikhyatk/status/1952808827281391701

New projects already being built on GPT OSS! Build your own with our Model APIs here -> https://x.com/basetenco/status/1952882156059148737

Next to Qwen3 of comparable size: Looks like gpt-oss is a wide (vs deep) model https://x.com/rasbt/status/1952842273848279364

One line of code is all it takes to fine-tune the gpt-oss models from @OpenAI 🔥 > Support to target the MoE expert layers with PEFT > Kernels for FlashAttention3 & MegaBlocks > Fast inference with MXFP4 quantization format In our testing, these models are extremely efficient https://x.com/_lewtun/status/1952788132908404941

openai/harmony: Renderer for the harmony response format to be used with gpt-oss https://github.com/openai/harmony

qianwen-res.oss-cn-beijing.aliyuncs.com https://qianwen-res.oss-cn-beijing.aliyuncs.com/

RT @CerebrasSystems: OpenAI GPT-OSS-120B is live on Cerebras 3,000 tokens/s – fastest OpenAI model on record 1 second reasoning time 131K c…”” / X https://x.com/cline/status/1952960760759632025

RT @mattshumer_: It’s over. OpenAI just crushed it. We have their o3-level open-source model running on @GroqInc at 500 tokens per second.…”” / X https://x.com/JonathanRoss321/status/1953119620103381440

RT @reach_vb: BOOOOM! You can now run @OpenAI gpt-oss 20B natively in @GoogleColab T4 for FREE! 🔥 Powered by Transformers ⚡ The setup tak…”” / X https://x.com/_lewtun/status/1953441199253069936

RT @thanosthinking: running gpt-oss:20b on @ollama with Turbo and web search 🏎️ 💨 very happy with how the web search turned out 🙂 and o…”” / X https://x.com/ollama/status/1952882173255856223

We’re thrilled to announce Axolotl v0.12.0. We’re ramping up our distributed training featureset with ND parallel multi-node training, and FP8 support. We’ve also added fine-tuning for gpt-oss, FSDP support for TiledMLP, and many more exciting features. 1/5 https://x.com/axolotl_ai/status/1953845149391630472

The Harmony format from gpt-oss is now supported for datasets on the @huggingface Hub 🧘 Nifty feature by @calebfahlgren! https://x.com/_lewtun/status/1953870411050959110

OpenAI claims that GPT-5 is the leading agentic tool calling model. State-of-the-art performance (97%) on the Tau benchmark. GPT-5 also achieves significant improvements in instruction following across different benchmarks. https://x.com/omarsar0/status/1953516984672420041

Delphi (@withdelphi) solves a core need that I find myself struggling with when trying to write with AI – it is extremely hard and tedious getting ChatGPT/Claude to not only generate relevant content, but also match the style and tone of my blogs 📝. Delphi offers a concept of https://x.com/jerryjliu0/status/1952889056200655206

How big of a paradigm shift was the rise of reasoning models? We dug into the data and found that at least on some benchmarks, reasoning models were likely as large of an algorithmic advance as the Transformer. https://x.com/EpochAIResearch/status/1951734757483487450

This is a really nicely done paper, and a good example of how experts can not only design good benchmarks but also better diagnose what the AI is doing wrong. In this case, not turning to lookup tables and not doing math with coding tools (both of these seem solvable in time)”” / X https://x.com/emollick/status/1952606285678915743

It’s wild how much performance can differ depending on provider implementation! (probably the outcome of too heavy quantization) I also had cases where some providers silently fail to respect tool calling or structured generation formats: always have your own mini-benchmark to”” / X https://x.com/AymericRoucher/status/1953115586273394873

GPT-5 results on ARC-AGI 1 & 2! Top line: 65.7% on ARC-AGI-1 9.9% on ARC-AGI-2″” / X https://x.com/fchollet/status/1953509615624499571

Grok Imagine is now live to all X Premium users on the Grok app. Update to the latest version of the app (1.1.35) and give it a try! https://x.com/chaitualuru/status/1952483088510140670

LisanBenchV2 – Grok-4 smashes expectations Another version of the same leaderboard, that more accurately reflects the relative strength of the models: Models start with an Elo score of 1500. The worst model is Llama3.2-1B with an Elo of 1146. – o3 in 2nd ahead of GPT-5 – https://x.com/scaling01/status/1953843230564323443

The new Grok iOS App update now supports downloading both Generated video and Source image https://x.com/obeydulX/status/1951724900198367515

o3 just killed Grok-4 in the Kaggle Game Arena https://x.com/scaling01/status/1953522772308545574

Update your Grok app, Imagine is now live on Android. https://x.com/AndrewCurran_/status/1953177772719059224

Very proud of us @xai after seeing the GPT5 release. With a much smaller team, we are ahead in many. Grok4 world’s first unified model, and crushing GPT5 in benchmarks like ARC-AGI. @OpenAI is a very respectful competitor and still the leader in many, but we’re fast and”” / X https://x.com/Yuhu_ai_/status/1953551132921671712

@BasedBeffJezos It’s high time we open sourced Grok 2. Will make it happen next week. We’ve just been fighting fires and burning the 4am oil nonstop for a while now.”” / X https://x.com/elonmusk/status/1952988026617119075