Benchmarks: AI News Week Ending 06/13/2025

Benchmarks: AI News Week Ending 06/13/2025

June 13, 2025

Image created with OpenAI GPT-Image-1. Image prompt: 1966 Kodachrome photo-look, thin white frame, forest-green title band in upper left with stacked yellow/white serif text reading “BENCHMARKS”, sun-flared late-afternoon palette scene featuring a clipboard with a colorful bar chart labelled “Benchmarks”; gentle film grain, overcast daylight

How @Uber used LangGraph to build AI developer agents that generate thousands of daily code fixes and saved 21,000+ hours — serving an organization of 5,000 developers working with hundreds of millions of lines of code. Watch their full session here: https://x.com/LangChainAI/status/1932493346498543898

Apple doesn’t report benchmarks for their AIs, reporting on an ill-documented head-to-head evaluation But even by their standards, Apple’s latest on device models are mostly worse than the open Gemma 3-4B from Google or Qwen 3-4B And their server LLM is similar to Llama 4 Scout https://x.com/emollick/status/1932420903515590997

Qualcomm strengthens AI portfolio with $2.4 billion Alphawave deal | Reuters https://www.reuters.com/world/uk/qualcomm-acquire-uks-alphawave-24-billion-2025-06-09/

o3 (left) got an idle question wrong, o3-pro nailed it. Good first impression 🙂 (Q: If a full-size crossbow shoots 160m, estimate what a half-size replica would shoot…) https://x.com/johnowhitaker/status/1932821323979632783

RT @WesRothMoney: o3 pro one-shotted the Tower of Hanoi 10 disk problem (one of the more contested problems in Apple’s “”The Illusion of Th…”” / X https://x.com/code_star/status/1932679839682867296

HOLY SHIT IT’S FUCKING REAL LET THE PRICE WARS BEGIN OpenAI updated their pricing page. o3 is now cheaper than GPT-4o, but more importantly, cheaper than Sonnet 4 and Gemini 2.5 Pro I would cry if I were Anthropic and Google! https://x.com/scaling01/status/1932488441100468438

OpenAI just killed Claude 4 and Gemini 2.5 Pro if that 80% price drop is true (docs still show old pricing) It would also mean o3 would be cheaper than GPT-4o ? https://x.com/scaling01/status/1932437241592152161

After the o3 price reduction, we retested the o3-2025-04-16 model on ARC-AGI to determine whether its performance had changed. We compared the retest results with the original results and observed no difference in performance.”” / X https://x.com/arcprize/status/1932836756791177316

ARC-AGI-1 results for o3-pro and o3-high are in o3-pro (high) does not beat o3-high despite being slightly above 8 times more expensive https://x.com/scaling01/status/1932539254703321399

ARC-AGI-2 don’t look good for o3-pro (high) o3-pro (high) does not beat o3-high despite being 9 times more expensive https://x.com/scaling01/status/1932539573432684779

Been playing with o3-pro for a bit. It is quite smart. One problem it solved where every other model has failed is making word ladder from SPACE to EARTH. (Probably not contamination: the answer is different than the only online answer, which is for EARTH to SPACE in any case) https://x.com/emollick/status/1932533635984355792

o3-pro is much stronger than o3:”” / X https://x.com/gdb/status/1932561536268329463

o3-pro is rolling out now for all chatgpt pro users and in the api. it is really smart! i didnt believe the win rates relative to o3 the first time i saw them.”” / X https://x.com/sama/status/1932532561080975797

RT @GregKamradt: After the o3 price drop it made sense to test it on SnakeBench wow – it’s the new #1 model (out of 71 tested) It made th…”” / X https://x.com/imjaredz/status/1932898036466004317

The 80% price drop of o3 came with no performance trade-offs”” / X https://x.com/emollick/status/1932846451681337674

i like this take: “”The plan o3 gave us was plausible, reasonable; but the plan o3 Pro gave us was specific and rooted enough that it actually changed how we are thinking about our future.”””” / X https://x.com/sama/status/1932533208366608568

o3 was considerably less verbose in responses in our Artificial Analysis Intelligence Index eval set than Gemini 2.5 Pro & DeepSeek R1 but more than Claude 4 Opus https://x.com/ArtificialAnlys/status/1932489580592435301

In expert evaluations, reviewers consistently prefer OpenAI o3-pro over o3, highlighting its improved performance in key domains—including science, education, programming, data analysis, and writing. Reviewers also rated o3-pro consistently higher for clarity, comprehensiveness, https://x.com/OpenAI/status/1932530411651150013

New paper shows a familiar result on LLMs & medicine: Doctors given clinical vignettes produce significantly more accurate diagnoses when using a custom GPT built with the (obsolete) GPT-4 than doctors with Google/Pubmed but not AI. Yet AI alone is as accurate as doctors + AI. https://x.com/emollick/status/1931907652118069510

This graph (shows a steep decline in organic search traffic) https://x.com/fdaudens/status/1932501681628905788

What an incredible trajectory of performance improvements for the reasoning models since the original o1-preview ! 60%+ winrates are comparatively huge, and few model upgrades achieved this historically”” / X https://x.com/BorisMPower/status/1932556016455201145

What “”Working”” Means in the Era of AI Apps | Andreessen Horowitz https://a16z.com/revenue-benchmarks-ai-apps/

ScreenSuite – The most comprehensive evaluation suite for GUI Agents! https://huggingface.co/blog/screensuite

Evals now supports tool use. 🛠️ You can now use tools and Structured Outputs when completing eval runs, and evaluate tool calls based on the arguments passed and responses returned. This supports tools that are OpenAI-hosted, MCP, and non-hosted. Read more in our guides below. https://x.com/OpenAIDevs/status/1932169029147557924

How do you evaluate Voice Agents? This is the talk for you with @kwindla He provides code/Github repo & does some fun demos in this talk (links in yt description) https://x.com/HamelHusain/status/1932204210994704625

📊Benchmarking Multi-Agent Architectures As more systems become multi-agent, this begs the question: how do you best orchestrate across multiple agents? We did some initial benchmarking, including some improvements to our supervisor approach Blog: https://x.com/LangChainAI/status/1932825652312600810

Glass with Deep Reasoning achieves new state-of-the-art performance on common clinical benchmarks. ✅ 97% on USMLE Steps 1–3 ✅ 98% on JAMA Clinical Challenge cases ✅ 90% on NEJM Clinicopathologic Case Conferences Available to clinicians at https://x.com/GlassHealthHQ/status/1933291603906736328

Super impressive to see the new Gemini 2.5 Pro (06-05) climbing the public leaderboards! 🚀 > Best model at 192k tokens on Live Fiction > Number #1 on SimpleBench with 62.4% > Strongest over Document Processing model in IDP > Best cost-performance on Aider Kudos to everyone https://x.com/_philschmid/status/1932723220379049999

The new Gemini 2.5 Pro is SOTA at long context, especially capable on higher number of items being retrieved (needles) as shown below! https://x.com/OfficialLoganK/status/1931078494337073409

RT @xennygrimmato_: Gemini 2.5 Pro solved *all* JEE Advanced 2025 problems from the Mathematics Section (both Paper 1 and Paper 2)! * Goo…”” / X https://x.com/dilipkay/status/1932754214469402630

Interesting argument: solving the Tower of Hanoi requires thousands of moves. Reasoning models are trained so that they will not “”think”” long enough to solve the puzzle at high complexity, not because LLMs can’t do it, but because deployed models are built to not think that long.”” / X https://x.com/emollick/status/1932096155049169155

This is a fantastic application of applied interpretability! When using llms to review resumes, prior debiasing techniques break in more realistic settings. But simply finding and removing gender or race directions remains effective, beating existing than baselines!”” / X https://x.com/NeelNanda5/status/1933645976889422110

I replicated the Anthropic alignment faking experiment on other models, and they didn’t fake alignment – LessWrong 2.0 viewer https://www.greaterwrong.com/posts/pCMmLiBcHbKohQgwA/i-replicated-the-anthropic-alignment-faking-experiment-on

eval work and staring at data are both incredibly important and incredibly boring”” / X https://x.com/finbarrtimbers/status/1933278968859468161

Get 2x faster for reward model serving and sequence classification inference through @UnslothAI! Nice benchmarks Kyle!”” / X https://x.com/danielhanchen/status/1932965003621204391

Interesting findings on potential limitations of reasoning models.”” / X https://x.com/emollick/status/1930720378361672130

my timeline is flooded with “”haiku is cracked”” tweets so i got out my spreadsheet again to see if this is legit dear reader, haiku officially cracked the cost-elo frontier https://x.com/swyx/status/1772799201023557697

We Made Top AI Models Compete in a Game of Diplomacy. Here’s Who Won. https://every.to/diplomacy

Chris is joining us for our next (and last!) live evals course next month! https://x.com/HamelHusain/status/1932657675294421061

Big question in AI is whether new entrants can still hope to reach the state-of-the-art, or whether learning curve plus compute needs are high enough that this is impossible. xAI did it with massive compute & hiring investment. But otherwise is the list of competitors fixed?”” / X https://x.com/emollick/status/1932945672300314659

AI + Continously updated personal health data fed to it can change healthcare. Correlations that aren’t obvious to humans jump out fast. AI isn’t YET replacing your doctor. It’s augmenting your ability to ask better questions. Most people don’t track enough data to even ask https://x.com/rohanpaul_ai/status/1931298548831731769

Corporate AI adoption may be leveling off, according to Ramp data | TechCrunch https://techcrunch.com/2025/06/09/corporate-ai-adoption-may-be-leveling-off-according-to-ramp-data/

If I want to compare the generation costs, latency and other attributes for all the video models out there — is there a good resource you would recommend? Interested in everything from the Chinese open weights models to Veo 3 etc.”” / X https://x.com/bilawalsidhu/status/1932799827324064026

Announcing Magistral — our @MistralAI first reasoning model — excelling in domain-specific, transparent, and multilingual reasoning. https://x.com/sophiamyang/status/1932451856447586312

Announcing Magistral, our first reasoning model designed to excel in domain-specific, transparent, and multilingual reasoning. https://x.com/MistralAI/status/1932441507262259564

Magistral | Mistral AI https://mistral.ai/news/magistral

Mistral just released their reasoning models: Magistral-Small and Magistral-Medium Magistral Small is open-source and based on the 24B Mistral-Small 3.1, and can run on a single RTX 4090. Unfortunately it gets crushed by Qwen3-32B and Qwen3-30B-A3B Download Link: https://x.com/scaling01/status/1932445360380612712

Mistral really cooked – 24B, Based on Mistral Small 3.1, Multilingual, 128K context (40k effective), Apache 2.0 licensed! 🔥 Works on MLX, llama.cpp, transformers, vllm and more ⚡ https://x.com/reach_vb/status/1932449015657836730

The Mistral team at it again with Magistral! GRPO with edits: 1. Removed KL Divergence 2. Normalize by total length (Dr. GRPO style) 3. Minibatch normalization for advantages 4. Relaxing trust region Paper: https://x.com/danielhanchen/status/1932451325398413518

DesignBench provides a benchmark for multimodal LLMs evaluating front-end engineering across popular frameworks and tasks like generation, edit, and repair. Methods 🔧: → DesignBench contains 900 real-world webpage samples for HTML/CSS, React, Vue, and Angular frameworks. → https://x.com/rohanpaul_ai/status/1932279554954940445

RT @mattshumer_: I really can’t believe I’m saying this, but for non-code tasks, o3 Pro feels MILES ahead of Claude Opus 4″” / X https://x.com/imjaredz/status/1932657322204987718

my minions returned and they observed no performance difference between o3 versions”” / X https://x.com/scaling01/status/1932839048273670563

OpenAI slashed o3’s price by 80% and dropped a new o3-pro that: —Thinks longer to boost performance —Beats rivals on PhD-level math & science —Can web search and data analysis (but no image generation & canvas) —Is available to ChatGPT Pro and Team users https://x.com/rowancheung/status/1932694638785122610

pretty embarrassing for OpenAI, at this point one could expect o3 to crush this kind of trickery with contempt. muh ARC-AGI! …I’ve started to fear that as models and datasets get bigger, we’ll be seeing persistent sloppiness because they fail to develop crisp general features. https://x.com/teortaxesTex/status/1933371065863909638

RT @flavioAd: I’ve been secretly testing o3-pro for a while now 👀 Extremely cheaper, faster, and way more precise than o1-pro (and coding…”” / X https://x.com/OpenAIDevs/status/1932538094168801492

RT @LechMazur: o3-pro sets a new record on the Extended NYT Connections, surpassing o1-pro! 82.5 → 87.3. This benchmark evaluates LLMs us…”” / X https://x.com/SebastienBubeck/status/1932656485341032719

We ran this eval yesterday before price drop 😆🫠 @OpenAI”” / X https://x.com/StringChaos/status/1932642264163180844

BREAKING: Apple just proved AI “”reasoning”” models like Claude, DeepSeek-R1, and o3-mini don’t actually reason at all. They just memorize patterns really well. Here’s what Apple discovered: (hint: we’re not as close to AGI as the hype suggests) https://x.com/RubenHssd/status/1931389580105925115

Wow, end-to-end omni model from Ant Group: Ming Lite Omni – can hear, speak, and generate images – competitive to GPT4o 🔥 Some notes on the paper and the release: > GUI tasks: +9% accuracy over Qwen2.5VL-7B on AITZ(EM) > Audio understanding: 6/13 SOTA results on public https://x.com/reach_vb/status/1933458455794229317

Most AI labs talk about merely “augmenting” humans at work. They say this because AI currently falls short, not out of some deep conviction. I’m not here to bullshit anyone. Mechanize’s explicit goal is to automate all work as quickly as possible. https://x.com/tamaybes/status/1932841955542904919

Hackaprompt 2.0 https://www.hackaprompt.com/track/pliny