Benchmarks: AI News Week Ending 11/14/2025

Benchmarks: AI News Week Ending 11/14/2025

November 14, 2025

Image created with gemini-2.5-flash-image with claude-sonnet-4-5. Image prompt: Wide cinematic view of a massive orbital testing arena in deep space, holographic performance metrics and scoreboards floating in zero gravity, geometric test environments glowing in the distance, cold blue lighting with dramatic rim light, sleek futuristic military aesthetic, high contrast with deep blacks and muted neon highlights, sense of vast scale and clinical precision, Ender’s Game inspired tactical environment

GPT-5, Claude, Kimi, and Gemini: “”I can travel back in time to any time before 1500 and change only one thing, what is the single thing you would change, nothing obvious.”” https://x.com/emollick/status/1987355374928769395

“I finally reached human-level performance (85%) on ARC-AGI v1 for under $10k and within 12 hours. I use the same multi-agent collaboration with evolutionary test-time compute, now powered by GPT-5 pro with lower parallelism. https://x.com/jerber888/status/1987982067116777521

@OpenAI’s GPT-5.1 delivers a solid upgrade from GPT-5 for agentic coding. We’ve noticed that the model is more steerable, overthinks less, and is better at frontend design. The model is also faster on most tasks because it dynamically adjusts reasoning depth based on the”” / X https://x.com/cognition/status/1989081722353529178

We’ve been testing Box AI with GPT-5.1 for the past week to compare it to GPT-5 for enterprise content use-cases. It’s a very strong upgrade from GPT-5. It’s super fast, performing ~2X (or more) faster on our tests on long documents (30,000+ tokens); and we saw an 8 percentage point gain in data extraction from our most our most challenging documents (across 1,000+ data fields) from a variety of content types. https://x.com/levie/status/1989051715207983511

GPT-5 on Sudoku-Bench 🧩 Since releasing Sudoku-Bench in May 2025, when no LLM could solve a classic 9×9 puzzle, we’ve been evaluating the latest generation of models. GPT-5 now leads our leaderboard with 33% puzzles solved–approximately 2x the previous leader–and is the first https://x.com/SakanaAILabs/status/1988080410392404021

GPT-5.1: A smarter, more conversational ChatGPT | OpenAI https://openai.com/index/gpt-5-1/

GPT-5.1 isn’t “GPT-5 but faster.” In our evals of the model, we found it’s the highest-precision model we’ve ever tested for code-related tasks like code review. Less noise, more fixes, reviews that read like patches again. https://x.com/coderabbitai/status/1989035006774354387

GPT-5.1 is a great new model that we think people are going to like more than 5. But with 800M+ people using ChatGPT, one default personality won’t work for everyone. We launched new preset personalities so people can make ChatGPT their own. https://x.com/fidjissimo/status/1988683216681889887

Moving beyond one-size-fits-all – Fidji Simo https://fidjisimo.substack.com/p/moving-beyond-one-size-fits-all

Baidu just dropped an open-source multimodal AI that it claims beats GPT-5 and Gemini | VentureBeat https://venturebeat.com/ai/baidu-just-dropped-an-open-source-multimodal-ai-that-it-claims-beats-gpt-5

🚨 Video Arena leaderboard update! 🎬 There is a new model provider in the Video Arena Vidu Q2 Turbo and Vidu Q2 Pro by @ViduAI_official have just made their debut with strong initial performance, both landing in the Top 10 for Image-to-Video: 🔹Vidu Q2 Turbo lands #6 with a https://x.com/arena/status/1989056583872180298

I keep coming back to GDPval, there is a lot in that paper that sheds light on the coming impact of AI on knowledge work, especially as agentic work starts to become a real thing, replacing the back-and-forth cyborg/centaur prompting we have used for years https://x.com/emollick/status/1988088613125714402

The Next Stage of AI Coding Evaluation Is Here https://news.lmarena.ai/code-arena/

As AIs get smarter & more useful, our benchmarks become less useful. Measuring general knowledge or coding ability gives us only a glimpse into what an AI model can do. Anyone who wants to use AI seriously for real work will need to assess it themselves. https://x.com/emollick/status/1988440050716279110

Most models: think → tool call → think → tool call K2 Thinking: keeps tool calls inside the reasoning trace so multi-step workflows don’t drift. We’ll show how Moonshot post-trained for agentic tool calling and demo complex workflows running in one model call.”” / X https://x.com/togethercompute/status/1988009780149878904

It turns out that Kimi K2 Thinking is also a beast at deep research. It can run 200-300 tool requests for impressive multi-agent capabilities. Would you like to see a code example of it?”” / X https://x.com/omarsar0/status/1987912692099682399

Kimi K2 Thinking is impressive. So I built a multi-agent deep researcher, Kimi Deep Researcher. It generates long research reports on any topic, powered by subagents (web searcher, analyzer, and synthesizer). It can do 100s of tool calls per session. Repo soon! https://x.com/omarsar0/status/1988974710592516454

These are pretty impressive benchmarks from a Chinese open weights model. Especially big is the agentic capability, which has generally lagged in the open weights models. Be interesting to see independent confirmation soon, I found K2 a solid, but kind of weird, model to use.”” / X https://x.com/emollick/status/1986452925418270871

🚀 Hello, Kimi K2 Thinking! The Open-Source Thinking Agent Model is here. 🔹 SOTA on HLE (44.9%) and BrowseComp (60.2%) 🔹 Executes up to 200 – 300 sequential tool calls without human interference 🔹 Excels in reasoning, agentic search, and coding 🔹 256K context window Built https://x.com/Kimi_Moonshot/status/1986449512538513505

🚀We’re going live with @Kimi_Moonshot on Nov 19 for a technical deep dive on Kimi K2 Thinking Learn about the 1T parameter MoE that allows your AI agent to make 300 tool calls in one run. Register: https://x.com/togethercompute/status/1988009777247510564

from Kimi AMA: – K3 will likely use KDA or some other hybrid attention mechanism – Kimi-K2 will get vision https://x.com/scaling01/status/1987916859400659011

I wonder if part of what makes Kimi K2 Thinking impressive is that it produces a lot more thinking tokens for even minor & non-technical queries than any model I have used. This is the thinking trace for “”write me a really good sentence about cheese”” it is 1,595 tokens long! https://x.com/emollick/status/1987286609713107261

Try Kimi-K2-Thinking now on Together AI https://x.com/togethercompute/status/1988011880443470217

I’m sorry Kimi bros The problem is and was 100% the OpenRouter API and it’s starting to piss me off that long reasoning always breaks Just use Kimi API for now and not OpenRouter if you have requests that take a lot of reasoning tokens. Simpler requests work fine with”” / X https://x.com/scaling01/status/1987938809628291168

since testing Kimi-K2 Thinking I have become very wary of providers on OpenRouter might switch to original provider APIs only they need to do quality testing for every model and provider”” / X https://x.com/scaling01/status/1988399213563236810

Kimi K2 Thinking passes the Lem Test the first time, very few models have done so Just like Kimi K2, however, this remains a very weird & interesting model in a way that is hard to benchmark. Its writing is often very good but sometimes doesn’t hold up under close investigation https://x.com/emollick/status/1986552301922738651

Thanks everyone for testing Kimi K2 Thinking and sharing benchmark results! We’ve noticed that benchmark outcomes can vary across providers. Some third-party endpoints show substantial accuracy drops (e.g., 20+ pp), which has negatively affected scores on reasoning-heavy tasks”” / X https://x.com/Kimi_Moonshot/status/1987892275092025635

Kimi AMA on K2 Thinking: 1. $4.6M training cost is not an official number 2. Trained on H800s (nerfed H100s) 3. KDA (Kimi Delta Attention) hybrids with NoPE MLA perform better than full MLA with RoPE 4. Muon scales well to 1T parameters. “there are tens of optimizers and”” / X https://x.com/Yuchenj_UW/status/1987940704929395187

Test out Kimi K2 Thinking vs. all the frontier models for yourself at: https://x.com/arena/status/1987947224173781185

Testing Kimi K-2 has reminded me of how insane it is that firms picking AIs are treating them as fungible based on benchmarks Kimi & Grok & Claude & every other model have strengths, quirks & weaknesses that can make a big difference in aggregate Develop your own benchmarks!”” / X https://x.com/emollick/status/1986604851770360213

In our new Expert and Occupational leaderboards: The previous, non-thinking Kimi K2 is ranked #7 for Hard Prompts, particularly excelling in the ‘Legal & Government’ category under the ‘Occupational’ leaderboard, while falling behind in ‘Instruction Following’. Kimi K2 Thinking https://x.com/arena/status/1987947222299013630

k2 vision is happening. this is not a drill. https://x.com/code_star/status/1987917177417289794

Whenever people ask me, “Is Muon optimizer just hype?” I need to show them this. Muon isn’t just verified and used in Kimi; other frontier labs like OpenAI are using it and its variants. It’s also in PyTorch stable now! https://x.com/Yuchenj_UW/status/1987955443420065816

Latest LisanBench results for Kimi-K2 Thinking Kimi-K2 Thinking is the best open-source model and 7th best model overall, right between GPT-5 and GPT-5-Mini Raw Scores: Glicko-2 ratings – better indicator of relative strength Kimi-K2 Thinking managed to set new high-scores https://x.com/scaling01/status/1987952884927934966

🚨 Leaderboard Update! Kimi K2 Thinking by @Kimi_Moonshot has landed on the Text leaderboard as the #2 open source model (MIT modified), tied for #7 overall. These are real-world results. With only a six-point difference with @Zai_org ‘s GLM 4.6, the competition is tight. Kimi https://x.com/arena/status/1987947219224526902

Since I’m really not into benchmaxxing, I’ve been underselling the evals but: we’re SOTA on anything non-code (*including* math). https://x.com/Dorialexander/status/1987977993440936433

benchmark designers should “train on the test set” to expose exploitable non-visual shortcuts”” / X https://x.com/sainingxie/status/1988019293926080611

New SWE/ML Leaderboard just dropped, like WeirdML but with a human baseline Turns out, all LLMs slow you down compared to human experts in ML/HPC optimization tasks (measured by runtime) https://x.com/scaling01/status/1989338806575903109

We’re proud to support @arcprize’s mission to build rigorous interactive benchmarks that measure generalized intelligence https://x.com/NousResearch/status/1988733248693027053

Sudoku-Bench https://pub.sakana.ai/sudoku-gpt5/

A new addition to the ERNIE open-source model family is here! Meet ERNIE-4.5-VL-28B-A3B-Thinking, our lightweight multimodal reasoning model. > 3B active parameters with enhanced semantic alignment between visual and language modalities > Outperforming Gemini-2.5-Pro and https://x.com/Baidu_Inc/status/1988182106359411178

Great new capability in Databricks powered by our AI research team! We trained a document parsing system that delivers leading quality at 3-5x lower cost and outperforms leading VLMs like GPT-5 and Claude. This is critical to connect AI to so many kinds of data. https://x.com/matei_zaharia/status/1988325177193885885

You can now get more Codex usage from your plan and credits with three updates today: 1️⃣ GPT-5-Codex-Mini — a more compact and cost-efficient version of GPT-5-Codex 2️⃣ 50% higher rate limits for ChatGPT Plus, Business, and Edu 3️⃣ Priority processing for ChatGPT Pro and”” / X https://x.com/OpenAIDevs/status/1986861734619947305?s=20

GPT-5.1 is now live in Warp. It’s much faster (40% faster task completion on a subset of SWE-bench Verified) without compromising quality. GPT-5.1 is available to all Warp users, and is now the default model for all new users. https://x.com/warpdotdev/status/1989049715837829326

OpenAI’s $1 Trillion Infrastructure Spend | Tomasz Tunguz https://tomtunguz.com/openai-hardware-spending-2025-2035/

everyone complained that the GPT5.1 release yesterday had no benchmarks. now you have them. note minor regressions in AIME and Taubench, which increases confidence that this is not benchmarkmaxxing i think more generally model comms for a consumer AI model lab has to be split https://x.com/swyx/status/1989047883639980141

Anthropic to Outpace OpenAI in Server Efficiency, Internal Projections Show — The Information https://www.theinformation.com/articles/anthropic-projects-cost-advantage-openai

We did a deep dive into how to evaluate & benchmark LLMs. Read our recent blog to get up to speed on: ⚪️The 5 principles of good LLM benchmarking ⚪️Identifying LLM capabilities and limitations ⚪️Types of model evaluation methods Full rundown: https://x.com/togethercompute/status/1987949723106557975