Benchmarks: AI News Week Ending 09/12/2025

Benchmarks: AI News Week Ending 09/12/2025

September 12, 2025

Image created with Flux Pro v1.1 Ultra. Image prompt: Benchmarks, resort desk with notebook, printed grid comparison chart on a brochure, morning sun through sheer curtains, photorealistic, editorial, minimal, landscape, vacation, no text overlays

📊 @Kimi_Moonshot’s K2-0905 on @GroqInc scored 7th overall at 94% on Roo Code evals, the 1st open-source model to break the 90+ barrier. It’s also the fastest and cheapest in the top 10, while holding its own on accuracy. View the full leaderboard: https://x.com/roo_code/status/1965098976677658630

It feels the coding agent frontier is now open-weights: GLM 4.5 costs only $3/month and is on par with Sonnet Kimi K2.1 Turbo is 3x speed, 7x cheaper vs Opus 4.1, but as good Kimi K2.1 feels clean. The best model for me. GPT-5 is only good for complicated specs — too slow.”” / X https://x.com/Tim_Dettmers/status/1965021602267217972

Kimi K2 0905 upgrade: Substantial improvement in agentic capabilities, modest change in overall intelligence Key takeaways: ➤ Intelligence increased +2 pts in our Artificial Analysis Intelligence Index ➤ Agentic capabilities substantially improved as shown by our two new https://x.com/ArtificialAnlys/status/1965010554499788841

🚨 Leaderboard Disrupted! Two new models have entered the Top 10 Text leaderboard: 🔸#6 Qwen3-max-preview (Proprietary) by @Alibaba_Qwen 🔸#8 Kimi-K2-0905-preview (Modified MIT) by @Kimi_Moonshot tied with 7 others. Note that this puts Kimi-K2-0905-preview in a tight race for https://x.com/arena/status/1965115050273976703

Seedream 4.0 is the new leading image model across both the Artificial Analysis Text to Image and Image Editing Arena, surpassing Google’s Gemini 2.5 Flash (Nano-Banana), across both! Seedream 4.0 is the latest release from Bytedance Seed, and is a substantial improvement on https://x.com/ArtificialAnlys/status/1966167814512980210

LLMs do many things, to different levels of quality, the “jagged frontier” of ability that my coauthors and I discussed in 2023. One weak part of multimodal LLMs has been seeing fine visual details. So this is an interesting benchmark to watch to follow progress in this area.”” / X https://x.com/emollick/status/1964758268930379794

Warp Code launched yesterday — here’s what’s new: – Top coding agent: #3 SWE-bench, 52% Terminal-Bench – Built-in code review – Native editor – Slash Commands, Project Rules, and more We’re already seeing millions more lines of code shipped through Warp. https://x.com/warpdotdev/status/1963683282538688694

MBZUAI and G42 Launch K2 Think: A Leading Open-Source System for Advanced AI Reasoning https://www.prnewswire.com/news-releases/mbzuai-and-g42-launch-k2-think-a-leading-open-source-system-for-advanced-ai-reasoning-302551074.html

⚡️ Efficient weight updates for RL at trillion-parameter scale 💡 Best practice from Kimi @Kimi_Moonshot vLLM is proud to collaborate with checkpoint-engine: • Broadcast weight sync for 1T params in ~20s across 1000s of GPUs • Dynamic P2P updates for elastic clusters •”” / X https://x.com/vllm_project/status/1965824120920342916

Introducing checkpoint-engine: our open-source, lightweight middleware for efficient, in-place weight updates in LLM inference engines, especially effective for RL. ✅ Update a 1T model on thousands of GPUs in ~20s ✅ Supports both broadcast (sync) & P2P (dynamic) updates ✅ https://x.com/Kimi_Moonshot/status/1965785427530629243

Updated & turned my Big LLM Architecture Comparison article into a narrated video lecture. The 11 LLM architectures covered in this video: 1. DeepSeek V3/R1 2. OLMo 2 3. Gemma 3 4. Mistral Small 3.1 5. Llama 4 6. Qwen3 7. SmolLM3 8. Kimi 2 9. GPT-OSS 10. Grok 2.5 11. GLM-4.5 https://x.com/rasbt/status/1965798055141429523

People are obsessed with this “”machine break out of the simulation”” story but it’s just not real. This affected a few trajectories in 4 submissions. The bug has now been fixed by @_carlosejimenez. The overall picture and the trends on SWE-bench are not affected at all. https://x.com/OfirPress/status/1966227423252595056

🗣️ Evals now support native audio inputs and audio graders. Evaluate model audio responses, with no text transcription needed. Get started in the Cookbook guide: https://x.com/OpenAIDevs/status/1965923707085533368

Congrats to @Zai_org GLM-4.5 on getting the 7th spot on our SWE-bench Verified [Bash Only] leaderboard! w/ @KLieret @_carlosejimenez @jyangballin https://x.com/OfirPress/status/1965889864395899262

Results – Outperformed GPT-4o on web navigation (26% vs. 16%). – Did very well on deep search (38% vs. GPT-4o’s 26%), even topping some subtasks. – Reached 91% in the TextCraft game and was one of the few models to handle the hardest level. – Hit 96.7% in BabyAI (a simulated https://x.com/omarsar0/status/1966167191805734978

.@_carlosejimenez just merged a PR that fixes the SWE-bench bug that allowed agents to ‘look into the future’. Our analysis showed that this bug was only exploited by a few agents a handful of times. Version update coming soon with a bunch of other extra things! https://x.com/OfirPress/status/1965978758336163907

With the economic value at stake, it would be worthwhile to assemble a few standing bodies of experts to do very fast evaluations for new AI models so that the world doesn’t need to rely on benchmarks that consist of math problems, trivia questions, & the vibes of people like me.”” / X https://x.com/emollick/status/1964044908030816378

✨ NEW: Feature update: 🖼️ Image edit models now support multi-turn editing! Instead of trying to fit every edit into one mega-prompt, you can now refine your image step by step. Like a natural back-and-forth conversation. Do it in Battle mode, Side by Side or Direct. Just https://x.com/arena/status/1965150440401809436

Don’t forget: Multi-turn is now available in Image Edit! You don’t have to prompt for a one-shot. Iterate and edit via a natural conversation. https://x.com/arena/status/1965929101799399757

One challenge that no AI model has solved yet is “”Create a compelling puzzle that is solvable by players for a D&D game that isn’t boring or trite and where choices matter.”” It just involves too much planning & detail. Here, GPT-5 Pro comes very close, but there are still flaws. https://x.com/emollick/status/1964882159157784961

We just released AlgoPerf v0.6! 🎉 ✅ Rolling leaderboard ✅ Lower compute costs ✅ JAX jit migration ✅ Bug fixes & flexible API Coming soon: More contemporary baselines + an LM workload… https://x.com/algoperf/status/1965044626626342993

📢 New Model Drop: Seedream 4.0 is live on Yupp! This image model from ByteDance offers text-to-image generation as well as image editing. We dove in with some prompts: https://x.com/yupp_ai/status/1965827081826422990

🚨 ByteDance just released Seedream 4.0 — how does its AI image generation perform? Zhihu contributors share their feedbacks. Let’s have a quick view👇 🎨 Trisimo 崔思莫: ➤ Seed 4.0 vs Nano Banana: Different tech paths. • Nano Banana: multimodal, stronger understanding & https://x.com/ZhihuFrontier/status/1965681077231727069

🚨 New Model Alert! ByteDance’s latest Seedream 4 is ready in the Arena! 🖼️Seedream 4 merges the capabilities of Seedream 3 (Text-to-Image) with SeedEdit 3 (Image Edit). Come and test out your hardest Text-to-Image and Image Edit prompts! https://x.com/arena/status/1965929099370889432

DeepSeek V3.1 dynamic @UnslothAI quants on Aider Polyglot benchmarks are here! 1. 3-bit thinking gets 75.6% vs 76.1% un-quantized 2. Leaving attn_k_b in 8-bit gets +2% accuracy vs 4-bit 3. Dynamic quants beat other similar imatrix quants 4. AMA r/LocalLlama today 10AM PST! https://x.com/danielhanchen/status/1965800675105017980

We challenged ourselves to build the cleanest, highest-signal factuality benchmark out there. Today, we’re releasing the result: SimpleQA Verified ✅🥇 On this more reliable, 1,000-prompt eval, Gemini 2.5 Pro establishes a new SOTA, outperforming other frontier models. We’re https://x.com/lkshaas/status/1965799946621202719

SimpleQA Verified Benchmark! A New reliable factuality benchmark for measuring knowledge in LLMs from our Research Team @GoogleDeepMind! – Includes exactly 1,000 prompts for evaluating short-form factuality. – Reduced from original 4,326 questions to address incorrect labels. – https://x.com/_philschmid/status/1965806183970652368

4B OCR with Apache-2.0 license outperforming Mistral OCR 🔥 Tencent released Points-Reader, it’s a new model firstly trained on Qwen2.5VL annotations and then self-trained on real data in many benchmarks, it performs better than Qwen2.5VL and MistralOCR! https://x.com/mervenoyann/status/1966176133894098944

i’m hiring for a new team @openai: Applied Evals our goal is to build the world’s best evals for the economically valuable tasks our customers care about most. we’ll execute as a group of high‑taste engineers, combining hands-on, unscalable efforts with systems that others can”” / X https://x.com/shyamalanadkat/status/1965807750916812803

OpenAI referenced Artificial Analysis’ Big Bench Audio benchmark in their recent GPT-Realtime release, where they secured the #1 position with a score of 83% Benchmark context: Big Bench Audio is the first dedicated dataset for evaluating reasoning performance of speech models. https://x.com/ArtificialAnlys/status/1966116575851028970

is “”numerical determinism”” worth a 60%* latency hit? 🤔 (* = unoptimized upper-bound by some of the best talent in the industry) https://x.com/suchenzang/status/1965914700786622533