Benchmarks: AI News Week Ending 11/07/2025

Benchmarks: AI News Week Ending 11/07/2025

November 7, 2025

Image created with gemini-2.5-flash-image with claude-sonnet-4-5. Image prompt: Photorealistic wide shot of six limestone Ionic columns with completed classical entablature inscribed ‘BENCHMARKS’ in carved Roman serif letters, golden hour light, vintage brass surveying theodolite and measuring chains arranged on green grass foreground, red brick campus buildings background, clear blue sky, sharp architectural photography, warm beige stone texture, long soft shadows.

Anthropic just posted another banger guide. This one is on building more efficient agents to handle more tools and efficient token usage. This is a must-read for AI devs! (bookmark it) It helps with three major issues in AI agent tool calling: token costs, latency, and tool https://x.com/omarsar0/status/1986099467914023194

@Kimi_Moonshot Congratulations to the entire Moonshot team — today is a great day for open source everywhere. We’re excited to continue supporting Kimi models with fast inference on Baseten. https://x.com/basetenco/status/1986494013109903362

@QuixiAI @Kimi_Moonshot a single H200 node is enough😃”” / X https://x.com/vllm_project/status/1986626058897269070

📢 New Model(s) Drop: Kimi K2 Thinking and Kimi K2 Thinking Turbo are now on Yupp! This pair of thinking models from @Kimi_Moonshot specialize in deep reasoning tasks. We explored their capabilities with some prompts on Yupp: https://x.com/yupp_ai/status/1986469027997491422

🚀 Hello, Kimi K2 Thinking! The Open-Source Thinking Agent Model is here. 🔹 SOTA on HLE (44.9%) and BrowseComp (60.2%) 🔹 Executes up to 200 – 300 sequential tool calls without human interference 🔹 Excels in reasoning, agentic search, and coding 🔹 256K context window Built https://x.com/Kimi_Moonshot/status/1986449512538513505

🚨 New Open Source Model Update! Touted for its reasoning and coding strengths, Kimi K2 Thinking by @Kimi_Moonshot is now live for both Text and WebDev in Battle, Side by Side and Direct. Bring your toughest prompts! 💪 The last time Kimi K2 was in the Arena with a new model, https://x.com/arena/status/1986482438768673107

5 Thoughts on Kimi K2 Thinking – by Nathan Lambert https://www.interconnects.ai/p/kimi-k2-thinking-what-it-means

70% on SWE bench verified 30% terminal bench those are two intuitive thresholds for “”actually useful and not frustrating”” coding assistant. Kimi k2 thinking got 71.3% on SWE-Bench Verified 47.1% on Terminal-Bench”” / X https://x.com/andrew_n_carr/status/1986538323876454461

Congrats to the Kimi K2 team on the great numbers on our SWE-bench Verified, SWE-bench Multilingual and SciCode benchmarks!! https://x.com/OfirPress/status/1986475891158040760

Kimi AI – Kimi K2 is Live https://www.kimi.com/

Kimi API is barely alive right now kinda slow ~20tks/s and get quite a few timeouts / network errors when I let the model reason for a long time”” / X https://x.com/scaling01/status/1986476278908920061

Kimi K2 Thinking feels like a big milestone for open-source AI. The first time in a while that open-source gets ahead of proprietary APIs on their big area of focus (agents). Fun to see that it’s happening at a time when the proprietary APIs have the most money/attention”” / X https://x.com/ClementDelangue/status/1986833436607160600

Kimi K2 Thinking https://moonshotai.github.io/Kimi-K2/thinking.html

Kimi K2 Thinking is now available in anycoder https://x.com/_akhaliq/status/1986468663600337125

Kimi K2 Thinking is the new leading open weights model: it demonstrates particular strength in agentic contexts but is very verbose, generating the most tokens of any model in completing our Intelligence Index evals @Kimi_Moonshot’s Kimi K2 Thinking achieves a 67 in the https://x.com/ArtificialAnlys/status/1986911675820446013

Kimi K2 Thinking just launched on Product Hunt! 🥳 Not chasing votes, just using PH as a clean milestone log for our model updates. 🙂 Huge thanks to the helpful team from @ProductHunt https://x.com/crystalsssup/status/1986714377983304137

Kimi-K2 is an exceptional base model GPQA Diamond 77% GPT-4.5 only got 71.4%”” / X https://x.com/scaling01/status/1986112227875954967

Kimi-K2 Reasoning is coming very soon just got merged into VLLM LETS FUCKING GOOOO im so hyped im so hyped im so hyped https://x.com/scaling01/status/1986071916541870399

Kimi-K2 reasoning is landing soon; it just got merged into vLLM https://x.com/cedric_chee/status/1986073808672067725

Kimi-K2 Thinking ranking 19th on SimpleBench improving Kimi-K2s score from 26.3% (rank 33) to 39.6% This makes it the 3rd best open-source model on SimpleBench. Other chinese open-source models like DeepSeek R1 0528 and DeepSeek V3.1 beat it by roughly 1 %. https://x.com/scaling01/status/1986846212050362510

Live in Cline: kimi-k2-thinking https://x.com/cline/status/1986512739490275680

MoonshotAI has released Kimi K2 Thinking, a new reasoning variant of Kimi K2 that achieves #1 in the Tau2 Bench Telecom agentic benchmark and is potentially the new leading open weights model Kimi K2 Thinking is one of the largest open weights models ever, at 1T total parameters https://x.com/ArtificialAnlys/status/1986541785511043536

moonshotai/Kimi-K2-Thinking · Hugging Face https://huggingface.co/moonshotai/Kimi-K2-Thinking

ollama run kimi-k2-thinking:cloud Kimi K2 Thinking is Moonshot AI’s best open-source thinking model. Try it on Ollama’s cloud! https://x.com/ollama/status/1986640693108863271

Our first research paper: custom Mixture-of-Experts (MoE) kernels that make deployment of trillion-parameter models like Kimi K2 viable for the first time on AWS EFA https://x.com/AravSrinivas/status/1986106660386222592

Unsurprisingly, Kimi K2 Thinking is already number one trending on HF. The AI frontier is open-source! https://x.com/ClementDelangue/status/1986827413532057712

🚀 Day 0 support: Kimi K2 Thinking now running on vLLM! In partnership with @Kimi_Moonshot, we’re proud to deliver official support for the state-of-the-art open thinking model with 1T params, 32B active. Easy deploy in vLLM (nightly version) with OpenAI-compatible API: What https://x.com/vllm_project/status/1986455911066706160

It even compares with GPT 5 Pro on some benches. Looks like Kimi’s interpretation of Pro mode is 8 samples + self reflection https://x.com/nrehiew_/status/1986453238552666320

From my tests, Kimi K2 thinking is better than everything Xai, Anthropic, Google has to offer atm. The only thing that is better than this is Gpt 5 codex (at code) and Gpt 5 pro (at high level algorithm design) It beats the SOTA at creative writing by a mile. Good work”” / X https://x.com/karmay007/status/1986454592809529493

IndQA is a new benchmark designed to evaluate how well AI systems understand culture, context and history to answer questions that matter to people in India. With 2278 questions created in partnership with 250+ experts, IndQA dives deep into reasoning about everyday life,”” / X https://x.com/snsf/status/1985719755551158754

“ChatGPT-o1 & DeepSeek-R1, achieved diagnostic accuracy up to 93.75%. For context, this figure approaches the 96% accuracy benchmark reported for primary care physicians on the same vignette set” Except they told folks to get urgent care too often. Not unexpected given alignment”” / X https://x.com/emollick/status/1985164511947682070

Whisper no longer wears the open weights transcription accuracy crown with new entrants achieving better Artificial Analysis Word Error Rate scores Once considered the default choice for open weights transcription, OpenAI’s Whisper has now been surpassed by newer open weights https://x.com/ArtificialAnlys/status/1986100695989145649

Introducing IndQA — a new benchmark that evaluates how well AI systems understand Indian languages and everyday cultural context. https://x.com/OpenAI/status/1985950264525013210

🚨 WebDev Leaderboard Update MiniMax-M2 from @MiniMax__AI has landed as the #1 open model! A 230B MoE model with 10B-active-parameters, it’s an open source model built for efficient, high-performance coding, reasoning, and agentic-style tasks. It also ranks #4 in WebDev https://x.com/arena/status/1985465603206107318

Introducing SWE-1.5, our fast agent model. It achieves near-SOTA coding performance while setting a new standard for speed. Now available in Windsurf. https://x.com/windsurf/status/1983667319944712460

Thanks @_akhaliq sharing our work! 🚀 Glad to introduce our newest work — VCode! 🎨 VCode: A Multimodal Coding Benchmark with SVG as Symbolic Visual Representation For decades, RGB pixels have been the default medium for representing images. But in the agentic era, how can we”” / X https://x.com/KevinQHLin/status/1986126304316411928

Today we’re releasing SWE-1.5, our fast agent model. It achieves near-SOTA coding performance while setting a new standard for speed. Now available in @windsurf. https://x.com/cognition/status/1983662836896448756

VCode a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation https://x.com/_akhaliq/status/1986073575216824650

Real shortage of good papers, even working papers, testing agentic (post o3) and Deep Research AI outputs in law, medicine, business, coding, etc. Assume when a research paper discusses AI it means GPT-4o (with occasional Gemini 2.5 or o1) for the next year or so.”” / X https://x.com/emollick/status/1985417078749479212

Ant AQ-Team @AQ_MedAI @TheInclusionAI and SGLang RL Team @sgl_project just helped land Kimi-K2-Instruct RL on slime — fully wired up and running on 256× H20 141GB 🚀 Huge shout-out to @yngao016, @menlzy, @Yonah_x from AQ Team and @Ji_Li_233, @Yefei_RL from the SGLang RL Team for”” / X https://x.com/slime_framework/status/1986811354502906304

Fixed the token generation speed on https://x.com/Kimi_Moonshot/status/1986754111992451337

Here’s the command I ran: “` mlx.launch –hosts first.ip,second.ip –env MLX_METAL_FAST_SYNCH=1 mlx-lm/mlx_lm/examples/pipeline_generate.py –model mlx-community/Kimi-K2-Thinking –prompt “”Write an HTML and JavaScript page implementing space invaders”” -m 16384 “` PR here”” / X https://x.com/awnihannun/status/1986602098017116357

The new 1 Trillion parameter Kimi K2 Thinking model runs well on 2 M3 Ultras in its native format – no loss in quality! The model was quantization aware trained (qat) at int4. Here it generated ~3500 tokens at 15 toks/sec using pipeline-parallelism in mlx-lm: https://x.com/awnihannun/status/1986601104130646266

Gemini 3.0 Ultra or Gemini 3.0 Pro? Which is it and why do you think that? It sounds too big for Pro but too small for Ultra, but since models just get sparser and sparser I believe it’s Pro and very similar to Kimi-K2 1.2T@30B. Also as of right now Ultra is still just a”” / X https://x.com/scaling01/status/1986161974883860486

Team from Ant Group @TheInclusionAI helped land Kimi model @Kimi_Moonshot on @Zai_org’s slime framework! Open AIs help Open AIs ♥️ CN AIs help CN AIs ♥️”” / X https://x.com/bigeagle_xd/status/1986815075785879723

Today we’re announcing ARC Prize Verified, a program to increase the rigor of evaluating frontier systems on ARC-AGI This program adds a third-party academic panel to audit our testing process We are also welcoming 5 new AI labs as sponsors of ARC-AGI-3 https://x.com/arcprize/status/1985802145300693140

I don’t think how people are tracking how quickly this is happening, for better or worse. https://x.com/emollick/status/1985132904771399899

🚀 Introducing Arena Expert: a new LMArena evaluation framework to identify the toughest, most expert-level prompts from real users, powering a new Expert leaderboard. We also introduce Occupational Categories that underlie eight new leaderboards: 💻 Software & IT Services ✍️ https://x.com/arena/status/1986153162802368555

Congrats on the launch @nebiusai! Live benchmarks are available on Artificial Analysis for all the language model APIs served on Nebius Token Factory: https://x.com/ArtificialAnlys/status/1986174888080789509

Individual results across all evaluations in the Artificial Analysis Intelligence Index: MMLU-Pro, GPQA Diamond, Humanity’s Last Exam, LiveCodeBench, SciCode, AIME 2025, IFBench, AA-LCR, Terminal-Bench Hard, 𝜏²-Bench Telecom https://x.com/ArtificialAnlys/status/1986911685496746404

Interesting new AI benchmark combining elements of game environment testing with world model testing that finds large gaps between human and AI ability (& some behavioral differences as well) We need more grounded, unsaturated & hard benchmarks like this. https://x.com/emollick/status/1983914683976265755

It’s rare nowadays to find something that is intuitively important and not yet done well by any major language models. But *precisely aggregating lots of information over long contexts* is one of those things. Our new benchmark Oolong tests this ability, see the 🧵 for more!”” / X https://x.com/gneubig/status/1986851194862510102

New eval! Code duels for LMs ⚔️ Current evals test LMs on *tasks*: “”fix this bug,”” “”write a test”” But we code to achieve *goals*: maximize revenue, cut costs, win users Meet CodeClash: LMs compete via their codebases across multi-round tournaments to achieve high-level goals https://x.com/jyangballin/status/1986093902122942700

ok we’re at 51% with “”heavy”” mode > Heavy Mode: K2 Thinking Heavy Mode employs an efficient parallel strategy: it first rolls out eight trajectories simultaneously, then reflectively aggregates all outputs to generate the final result. https://x.com/eliebakouch/status/1986447441668022471

🏆NEW LMARENA LEADERBOARDS🏆 🤓Experts 💻 Software & IT Services ✍️ Writing, Literature, & Language 🔬 Life, Physical, & Social Science 🎭 Entertainment, Sports, & Media 📈 Business, Management, & Financial Ops 🧮 Mathematical ⚖️ Legal & Government 🩺 Medicine & Healthcare https://x.com/ml_angelopoulos/status/1986154276499104186

We looked at OSWorld, a popular evaluation of AI computer use capabilities. Our findings: tasks are simple, many don’t require GUIs, and success often hinges on interpreting ambiguous instructions. The benchmark is also not stable over time. See thread for details! https://x.com/EpochAIResearch/status/1985441059032478172

Remote Labor Index: Measuring AI Automation of Remote Work https://arxiv.org/pdf/2510.26787

When Visualizing is the First Step to Reasoning MIRA, a Benchmark for Visual Chain-of-Thought https://x.com/_akhaliq/status/1986075520962793672

If you’d like to win your own Dell Pro Max with GB300 we’re launching a new kernel competition with @NVIDIAAI @sestercegroup @Dell to optimize NVF4 kernels on B200 2025 has seen a tremendous rise of pythonic kernel DSLs, we got on-prem hardware to have reliable ncu benchmarking”” / X https://x.com/GPU_MODE/status/1985436876384453128

We’ve released an early preview of Qwen3-Max-Thinking–an intermediate checkpoint still in training. Even at this stage, when augmented with tool use and scaled test-time compute, it achieves 100% on challenging reasoning benchmarks like AIME 2025 and HMMT. You can try the https://x.com/Alibaba_Qwen/status/1985347830110970027

Super happy to see the next iteration of ViDoRe: ViDoRe v3 is built on human-created examples, covers more realistic RAG scenarios (including open-ended and multi-hop queries), and should be your new default benchmark for multimodal retrieval! 👀 Congrats to the team!”” / X https://x.com/tonywu_71/status/1986047154620633370

🚀 new 🌤️ lighteval release and our biggest yet! • new benchmark finder to explore all available tasks • inspect-ai integration from @AISecurityInst → more stable and easier to add benchmarks • share your evals and insights with the community on the @huggingface hub • new https://x.com/nathanhabib1011/status/1985720151673880923

Google DeepMind release: Towards Robust Mathematical Reasoning Introduces IMO-Bench, a suite of advanced reasoning benchmarks that played a crucial role in GDM’s IMO-gold journey. Vetted by a panel of IMO medalists and mathematicians. IMO-AnswerBench – a large-scale test on https://x.com/iScienceLuvr/status/1985685404276965481

While human expert evaluation remains the gold standard for mathematical proofs, its cost and time intensity limit scalable research. To address this, we built #ProofAutoGrader, an automatic grader for IMO-ProofBench. The autograder leverages Gemini 2.5 Pro, providing it with a https://x.com/lmthang/status/1985772094085595570

Continuing our IMO-gold journey, I’m delighted to share our #EMNLP2025 paper “Towards Robust Mathematical Reasoning”, which tells some of the key stories behind the success of our advanced Gemini #DeepThink at this year IMO. Finding the right north-star metrics was highly https://x.com/lmthang/status/1985760224612057092

#1 on the MTEB multilingual leaderboard”” / X https://x.com/fdaudens/status/1984541314063446191

> All benchmark results are reported under INT4 precision. Do you understand what a flex this was. They go toe to toe with GPT-5 on the heaviest, longest-range tasks, with hundreds of tool calls. ALL IN INT4. «Convert to fp8 if you need» Frontier lab. https://x.com/teortaxesTex/status/1986612178133123165

It’s SOTA, not only open weights SOTA :)”” / X https://x.com/crystalsssup/status/1986627840310452366

Qwen3-VL Accuracy Differences on Ollama vs MLX Video: https://x.com/andrejusb/status/1985612661447331981

Butter-Bench: Evaluating LLM Controlled Robots for Practical Intelligence | Andon Labs https://andonlabs.com/evals/butter-bench

The “AI will replace radiologists” prediction remains a rich example. A lot of folks have pointed out the problems with confusing a task (“reading a scan”) with a job (“radiologist”) with many tasks. That is true. But there was a human problem. Radiologists rejected (pre-LLM) AI”” / X https://x.com/emollick/status/1984696156140470530

Did you know that “”OSWorld”” does not really exist and everyone benches a different set of prompts, so you can’t really compare scores? … yeah I couldn’t believe it either 🙃 Full review: https://x.com/xeophon_/status/1985441764132499883

the score are insane, very cool to see native int4 quantization for the MoE layers > To overcome this challenge, we adopt Quantization-Aware Training (QAT) during the post-training phase, applying INT4 weight-only quantization to the MoE components. It allows K2 Thinking to”” / X https://x.com/eliebakouch/status/1986451219892646124

Someone from xAI reached out and asked me to retest grok-4-fast, because they’ve improved the injected system prompts. Huge improvement! grok-4-fast-reasoning: 77.5% -> 94.1% grok-4-fast-non-reasoning: 77.9 -> 97.9% I really appreciate that xAI takes this topic seriously. https://x.com/xlr8harder/status/1986728144712380682

The results are in. LTX-2 is now ranked #3 video model on @ArtificialAnlys Video Arena. No surprise here. LTX-2 delivers on quality and speed. Huge credit to the LTX team here at @lightricks for making it happen. https://x.com/LTXStudio/status/1986442720534016449

New, extremely challenging visual reasoning benchmark “”MIRA”” where current models fail…great resource to research reasoning with imgs/video🌌 https://x.com/Muennighoff/status/1986519726823211129