Benchmarks: AI News Week Ending 02/06/2026

Image created with gemini-3.1-flash-image-preview with claude-sonnet-4-5. Image prompt: Flat cartoon illustration of a friendly coral-red lobster mascot wearing a referee whistle, holding a small clipboard with simple bar chart graphics, white speech bubble above saying BENCHMARKS in bold sans-serif font, dark charcoal background with subtle measurement grid lines, minimal web interface aesthetic, high contrast, kawaii mascot style.

Within just 10 months, performance on the ARC-AGI-2 benchmark surpassed 75%. Let that sink in.”” https://x.com/kimmonismus/status/2018800964891984181

Gemini now processes over 10 billion tokens per minute via direct API use by our customers and the Gemini App just crossed 750M monthly active users : )”” https://x.com/OfficialLoganK/status/2019166152199459074

Google’s 52x AI Growth | Tomasz Tunguz https://tomtunguz.com/google-earnings-q4-2025/

Google’s Gemini app has surpassed 750M monthly active users | TechCrunch https://techcrunch.com/2026/02/04/googles-gemini-app-has-surpassed-750m-monthly-active-users/

Our Q4/FY’25 results are in. Thanks to our partners & employees, it was a tremendous quarter, exceeding $400B in annual revenue for the first time. Our full AI stack is fueling our progress, and Gemini 3 adoption has been faster than any other model in our history. We’re really”” https://x.com/sundarpichai/status/2019155348264042934

We’ve started to measure time horizons for recent models using our updated methodology. On this expanded suite of software tasks, we estimate that Gemini 3 Pro has a 50%-time-horizon of around 4 hrs (95% CI of 2 hr 10 mins to 7 hrs 20 mins).”” https://x.com/METR_Evals/status/2018752230376210586

The Gemini app hit 750M+ monthly active users in Q4 2025. ChatGPT was reported to have 810M monthly active users by the end of 2025. The gap is shockingly small. Gemini has a real shot at passing ChatGPT.”” https://x.com/Yuchenj_UW/status/2019157674143936980

It’s so exponential, it literally looks like a wall. GPT-5.2 high sets new record in task duration. And it’s not even xhigh”” https://x.com/kimmonismus/status/2019174066565849193?s=46

We estimate that GPT-5.2 with `high` (not `xhigh`) reasoning effort has a 50%-time-horizon of around 6.6 hrs (95% CI of 3 hr 20 min to 17 hr 30 min) on our expanded suite of software tasks. This is the highest estimate for a time horizon measurement we have reported to date.”” https://x.com/METR_Evals/status/2019169900317798857

🔎 Evaluating Deep Agents: Here’s What We Learned 🔎 Deep agents can’t be evaluated like simple LLM tasks. After building and testing 4 production agents over the past few months, we learned that evaluating deep agents requires: 1. Bespoke test logic for each datapoint — each”” https://x.com/LangChain/status/2018769968515404212

We have been shipping 🛳️❤️ 📦 Community Evals & Benchmark Datasets: Benchmark datasets host benchmark leaderboards, you can now contribute eval results by opening a PR to model repositories, all PRs are fed to benchmark datasets 📦 Chat with datasets: agents live in Data”” https://x.com/huggingface/status/2019754567685050384

🏆 Agent-Centric Benchmark Results 🟣 SWE-Bench Verified: Qwen3-Coder-Next >70% with the SWE-Agent scaffold 🟣 Efficient but strong: Despite a small active footprint, it matches or exceeds several much larger open-source models on a range of agent benchmarks”” https://x.com/Alibaba_Qwen/status/2018719026558664987

🚀What Benchmark Design Tells Us About the Result of Step 3.5 Flash? Here’s a detailed breakdown from model infra engineer & Zhihu contributor P2oileen, who worked directly on the benchmarking infrastructure. 💬””If high scores can’t be reproduced, a tech report is just paper.”””” https://x.com/ZhihuFrontier/status/2019734062689304970

After my benchmark test, GLM OCR＞paddleOCR1.5＞deepseek OCR2 GLM OCR can even capture some small handwritten characters and is currently the state-of-the-art OCR model.”” https://x.com/bdsqlsz/status/2018663915404841212

Artificial Analysis released version 4.0 of its Intelligence Index, replacing saturated benchmarks with new tests focused on economically useful work, factual reliability, and reasoning. The update aims to better capture how large language models perform in business contexts,”” https://x.com/DeepLearningAI/status/2019169092024848512

Eval scores in 2026 are broken. MMLU at 91%+, GSM8K at 94%+, yet models still can’t handle basic multi-step tasks. And reported scores don’t even agree across model cards, papers, and platforms. We just shipped Community Evals on @huggingface: – Benchmark datasets now host live”” https://x.com/ben_burtenshaw/status/2019795723378942295

I don’t really think we’ll ever get to the point of ‘all benchmarks being saturated’, because we’ll always create harder ones, but we *definitely* aren’t there right now- SWE-bench Multilingual- best score: <80%. SciCode [subquestion]: 56% CritPt: 12% VideoGameBench: 1%”” https://x.com/OfirPress/status/2019755847149056456

I know it’s probably a great model I just wish they didnt cherry pick their benchmarks so much. Like where is MMLU, HLE, ARC AGI. Too bad @Huggingface shut down their leaderboard, and nobody else has stepped up. We also learned from @AIatMeta that we can’t just take the model”” https://x.com/QuixiAI/status/2018251816647938051

New SOTA public submission to ARC-AGI: – V1: 94.5%, $11.4/task – V2: 72.9%, $38.9/task Based on GPT 5.2, this bespoke refinement submission by @LandJohan ensembles many approaches together”” https://x.com/arcprize/status/2018746794310766668

Our DRACO Benchmark is fully open-source and we’re releasing the benchmark, rubrics, and methodology today. To learn more about methodology and detailed results, read the full paper: https://t.co/MDgnQ3E0kO The dataset is available on Hugging Face:”” https://x.com/perplexity_ai/status/2019126646054482294

very impressive coding model with a nice tech report, it’s only a 3B active MoE, with strong benchmark and hybrid linear attention (Gated DeltaNet) so efficient long context inference”” https://x.com/eliebakouch/status/2018730622358073384

we released Community Evals to fix transparency in evals 🤝 → Benchmark Datasets host leaderboards → create PRs to add eval result to the leaderboard, link models 🔗 leaderboards GPQA, HLE and MMLU-Pro are live, check how sota models like Kimi 2.5 compare 🙌🏻”” https://x.com/mervenoyann/status/2019784907178811644

We tested this with “”Oracle”” experiments across multiple benchmarks: – Index the corpus at several chunk sizes – Let an oracle pick the best size per query (with ground truth) Result: 20-40% better recall than ANY fixed chunk size. The optimal choice is query-dependent. >>”” https://x.com/YuvalinTheDeep/status/2018297202066481445

@finbarrtimbers DO NOT use FireworksAI to benchmark Kimi – They have failed to make any of it work right, tool calls aren’t parsed, model is shot up somehow in other ways”” https://x.com/Teknium/status/2018092504613285900

A week after PaddleOCR-VL-1.5 took the top spot on OmniDocBench, *another* 0.9B model dethrones it! GLM-OCR shows SOTA results on doc parsing benchmarks and it’s apparently 50-100% faster https://x.com/jerryjliu0/status/2018713059359899729

🚨 Top 10 Open Models in January: Text Arena Looking back last month, here are the rankings by provider for January: 🥇 #1 Kimi-K2.5-Thinking by @Kimi_Moonshot (Modified MIT) 🥈 #2 GLM-4.7 by @Zai_org (MIT) 🥉 #3 Qwen3-235b-a22b-instruct-2507 by @Alibaba_Qwen (Apache 2.0)”” https://x.com/arena/status/2018727506850033854

Today, we’re rolling out an Advanced version of Perplexity Deep Research, achieving state-of-the-art performance on external and internal benchmarks, beating every other deep research tool on accuracy, usability, and reliability across all verticals.”” https://x.com/AravSrinivas/status/2019129261584752909

We’ve upgraded Deep Research in Perplexity. Perplexity Deep Research achieves state-of-the-art performance on leading external benchmarks, outperforming other deep research tools on accuracy and reliability. Available now for Max users. Rolling out to Pro in the coming days.”” https://x.com/perplexity_ai/status/2019126571521761450

Humanoid whole-body control ASI benchmark”” https://x.com/TheHumanoidHub/status/2017293983115092168

Introducing the Artificial Analysis Video with Audio Arena! Compare video models with native audio generation including Veo 3.1, Grok Imagine, Sora 2, and Kling 2.6 Pro Since Google’s Veo 3 launched last May as the first major video model with native audio generation, many”” https://x.com/ArtificialAnlys/status/2019132516897288501

3/ These benchmarks aren’t just a leaderboard: They help us measure how AI models handle real-world skills like planning, communication and decision-making under uncertainty.”” https://x.com/Google/status/2019094601080992004

We’re advancing AI benchmarking systems by letting them play games 🕹️ Games like Poker & Werewolf in the @Kaggle Game Arena allow us to test AI capabilities and “”soft skills”” in controlled sandbox environments before they’re deployed 🧵 (1/4) ↓”” https://x.com/Google/status/2019094596588839191

📉✂️Image Arena Pareto Frontier: Image Edit Now let’s take a look at image editing. Looking at Arena Score versus price per image lets us see which models sit on the Pareto frontier across both efficient and highly complex image editing. Top models on the Pareto frontier for”” https://x.com/arena/status/2018792314878234704

📉🖼️Image Arena Pareto Frontier Image use cases vary widely. Sometimes you want the highest quality, and sometimes you need something efficient enough to run at scale. Looking at Arena Score versus price per image lets us see which models sit on the Pareto frontier. Top”” https://x.com/arena/status/2018787949840896119

🚨BREAKING: Kimi K2.5 by @Kimi_Moonshot is now the #1 open model in Code Arena! In Code Arena’s agentic coding evaluations, Kimi K2.5 is now: – #1 open model, surpassing GLM-4.7 – #5 overall, on par with top proprietary models like Gemini-3-Flash – The only open model in the top”” https://x.com/arena/status/2018355347485069800

We’re introducing WorldVQA, a new benchmark to measure atomic vision-centric world knowledge in Multimodal Large Language Models. Current evaluations often conflate visual knowledge retrieval with reasoning. In contrast, WorldVQA decouples these capabilities to strictly measure”” https://x.com/Kimi_Moonshot/status/2018697552456257945

BREAKING: @xAI’s Grok-Imagine-Video now #1 in Video Arena! For the first time, Grok-Imagine-Video-720p takes the top spot on the Image-to-Video leaderboard, overtaking Google’s Veo 3.1 while being 5x cheaper. Its 480p version released a few days ago ranks #4. Huge congrats to”” https://x.com/arena/status/2019204821551837665