Benchmarks: AI News Week Ending 12/26/2025

Benchmarks: AI News Week Ending 12/26/2025

December 26, 2025

Image created with gemini-2.5-flash-image with claude-sonnet-4-5. Image prompt: Seamless repeating wrapping paper pattern in deep navy and antique gold featuring Victorian scientific instruments, ornate measurement scales, precision calipers, and eye chart medallions arranged in elegant damask symmetry with ‘BENCHMARKS’ woven as decorative typography, museum-quality gift wrap design with subtle embossed texture, sophisticated vintage ephemera aesthetic.

Bloom – an open-source agentic tool that auto-generates behavioral evaluations for AI models by @AnthropicAI It turns what was once painstaking alignment work into a matter of configuration. – Bloom crafts and judges hundreds of scenarios targeting specific traits like https://x.com/TheTuringPost/status/2003629256522498061

We estimate that, on our tasks, Claude Opus 4.5 has a 50%-time horizon of around 4 hrs 49 mins (95% confidence interval of 1 hr 49 mins to 20 hrs 25 mins). While we’re still working through evaluations for other recent models, this is our highest published time horizon to date. https://x.com/METR_Evals/status/2002203627377574113?s=20

Introducing Manus: the first general AI agent. Try Manus today and see the future of human-machine collaboration: https://x.com/ManusAI/status/1897294098945728752

Meta just bought the fastest-growing AI agent company in history for what’s probably $1-2B. The math tells you exactly why Zuckerberg did this deal today. Manus hit $100M ARR in eight months. That’s faster than ChatGPT, faster than Midjourney, faster than any AI product ever.”” / X https://x.com/aakashgupta/status/2005815184976417117

Manus Joins Meta: Accelerating AI Innovation for Businesses | Meta for Business https://www.facebook.com/business/news/manus-joins-meta-accelerating-ai-innovation-for-businesses

Meta just acquired Manus AI. Ramp Sheets modeled it out: Estimated price: $4-6B based on AI M&A comps Fastest to $100M ARR in history (8 months) Benchmark likely 8-12x’d in under a year https://x.com/RampLabs/status/2005807066351325470

Also, MSL is hiring in Singapore! We already have some amazing researchers and engineers there, buoyed now by the 100-strong Manus team, and we’re growing fast. Feel free to DM with a resumé if interested!”” / X https://x.com/alexandr_wang/status/2005766471516053736

Fun fact: Manus is currently SOTA on the Remote Labor Index (RLI) benchmark that @scale_AI and @ai_risks released earlier this year. https://x.com/alexandr_wang/status/2005766785237410107

Introducing Manus Design View https://manus.im/blog/manus-design-view

Manus Update: $100M ARR, $125M revenue run-rate https://manus.im/blog/manus-100m-arr

Meta just bought Manus for >$1B and it makes sense. ~8 Consumer AI apps hit $100M+ arr that aren’t big labs: Perplexity: $20B ElevenLabs: $6.6B Lovable: $6.6B Replit: $3B+ Suno: $2.5B Gamma: $2.1B Character: $1B+ Manus: $500M Meta AI has ~no product. This was the cheapest, and https://x.com/deedydas/status/2005798365733478490

We finally had a moment to run our system with GPT-5.2 X-High on ARC-AGI-2! Using the same Poetiq harness as before, we saw results as high as 75% at under $8 / problem using GPT-5.2 X-High on the full PUBLIC-EVAL dataset. This beats the previous SOTA by ~15 percentage points. https://x.com/poetiq_ai/status/2003546910427361402

I would judge this a win by Gemini and a close second from Claude. ChatGPT-5.2 missed the reference (though, to be fair, it did write a surprising amount of successful code to actually enhance the image) and Grok wasn’t in the ballpark. https://x.com/emollick/status/2002961280534303206

So, Claude 4.5 came in far above trend in the much-watched METR measure of the task duration that AI can accomplish autonomously at 4 hours 49 minutes. Interestingly, at the harder 80% success threshold, it is GPT-5.1 Codex Max that breaks the trend. In 2023, GPT-4 was a minute. https://x.com/emollick/status/2002208335991337467

Documents: OpenAI has sold 700K+ ChatGPT licenses to ~35 US public universities for students and faculty, who used it 14M+ times in September, beating Copilot (Bloomberg) https://x.com/Techmeme/status/2001633781388648559

In fact the average ChatGPT query takes almost exactly as much energy as a Google search in 2008 (that is the last time Google clearly indicated the electrical consumption of a search). https://x.com/emollick/status/2003749085468311853

Similarweb on X: “Gen AI Website Traffic Share, Key Takeaways: → Gemini is approaching the 20% share benchmark. → Grok’s momentum continues. → ChatGPT drops below the 70% mark. 🗓️ 12 Months Ago: ChatGPT: 87.2% Gemini: 5.4% Perplexity: 2.0% Claude: 1.6% Copilot: https://t.co/uyXnFJbxgV” / X
https://x.com/Similarweb/status/2004113864347029663?s=20

I’m on a singular mission to solve the Physical Turing Test for robotics. It’s the next, or perhaps THE last grand challenge of AI. Super-intelligence in text strings will win a Nobel prize before we have chimpanzee-intelligence in agility & dexterity. Moravec’s paradox is a https://x.com/DrJimFan/status/2003879965369290797

Another interesting thing is that ARC-AGI is also highly correlated with every other benchmark, even though it is conceptually trying to measure something different Either all benchmarks are measuring the same thing or else AI just continues to improve on every measure over time”” / X https://x.com/emollick/status/2002922711149142293

Benchmarking is crucial to keep track of AI progress. However, benchmarking is hard: each step of the pipeline involves moving parts that can affect the final headline result. Let’s dive deeper: https://x.com/EpochAIResearch/status/2003592566772822516

Benchmarks aside, I judge new models by how many of my Baseten teammates use them daily. GLM 4.7 has become the default coding model for many of them, because of strong reasoning and speed. And GLM 4.7 runs 20% faster on Baseten judging by tok/s and ttft. https://x.com/amiruci/status/2005697292326797740

For all the models we tested, some of the providers had errors that affected the benchmark score. Recently released models are more impacted. https://x.com/EpochAIResearch/status/2003592610569683089

interesting blogpost by our eval-police guy on evaluating providers and why benchmarking is hard https://x.com/dejavucoder/status/2003594248973930929

Why benchmarking is hard | Epoch AI https://epoch.ai/gradient-updates/why-benchmarking-is-hard

I think what has become clear over the past year is that the AGI label was never very useful. It is both true that (1) 2025-era Reasoners would meet many prior definitions of AGI & (2) the idea of a single dimensional “”intelligence”” factor does not help us understand AI impacts https://x.com/emollick/status/2001530531666628703

LLM Leaderboard for Code Quality & Security | Sonar https://www.sonarsource.com/the-coding-personalities-of-leading-llms/leaderboard/

Understanding AI Benchmarks – by Shrivu Shankar https://blog.sshh.io/p/understanding-ai-benchmarks

Excited to announce that @ManusAI has joined Meta to help us build amazing AI products! The Manus team in Singapore are world class at exploring the capability overhang of today’s models to scaffold powerful agents. Looking forward to working with you, @Red_Xiao_!”” / X https://x.com/alexandr_wang/status/2005766469771223106

Meta acquired Manus 👀”” / X https://x.com/scaling01/status/2005768491740360722

18 months ago, I decided to join with @Red_Xiao_ and @peakji on my sofa. No one knew where it would lead. We just kept building, pivoting, and shipping–again and again–until now. Grateful to our team and every user who believed early. Day 1 isn’t over. We’ll keep shipping. https://x.com/hidecloud/status/2005766533910602183

.@OpenAI introduced a rigorous framework for evaluating “chain-of-thought monitorability” It’s a fancy way of asking: Can we understand what our AIs are thinking before they act? The answer: yes, but not without nuance. – Longer reasoning helps – Bigger models muddle things – https://x.com/TheTuringPost/status/2003636642767384639

I think there is likely too much emphasis on the METR long-task measurement as a sign of AI progress… … but it doesn’t matter. With a little help from GPT-5.2 Pro, I calculated the correlations between log(METR) & other key benchmarks, and they basically all correlate highly https://x.com/emollick/status/2002861706658398211

Context Arena Update: Added @ByteDance’s Seed 1.6 and Seed 1.6 Flash to the MRCR leaderboards. Seed 1.6 closely mimics the retrieval curve of @OpenAI ‘s reasoning models (o3 / o4-mini). It offers high fidelity at start, but follows a similar degradation slope as complexity https://x.com/DillonUzar/status/2005671520488640587

Message from Welcome to Sonar Chat! https://www.sonarsource.com/blog/new-data-on-code-quality-gpt-5-2-high-opus-4-5-gemini-3-and-more/

Leading open-weight models on both Artificial Analysis & Yupp Benchmarks 👇 https://x.com/fal/status/2005690259787366844

How to game the METR plot – by Shashwat Goel https://shash42.substack.com/p/how-to-game-the-metr-plot