Benchmarks: AI News Week Ending 12/12/2025

Image created with gemini-2.5-flash-image with claude-sonnet-4-5. Image prompt: Cinematic black and white photograph looking directly upward at horizontal stratified cloud layers creating natural bands at different altitudes, high contrast film grain, bold sans-serif text reading BENCHMARKS overlaid in lower third, minimal composition with pure sky background, dramatic tonal gradients between cloud strata suggesting measurement scales

Measuring AI Ability to Complete Long Tasks – METR
https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/

To measure these capabilities, we’re open-sourcing DeepSearchQA, a new benchmark to evaluate agents on complex web search tasks. Deep Research achieves state-of-the-art performance on this benchmark, as well as on the full Humanity’s Last Exam set (reasoning & knowledge), and https://x.com/GoogleDeepMind/status/1999165706231820297

We tested one of the most common prompting techniques: giving the AI a persona to make it more accurate We found that telling the AI “”you are a great physicist”” doesn’t make it significantly more accurate at answering physics questions, nor does “”you are a lawyer”” make it worse. https://x.com/emollick/status/1998063517681799418

Yes, there is a leak. I had investigated this. Some of the ARC-AGI-1 public evaluation examples can be found in the ARC-AGI-2 training examples. So training on both ARC-AGI-1 and ARC-AGI-2 training data is cheating as it leads to crazy good accuracy for ARC-AGI-1.”” / X https://x.com/jm_alexia/status/1998487516182467055

Gemini 3 Pro: the frontier of vision AI https://blog.google/innovation-and-ai/technology/developers-tools/gemini-3-pro-vision/

We’ve developed the FACTS Benchmark Suite with @GoogleResearch. 📊 It’s the industry’s first comprehensive test evaluating LLM factuality across four dimensions: internal model knowledge, web search, grounding, and multimodal inputs. https://x.com/GoogleDeepMind/status/1998831084277313539

OpenAI testing new Image-2 models on LM Arena https://www.testingcatalog.com/openai-testing-new-image-2-models-on-lm-arena/

🚨BREAKING: New Model & WebDev Leaderboard Update! GPT-5.2 by @OpenAI has officially made its debut in the Arena, appearing on the WebDev leaderboard. Current leaderboard standings: 🥈 #2 for GPT-5.2-high in WebDev (score: 1486) 🔹 #6 for GPT-5.2 in WebDev (score: 1399) https://x.com/arena/status/1999183339283185878

Poetiq | Traversing the Frontier of Superintelligence https://poetiq.ai/posts/arcagi_announcement/

A year ago, we verified a preview of an unreleased version of @OpenAI o3 (High) that scored 88% on ARC-AGI-1 at est. $4.5k/task Today, we’ve verified a new GPT-5.2 Pro (X-High) SOTA score of 90.5% at $11.64/task This represents a ~390X efficiency improvement in one year https://x.com/arcprize/status/1999182732845547795

OpenAIs latest model GPT-5.2 Thinking still not beating Opus 4.5 at SWE-Bench Verified however SWE-Bench Pro looking juicy over 10% higher score than Sonnet 4.5 https://x.com/scaling01/status/1999182909144519019

I meet a lot of very smart AI critics who never seriously try to make AI work for them by spending a couple of hours with a frontier model. People can be (and should be & are) critical after realizing what AI can do, but experience leads to better-informed and sharper critiques.”” / X https://x.com/emollick/status/1998398372986736777

We released OfficeQA today — a hard benchmark for evaluating agents on grounded reasoning tasks. More details in our blog https://x.com/bemikelive/status/1998491671609405748

First large-scale field study of how people actually use AI agents in the wild. The hype says 2025 is the year of agentic AI. But systematic behavioral evidence on real-world agent adoption has been almost nonexistent until now. Researchers from Harvard and Perplexity analyzed https://x.com/dair_ai/status/1999117070576058415

Made this video to explain evals https://x.com/HamelHusain/status/1998452926935695649

📈Arena Trends Update We pulled Arena scores for the Top 10 labs since the beginning of 2025, and the top climbers may surprise you. With tighter confidence intervals and new entries in the mix, the Arena continues to shift. Stay tuned for more EOY insights and updates from the https://x.com/arena/status/1998536014000959497

🚨Text Arena Update ERNIE-5.0-Preview-1103 by Baidu @ernieforDevs has landed on the Text leaderboard with a score of 1431 putting it in the top 20 in the most competitive Arena. A few highlights: 🔹scores 1471 in the Software & IT Services Occupational field on par with https://x.com/arena/status/1998437959553716260

ARC Prize – Leaderboard https://arcprize.org/leaderboard

ARC Prize 2025 Results and Analysis https://arcprize.org/blog/arc-prize-2025-results-analysis

Individual AI benchmarks saturate too quickly to give us a long-run trend of AI progress. We can solve this by “”stitching”” them together. As @ansonwhho explains, this lets us forecast AI capabilities, quantify algorithmic improvements, and detect accelerations in AI progress. https://x.com/EpochAIResearch/status/1998823086473568277

Poetiq | ARC-AGI-2 SOTA at Half the Cost https://poetiq.ai/posts/arcagi_verified/

Takeaway 4: Process-verified outcome rewards mitigate reward hacking and enhance reasoning fidelity. We find that incorporating process verification into outcome rewards delivers: 1) More truthful, error-resistant reasoning and 2) Better generalization on complex, multi-step https://x.com/xiangyue96/status/1998489119660638257

SVG Contest time 🎉 Our biggest contest yet! This one is about prompting AIs on Yupp to produce brilliant SVG outputs. 3 categories, 15 winners, and nearly 1M Yupp credits as prizes! Hosted by renowned prompting master/AI red teamer @chetaslua. Full details in our Discord 👇 https://x.com/yupp_ai/status/1998120413285769302

Interesting study, but this is somewhat unexpected. (green is programming, yellow is role playing) https://x.com/emollick/status/1996758326877868268

Today we’re introducing OfficeQA, a new benchmark grounded in ~89,000 pages of U.S. Treasury Bulletins that reflects the complex, document-heavy tasks enterprises actually face. Unlike existing benchmarks, OfficeQA measures economically valuable, real-world reasoning: parsing https://x.com/databricks/status/1998424470881525822

Directly comparing a benchmark of Devstral2-123B on my hardware to MiniMax-M2 (230B-A10B) shows the difference in performance MoE can give. At 100 requests concurrently: MiniMax is 2x faster At 2 requests concurrently: MiniMax is 3.5x faster https://x.com/JustinWaugh/status/1998467712235028888

🎉 Introducing Parallel Coordinated Reasoning (PaCoRe) 📈 An 8B model beats GPT-5 on HMMT25 by unlocking parallel thinking for test-time scaling! 📂 Open-source deep think: data + model + inference code! 🆓 MIT-licensed — use it however you want 🔍Key findings: 1. Message https://x.com/CyouSakura/status/1998344501262533011

Juuuust a bit outside https://x.com/buccocapital/status/1999303168568754348

Gemini 3 Pro continues to be SOTA on most multi-modal benchmarks and use cases! https://x.com/OfficialLoganK/status/1997003665433838026

We just updated our suite of Gemini TTS models 🗣️, they now come with: – Richer tone versatility and stricter adherence to style prompts – Smarter context-aware speed adjustments and better instruction following – Consistent character voices in multi-speaker scenarios”” / X https://x.com/OfficialLoganK/status/1998884687457173580

Google tests new Gemini 3 models on LM Arena https://www.testingcatalog.com/google-tests-new-gemini-3-models-on-lm-arena/

It’s not perfect tho. Some post-training might still be needed – I did see a few loops (repeating the same text over and over again) in my testing. Overall this is a SOLID model – especially priced cheaper than gemini-2.5-flash, a model it beats hands down. What a time to be”” / X https://x.com/hrishioa/status/1998636284533944725

We evaluated 15 leading models. Gemini 3 Pro achieved the top score of 68.8%. While search and internal knowledge has improved, multimodal factuality remains an industry-wide challenge. We’re sharing these benchmarks on @kaggle to help the research community build more reliable”” / X https://x.com/GoogleDeepMind/status/1998831088324473025

Gemini 3 Pro scores 69% trust in blinded testing up from 16% for Gemini 2.5: The case for evaluating AI on real-world trust, not academic benchmarks | VentureBeat https://venturebeat.com/ai/gemini-3-pro-scores-69-trust-in-blinded-testing-up-from-16-for-gemini-2-5

🚀 New InferenceMAX results are live! The team at @NVIDIA has pushed the boundaries of sglang-dsr1-1k1k-FP8 on the @SemiAnalysis_ InferenceMAX dashboard. The new submission delivers: 🔹 20% higher peak throughput 🔹 4260 tok/s/GPU at 30 TPS/user 🔹 Interactivity extended to 102 https://x.com/lmsysorg/status/1998454089903226967

GPT-5.2 weaker than GPT-5.1 Codex Max on CVE-Bench an eval that tasks models with identifying and exploiting real-world web application vulnerabilities https://x.com/scaling01/status/1999186361169871055

An important lesson that ARC-AGI has internalized, but not many others have, is that benchmark perf is a function of test-time compute. @OpenAI publishes single-number benchmark results because it’s simpler and people expect to see it, but ideally all evals would have an x-axis.”” / X https://x.com/polynoamial/status/1999189845164667132

LisanBench results for GPT-5.2 Thinking GPT-5.2 Thinking improves over GPT-5 and o3 but does not match other frontier models like Opus 4.5, Gemini 3 Pro, DeepSeek-V3.2 Speciale or Grok 4 GPT-5.2 Thinking improves over GPT-5 in average validity ratio, meaning it’s less likely to https://x.com/scaling01/status/1999240662147825876

The AI Consumer Index (ACE) Most AI benchmarks today focus on reasoning and coding. But most people use AI to shop, cook, and plan their weekends. In those domains, LLM hallucinations continue to be a real problem. 73% of ChatGPT messages (according a recent report) are now https://x.com/omarsar0/status/1998039629556256995

Announcing GDPval-AA — our leaderboard and evaluation harness for comparing models on OpenAI’s GDPval dataset of real-world knowledge work tasks Earlier today, we announced our agentic harness called Stirrup, which we built to run GDPval tasks on any language model. We’re https://x.com/ArtificialAnlys/status/1998841566627246173

GPT-5.2 is a massive model scoring even higher than Gemini 3 Pro on GPQA Diamond (91.9%) https://x.com/scaling01/status/1999183900673798454

Holy moly, thats insane: Nomos 1 is a 30B open-source model that just scored 87/120 on this year’s Putnam, good enough for an estimated #2/3988, showing that near-top human math performance is now possible with relatively small models plus good post-training and reasoning https://x.com/kimmonismus/status/1998749650984255985

Putnam, the world’s hardest college-level math test, ended yesterday 4p PT. Noon today, AxiomProver solved 9/12 problems in Lean autonomously (3:58p PT yesterday, it was 8/12). Our score would’ve been #1 of ~4000 participants last year and Putnam Fellow (top 5) in recent years”” / X https://x.com/axiommathai/status/1997767850279440715

Jina-VLM achieves state-of-the-art performance among open 2B-scale VLMs, leading with the highest average (72.3) across eight general VQA benchmarks, particularly strong on diagrams, charts, and scene text. Its multilingual capabilities stand out most, achieving best-in-class https://x.com/JinaAI_/status/1997926493456834978