Benchmarks: AI News Week Ending 12/19/2025

Benchmarks: AI News Week Ending 12/19/2025

December 19, 2025

Image created with gemini-2.5-flash-image with claude-sonnet-4-5. Image prompt: Photorealistic 35mm cinema shot of a 7-year-old child sitting on plush rug viewing multiple glowing TV screens in warm bedroom, side angle shallow depth of field, prominent wall-mounted wooden growth chart covered in detailed percentage scores and metrics behind child, small trophy shelf visible, warm peach and cream tones contrasted with cool blue screen glow, scattered newspapers, bold text ‘BENCHMARKS’ at top, cozy yet subtly unsettling atmosphere, natural skin texture and fabric detail

Sonnet 4.5 was underestimated on METR its time horizon improves around 20 minutes https://x.com/scaling01/status/2001476927362605354

We’re working on updating and improving our time horizon task suite. Recently, we found two issues with our tasks, one of which was differentially lowering the performance of Claude models. We think these also illustrate some interesting model behavior.”” / X https://x.com/METR_Evals/status/2001473506442375645

All the frontier AIs now pass all levels of the very challenging Chartered Financial Analyst (CFA) exam The paper used paywalled, new mock exams to reduce the risk of leakage but AI grading for the essays. Interestingly, prompting strategy doesn’t matter for most question types https://x.com/emollick/status/2000605774695837711

BREAKING: OpenAI releases “”GPT-Image-1.5″” (ChatGPT Images) & It instantly takes the #1 Spot on LMArena, beating Google’s Nano Banana Pro. : r/singularity https://www.reddit.com/r/singularity/comments/1po98xo/breaking_openai_releases_gptimage15_chatgpt/

GPT-5.2 is here and it’s the best model out there for everyday professional work. On GDPval, the thinking model beats or ties human experts on 70.9% of common professional tasks like spreadsheets, presentations, and document creation. It’s also better at general intelligence,”” / X https://x.com/fidjissimo/status/1999183159356006450

Today I ran two complex tasks through Codex with GPT 5.2 Extra High The first ran for 2 hours 30 minutes The second ran for 1 hours 45 minutes Both resulted in: – all acceptance criteria resolved – all test coverage complete – zero broken or non-working code Amazing”” / X https://x.com/nummanali/status/2000228337030152347

Whoa. This new GDPval score is a very big deal. Probably the most economically relevant measure of AI ability suggesting that in head-to-head competition with human experts on tasks that require 4-8 hours for a human to do, GPT-5.2 wins 71% of the time as judged by other humans https://x.com/emollick/status/1999189828756263359

GPT Image 1.5 achieves both #1 in Text to Image and Image Editing in the Artificial Analysis Image Arena, surpassing Nano Banana Pro GPT Image 1.5 is OpenAI’s newest flagship image generation model, demonstrating improved image quality and prompt fidelity relative to earlier https://x.com/ArtificialAnlys/status/2001016199094948185

GPT Image 1.5 is now available in the API: ✏️ More precise image editing and preservation of logos & faces 🎯 Better instruction following and adherence to prompts 🔤 Improved text rendering, particularly for denser and smaller text Learn more in docs: https://x.com/OpenAIDevs/status/2000992413402456485

Grace Li (@grx_xce): “”This is the biggest jump in Image Arena that we’ve seen since Nano Banana GPT-Image-1.5 has taken #1 on Image Arena with a significant lead Huge congratulations to the team at @OpenAI for this achievement!”” | XCancel https://xcancel.com/grx_xce/status/2000993261914350070?s=20

Introducing ChatGPT Images, powered by our flagship new image generation model. – Stronger instruction following – Precise editing – Detail preservation – 4x faster than before Rolling out today in ChatGPT for all users, and in the API as GPT Image 1.5. https://x.com/OpenAI/status/2000990989629161873

The Image Arena is buzzing 👀 @OpenAI’s GPT-image-1.5 is live and already shaking up the leaderboard. Watch it in action below, then try your own prompt and share what you create 👇🎨 https://x.com/arena/status/2001014708254773549

The new ChatGPT Images is here | OpenAI https://openai.com/index/new-chatgpt-images-is-here/

This is the biggest jump in Image Arena that we’ve seen since Nano Banana GPT-Image-1.5 has taken #1 on Image Arena with a significant lead Huge congratulations to the team at @OpenAI for this achievement! https://x.com/grx_xce/status/2000993261914350070

xAI’s new Grok Voice Agent is the new leading Speech to Speech model, surpassing Gemini 2.5 Flash Native Audio and GPT Realtime in our Big Bench Audio benchmark The new model achieves a score of 92.3% on Big Bench Audio, just ahead of the previous leader, Google’s Gemini 2.5 https://x.com/ArtificialAnlys/status/2001388724987527353

🎥 Kling 2.6 Motion Control Feature Is Now Live! To celebrate the launch of Kling 2.6 Motion Control Feature, we’re kicking off a new contest – and the prizes are one post away from you! 🔥 Show us your creative power with Kling 2.6 Motion Control Feature – The Kling 2.6 Motion https://x.com/Kling_ai/status/2001891240359632965

🎥 Kling 2.6 Voice Control Feature Is Now Live! To celebrate the launch of Kling 2.6 Voice Control Feature, we’re kicking off a new contest – and the prizes are one post away from you! 🔥 Show us your creative power with Kling 2.6 Voice Control Feature – Use your signature voices https://x.com/Kling_ai/status/2001198609115628029

🚀 Motion Control, Leveled Up Newly upgraded Motion Control is now live in Kling VIDEO 2.6! Experience precise, full control over every action & expression ✅ Full-Body Motions — Body movements captured in stunning detail ✅ Fast & Complex Actions — From martial arts to https://x.com/Kling_ai/status/2001306445262823431

🚨 Kling O1 Video Standard is here on fal! 🎬 Same powerful editing model, 720P mode ✨ Start & end frame control for precision 🎯 3-10 second range for flexible videos 💰 Faster generation, lower cost https://x.com/fal/status/2000590369545744599

🚨Video Leaderboard Updates Kling 2.6 Pro by @kling_AI and the new Kandinsky 5.0 open models by @kandinskylab have now landed on the Video Arena leaderboard. Kling 2.6 Pro delivers a major 16-point jump over Kling-2.5-turbo-1080p. While Kandinsky 5.0 enters strong, taking the https://x.com/arena/status/1999530939886768205

A new prompt unlock? Multiple gliding rack focus through a cyberpunk nightclub, yes the characters in close up are prompted, prompt share in later post. Not keyframes. Created in @Kling_ai 2.6 Image to video. 🔊🔊🎧 https://x.com/StevieMac03/status/2002001196383391813

Do you want to create ultra-dynamic action animations with @Kling_ai 2.6? 🎬⚡️ After testing many prompts, I’ve noticed what works best. And here’s the key. 👉 What usually gives the best results is starting the prompt with “”High-speed anime battle.”” Other combinations that https://x.com/Artedeingenio/status/2001960379610767835

Kling 2.6「MotionControl」ダンス動画で検証・全身のステップや重心移動が自然・髪の毛の追尾性能も◎ こういったダンスやアクションの方が相性が良く、強みを発揮できる印象です✨ https://x.com/genel_ai/status/2001532885673873677

Oh my… Kling just dropped the next era of motion control. Kling VIDEO 2.6 can copy any action with perfect lip-sync, lifelike motion and expressive gesture. It outperforms Wan 2.2-Animate, Act-Two and DreamActor 1.5 across all metrics. More examples below. https://x.com/AngryTomtweets/status/2001569619375698199

Quick test of Kling 2.6 Motion Control Shall I keep going? 😭 https://x.com/blizaine/status/2001849003819098168

Your frames. Your timing. Kling VIDEO O1 now supports Start & End Frames generation with freely selectable durations from 3- 10s, giving you smoother transitions and more control over pacing. From fast, high-impact moments to fully immersive cinematic shots–your story moves the https://x.com/Kling_ai/status/2000581619556421673

How good is AI for science? Yesterday, OpenAI released a benchmark, FrontierScience, to measure frontier model performance on scientific tasks. This is the most sophisticated benchmark for science I’ve seen. FrontierScience has 160 questions across various subdomains, https://x.com/jungofthewon/status/2001302379527114798

Measuring AI Ability to Complete Long Tasks – METR https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/

The updated time horizon numbers are live on the dashboard on our website: https://x.com/METR_Evals/status/2001473519197335899

Best AI research of the week: ▪️ On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning LMs ▪️ Native Parallel Reasoner ▪️ Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving ▪️ DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent https://x.com/TheTuringPost/status/2000874193249034463

InternGeometry: An LLM Agent tackles Olympia-level geometry. This novel agent solves 44 of 50 International Math Olympiad problems, beating gold medalists with only 13K training examples. It uses iterative reasoning & Complexity-Boosting RL. https://x.com/HuggingPapers/status/1999572332906438987

All the most recent models now do this right first try. https://x.com/emollick/status/1999960137386361093

Can AI reviewers catch real bugs without flooding PRs? Akshay Utture, Applied AI Engineer at @augmentcode, and his team benchmarked 7 AI code review tools on large open-source projects. Here are the results: ▪️ They saw the same pattern: Missed issues came from missing https://x.com/TheTuringPost/status/1999619297057112275

Honestly weird that the frontier models do not diverge that much in terms of abilities, prompt adherence, and other factors. Whether you pick any of the big American closed source models or the Chinese and French open models, they are all very similar to each other, and have been”” / X https://x.com/emollick/status/1999712938861674798

I see multiple QTs saying “”train on test”” But the way i understand it, I don’t think he is doing anything wrong? And this does not look like the classic “”oops i trained on test”” to me? Arc-agi is a meta-learning benchmark, but they don’t like to call it that. – On the left, he”” / X https://x.com/giffmana/status/2002111246225621296

Individual results across the 10 evals we run independently for the Artificial Analysis Intelligence Index: MMLU-Pro, GPQA Diamond, Humanity’s Last Exam, LiveCodeBench, SciCode, AIME 2025, IFBench, AA-LCR, Terminal-Bench Hard, 𝜏²-Bench Telecom https://x.com/ArtificialAnlys/status/2001335963952521243

Our lack of any reliable measures of human error rates across intellectually demanding tasks and fields is a huge hindrance to understanding the thresholds of hallucination and reliability that AI might cross incrementally that could lead to sudden leaps in usefulness & adoption.”” / X https://x.com/emollick/status/2001310462890160443

SimpleBench results extremely disappointing for GPT-5.2 GPT-5.2 scores below Sonnet 3.7, an almost 1 year old model GPT-5.2 Pro doesn’t fare much better, barely beating GPT-5 https://x.com/scaling01/status/1999466846563762290

Text to Image Leaderboard | Artificial Analysis https://artificialanalysis.ai/image/leaderboard/text-to-image

we desperately need new and better benchmarks I think I need to sit down another 2 hours with Opus 4.5 and cook up a LisanBench follow up But I really want to see more benchmarks on complex games (well ARC-AGI-3 is already going in that direction of dynamic environments) I”” / X https://x.com/scaling01/status/1999321464319754290

Were you right a year ago? Let’s see! Revisiting our early-2025 predictions Last December, we made a bold bet: 2025 would be the “”Year of Inference-Time Search.”” Looking back, that prediction defined the entire year. ⬇️ 1. The Big Win: The “”Thinking”” Shift @fchollet nailed https://x.com/TheTuringPost/status/1999097028023062937

would like to clarify this this work is actually _very interesting_ for exactly the reasons listed in the community note: 1) you can train purely on arc agi train set and get a new “”pareto frontier”” for arc agi 2) the cost of doing (1) is so low that it’s effectively ~free to https://x.com/suchenzang/status/2002100653049753901

Zoom AI sets new state-of-the-art benchmark on Humanity’s Last Exam | Zoom https://www.zoom.com/en/blog/humanitys-last-exam-zoom-ai-breakthrough/

GDPval-AA Leaderboard: https://x.com/ArtificialAnlys/status/1999404589049872615

@OpenAI Super cool to see the eval on the Hugging Face hub too – OPEN SOURCE EVALS FTW! 🔥 https://x.com/reach_vb/status/2000982838171328882

Important new eval!”” / X https://x.com/sama/status/2000980694588383434

💡 LMArena Deep Dive: DeepSeek v3.2 (Text Arena) Leaderboard rank doesn’t always tell the full story. As previously reported, DeepSeek released v3.2 two weeks ago. Its results varied across categories and, overall, ranked lower than earlier v3.1 and v3.2 Experimental versions. https://x.com/arena/status/2000637978662821942

New benchmark from Google Research. Models get better at benchmarks, but do they actually get more factual? Previous evaluations focused on narrow slices: grounding to documents, answering from memory, or using search. A model excelling at one often fails at another. This new https://x.com/omarsar0/status/2000935220049273303

after testing GPT-5.2 I no longer think that it is a much larger model or anywhere the size Gemini 3 Pro is”” / X https://x.com/scaling01/status/1999566015873569174

🚨BREAKING: Leaderboard updates for Text, Vision & WebDev Gemini-3-Flash by @GoogleDeepMind is now ranked top 5 across Text, Vision, and WebDev, making it the most cost-efficient frontier model (input $0.5 and output $3/MTokens). Gemini-3-Flash highlights: 🔹 Top 5 across Text, https://x.com/arena/status/2001322123730788698

I am asking, once again, for @GoogleDeepMind to provide benchmarks for different thinking levels. If they’re giving me low, medium, and high thinking level parameters, I wanna know as a builder how they compare. I don’t think think that’s too much to ask @OfficialLoganK”” / X https://x.com/RobertHaisfield/status/2001327612887785904

🖼️🚨 Image Leaderboard Update Competition in the Arena continues to drive leaderboard movement, with Flux-2-Max making a competitive debut. 🔹 #3 on Text-to-Image (1167) 🔹 #7 on Image Edit (1247) The Text-to-Image leaderboard tightens as Flux-2-Max slots ahead of https://x.com/arena/status/2000947088738431408

🖼️🚨 Image Leaderboard Update Competition in the Arena continues to shake up the leaderboards. Flux-2-Dev lands on the board with solid early results. 🔹 #7 on Text-to-Image (1149) 🔹 #8 on Image Edit (1240) Margins remain slim on the Text-to-Image leaderboard, where https://x.com/arena/status/1999560495867793881

🚨 FLUX.2 [max] live on fal! ✨ Black Forest Labs’ top-tier: quality + edit consistency 🎯 Better than FLUX.2 [pro], easier prompting 🎨 Consistent edits: characters, objects, styles, backgrounds 💡 Most creative FLUX model: same prompt, varied outputs that still follow https://x.com/fal/status/2000945229977829784

🚀 The GeoAI QGIS Plugin is here 🔥 You can run Moondream vision-language models, object detection, image segmentation (SAM 3), and even train your own geospatial segmentation model end-to-end. Website: https://x.com/giswqs/status/1999536028282179721

GPT Image 1.5’s IQ is far behind Nano Banana Pro. It fails the math problem here (left: GPT, right: 🍌), also other math/physics/maze problems. Nano Banana Pro is a multimodal built on Gemini 3 Pro. I suspect GPT Image 1.5 is still stuck on the older GPT-4o architecture. https://x.com/Yuchenj_UW/status/2001023040763920870

GPT-5.2 below Opus 4.5 and Gemini 3 Pro on LiveBench https://x.com/scaling01/status/1999323401421488319

GPT-5.2 scores 152 on the Epoch Capabilities Index (ECI), our tool for aggregating benchmark scores. This puts it second only to Gemini 3 Pro. 🧵 with individual scores. https://x.com/EpochAIResearch/status/1999548496198926728

GPT-5.2 xhigh doing better than Gemini 3 Pro on MRCR long context eval https://x.com/scaling01/status/1999327512401527107

GPT-5.2 xhigh reasoning scores 89.3 on the Extended NYT Connections benchmark, compared with 77.9 for GPT-5.2 high reasoning. GPT-5.2 Pro scores lower (86.7) but above GPT-5 Pro (83.9). https://x.com/LechMazur/status/1999582591905583256

Ok GPT-5.2 is *much* stronger at proof-writing. It notices BS previous models wrote immediately (I like to test this between model iterations to see if they notice what I notice). It also has better sense for what problems seem more tractable, and makes further progress.”” / X https://x.com/AcerFur/status/1999314476320063546

Real user feedback matters in model evaluation. ✨GPT-5.2 Instant, meant for everyday work, is #1 on @yupp_ai’s Text Leaderboard while GPT-5.2 (High) is #1 on our SVG Leaderboard. @openai’s strategy of releasing model variants suited to the task looks sound. Congrats @openai! 🎉 https://x.com/lintool/status/2000368978708119958

Yeah it’s over AI explained specified that this GPT-5.2 result was with reasoning effort xhigh aka 100k tokens spent thinking”” / X https://x.com/scaling01/status/1999535536130662576

GPT-5.2 just overtook Claude Opus 4.5 to achieve the highest score in GDPval-AA, a benchmark that focuses on performance in real-world economically valuable tasks However, GPT-5.2 is also the most expensive model to run GDPval-AA: GPT-5.2 cost $620, compared to Claude Opus 4.5’s https://x.com/ArtificialAnlys/status/1999404579599823091

@OpenAI NOTE: OpenAI calls their official results MRCRv2. I reported to them a few weeks ago that ~5%-10% of their tests in MRCRv1 had issues, which came from their generation. The results above are using corrected tests, similar to OpenAI’s MRCRv2. Here’s the MRCRv2 dataset from OpenAI”” / X https://x.com/DillonUzar/status/1999328225164431394

Evaluating chain-of-thought monitorability | OpenAI https://openai.com/index/evaluating-chain-of-thought-monitorability/

GPT 5.2 (xhigh) scores 72.2% and takes the lead on WeirdML, ahead of gemini 3 at 69.9%. 5.2 xhigh uses a lot of tokens (28k on avg, vs 7.8k for gemini and 3.7k for opus). It struggles with some tasks, but is really good at optimising the solutions to the other tasks to reliably https://x.com/htihle/status/2000571235734810805

GPT-5 Pro by @OpenAI is the Best Reasoning Model of 2025. 🏆 Calculated across SEAL’s reasoning leaderboards, GPT-5 Pro was the best at answering complicated questions, explaining its thinking, and solving multi-step problems. https://x.com/scale_AI/status/2000998950824968482

GPT-5.2 is a big improvement over GPT-5.1 on VendingBench-2 but barely beats Sonnet 4.5 and loses to Gemini 3 Pro and Claude 4.5 Opus https://x.com/scaling01/status/1999449402776387808

To preserve chain-of-thought (CoT) monitorability, we must be able to measure it. We built a framework + evaluation suite to measure CoT monitorability — 13 evaluations across 24 environments — so that we can actually tell when models verbalize targeted aspects of their”” / X https://x.com/OpenAI/status/2001791131353542788

Evaluating AI’s ability to perform scientific research tasks | OpenAI https://openai.com/index/frontierscience/

Science 🤝 GPT-5. Our new FrontierScience benchmark will be a valuable way to measure the performance of AI models on hard chemistry, biology, physics, and more. Plus, GPT-5 operating in a wet lab environment suggested experiments to increase a molecular cloning protocol’s”” / X https://x.com/kevinweil/status/2000982202067165253

We’re releasing a new eval to measure expert-level scientific reasoning: FrontierScience. This benchmark measures PhD-level scientific reasoning across physics, chemistry, and biology. It contains hard, expert-written questions (both olympiad-style problems and longer”” / X https://x.com/OpenAI/status/2000975293448905038

i wanted to compare gemini 3 pro and gpt 5.2 thinking on the long context eval MRCR v2, but i can’t make sense of the already high score reported by gemini for gpt 5.1? gemini is doing an average with samples < 128k, but i get 46.2% when doing that for gpt 5.1 (which is a 14% https://x.com/eliebakouch/status/1999482968717279441

I’m satisfied with GPT-5.2’s long-context capability. Up to now, I’ve always used Gemini to summarize podcasts, but I can now switch this use case over to ChatGPT. What I like is that, with the same prompt, it produces summaries with richer detail compared to Gemini. (That”” / X https://x.com/Hangsiin/status/2000738988378968224

@pli_cachete Man discovers unsupervised learning and confuses it with what “”training on test”” actually means.”” / X https://x.com/jeremyphoward/status/2002136723573387537

Man discovers training on test improves performance on test. 1.1k people cheer https://x.com/pli_cachete/status/2002068489386004596

useful lifetime of a benchmark these days is measured in months”” / X https://x.com/gdb/status/1999454952801075353

kling2.6(@Kling_ai )のモーションコントロールについて v2vの最大の魅力は AIで再現できない演技をさせること。実例として私が恥を晒して再現したから見てほしい。こんな動きプロンプトでは無理なんですよ。 https://x.com/onofumi_AI/status/2001840428250022087