Benchmarks: AI News Week Ending 12/05/2025

Benchmarks: AI News Week Ending 12/05/2025

December 5, 2025

Image created with gemini-2.5-flash-image with claude-sonnet-4-5. Image prompt: Minimalist luxury trophy room with single gold trophy on white marble pedestal under dramatic spotlight, dozens of empty white pedestals receding into shadow, cold grey marble floors with reflections, architectural emptiness, cinematic lighting, the word BENCHMARKS in bold white sans-serif overlaid prominently

Amazon has launched a new speech-to-speech model, Nova Sonic 2.0, which ranks #2 on our Artificial Analysis Big Bench Audio Speech Reasoning benchmark! The new model achieves a reasoning accuracy score of 87.1% on Big Bench Audio, placing second overall behind Google’s Gemini https://x.com/ArtificialAnlys/status/1995950101068763393

Congrats to the ARC Prize 2025 winners! The Grand Prize remains unclaimed, but nevertheless 2025 saw remarkable progress on LLM-driven refinement loops, both with “”local”” models and with commercial frontier models. We also saw the rise of zero-pretraining DL approaches like HRM”” / X https://x.com/fchollet/status/1997011262723801106

Image Leaderboard Update: 🖼️📊 Our image leaderboard ranks image generation AIs according to user preference – and Seedream 4.5 from @BytePlusGlobal is speeding up the rankings! Seedream 4.5’s standard version comes in at #4, just below Nano Banana Pro – and the Max version is https://x.com/yupp_ai/status/1997032930846396466

Nano banana pro is hitting the threshold for images that Veo 4 will unlock for video. We’ll suddenly go from static infographics to pro-grade animated motion graphics — like having a custom youtube video essay on any topic imaginable. And just like that ai video will become a”” / X https://x.com/bilawalsidhu/status/1994110158138646693

Nano Banana Pro with 2k resolution is now #1 on the lmarena image editing leader board (with regular Nano Banana Pro at #2). It looks like users prefer higher resolution: who’d have thunk it?!”” / X https://x.com/JeffDean/status/1996457766349848753

Surprisingly good for the first try. Nano banana pro: “”create a map of the US where every state is made out of its most famous food (the states should actually look like they are made of the food, not a picture of the food). Check carefully to make sure each state is right.”” https://x.com/emollick/status/1995720976068137048

🚨BREAKING: Text Leaderboard Update: A new open source model has landed on the leaderboard! Mistral-Large-3 lands at #6 among open models and #28 overall on the Text leaderboard. Mistral 3 is the next generation of Mistral AI models and their most capable model family to date. https://x.com/arena/status/1995877395510051253

Introducing Mistral 3 | Mistral AI https://mistral.ai/news/mistral-3

Introducing Mistral Code | Mistral AI https://mistral.ai/news/mistral-code

Introducing the Mistral 3 family of models: Frontier intelligence at all sizes. Apache 2.0. Details in 🧵 https://x.com/MistralAI/status/1995872766177018340

Magistral | Mistral AI https://mistral.ai/news/magistral

Mistral Small 3 | Mistral AI https://mistral.ai/news/mistral-small-3

Mistral Small 3.1 | Mistral AI https://mistral.ai/news/mistral-small-3-1

Voxtral | Mistral AI https://mistral.ai/news/voxtral

Runway released Gen-4.5 today and it is already ranked first on the Video Arena leaderboard. We sat down with CEO @c_valenzuelab to discuss how a small team is currently beating Google and Meta in the race for state-of-the-art video generation. The full episode is below! https://x.com/wandb/status/1995548641801765249

CORE-Bench is solved (using Opus 4.5 with Claude Code) TL;DR: Last week, we released results for Opus 4.5 on CORE-Bench, a benchmark that tests agents on scientific reproducibility tasks. Earlier this week, Nicholas Carlini reached out to share that an updated scaffold that uses https://x.com/sayashk/status/1996334941832089732

📊 Evaluating DeepAgents CLI on Terminal Bench 2.0 📊 The DeepAgents CLI is a coding agent built on top of the Deep Agents SDK, offering an interactive terminal interface with shell execution, filesystem tools, and persistent memory. How well does it actually perform on https://x.com/LangChain/status/1997006806904984002

🚨New Models in the Arena! 🐳DeepSeek V3.2: a new family of reasoning-first, agent-oriented models from @deepseek_ai are now live in the Arena. Standard, Thinking, and Speciale are all in the Text Arena, waiting for your toughest prompts! Get your votes in: we’ll see how they https://x.com/arena/status/1995564824718442620

At this point, papers testing whether AI can or cannot do something should try to test the strongest case, as well as a default. It is fine to say Llama 2 failed, but did a serious attempt to use GPT-5.1 Thinking in an agentic harness work? It would help better map the frontier.”” / X https://x.com/emollick/status/1994913383871586563

WOW! @AnthropicAI released interviews with 1,250 professionals about how they use AI for work. You can find it on @huggingface as an open dataset! https://x.com/calebfahlgren/status/1996646452509266266

📢 If you’re interested in working at @arena please ping me. I will be at NeurIPS today and part of tomorrow. 📢 We are looking for excellent researchers (ICs and leaders) in machine learning, statistics, and evaluation. We can promise an intense, high-performance,”” / X https://x.com/ml_angelopoulos/status/1997006962522021992

AI Adoption Rates Starting to Flatten Out – Apollo Academy https://www.apolloacademy.com/ai-adoption-rates-starting-to-flatten-out/

Arena Expert launched last month as a new system for identifying the most difficult prompts–the kinds of questions people at the forefront of their fields are expected to ask. Since the launch, we looked at how “thinking” and “non-thinking” models perform across both general and https://x.com/arena/status/1997018150068801911

cloudflare down again 🙃 https://x.com/crystalsssup/status/1996869639608164505

Hey twitter! I’m releasing the LLM Evaluation Guidebook v2! Updated, nicer to read, interactive graphics, etc! https://x.com/clefourrier/status/1996250279033839918

How do the Top 10 open models really compare? We ran the “SF sea lion in front of the Golden Gate Bridge, as an SVG” test to find out. Prompt: “SVG sea-lion balancing a beach ball on its nose with the Golden Gate bridge in the background” https://x.com/arena/status/1995534738485129706

Interesting all AIs struggle with: “Updated version of the fighting temeraire with the same style and feel but entirely different subject appropriate for today” All four models get the idea of a retiring technology but miss most of the symbolism of what is being retired and how https://x.com/emollick/status/1994945921076138012

Introducing the Artificial Analysis Openness Index: a standardized and independently assessed measure of AI model openness across availability and transparency Openness is not just the ability to download model weights. It is also licensing, data and methodology – we developed a https://x.com/ArtificialAnlys/status/1995523178521846191

Most AI benchmarks share a common flaw: they saturate too quickly to study long-run trends. Our solution: “stitch” many benchmarks together. This lets us compare models across a wide range of capabilities on a single unified scale. Here’s how this works.🧵 https://x.com/EpochAIResearch/status/1996248575400132794

This is a remarkable result. V3.2 has high Pass@1 on Tool Decathlon, mid (ie GPT-5 tier) Passˆ3 (all three trajectories correct), but Pass@3 is #2, right behind new Opus. Do you know what this looks like? Like a model that’s *still* not RL’d anywhere close to its ceiling. https://x.com/teortaxesTex/status/1995538676332278238

Three years of the Lem Test, from the release of ChatGPT-3.5 (though it was not called that at the time) to Claude Opus 4.5 last week. https://x.com/emollick/status/1995025704870887652

Yupp LIVE leaderboard news 📰📊 A live model searches the Internet, integrating the latest real-world information into its responses. No dreaded old “knowledge cut-off time!” The new Claude Opus 4.5 Online and Claude Opus 4.5 (Thinking) Online have quickly risen to the top. https://x.com/yupp_ai/status/1996963861455593829

Yupp’s SVG AI Leaderboard is live! Why? It’s one of the clearest ways to demonstrate models’ reasoning and coding capabilities.”” / X https://x.com/yupp_ai/status/1996697775585787924

this one chart explains EVERYTHING about why OpenAI, xAI and Deepmind dropped everything to go chase after the grand prize in koding usecases as i said at AIE CODE and in my cogpost, Code AGI will be achieved in 20% of the time of full AGI, and capture 80% of the value of AGI. https://x.com/swyx/status/1996760294614507929

Unlock the secret to AI success | Forrester study https://miro.com/events/secret-to-ai-success-forrester-study/

🚨Top 10 Open Models by Provider for November The open model race continues with new models entering the Text Arena. Confidence intervals are getting tighter and the competition is heating up! Here are the November Top 3: 🥇 #1 Kimi-K2-Thinking-Turbo by @Kimi_Moonshot (Modified https://x.com/arena/status/1995534475070243043

Announcing the ARC Prize 2025 Top Score & Paper Award winners The Grand Prize remains unclaimed Our analysis on AGI progress marking 2025 the year of the refinement loop https://x.com/arcprize/status/1997010070585201068

🚨🖼️ Image Leaderboard Update Seedream 4.5 by Bytedance has officially entered the Arena on both the Image Edit and Text-to-Image leaderboards. Here is where it landed: 🔹 #3 on Image Edit (score: 1338) 🔹 #7 on Text-to-Image (score: 1146) This update delivers a 27-pt increase https://x.com/arena/status/1996641968005566876

🚨BREAKING: Text Leaderboard Update 🐳 Deepseek-v3.2 enters the leaderboard at #38, and Deepseek-v3.2-thinking lands at #41. For comparison, previous versions ranked higher: 🔹 v3.2 ranks #38 (-5 pts v3.1 and -14 pts v3.2-exp) 🔹 v3.2-thinking ranks #41 (-7 pts vs v3.1-thinking https://x.com/arena/status/1996707563208167881

Compare how DeepSeek V3.2 performs relative to models you are using or considering at: https://x.com/ArtificialAnlys/status/1996110266065715249

DeepSeek’s new DeepSeekMath-V2 hits gold-medal performance on IMO and Putnam. It’s the first open model that can check its own proofs, fix mistakes, and improve itself. DeepSeekMath-V2 uses two “minds” in one model: ▪️ A verifier – Reads a proof and points out issues. – https://x.com/TheTuringPost/status/1994926897248288813

Gemini 3 Deep Think mode is now live in the Gemini app for Ultra users. 🚀 Building on the technology that reached a gold-medal level at the ICPC World Finals & IMO, it uses parallel thinking to excel at difficult coding and scientific tasks. https://x.com/quocleix/status/1996659461851885936

Gemini 3 Pro is the frontier of multimodal AI, delivering SOTA performance across document, screen, spatial, and video understanding. Read our deep dive on how we’ve pushed our core capabilities to power hero use cases across: + Docs: “”derender”” complex docs into structured https://x.com/googleaidevs/status/1996973083467333736

Google out here building the Borg cube for real https://x.com/bilawalsidhu/status/1995650915785986491

Happy to share that the @GoogleDeepMind Gemini team is starting a new research team in Singapore! This new team will be focused on advanced reasoning, LLM/RL and improving bleeding edge SOTA models such as Gemini, Gemini Deep Think and beyond. 🔥 This team will be led by yours https://x.com/YiTayML/status/1996640869584445882

I was in Singapore earlier this year to visit the office, and this is going to be a very-high impact part of the Gemini team! If you’re interested in working on Gemini and want to be in Singapore working with awesome people like @YiTayML and @quocleix, see below ⬇️”” / X https://x.com/JeffDean/status/1996644208854388983

Opera rolls out Gemini-powered AI features across its browsers – 9to5Mac https://9to5mac.com/2025/12/01/opera-browsers-get-google-gemini-integration/

Our Gemini 3 Vibe Code hackathon started!, Build applications using the new Gemini 3 Pro model with a price pool of $500k. 🤯 > Top 50 winners receive $10,000 in Gemini API credits each. > Access Gemini 3 Pro Preview directly in Google AI Studio. > Leverage advanced reasoning”” / X https://x.com/_philschmid/status/1996990062836244732

Take an early look at how Google Gemini projects will work – Android Authority https://www.androidauthority.com/google-gemini-projects-2-3620950/

Today, we’re rolling out an updated Deep Think mode available in the Gemini app for Google AI Ultra subscribers. Here’s what you need to know: — Gemini 3 Deep Think mode pushes the boundaries of intelligence even further, delivering meaningful improvement in reasoning https://x.com/GoogleAI/status/1996657213390155927

Ultra users, ready to try Gemini 3 Deep Think mode? Here’s how: 1) Select ‘Deep Think’ in the prompt bar 2) Select ‘Thinking’ from the model drop down 3) Type your prompt & submit”” / X https://x.com/GeminiApp/status/1996670867770953894

We’re hiring research scientists & student researchers at Google DeepMind. DM or email me if you’re interested! I’ll be at NeurIPS this week. Happy to chat in person!”” / X https://x.com/RuiqiGao/status/1995572419218796567

We’re pushing the boundaries of intelligence even further with Gemini 3 Deep Think. 🧠 This mode meaningfully improves reasoning capabilities by exploring many hypotheses simultaneously to solve problems. Here’s how it coded a simulated dominoes game from a single prompt ⬇️ https://x.com/GoogleDeepMind/status/1996658401233842624

With state-of-the-art reasoning, richer visuals, and deeper interactivity, Gemini 3 is more intuitive, more powerful, and more personalized. Start exploring at https://x.com/GeminiApp/status/1995534313044238347

👀Introducing a brand new @yupp_ai SVG leaderboard ranking frontier models on the generation of coherent and visually appealing SVGs! Gemini 3 Pro by @GoogleDeepMind takes the crown as the most powerful model! 👏 We’re also releasing a public SVG dataset. Details in🧵 https://x.com/lintool/status/1996696157985398812

Gemini 3 Deep Think mode is live for Ultra users today. When using parallel lines of thought in this mode, Gemini shows meaningful improvement on key reasoning benchmarks such as ARC-AGI-2 & HLE. I think you will be deeply impressed. https://x.com/NoamShazeer/status/1996679619031060680

@venturetwins Feels a bit underwhelming? Not sure but a lot of hype for something that has existed for months and with better results. Nano Banana for video is Aleph: https://x.com/c_valenzuelab/status/1995559319962783919

Nano Banana Pro keeps getting more SOTA (support for 2K and 4K is available in the API!) 🍌 https://x.com/OfficialLoganK/status/1996036187979678088

Editing video models (think nano banana for video) will cause a boom in faithful remasters of old classics. Suddenly you can afford decisions that used to be cost prohibitive. Maybe even tackle cult favorites that never had the fan base to justify the expense (Stargate SG-1 https://x.com/bilawalsidhu/status/1995883669358006526

And Mistral Large 3, a frontier class open source MoE. https://x.com/MistralAI/status/1995872771516354828

🎉 Congratulations to the Mistral team on launching the Mistral 3 family! We’re proud to share that @MistralAI, @NVIDIAAIDev, @RedHat_AI, and vLLM worked closely together to deliver full Day-0 support for the entire Mistral 3 lineup. This collaboration enabled: • NVFP4 https://x.com/vllm_project/status/1995890057224618154

Europe still has one frontier model maker that can generally keep pace with Chinese open weights models, though no reasoner for Mistral 3 yet means they are behind the curve of actual performance – DeepSeek r1 got 71.5% on GPQA Diamond (& 1-shot, not 5-shot) back in January. https://x.com/emollick/status/1996068920596594932

I want to especially thank @MistralAI for releasing the base models for Mistral 3. Fewer companies are sharing base models and this opens many use cases from custom instruct to non-instruct cases”” / X https://x.com/QuixiAI/status/1996272948378804326

Meet the Ministral 3 models from @MistralAI! – 3B, 8B, and 14B models – Instruct, reasoning, and base variants – Supports tool use and vision input – Open-weights, Apache 2.0 licensed https://x.com/lmstudio/status/1995908228526604451

Mistral 3 is now available on Ollama v0.13.1 (currently in pre-release on GitHub). 14B: ollama run ministral-3:14b 8B: ollama run ministral-3:8b 3B: ollama run ministral-3:3b Please update to the latest Ollama. https://x.com/ollama/status/1995885696360566885

Mistral releases Ministral 3, their new reasoning and instruct models! 🔥 Ministral 3 comes in 3B, 8B, and 14B with vision support and best-in-class performance. Run the 14B models locally with 24GB RAM. Guide + Notebook: https://x.com/UnslothAI/status/1995874975631503479

NEW: @MistralAI released a fantastic family of multimodal models, Ministral 3. You can fine-tune them for free on Colab using TRL ⚡️, supporting both SFT and GRPO https://x.com/SergioPaniego/status/1996257877871509896

NEW: @MistralAI releases Mistral 3, a family of multimodal models, including three start-of-the-art dense models (3B, 8B, and 14B) and Mistral Large 3 (675B, 41B active). All Apache 2.0! 🤗 Surprisingly, the 3B is small enough to run 100% locally in your browser on WebGPU! 🤯 https://x.com/xenovacom/status/1995879338583945635

Run Mistral Large 3 on Ollama’s cloud: ollama run mistral-large-3:675b-cloud”” / X https://x.com/ollama/status/1996682858933768691

Super nice to see Mistral Large 3 as the #1 OSS model for coding on lmarena 🥳😎🙌 And the spoiler alert! 👀👀”” / X https://x.com/sophiamyang/status/1996587296666128398

Support for running Mistral Large 3 locally will be available in Ollama soon.”” / X https://x.com/ollama/status/1996683156817416667

The Bert-Nebulon Alpha Stealth model is live now as @MistralAI’s new Mistral Large 3! Try the full release now on OpenRouter: https://x.com/OpenRouterAI/status/1995904288560988617

The world’s best small models–Ministral 3 (14B, 8B, 3B), each released with base, instruct and reasoning versions. https://x.com/MistralAI/status/1995872768601325836

Mistral Large 3 debuts as the #1 open source coding model on the @arena leaderboard. We’d love for you to try it! More on coding in a few days… 👀 https://x.com/MistralAI/status/1996580307336638951

Mistral AI raises 1.7B€ to accelerate technological progress with AI | Mistral AI https://mistral.ai/news/mistral-ai-raises-1-7-b-to-accelerate-technological-progress-with-ai

NVIDIA Shatters MoE AI Performance Records With a Massive 10x Leap on GB200 ‘Blackwell’ NVL72 Servers, Fueled by Co-Design Breakthroughs https://wccftech.com/nvidia-shatters-moe-ai-performance-records-with-a-massive-10x-leap-on-gb200-nvl72/

Curiosity is a requirement for greatness. You win when you keep asking new questions every day. That’s why I am proud to announce my investment in Perplexity. Perplexity is powering the world’s curiosity, and together we will inspire everyone to ask more ambitious questions. https://x.com/Cristiano/status/1996626923720462425

Robotics keeps hitting the same wall. Single task RL works, but… it does not scale to hundreds of tasks or new embodiments. This new paper looks like a real step toward fixing that. The team introduces MMBench, a benchmark with 200 tasks across many domains and robots, and https://x.com/IlirAliu_/status/1994695830612447330

🌍 Global MMLU was released exactly a year ago and has already become a key reference for multilingual evaluation. Today, we’re introducing Global MMLU 2.0 now covering more languages and refining the benchmark for what’s next. Excited for what’s yet to come in 2026 🚀🚀 🚀”” / X https://x.com/mziizm/status/1996517093039382879

Are we in a GPT-4-style leap that evals can’t see? – Martin Alderson https://martinalderson.com/posts/are-we-in-a-gpt4-style-leap-that-evals-cant-see/

Cohere Labs x NeurIPS 2025: “The Leaderboard Illusion” The Leaderboard Illusion highlights how private testing, selective score retraction, and data access gaps can distort leaderboard rankings, affecting AI model evaluation reliability. Congrats to authors https://x.com/Cohere_Labs/status/1996593263609045458

Today we are announcing the creation of the AI Evaluator Forum: a consortium of leading AI research organizations focused on independent, third-party evaluations. Founding AEF members: @TransluceAI @METR_Evals @RANDCorporation @halevals @SecureBio @collect_intel @Miles_Brundage”” / X https://x.com/aievalforum/status/1996641899332198403

We need rigorous, transparent evaluation if we want the world to understand advanced AI capabilities and risks. We’re excited to join with other independent evaluators through the AI Evaluator Forum to raise the bar on measurement best practices.”” / X https://x.com/METR_Evals/status/1996656514774524054

The mystery is over. It’s Runway with a leaderboard topping video model. Can’t wait to give it a test drive.”” / X https://x.com/bilawalsidhu/status/1995541831103512965

With Gen-4.5 you can achieve an unprecedented level of cinematic realism while still achieving novel creative concepts. The model is exceptionally good at generating objects that move with realistic weight, momentum and force. Even when suspended in zero gravity. Gen-4.5 early https://x.com/runwayml/status/1995857775771918574

With Gen-4.5 you can explore worlds that represent very specific points of view and aesthetic characteristics. The model allows you to precisely generate the look, feel and atmosphere of the world you want to create and the stories you want to tell. https://x.com/runwayml/status/1996942421121191987