Benchmarks: AI News Week Ending 02/13/2026

Benchmarks: AI News Week Ending 02/13/2026

February 13, 2026

Image created with gemini-3.1-flash-image-preview with claude-sonnet-4-5. Image prompt: Photorealistic 4K nature shot of wooden measuring sticks and rulers frozen vertically into a winter bay at dusk, some upright and some tilted, trapped in clear blue ice with visible depth layers, golden hour sunlight illuminating the measurement markings and casting long shadows across the frozen surface, gradient sky from deep blue to warm orange, dark water visible through ice cracks, landscape format with bold sans-serif ‘BENCHMARKS’ title text.

MiniMax-M2.5 is a surprising new step in open coding models. The first model where I’ve been able to independently confirm that it’s better than the most recent Claude Sonnet. It showed up in our benchmarks below, and in my vibe checks it felt strong and diverse.”” https://x.com/gneubig/status/2021988250240598108

80.2% on SWE-Bench Verified and 76.3% on BrowseComp is quite impressive. Try @MiniMax_AI M2.5 on @Eigent_AI”” https://x.com/guohao_li/status/2021984827923476922

M2.5 runs at 100 tokens per second. That’s 3x faster than Opus. At $0.06/M blended with caching, you can run subagents in the CLI and just leave them going. Fast models exist. Cheap models exist. Both at SOTA performance is new.”” https://x.com/cline/status/2022034678065373693

So far “telling a satisfying and well-written medium-length story” has proved far harder for LLMs than mathematical proofs, music generation, research reports, code, and many other forms of work. The technical reasons are pretty clear, but they are supposed to be language models”” https://x.com/emollick/status/2020993610540605560

An updated Gemini 3 Deep Think is out today: 📈 Achieves SOTA on ARC-AGI-2, MMMU-Pro, and HLE. 🥇Gold-medal level on Physics & Chemistry Olympiads. It turns out the best way to solve hard problems is still to think about them. Read more: https://x.com/NoamShazeer/status/2021988459519652089

Gemini 3 Deep Think (2/26) Semi Private Eval – ARC-AGI-1: 96.0%, $7.17/task – ARC-AGI-2: 84.6% $13.62/task New ARC-AGI SOTA model from @GoogleDeepMind”” https://x.com/arcprize/status/2021985585066652039

Gemini 3 Deep Think scores 84.6% on ARC-AGI-2″” https://x.com/scaling01/status/2021981766249328888

Sundar buried the real story in the cost data. Gemini 3 Deep Think went from 45.1% to 84.6% on ARC-AGI-2 in under 3 months. That’s an 88% improvement on a benchmark specifically designed to resist brute-force scaling. The number that matters: $13.62 per task. The previous Deep”” https://x.com/aakashgupta/status/2022025020839801186

The new Gemini Deep Think is achieving some truly incredible numbers on ARC-AGI-2. We certified these scores in the past few days.”” https://x.com/fchollet/status/2021983310541729894

Thrilled to announce a big upgrade to Gemini 3 Deep Think that hits new records on the most rigorous benchmarks in maths, science & reasoning – including 84.6% on ARC-AGI-2, 48.4% Humanity’s Last Exam without tools, and 3455 Elo rating on Codeforces!”” https://x.com/demishassabis/status/2022053593910821164

Today, we updated Gemini 3 Deep Think to further accelerate modern science, research and engineering. With 84.6% on ARC-AGI-2 and a new standard on Humanity’s Last Exam, see how this specialized reasoning mode is advancing research & development 🧵↓”” https://x.com/Google/status/2021982003818823944

We updated Gemini 3 Deep Think in @GeminiApp. Available for Ultra subscribers and slowly opening Gemini API access (fill out form below). – 48.4%, without tools on Humanity’s Last Exam. – 84.6% on ARC-AGI-2, verified by the ARC Prize Foundation. – Elo of 3455 on Codeforces. -“” https://x.com/_philschmid/status/2021989093110927798

An updated & faster Gemini 3 Deep Think is taking off! 🚀 Our smartest mode to date!™️ PhD-level reasoning to the most rigorous STEM challenges (models’ gotta think harder). Gold medal-level results on Physics & Chemistry Olympiads. 🧪💻 Full details: https://x.com/OriolVinyalsML/status/2021982720860233992

Anupam Pathak, a Google R&D lead in Google’s Platforms and Devices division, tested Deep Think’s ability to speed up the design of physical components. It’s proving that deep reasoning can translate directly into faster, more efficient prototyping.”” https://x.com/Google/status/2022007994897379809

At Duke University, the Wang Lab used Deep Think to optimize crystal growth for new semiconductors. Deep Think designed a recipe to grow thin films larger than 100 μm — hitting a precision target that previous methods had challenges to hit.”” https://x.com/Google/status/2022007988823973977

Gemini 3 Deep Think: AI model update designed for science https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-deep-think/

nano-banana is Gemini‑2.5‑Flash‑Image, beating Flux Kontext by 170 Elo with SOTA Consistency, Editing, and Multi-Image Fusion | AINews https://news.smol.ai/issues/25-08-26-nano-banana

The upgraded Gemini 3 DeepThink is now live! 🚀 We’re already seeing engineers and researchers leverage it as a partner in their design and development processes I love this example of Anupam Pathak using DeepThink to go from prompt to physical prototype–actually designing”” https://x.com/tulseedoshi/status/2021997867305775324

We’ve updated Gemini 3 Deep Think to better tackle the complexity of real-world research, science, and engineering. ♊ 🚀 It achieves gold-medal standards on the written portions of the Physics and Chemistry Olympiads, building on gold-level performance at IMO and ICPC and has”” https://x.com/JeffDean/status/2021989820604539250

We’ve upgraded our specialized reasoning mode Gemini 3 Deep Think to help solve modern science, research, and engineering challenges – pushing the frontier of intelligence. 🧠 Watch how the Wang Lab at Duke University is using it to design new semiconductor materials. 🧵”” https://x.com/GoogleDeepMind/status/2021981510400709092

What’s ahead for commercial experiences in 2026 https://blog.google/products/ads-commerce/digital-advertising-commerce-2026/

people sleep on last week’s open multimodal releases > GLM-OCR: sota OCR model > MiniCPM-o-4.5: Gemini 2.5-flash level Omni model that runs on your phone > InternS1: efficient generalist VLM outperforming on science tasks all allow commercial use freely 🔥”” https://x.com/mervenoyann/status/2021233480957304913

This is batshit insane. Gemini 3 Deep Think just scored a 3455 on Codeforces, equivalent to the #8 best competitive programmer in the world. The previous best was 2727 (#175) from OpenAI o3. This is an absolutely superhuman result for AI and technology at large.”” https://x.com/deedydas/status/2022021396768133336?s=46

GLM-5: From Vibe Coding to Agentic Engineering https://simonwillison.net/2026/Feb/11/glm-5/

GLM-5: From Vibe Coding to Agentic Engineering https://z.ai/blog/glm-5

Introducing GLM-5: From Vibe Coding to Agentic Engineering GLM-5 is built for complex systems engineering and long-horizon agentic tasks. Compared to GLM-4.5, it scales from 355B params (32B active) to 744B (40B active), with pre-training data growing from 23T to 28.5T tokens.”” https://x.com/Zai_org/status/2021638634739527773

GLM-5 was pre-trained on 28.5T tokens and uses DeepSeek Sparse Attention”” https://x.com/scaling01/status/2021627498451370331

A glance of MiniMax 2.5, are you ready?”” https://x.com/SkylerMiao7/status/2021578926884053084

Congrats @MiniMax_AI! 🎉 Free for 3 days on Qoder, it’s time to put M2.5 through some serious coding sessions!”” https://x.com/qoder_ai_ide/status/2021983111161213365

MiniMax just dropped M2.5 and it’s on par with Opus 4.6 while being 20x cheaper and 3x faster???”” https://x.com/shydev69/status/2021989925143597123

Folks claim to set the state of the art on ARC-AGI-2 using an RLM, a deeply recursive one, to manage the long horizon. “”Other agent harnesses keep everything in the model’s context window. We don’t. Agentica uses a stateful REPL to manage context. This is an RLM-style loop.”””” https://x.com/lateinteraction/status/2021994073675247816

What the hell happened with AGI timelines in 2025? | 80,000 Hours https://80000hours.org/podcast/episodes/agi-timelines-in-2025/

Announcing our $10M seed round and pitch deck | Adapt https://adapt.com/blog/pitch-deck

For those who doesn’t know: babushkin – in Russian – means “grandma’s” Grandma’s Ventures in AI – feels so cosy! so “come in, don’t stand in the cold,” so quietly judgmental about your benchmark charts, but still loving unconditionally”” https://x.com/TheTuringPost/status/2019541218355790191

Lots of folks spread false narratives about how ARC-1 was created in response to LLMs, or how ARC-2 was only created because ARC-1 was saturated. Setting the record straight: 1. ARC-1 was designed 2017-2019 and released in 2019 (pre LLMs). 2. The coming of ARC-2 was announced”” https://x.com/fchollet/status/2022036543582638517

Our ability to measure AI has been outpaced by our ability to develop it, and this evaluation gap is one of the most important problems in AI. Today we’re launching Open Benchmarks Grants — a $3M commitment to fund open benchmarks for frontier AI and close the evaluation gap.”” https://x.com/vincentsunnchen/status/2021663737716125781

Public benchmarks lag behind what frontier labs are using internally to test and develop LLMs, yet they are the key driver of progress for LLMs. This needs to change! Excited to work with @SnorkelAI who are investing $3M do build out the evaluation ecosystem with the community.”” https://x.com/lvwerra/status/2021671530108006705

StepFun-Flash-3.5 is now the #1 model on MathArena 🧮🏆 Fast enough to think. Reliable enough to reason. More updates coming soon. We are so back. 🚀 MathArena: https://t.co/b09fJSVecL OpenRouter: https://t.co/ZIaNfkCu7j Website: https://t.co/HcGbiBN8po Blog:”” https://x.com/CyouSakura/status/2021511358626554322

They don’t say it in the top level post but this is a recursive language model getting SOTA on ARC-AGI-2″” https://x.com/deepfates/status/2021991526856110252

$3M to support the development of open benchmarks!”” https://x.com/percyliang/status/2021701152333877681

AI needs better evaluations. Today we’re announcing Arena’s Academic Partnerships Program to fund independent academic research in AI evaluation and measurement. ▫️Up to $50K/project. Q1 Deadline: March 31, 2026. See more in thread for details and how to apply 👇”” https://x.com/arena/status/2021268433619374336

If you are in any situation where being right matters, you would, at this point, be making a mistake to not ask a frontier LLM for help. That can mean checking your own work, second opinions on other experts, or getting help with a complex problem. Have judgement, but use them”” https://x.com/emollick/status/2021052930410021335

Can just a 4B model solve IMO-level proof problems at the level of much stronger LLMs like Gemini 3 Pro? Yes, if you can train the LLM to scale test-time compute well! We’re very excited to release our 4B model “”QED-Nano””, built via an awesome open collab! Details below🧵⬇️”” https://x.com/aviral_kumar2/status/2022057927368995097

Early testers of Gemini 3 Deep Think are already seeing results. We partnered with researchers to explore how this model could tackle rigorous, real-world applications — from spotting hidden flaws in research papers to optimizing semiconductor growth. Here’s how early testers”” https://x.com/Google/status/2022007977419415958

If you’re an Ultra subscriber, you can try the latest in the Gemini App, but we’re also making Deep Think available for the first time in the Gemini API! Request early access here:”” https://x.com/tulseedoshi/status/2021997870858350640

@GeminiApp Do people realize how crazy that thing is??”” https://x.com/LexnLin/status/2021986194780041394

Codeforces results is “”no tools””? So Gemini 3.0 Deep Think cannot write test cases to test its solution before submission? I guess even the top1 human can’t get 3455 under this condition.”” https://x.com/YouJiacheng/status/2021985843074994534

Gemini 3 Deep Think benchmarks look amazing! On Codeforces, it scored 3,455 Elo. Apparently, only 7 humans in the world have a higher coding Elo score! A friend just sent me an output about a cancer mechanism that was so great that I am now resubscribing to Ultra for DT access!”” https://x.com/DeryaTR_/status/2022030594037989493

Gemini 3 Deep Think can help make things. 🧠 Here’s our side project: We sketched a laptop stand and Deep Think coded that into an interactive prototyping tool. We used that tool to generate a STL file, which we sent to @fleet_ai. And now I have a new laptop stand! What will”” https://x.com/joshwoodward/status/2022001967795777996

Gemini 3 Deep Think is available now in the @GeminiApp for Google AI Ultra subscribers and via the Gemini API to select researchers, engineers and enterprises through our early access program. Learn more ↓”” https://x.com/Google/status/2021982018679312829

Gemini 3 Deep Think is getting a significant upgrade. We’ve refined Deep Think in close partnership with scientists and researchers to tackle tough, real-world challenges. And it’s pushing the frontier across the most challenging benchmarks, achieving an unprecedented 84.6% on”” https://x.com/sundarpichai/status/2022002445027873257

Gemini 3 Deep Think now excels across scientific domains like chemistry and physics — achieving gold medal-level results on the written sections of the 2025 International Physics and Chemistry Olympiads.”” https://x.com/Google/status/2021982010739503138

Parsing PDFs at scale with LLMs is cost prohibitive. Newer models (e.g. gemini 3) are good at reading pdfs, but you burn unnecessary vision tokens even when the page is text heavy. We’ve built in a “cost-optimizer” within LlamaParse that will dynamically route pages to”” https://x.com/jerryjliu0/status/2021267495123140760

The upgraded Deep Think mode is rolling out now in the @GeminiApp for Google AI Ultra subscribers. For scientific researchers and developers, we’re opening a Vertex AI Early Access Program for the API. Start discovering → https://x.com/GoogleDeepMind/status/2021981517791342807

There are only 7 people on the planet who can beat Gemini 3 Deep Think in coding competitions. It has an Elo of 3455. A bit over a year ago the best systems were at 2727 (o3-preview).”” https://x.com/scaling01/status/2021983388442509478

Today, we’re releasing a significant upgrade to our specialized reasoning mode, Gemini 3 Deep Think. Deep Think is built to drive practical applications, enabling researchers to interpret complex data and engineers to model physical systems through code. With the updated Deep”” https://x.com/GeminiApp/status/2021985731577852282

Opus 4.6 dethroned GPT-5.2-xhigh on WeirdML and is now in clear first place! Opus finds much shorter (so presumably more simple and elegant) solutions to the problems. But code execution times went up. So maybe the difference in code length is due to optimizations? Would love”” https://x.com/scaling01/status/2020847174909665712

Opus 4.6, Codex 5.3, and the post-benchmark era https://www.interconnects.ai/p/opus-46-vs-codex-53

🤖 From this week’s issue: Official blog post announcing Qwen3-Coder-Next, an 80B-parameter coding model achieving competitive performance on SWE-Bench (70.6% on Verified) while enabling 10x higher throughput for repository-level agentic workflows.”” https://x.com/dl_weekly/status/2021690941879250945

📄We just launched PDF uploads in Arena. Upload PDFs with your prompts to add richer context and test models on document reasoning, bringing evaluations closer to real-world use. ▪️Ask questions directly against documents ▪️Digest complex, technical content in minutes ▪️Extract”” https://x.com/arena/status/2021300537711526113

Good thread on how we need academic research to be fast+updated. Imagine if the contribution of the SWE-Bench paper was “”AI can’t do software engineering””, and then the paper came out a year after the experiments were run.”” https://x.com/gneubig/status/2021370741237694705

Intelligence too cheap to meter: This 10 minute clip took 8 hours to creaste and cost around $60. That’s fast and inexpensive for an excellent anime clip. Soon everyone will be a movie-director. And I think many still don’t understand what that means. We’ve crossed the”” https://x.com/kimmonismus/status/2021604639557464134

Imho SeeDance looks the most natural, the most human. It’s the little things: the wine moving in the glass, the facial expressions, the details. SeeDance is forcing Google and OpenAI to quickly update their models to Sora 2.5 / Veo 3.2, thus boosting performance.”” https://x.com/kimmonismus/status/2021176568563785908

❤️ GLM-5 is on Ollama’s cloud! It’s free to start, and with higher limits available on the paid plans. ollama run glm-5:cloud It’s fast. You can connect it to Claude Code, Codex, OpenCode, OpenClaw via ollama launch! Claude: ollama launch claude –model glm-5:cloud”” https://x.com/ollama/status/2021667631405674845

🎉 The mysterious Pony Alpha is finally revealed, congrats to @Zai_org on releasing GLM-5! SGLang is ready to support on day-0. 🛠️ 744B params (40B active) model built for complex systems engineering & long-horizon agentic tasks 📚 28.5T tokens pretraining for a stronger”” https://x.com/lmsysorg/status/2021639499374375014

🔥Congrats to @Zai_org on launching GLM-5 — 744B parameters (40B active), trained on 28.5T tokens, integrating DeepSeek Sparse Attention to keep deployment cost manageable while preserving long-context capacity. vLLM has day-0 support for GLM-5-FP8 with: 📖 DeepSeek Sparse”” https://x.com/vllm_project/status/2021656482698387852

🚀 Zhipu AI GLM-5: A Real Step Into the Top Tier? Zhihu contributor toyama nao offers a concise verdict: “”A hard road upward — the stairway to godhood.”” 🔮From recovery to contention Over the past six months (4.5 → 5.0), Zhipu has climbed back into China’s first tier and now”” https://x.com/ZhihuFrontier/status/2022161058321047681

GLM-5 by @Zai_org is now the #1 open model in Code Arena, tied with Kimi-K2.5-Thinking! Overall #6 on par with Gemini-3-pro, 100+pts below Claude-Opus-4.6 in agentic webdev tasks. Congrats to the @Zai_org GLM team on the new milestone! 👏”” https://x.com/arena/status/2021996281141629219

GLM-5 from @Zai_org just climbed to #1 among open models in Text Arena! ▫️#1 open model on par with claude-sonnet-4.5 & gpt-5.1-high ▫️#11 overall; scoring 1452, +11pts over GLM-4.7 Test it out in the Code Arena and keep voting, we’ll see how GLM-5 performs for agentic coding”” https://x.com/arena/status/2021725350481526904

GLM-5 is coming to Coding Plan Pro users within one week, and we’re working to bring it to everyone after that. To be upfront: compute is very tight. Even before the GLM-5 launch, we were pushing every chip to its limit just to serve inference. We appreciate your understanding”” https://x.com/Zai_org/status/2021656633320018365

GLM-5 is now on AI Gateway. Better long-range planning, multiple thinking modes, and improved multi-step agent tasks versus previous https://t.co/Yqx8kVZ3i8 models. Use 𝚖𝚘𝚍𝚎𝚕: ‘𝚣𝚊𝚒/𝚐𝚕𝚖-𝟻’ to get started.”” https://x.com/vercel_dev/status/2021655129347539117

GLM-5 is the new leading open weights model! GLM-5 leads the Artificial Analysis Intelligence Index amongst open weights models and makes large gains over GLM-4.7 in GDPval-AA, our agentic benchmark focused on economically valuable work tasks GLM-5 is @Zai_org’s first new”” https://x.com/ArtificialAnlys/status/2021678229418066004

GLM-5 is ZAI’s new flagship. 744B params (40B active), trained on 28.5T tokens, and built for complex systems engineering and long-horizon agentic tasks. Two things worth paying attention to: 1. They integrated DeepSeek Sparse Attention to cut deployment costs while keeping”” https://x.com/cline/status/2021999167875555694

GLM-5 just launched — now available in Qoder. On Qoder Bench — our benchmark for real-world software engineering tasks — GLM-5 outperforms Sonnet 4.5 and approaches Opus 4.5. At a fraction of the cost. High demand expected — brief waits possible during peak hours. Scaling in”” https://x.com/qoder_ai_ide/status/2021639227814092802

GLM-5, the latest frontier open model from @Zai_org, is available now on Modal. We partnered with https://t.co/nhqgwNEWkB to release an endpoint that will be free for a limited time.”” https://x.com/modal/status/2021645783733616800

Pony Alpha Stealth model reveal: GLM-5 from @Zai_org GLM-5 is a new 744B foundation model for coding and agentic usecases. It achieves SOTA scores on top agent benchmarks, and has been used successfully in many agent flows during its Stealth period. Live now on OpenRouter!”” https://x.com/OpenRouter/status/2021639702789730631

Average Throughput of GLM-5 on Openrouter is 14 tps”” https://x.com/scaling01/status/2021981416452764058

Build more. Spend less. GLM-5 is now on YouWare. Landing pages, portfolios, prototypes. All handled fast, with a 200K context window. Save your premium credits for the big builds.”” https://x.com/YouWareAI/status/2021982784948936874

Congrats @Zai_org on GLM-5! Love the permissive MIT license (vs K2.5’s modified MIT). Haven’t chatted with it yet so no vibes, but from the numbers I’m not compelled to switch from @Kimi_Moonshot K2.5: • Similar evals, but GLM-5’s are at bf16 while K2.5’s are at int4 – GLM-5″” https://x.com/QuixiAI/status/2021651135615184988

Day-0 with @Zai_org: GLM-5 is live on DeepInfra 🔥 Built for long-horizon agents that plan, orchestrate, and self-correct. Serving ~100 TPS at launch and as usual the best price on the market!”” https://x.com/DeepInfra/status/2021666854088110318

GLM 5 is 2x the total parameter of GLM 4.5 + deepseek sparse attention for efficient long context this is going to be a crazy model”” https://x.com/eliebakouch/status/2020824645868630065

GLM MoE DSA”” is landing in transformers 👀”” https://x.com/xeophon/status/2020815776890909052

GLM-4.7-Flash-GGUF is now the most downloaded model on @UnslothAI.”” https://x.com/Zai_org/status/2021207517557051627

GLM-5 already available on OpenRouter (with even lower prices)”” https://x.com/scaling01/status/2021637257103651040

GLM-5 has a 200k context length and maximum output of 128k”” https://x.com/scaling01/status/2021628691357298928

GLM-5 is massive. 745B params. LETS FUCKING GOOOOO This should be fun!”” https://x.com/scaling01/status/2020840989947298156

GLM-5 Pricing $1 and $3.2 Output There is also a GLM-5 Code variant that is more expensive👀 almost 8 times cheaper than Opus”” https://x.com/scaling01/status/2021628971939418522

GLM-5 runs with mlx-lm on a single 512GB M3 Ultra in Q4. It’s quite good in my initial testing and pretty fast as well. It generated a highly functional space invaders game using 7.1k tokens at 15.4 tok/s and 419GB memory. Thanks to @ActuallyIsaak and @kernelpool for the port.”” https://x.com/awnihannun/status/2022007608811696158

https://t.co/ctlyPtiB3j GLM-5 architecture is out: ~740B parameters ~50B active 78 layers, MLA attention lifted from DeepSeek V3, plus DeepSeek V3.2’s sparse attention indexer for 200k context. Basically DeepSeek V3 scale with DSA bolted on.”” https://x.com/QuixiAI/status/2021111352895393960

GLM-5 is out on @huggingface 🔥 > A40B/744B, trained on more tokens (28.5T) > outperforms/on par with closed sota > allows commercial use (MIT licensed) 💗 use with vLLM/SGLang locally or through HF Inference Providers thanks to @novita_labs and @Zai_org 📦”” https://x.com/mervenoyann/status/2021642658188538348

DeepSeek V4-lite, Minimax 2.5, GLM-5 what a bloodbath will Qwen accelerate the release of 3.5?”” https://x.com/teortaxesTex/status/2021586965594857487