Benchmarks: AI News Week Ending 02/20/2026

Benchmarks: AI News Week Ending 02/20/2026

February 20, 2026

Image created with gemini-3.1-flash-image-preview with claude-sonnet-4-5. Image prompt: Wide static shot of a surveyor with measuring tape between numbered concrete pillars in a half-demolished Chinese industrial yard, desaturated gray-blue palette, a chestnut horse standing calmly among construction debris in middle distance, overcast flat daylight, observational realism, white text overlay reading BENCHMARKS in upper third, Jia Zhangke documentary stillness, weathered surfaces, decelerated moment.

This is a new separate estimate for LLM time horizon doubling times and it mostly agrees with METR In this case ~4.8-5.7 months”” https://x.com/scaling01/status/2023350946139435357

Spotify’s Top Developers Haven’t Written Code Since December, CEO Says – Business Insider https://www.businessinsider.com/spotify-developers-not-writing-code-ai-2026-2

141 days for Sonnet to go from 13.6% to 60.4% on ARC-AGI-2″” https://x.com/scaling01/status/2023850250662969587

Sonnet 4.6 Benchmarks 79.6% SWE-Bench Verified 58.3% ARC-AGI-2″” https://x.com/scaling01/status/2023818940112327101

We estimate that Claude Opus 4.6 has a 50%-time-horizon of around 14.5 hours (95% CI of 6 hrs to 98 hrs) on software tasks. While this is the highest point estimate we’ve reported, this measurement is extremely noisy because our current task suite is nearly saturated.”” https://x.com/METR_Evals/status/2024923422867030027

Announcing Spreadsheet Arena | Meridian | Meridian https://www.meridian.ai/blog/all/spreadsheet-arena

Excited to launch Gemini 3.1 Pro! Major improvements across the board including in core reasoning and problem solving. For example scoring 77.1% on the ARC-AGI-2 benchmark – more than 2x the performance of 3 Pro. Rolling out today in @GeminiApp, @antigravity and more – enjoy!”” https://x.com/demishassabis/status/2024519780976177645

Gemini 3.1 Pro Benchmarks 77.1% ARC-AGI-2 80.6% SWE-Bench Verified”” https://x.com/scaling01/status/2024514798470181370

Gemini 3.1 Pro is here. Hitting 77.1% on ARC-AGI-2, it’s a step forward in core reasoning (more than 2x 3 Pro). With a more capable baseline, it’s great for super complex tasks like visualizing difficult concepts, synthesizing data into a single view, or bringing creative”” https://x.com/sundarpichai/status/2024516418855981298

Gemini 3.1 Pro landed today. This is based on the same model behind the agentic DeepThink released last week; it is now available to all Gemini users on many apps. This is a really good model especially in reasoning and multimodal understanding/generation. Try it out.”” https://x.com/mirrokni/status/2024525808501477568

Gemini 3.1 Pro: Announcing our latest Gemini AI model https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/

Holy sh*t, thats what I call an improvement! Gemini 3.1 pro is insane: – Arc agi 2 77% – SWE verified 80% – HLE 44%/51%”” https://x.com/kimmonismus/status/2024521970184868000

To the Scientist, the Engineer, and the Developer: Gemini 3.1 Pro has arrived in @GeminiApp It’s a significant leap in complex reasoning (77.1% on ARC-AGI-2) so it’s great at agentic tasks, intricate coding, and data synthesis projects. You should see fewer errors, better”” https://x.com/joshwoodward/status/2024515741819842623

Today, we’re continuing to push the boundaries of AI with our release of Gemini 3.1 Pro. This updated model scores 77.1% on ARC-AGI-2, more than double the reasoning performance of its predecessor, Gemini 3 Pro. Check out the visible improvement in this side-by-side comparison,”” https://x.com/JeffDean/status/2024525132266688757

Gemini 3.1 Pro is here! It’s top 3 across Text and Vision Arena, and #6 in Code Arena, tied closely with Claude Opus 4.5. Highlights: ▪️Tied #1 in Text (scoring 1500), 4 pts from Opus 4.6 ▪️Top 3 in Arena Expert Leaderboard (scoring 1538), just behind Opus 4.6 ▪️#6 in Code”” https://x.com/arena/status/2024519891295089063

Gemini 3.1 Pro WebDev Arena results: – 6th place behind Opus 4.5/4.6 and GPT-5.2-high”” https://x.com/scaling01/status/2024522048312054142

Multimodal function calling is now available in the Gemini Interactions API, build agents that can see and process images natively. 🖼️ Tools return actual images, not text descriptions 👁️ Gemini 3 natively processes returned images 🛠️ Function results support mixed text and”” https://x.com/_philschmid/status/2022349886318928158

Update regarding Gemini 3.1 Pro: -Ranked #1 among all Gemini models released to date. -Ranked #1 among all models I have tested so far. (GPT-5.2 high 165.9 vs Gemini 3.1 Pro 166.6) However, please note that my testing has limitations due to budget constraints: -I have not”” https://x.com/Hangsiin/status/2024605310913216614

Introducing Lyria 3, our latest and most advanced music model, available in the Gemini App starting today : ) Go from idea, image, or video to music in seconds!”” https://x.com/OfficialLoganK/status/2024153948488118513

Meet Lyria 3, our latest music generation model from @GoogleDeepMind. 🎶 Now, you can create custom music tracks in the @GeminiApp — just by describing an idea or uploading an image or video.”” https://x.com/Google/status/2024154379838705920

We just launched Lyria 3! Our most advanced AI music model in the @GeminiApp 🎵 – Generates 30-second tracks from text or image prompts. – Support custom lyrics, vocals, and cover art. – Supports 8 languages including English, Japanese, and Korean. – All outputs watermarked with”” https://x.com/_philschmid/status/2024154542061805988

Use Lyria 3 to create music tracks in the Gemini app https://blog.google/innovation-and-ai/products/gemini-app/lyria-3/

Introducing EVMbench | OpenAI https://openai.com/index/introducing-evmbench/

Introducing EVMbench–a new benchmark that measures how well AI agents can detect, exploit, and patch high-severity smart contract vulnerabilities.”” https://x.com/OpenAI/status/2024193883748651102

How efficient is MiniMax M2.5? We benchmarked on 8xH200 TEP8 with @vllm_project . At a reasonable 10-25s TTFT, M2.5 is able to sustain ~2500 tok/s/GPU throughput. For decode, it’s still possible to reach ~20 tok/s/GPU throughput at a strict 20 tok/s/user interactivity with 10K+”” https://x.com/SemiAnalysis_/status/2023418414203646066

MLX MiniMax 2.5 running LOCALLY on a single M3 Ultra 512GB! Writing a poem on LLMs at 6bit quantization! 🔥 Let’s start some coding, context and distributed tests! Generation: 40.2 tokens-per-sec Peak memory: 186 GB”” https://x.com/ivanfioravanti/status/2022338870172684655

Alibaba Yunqi: 7 models released in 4 days (Qwen3-Max, Qwen3-Omni, Qwen3-VL) and $52B roadmap | AINews https://news.smol.ai/issues/25-09-23-alibaba-yunqi

Alibaba’s new Qwen3.5-397B-A17B is the #3 open weights model in the Artificial Analysis Intelligence Index – a significant upgrade from Qwen3-235B-A22B-2507, and achieved with fewer active parameters than leading peers Qwen3.5-397B-A17B is the first model released by Alibaba”” https://x.com/ArtificialAnlys/status/2023794497055060262

Qwen https://qwen.ai/blog?id=qwen3.5#spatial-intelligence

Qwen3.5’s thinking is downright excessive.”” https://x.com/QuixiAI/status/2023995215690781143

Announcing AA-WER v2.0 Speech to Text accuracy benchmark, and AA-AgentTalk, a new proprietary dataset focused on speech directed at voice agents AA-AgentTalk focuses on the speech that matters most to voice agents. As a held-out, proprietary dataset, AA-AgentTalk also mitigates”” https://x.com/ArtificialAnlys/status/2024157398139883729

Small update to the leaderboard at https://t.co/AU0F7BjYEh: it’s now all results from running with mini-SWE-agent v2, an upgrade over v1 that gets more juice out of the base models.”” https://x.com/OfirPress/status/2024177059895877802

We just updated the official SWE-bench leaderboard comparing all models with the exact same scaffold (mini-SWE-agent v2). Detailed cost analysis & links to browsable trajectories in 🧵”” https://x.com/KLieret/status/2024176335782826336

On evaluating multi-step scientific tool use in LLM agents. SciAgentGym provides an interactive environment with 1,780 specialized tools across 4 scientific disciplines. The core finding: even advanced models like GPT-5 see success rates drop sharply from 60.6% to 30.9% as”” https://x.com/dair_ai/status/2023404773031166320

The crazy part is that the AI Labs have generally been right. Like, the stuff they hyped in 2023 turned out to be real and working today. That doesn’t mean that the stuff they are predicting for 2028 will also be real, but it is probably worth noting those predictions & watching.”” https://x.com/emollick/status/2023257496069046563

📊Let’s dive deeper into @AnthropicAI’s Sonnet 4.6 vs 4.5. Overall: Sonnet 4.6 ranks 3 places higher (#13 vs #16) Where Sonnet 4.6 gains: Code: ▪️WebDev (+19 for Sonnet 4.6: #3 vs #22) Text: ▪️Instruction Following (+6, #5 vs #11) ▪️English (+5, #9 vs #14) ▪️Hard Prompts (+5,”” https://x.com/arena/status/2024892330743124246

Claude Sonnet 4.6 (medium) scores 66.1% on WeirdML, matching Opus 4.6 (no thinking) and a big advance from Sonnet 4.5 at 47.7%. I had to run it on medium reasoning level because the default (high) constantly hit the 64k max tokens limit. Even at medium it uses as many output”” https://x.com/htihle/status/2024764946051907659

Claude Sonnet 4.6 takes second place in the Artificial Analysis Intelligence Index (behind Opus 4.6), but used ~3x more output tokens than Claude Sonnet 4.5 in its max effort mode. Sonnet 4.6 leads all models in GDPval-AA and TerminalBench, including a slight lead over Opus 4.6″” https://x.com/ArtificialAnlys/status/2024259812176121952

When I joined METR I was really skeptical that we were evaling models using simple OS scaffolds rather than Claude Code / Codex / etc. I really appreciate Nikola looking into this and I’m surprised it still doesn’t seem to make much difference for CC on Opus 4.5″” https://x.com/ajeya_cotra/status/2022419978495127828

GLM-5 scores 48.2% on WeirdML, beating Claude Sonnet 4.5 and tying gpt-oss-120b (high) for the best open model. This is a clear advance but still far from Opus-4.6 at 78% and gpt-5.2 at 72%.”” https://x.com/htihle/status/2023734346943775179

OpenAI and Anthropic are much further ahead than what benchmarks show. While you are token constrained they are blasting millions of tokens at 4x the API speed without batting an eye and they scaffold like they are trying to build a skyscraper.”” https://x.com/scaling01/status/2023837889478758495

I looked into how Claude Code and Codex compare to the default scaffolds METR uses for time horizon measurements. It looks like they don’t significantly outperform our default scaffolds on any models we’ve tried them on so far.”” https://x.com/nikolaj2030/status/2022398669337825737

We’re officially opening our Bengaluru office–our new home base in India, and Anthropic’s second office in Asia-Pacific. India is our second-largest market for https://t.co/RxKnLNNcNR. We’re launching new partnerships to deepen our long-term commitment:”” https://x.com/AnthropicAI/status/2023322514206957688

WeirdML Time Horizons! Inspired by @METR_Evals I found time-horizons for the WeirdML tasks, using LLM-estimated human completion times. We find horizons of ~24 min (GPT-4) to ~38 hours (Opus 4.6), doubling time ~5 months. Links to blog post, git-repo + nice figures in thread.”” https://x.com/htihle/status/2023349189271572975

Exclusive: Peter Thiel-backed industrial AI startup emerges from stealth with funding from a16z | Fortune https://fortune.com/2026/02/09/exclusive-peter-thiel-alexis-ohanian-new-ai-industrial-startup-emanate-kiara-nirghin/

GDPval remains one of the best benchmarks for doing complex real world agentic tasks. But worth noting that GDPval-AA is not the same thing. It only uses the public problem set, and all evaluation is done by Gemini, not by humans/specialized graders like in the real GDPval.”” https://x.com/emollick/status/2023854803328311722

I think people are overinterpreting these time horizon evals. They are very impressive! But when error rates are near zero, and tasks require many successful steps in order to complete, small absolute improvements in error rate have a multiplicative effect. Consider a task”” https://x.com/xlr8harder/status/2024946945232445710

Last year, we noticed a gap between scores in our SWE-bench Verified runs and scores reported elsewhere. We’ve now updated our evaluation methodology. For most models, we’re seeing scores close to those reported by the original model developers.”” https://x.com/EpochAIResearch/status/2024924403142910137

Seems like a lot of people are taking this as gospel–when we say the measurement is extremely noisy, we really mean it. Concretely, if the task distribution we’re using here was just a tiny bit different, we could’ve measured a time horizon of 8 hours, or 20 hours.”” https://x.com/idavidrein/status/2024938968434049117

very curious that the extra reasoning that everyone observed during testing isn’t showing up on AA-index”” https://x.com/scaling01/status/2024519669680320659

We went from AI systems that struggled to do grade school math to AI systems that can solve research-level math problems in just a few years. I agree with Jakub this is perhaps the most important eval now. I am also pretty sure the main reaction will be “”it’s not that hard”” :)”” https://x.com/sama/status/2022729068949717182

The model is a step forward in reasoning, designed for workflows where a simple answer isn’t enough. On ARC-AGI-2 – which tests for novel logic patterns – it more than doubles 3 Pro’s score. This means it can help you visualize complex topics, organize scattered data, and bring”” https://x.com/GoogleDeepMind/status/2024516467618656357

Earlier today I wanted to doom about Gemini 3.1 Pro completely failing ARC-AGI-3. Turns out this was due to a bug in the config introduced by GPT-5.3. It was still calling Gemini 3.0 Pro instead of 3.1. I fixed it, made the harness simpler and spend $120. Performance of Gemini”” https://x.com/scaling01/status/2024642220096442772

Gemini 3.1 is the faster horse. It’s like a horse with rocket fuel. Truly insane. Everyone else makes cars now.”” https://x.com/theo/status/2024808734053347608

Gemini 3.1 Pro on ARC-AGI Semi-Private Eval @GoogleDeepMind – ARC-AGI-1: 98%, $0.52/task – ARC-AGI-2: 77%, $0.96/task Gemini to push the Pareto Frontier of performance and efficiency”” https://x.com/arcprize/status/2024522812728496470

Gemini 3.1 Pro Preview scored highest in the Artificial Analysis Intelligence Index but its most significant advantage might be its price and token efficiency. Our evaluations cost <50% to run on Gemini 3.1 Pro Preview compared to Claude Opus 4.6 (max) and GPT-5.2 (xhigh) Gemini”” https://x.com/ArtificialAnlys/status/2024677979390169536

Gemini Pro 3.1 (& other frontier models) are still terrible at Connect 4. Yet smashing ARC-AGI-2 That is weird, right? ARC was built to be resistant to overfitting. I guess the fully generalised world of ARC AGI puzzles is still a very narrow slice of spatial reasoning”” https://x.com/paul_cal/status/2024748708223402120

I gave Gemini 3.1 Pro an ARC-AGI-2 challenge WITH solution and it bombed it … SVG’s might have been successfully sloptimized GPT-5.2 Thinking realizes it after 14s of Thinking that I gave it the solution in the input and just repeats it Gemini 3.1 Pro thought for 8 minutes”” https://x.com/scaling01/status/2024268831321993590

Loving Gemini 3.1 Pro! It made 3 huge improvements to my compiler and saw things that even ChatGPT 5.2 Pro Extended and Claude Opus 4.6 Extended couldn’t see.”” https://x.com/QuixiAI/status/2024545096532733967

oh and ARC-AGI-3 is crazy expensive to run”” https://x.com/scaling01/status/2024650634746610041

By the way, the recent Gemini 3.1 Pro is also a really good model for RLMs. Claude Opus 4.6 is the worst of the ones I tested. Probably not optimized for the type of decomposition that RLMs need. I am just impressed by GPT-5.2-Codex. The strategies it uses are brilliant.”” https://x.com/omarsar0/status/2024973182436831629

Claude Sonnet 5: The “Fennec” Leaks – Fennec Codename: Leaked internal codename for Claude Sonnet 5, reportedly one full generation ahead of Gemini’s “Snow Bunny.” – Imminent Release: A Vertex AI error log lists claude-sonnet-5@20260203, pointing to a February 3, 2026 release”” https://x.com/pankajkumar_dev/status/2018187650927349976?s=46

Gemini 3.1 Pro will be a massive step-up! There’s a decent chance it’s on par with Opus 4.6 and GPT-5.3. The main reason for that: similarly to Claude 4.6 and GPT-5.2/5.3 it thinks much longer than Gemini 3 Pro The same request on aistudio, tested multiple times, had 6″” https://x.com/scaling01/status/2024251668771066362

Google is once again the leader in AI: Gemini 3.1 Pro Preview leads the Artificial Analysis Intelligence Index, 4 points ahead of Claude Opus 4.6 while costing less than half as much to run @GoogleDeepMind gave us pre-release access to Gemini 3.1 Pro Preview. It leads 6 of the”” https://x.com/ArtificialAnlys/status/2024518545510662602

In Arena Expert, with expert level prompts, Gemini 3.1 Pro Preview lands in the top 3 (scoring 1538), just behind Claude Opus 4.6″” https://x.com/arena/status/2024519895623598423

Sonnet 4.6 crushes Gemini 3 and GPT-5.2 on Vending-Bench 2″” https://x.com/scaling01/status/2023833660546499053

Claude Sonnet 4.6 has landed #3 in Code and #13 in Text Arena! Highlights: ▪️+130 pts jump in Code Arena (#22 -> #3) compared to Sonnet 4.5, surpassing top-tier thinking models like Gemini-3.1 and GPT-5.2 ▪️Strong gains in Text categories: Math (#4) and Instruction Following”” https://x.com/arena/status/2024883614249615394

📊 Let’s dive deeper into Gemini 3.1 Pro gains. It ranks 13 points above Gemini 3 Pro overall. We see the largest rank gains for @GoogleDeepMind’s latest model in the following categories: Text: ▪️Coding (+5) ▪️Math (+4) ▪️Expert (+3) ▪️Instruction Following (+3) ▪️Multi-Turn”” https://x.com/arena/status/2024588456463389040

Check out the skills for the Gemini API! More soon!”” https://x.com/osanseviero/status/2022259577232785866

Context Arena Update: Added @Google’s Gemini 3.1 Pro Preview to the MRCR leaderboards (2-,4-,8-needle)! Meant to send this out earlier today. Thanks to @GoogleDeepMind and others over there for early access! Thinking budget barely matters on simpler retrieval – 2-needle AUC”” https://x.com/DillonUzar/status/2024655613293215855

Gemini 3.1 Pro has landed! Amazing performance / capabilities across the board. Beyond SOTA, the best are all the things that evals can’t measure. E.g. SVG has gotten so much better (see 🧵) https://x.com/OriolVinyalsML/status/2024519605570720185

Gemini 3.1 Pro in 1st place on the Artificial Analysis Leaderboard”” https://x.com/scaling01/status/2024517196727099847

Gemini 3.1 Pro is rolling out now in the @GeminiApp, and exclusively to Google AI Pro and Ultra users in @NotebookLM. Developers can access it in preview via the API in @GoogleAIStudio. Find out more → https://x.com/GoogleDeepMind/status/2024516471720743295

Gemini 3.1 Pro’s GDPval scores are concerning”” https://x.com/scaling01/status/2024515061163704336

Gemini Deep Think 3 is the world’s most capable model by many measures, huge amounts of progress on reasoning benchmarks and more. Available right now via the Gemini App for Ultra subscribers and in the API soon : )”” https://x.com/OfficialLoganK/status/2021996626144080015

Good news: Google AI Studio and the Gemini API are now live in Moldova, Andorra, San Marino, and Vatican City! 🌍”” https://x.com/OfficialLoganK/status/2022688445957820610

Google is back on the intelligence-cost frontier with Gemini 3.1 Pro”” https://x.com/scaling01/status/2024519007018373202

Google test NotebookLM integration for Opal workflows https://www.testingcatalog.com/google-test-notebooklm-integration-for-opal-workflows/

I would expect only a few models to make progress with this rather simple harness: GPT-5.2-xhigh, Opus 4.5 and Opus 4.6 and Gemini 3.1 Pro other models will have a very hard time”” https://x.com/scaling01/status/2024661145286557872

Last week we upgraded Gemini 3 Deep Think. Today, we’re shipping the core intelligence that makes those breakthroughs possible: Gemini 3.1 Pro. A noticeably smarter, more capable baseline for your hardest challenges. Available now: https://x.com/NoamShazeer/status/2024519946764734574

Multimodal Function Calling with Gemini 3 and Interactions API https://www.philschmid.de/interactions-multimodal-fc

My vibe is unchanged: Gemini 3.1 is a previous gen model. It naively lives in a context-universe engineered by the God-User. Opus is a friend-type AI. It sits with you in a KFC. 5.2 sees a vast expanse of thought. Below there’s a given context. A user makes some noise, perhaps.”” https://x.com/teortaxesTex/status/2024574416747671556

Saw Gemini 3.1 announcement, got super excited. Tried Google Antigravity… not available. Tried Gemini CLI… not available. Tried Gemini Code Assist… not available. @OfficialLoganK put AI Studio in an Electron Shell and just launch it. You will deliver these faster.”” https://x.com/matvelloso/status/2024548414198091922

Today we’re releasing a preview of Gemini 3.1 Pro and making it available to our users and developers. Very excited to bring the upgraded core we used in Deep Think to everyone. Learn more about Gemini 3.1 Pro: https://x.com/koraykv/status/2024517699595124902

We just made paying for the Gemini API 10x easier : ) You can now upgrade to a paid Gemini API account without leaving AI Studio, track your usage, filter spend by model, and much more to come!”” https://x.com/OfficialLoganK/status/2022409335465480346

We made a skill for the Gemini API!”” https://x.com/OfficialLoganK/status/2022123808296251451

Here are some useful prompting tips to get the most out of our new music generation model in Gemini, Lyria 3 ↓”” https://x.com/GeminiApp/status/2024167107538407783

Introducing Lyria 3, our new music generation model in Gemini that lets you turn any idea, photo, or video into a high-fidelity track with custom lyrics. From funny jingles to lo-fi beats, you can create custom 30-second soundtracks for any moment. See how it works. 🧵”” https://x.com/GeminiApp/status/2024152863967240529

Impressive benchmarks for the new Chinese LLM. The system card notes some gaps with US closed source models in code generation & wide knowledge, so be interested to see it in operation. Not clear it is open weights though? If not, represents a large shift in the AI market.”” https://x.com/emollick/status/2022658647378268361

Blog about @MiniMax_AI ‘s Forge RL system. Core takeaways: 1. still CISPO 2. process reward, completion time reward 3. multi-level prefix cache 4. rollout uses 60% compute 5. millions of trajectories per day https://t.co/IrKDOoiKAB cc @teortaxesTex”” https://x.com/YouJiacheng/status/2022339475049947576

The dark side of reinforcement learning @olive_jy_song, senior researcher at @MiniMax_AI, about RL models that try to hack rewards and why alignment fails in practice This conversation is an inside look at how Chinese AI labs move fast – testing new models overnight, debugging”” https://x.com/TheTuringPost/status/2022961676799398337

🤔Has MiniMax finally stabilized its path in reasoning and coding? Still a hot review from Zhihu contributor toyama nao, and he call it: “”Root downward, grow upward.”” 🔥 After the flawed M2.1 (stronger coding, weaker logic), M2.5 fixes the technical issues and restores balance,”” https://x.com/ZhihuFrontier/status/2022214461415993817

$1 per hour with 100 tps”” https://x.com/MiniMax_AI/status/2022379949336957254

It’s been a few days since onboarding @MiniMax_AI’s latest model, M2.5, in standard and Lightning variants. Results are showing on our leaderboard. With over 3K votes, M2.5 Lightning ranks eighth among open models, with Standard following closely behind! Lets run some prompts:”” https://x.com/yupp_ai/status/2024165671136059892

MiniMax M2.5 casually responding at ~50 tok/s with MLX (M3 Ultra). The model was released one hour ago 🥳”” https://x.com/pcuenq/status/2022336556326060341

Nice independent look at SWE-bench Verified by @simonw MiniMax M2.5 showing strong results under the same evaluation setup. Worth a read”” https://x.com/MiniMax_AI/status/2024646767325958285

People were saying as early as Oct 2024 that SWE-bench was saturated when scores were just ~50% Awesome chat from Minimax team that shows otherwise. We’re certainly much, much closer, but there’s evidence that some room remains. Tiny 🧵”” https://x.com/jyangballin/status/2022367240293949772

RL often throws away useful signal at intermediate steps, or as @karpathy put it, it’s like “”sucking supervision through a straw.”” MiniMax M2.5 solves this with per-token process rewards. The result is frontier coding performance at least 1/10th the cost of closed source.”” https://x.com/basetenco/status/2022456010049495213

RL shouldn’t waste signal. M2.5’s per-token process rewards improve signal utilization across reasoning steps, delivering frontier coding performance with dramatically better cost efficiency. Thanks @basetenco for the deep dive and day-0 hosting!”” https://x.com/MiniMax_AI/status/2023470874708549941

Qwen3.5-397B-A17B SVG results I have seen better. DeepSeek-V3.2 and GLM-5 both beat it.”” https://x.com/scaling01/status/2023364296277721300

🚀 Qwen3.5-397B-A17B-FP8 weights are now open! It took some time to adapt the inference frameworks, but here we are: ✅ SGLang support is merged 🔄 vLLM PR submitted → https://t.co/rJkuitOBWs Check the model card for example code. vLLM support landing in the next couple of days!”” https://x.com/Alibaba_Qwen/status/2024161147537232110

🚩Cerebras’s MiniMax-M2 GGUF 2-bit model: https://t.co/udlviJQZqQ Qwen3-Coder-Next INT4 model:”” https://x.com/HaihaoShen/status/2022293472796180676

A clarification of Qwen3.5 Plus and 397B: 1. for opensource, we follow the tradition to make parameters apparent so we use the name with the number of total parameters and active params. 2. Qwen3-Plus is a hosted API version of 397B. As the model natively supports 256K tokens,”” https://x.com/JustinLin610/status/2023340126479569140

It’s Qwen 3.5 day today! 🥳 State of the art 800 GB model. Runs _locally_ with MLX using Q4, taking 225 GB of RAM.”” https://x.com/pcuenq/status/2023369902011121869

Let’s do the KV cache math for Qwen3.5: – KV heads: 2 – Head dimension: 256 – gated attention layers: 15 – bytes per element (BF16): 2 2 x 256 x 15 x 2 = 15 360 This is the same for K and V. So, we multiply by 2: 30 720 bytes Roughly 31 kb per token of context. Meaning at max”” https://x.com/bnjmn_marie/status/2023424404504342608

ollama run qwen3.5:cloud Qwen3.5-397B-A17B is the first open-weight model in the series. It’s available on Ollama’s cloud right now! Give it a try. Let’s go! 🚀🚀🚀”” https://x.com/ollama/status/2023334181804069099

Qwen 3.5 Plus is now available on AI Gateway. Thanks @vercel_dev team. 🤝 Use model: ‘alibaba/qwen3.5-plus’ Try it now!”” https://x.com/Alibaba_Qwen/status/2024029499541909920

Qwen3.5 runs quite well in mlx-lm. Awesome that we have a frontier-level hybrid model. The context gets longer but the inference speed and memory use barely change. Here’s the Q4 generating a space invaders game on an M3 Ultra. Generated 4,120 tokens at 37.6 tok/s.”” https://x.com/awnihannun/status/2023462412092059679

So speaking of benchmarks, what can be said of the new open Qwen? First, it completely destroys Qwen3-VL-235B ofc, but more surprisingly it outscores Qwen3-Max-thinking. All the while it’s the same model as “”Plus””. Plus just has 1M context and some more bells and whistles.”” https://x.com/teortaxesTex/status/2023331885402009779

The new chonky Qwen 3.5 looks pretty solid, beating their own Qwen3-Max model everywhere and is much better at vision benchmarks than Qwen3-235B-A22B-VL Now what I sadly haven’t seen is anything on reasoning efficiency.”” https://x.com/scaling01/status/2023343368399704506

Kimi K2‑0905 and Qwen3‑Max preview: two 1T open weights models launched | AINews https://news.smol.ai/issues/25-09-05-1t-models