Image created with gemini-3.1-flash-image-preview with claude-sonnet-4-5. Image prompt: Wide static shot of a muted Chinese electronics factory interior, workers assembling circuit boards at cluttered workbenches, natural overcast light through grimy windows, a chestnut horse standing naturally among the assembly line workers, desaturated concrete and metal surfaces, documentary realism, bold white text overlay reading TECH in poster style, Jia Zhangke observational composition, flat industrial lighting, human-scale intimacy.

This is a new separate estimate for LLM time horizon doubling times and it mostly agrees with METR In this case ~4.8-5.7 months”” https://x.com/scaling01/status/2023350946139435357

Spotify’s Top Developers Haven’t Written Code Since December, CEO Says – Business Insider https://www.businessinsider.com/spotify-developers-not-writing-code-ai-2026-2

141 days for Sonnet to go from 13.6% to 60.4% on ARC-AGI-2″” https://x.com/scaling01/status/2023850250662969587

Sonnet 4.6 Benchmarks 79.6% SWE-Bench Verified 58.3% ARC-AGI-2″” https://x.com/scaling01/status/2023818940112327101

We estimate that Claude Opus 4.6 has a 50%-time-horizon of around 14.5 hours (95% CI of 6 hrs to 98 hrs) on software tasks. While this is the highest point estimate we’ve reported, this measurement is extremely noisy because our current task suite is nearly saturated.”” https://x.com/METR_Evals/status/2024923422867030027

Announcing Spreadsheet Arena | Meridian | Meridian https://www.meridian.ai/blog/all/spreadsheet-arena

Excited to launch Gemini 3.1 Pro! Major improvements across the board including in core reasoning and problem solving. For example scoring 77.1% on the ARC-AGI-2 benchmark – more than 2x the performance of 3 Pro. Rolling out today in @GeminiApp, @antigravity and more – enjoy!”” https://x.com/demishassabis/status/2024519780976177645

Gemini 3.1 Pro Benchmarks 77.1% ARC-AGI-2 80.6% SWE-Bench Verified”” https://x.com/scaling01/status/2024514798470181370

Gemini 3.1 Pro is here. Hitting 77.1% on ARC-AGI-2, it’s a step forward in core reasoning (more than 2x 3 Pro). With a more capable baseline, it’s great for super complex tasks like visualizing difficult concepts, synthesizing data into a single view, or bringing creative”” https://x.com/sundarpichai/status/2024516418855981298

Gemini 3.1 Pro landed today. This is based on the same model behind the agentic DeepThink released last week; it is now available to all Gemini users on many apps. This is a really good model especially in reasoning and multimodal understanding/generation. Try it out.”” https://x.com/mirrokni/status/2024525808501477568

Gemini 3.1 Pro: Announcing our latest Gemini AI model https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/

Holy sh*t, thats what I call an improvement! Gemini 3.1 pro is insane: – Arc agi 2 77% – SWE verified 80% – HLE 44%/51%”” https://x.com/kimmonismus/status/2024521970184868000

To the Scientist, the Engineer, and the Developer: Gemini 3.1 Pro has arrived in @GeminiApp It’s a significant leap in complex reasoning (77.1% on ARC-AGI-2) so it’s great at agentic tasks, intricate coding, and data synthesis projects. You should see fewer errors, better”” https://x.com/joshwoodward/status/2024515741819842623

Today, we’re continuing to push the boundaries of AI with our release of Gemini 3.1 Pro. This updated model scores 77.1% on ARC-AGI-2, more than double the reasoning performance of its predecessor, Gemini 3 Pro. Check out the visible improvement in this side-by-side comparison,”” https://x.com/JeffDean/status/2024525132266688757

Gemini 3.1 Pro is here! It’s top 3 across Text and Vision Arena, and #6 in Code Arena, tied closely with Claude Opus 4.5. Highlights: ▪️Tied #1 in Text (scoring 1500), 4 pts from Opus 4.6 ▪️Top 3 in Arena Expert Leaderboard (scoring 1538), just behind Opus 4.6 ▪️#6 in Code”” https://x.com/arena/status/2024519891295089063

Gemini 3.1 Pro WebDev Arena results: – 6th place behind Opus 4.5/4.6 and GPT-5.2-high”” https://x.com/scaling01/status/2024522048312054142

Multimodal function calling is now available in the Gemini Interactions API, build agents that can see and process images natively. 🖼️ Tools return actual images, not text descriptions 👁️ Gemini 3 natively processes returned images 🛠️ Function results support mixed text and”” https://x.com/_philschmid/status/2022349886318928158

Update regarding Gemini 3.1 Pro: -Ranked #1 among all Gemini models released to date. -Ranked #1 among all models I have tested so far. (GPT-5.2 high 165.9 vs Gemini 3.1 Pro 166.6) However, please note that my testing has limitations due to budget constraints: -I have not”” https://x.com/Hangsiin/status/2024605310913216614

Introducing Lyria 3, our latest and most advanced music model, available in the Gemini App starting today : ) Go from idea, image, or video to music in seconds!”” https://x.com/OfficialLoganK/status/2024153948488118513

Meet Lyria 3, our latest music generation model from @GoogleDeepMind. 🎶 Now, you can create custom music tracks in the @GeminiApp — just by describing an idea or uploading an image or video.”” https://x.com/Google/status/2024154379838705920

We just launched Lyria 3! Our most advanced AI music model in the @GeminiApp 🎵 – Generates 30-second tracks from text or image prompts. – Support custom lyrics, vocals, and cover art. – Supports 8 languages including English, Japanese, and Korean. – All outputs watermarked with”” https://x.com/_philschmid/status/2024154542061805988

Use Lyria 3 to create music tracks in the Gemini app https://blog.google/innovation-and-ai/products/gemini-app/lyria-3/

Introducing EVMbench | OpenAI https://openai.com/index/introducing-evmbench/

Introducing EVMbench–a new benchmark that measures how well AI agents can detect, exploit, and patch high-severity smart contract vulnerabilities.”” https://x.com/OpenAI/status/2024193883748651102

How efficient is MiniMax M2.5? We benchmarked on 8xH200 TEP8 with @vllm_project . At a reasonable 10-25s TTFT, M2.5 is able to sustain ~2500 tok/s/GPU throughput. For decode, it’s still possible to reach ~20 tok/s/GPU throughput at a strict 20 tok/s/user interactivity with 10K+”” https://x.com/SemiAnalysis_/status/2023418414203646066

MLX MiniMax 2.5 running LOCALLY on a single M3 Ultra 512GB! Writing a poem on LLMs at 6bit quantization! 🔥 Let’s start some coding, context and distributed tests! Generation: 40.2 tokens-per-sec Peak memory: 186 GB”” https://x.com/ivanfioravanti/status/2022338870172684655

Alibaba Yunqi: 7 models released in 4 days (Qwen3-Max, Qwen3-Omni, Qwen3-VL) and $52B roadmap | AINews https://news.smol.ai/issues/25-09-23-alibaba-yunqi

Alibaba’s new Qwen3.5-397B-A17B is the #3 open weights model in the Artificial Analysis Intelligence Index – a significant upgrade from Qwen3-235B-A22B-2507, and achieved with fewer active parameters than leading peers Qwen3.5-397B-A17B is the first model released by Alibaba”” https://x.com/ArtificialAnlys/status/2023794497055060262

Qwen https://qwen.ai/blog?id=qwen3.5#spatial-intelligence

Qwen3.5’s thinking is downright excessive.”” https://x.com/QuixiAI/status/2023995215690781143

.@mattshumer_ “”Something Big is Happening”” article now has 83 million views. Clearly, it hit a nerve. I also want to argue with it, even if that puts me on the unpopular side of the timeline. Because his piece gave me real, unproductive anxiety.”” https://x.com/TheTuringPost/status/2023743799042666989

The Future of Design Is Code and Canvas | Figma Blog https://www.figma.com/blog/the-future-of-design-is-code-and-canvas/

Announcing AA-WER v2.0 Speech to Text accuracy benchmark, and AA-AgentTalk, a new proprietary dataset focused on speech directed at voice agents AA-AgentTalk focuses on the speech that matters most to voice agents. As a held-out, proprietary dataset, AA-AgentTalk also mitigates”” https://x.com/ArtificialAnlys/status/2024157398139883729

Small update to the leaderboard at https://t.co/AU0F7BjYEh: it’s now all results from running with mini-SWE-agent v2, an upgrade over v1 that gets more juice out of the base models.”” https://x.com/OfirPress/status/2024177059895877802

We just updated the official SWE-bench leaderboard comparing all models with the exact same scaffold (mini-SWE-agent v2). Detailed cost analysis & links to browsable trajectories in 🧵”” https://x.com/KLieret/status/2024176335782826336

On evaluating multi-step scientific tool use in LLM agents. SciAgentGym provides an interactive environment with 1,780 specialized tools across 4 scientific disciplines. The core finding: even advanced models like GPT-5 see success rates drop sharply from 60.6% to 30.9% as”” https://x.com/dair_ai/status/2023404773031166320

[2602.16301] Multi-agent cooperation through in-context co-player inference https://arxiv.org/abs/2602.16301

The crazy part is that the AI Labs have generally been right. Like, the stuff they hyped in 2023 turned out to be real and working today. That doesn’t mean that the stuff they are predicting for 2028 will also be real, but it is probably worth noting those predictions & watching.”” https://x.com/emollick/status/2023257496069046563

📊Let’s dive deeper into @AnthropicAI’s Sonnet 4.6 vs 4.5. Overall: Sonnet 4.6 ranks 3 places higher (#13 vs #16) Where Sonnet 4.6 gains: Code: ▪️WebDev (+19 for Sonnet 4.6: #3 vs #22) Text: ▪️Instruction Following (+6, #5 vs #11) ▪️English (+5, #9 vs #14) ▪️Hard Prompts (+5,”” https://x.com/arena/status/2024892330743124246

Claude Sonnet 4.6 (medium) scores 66.1% on WeirdML, matching Opus 4.6 (no thinking) and a big advance from Sonnet 4.5 at 47.7%. I had to run it on medium reasoning level because the default (high) constantly hit the 64k max tokens limit. Even at medium it uses as many output”” https://x.com/htihle/status/2024764946051907659

Claude Sonnet 4.6 takes second place in the Artificial Analysis Intelligence Index (behind Opus 4.6), but used ~3x more output tokens than Claude Sonnet 4.5 in its max effort mode. Sonnet 4.6 leads all models in GDPval-AA and TerminalBench, including a slight lead over Opus 4.6″” https://x.com/ArtificialAnlys/status/2024259812176121952

When I joined METR I was really skeptical that we were evaling models using simple OS scaffolds rather than Claude Code / Codex / etc. I really appreciate Nikola looking into this and I’m surprised it still doesn’t seem to make much difference for CC on Opus 4.5″” https://x.com/ajeya_cotra/status/2022419978495127828

GLM-5 scores 48.2% on WeirdML, beating Claude Sonnet 4.5 and tying gpt-oss-120b (high) for the best open model. This is a clear advance but still far from Opus-4.6 at 78% and gpt-5.2 at 72%.”” https://x.com/htihle/status/2023734346943775179

OpenAI and Anthropic are much further ahead than what benchmarks show. While you are token constrained they are blasting millions of tokens at 4x the API speed without batting an eye and they scaffold like they are trying to build a skyscraper.”” https://x.com/scaling01/status/2023837889478758495

I looked into how Claude Code and Codex compare to the default scaffolds METR uses for time horizon measurements. It looks like they don’t significantly outperform our default scaffolds on any models we’ve tried them on so far.”” https://x.com/nikolaj2030/status/2022398669337825737

A paper worth paying close attention to. It presents Lossless Context Management (LCM), which reframes how agents handle long contexts. It outperforms Claude Code on long-context tasks. Recursive Language Models give the model full autonomy to write its own memory scripts. LCM”” https://x.com/dair_ai/status/2023765147970662761

We’re officially opening our Bengaluru office–our new home base in India, and Anthropic’s second office in Asia-Pacific. India is our second-largest market for https://t.co/RxKnLNNcNR. We’re launching new partnerships to deepen our long-term commitment:”” https://x.com/AnthropicAI/status/2023322514206957688

WeirdML Time Horizons! Inspired by @METR_Evals I found time-horizons for the WeirdML tasks, using LLM-estimated human completion times. We find horizons of ~24 min (GPT-4) to ~38 hours (Opus 4.6), doubling time ~5 months. Links to blog post, git-repo + nice figures in thread.”” https://x.com/htihle/status/2023349189271572975

Exclusive: Peter Thiel-backed industrial AI startup emerges from stealth with funding from a16z | Fortune https://fortune.com/2026/02/09/exclusive-peter-thiel-alexis-ohanian-new-ai-industrial-startup-emanate-kiara-nirghin/

GDPval remains one of the best benchmarks for doing complex real world agentic tasks. But worth noting that GDPval-AA is not the same thing. It only uses the public problem set, and all evaluation is done by Gemini, not by humans/specialized graders like in the real GDPval.”” https://x.com/emollick/status/2023854803328311722

I think people are overinterpreting these time horizon evals. They are very impressive! But when error rates are near zero, and tasks require many successful steps in order to complete, small absolute improvements in error rate have a multiplicative effect. Consider a task”” https://x.com/xlr8harder/status/2024946945232445710

Last year, we noticed a gap between scores in our SWE-bench Verified runs and scores reported elsewhere. We’ve now updated our evaluation methodology. For most models, we’re seeing scores close to those reported by the original model developers.”” https://x.com/EpochAIResearch/status/2024924403142910137

Seems like a lot of people are taking this as gospel–when we say the measurement is extremely noisy, we really mean it. Concretely, if the task distribution we’re using here was just a tiny bit different, we could’ve measured a time horizon of 8 hours, or 20 hours.”” https://x.com/idavidrein/status/2024938968434049117

very curious that the extra reasoning that everyone observed during testing isn’t showing up on AA-index”” https://x.com/scaling01/status/2024519669680320659

We went from AI systems that struggled to do grade school math to AI systems that can solve research-level math problems in just a few years. I agree with Jakub this is perhaps the most important eval now. I am also pretty sure the main reaction will be “”it’s not that hard”” :)”” https://x.com/sama/status/2022729068949717182

The model is a step forward in reasoning, designed for workflows where a simple answer isn’t enough. On ARC-AGI-2 – which tests for novel logic patterns – it more than doubles 3 Pro’s score. This means it can help you visualize complex topics, organize scattered data, and bring”” https://x.com/GoogleDeepMind/status/2024516467618656357

Earlier today I wanted to doom about Gemini 3.1 Pro completely failing ARC-AGI-3. Turns out this was due to a bug in the config introduced by GPT-5.3. It was still calling Gemini 3.0 Pro instead of 3.1. I fixed it, made the harness simpler and spend $120. Performance of Gemini”” https://x.com/scaling01/status/2024642220096442772

Gemini 3.1 is the faster horse. It’s like a horse with rocket fuel. Truly insane. Everyone else makes cars now.”” https://x.com/theo/status/2024808734053347608

Gemini 3.1 Pro on ARC-AGI Semi-Private Eval @GoogleDeepMind – ARC-AGI-1: 98%, $0.52/task – ARC-AGI-2: 77%, $0.96/task Gemini to push the Pareto Frontier of performance and efficiency”” https://x.com/arcprize/status/2024522812728496470

Gemini 3.1 Pro Preview scored highest in the Artificial Analysis Intelligence Index but its most significant advantage might be its price and token efficiency. Our evaluations cost <50% to run on Gemini 3.1 Pro Preview compared to Claude Opus 4.6 (max) and GPT-5.2 (xhigh) Gemini”” https://x.com/ArtificialAnlys/status/2024677979390169536

Gemini Pro 3.1 (& other frontier models) are still terrible at Connect 4. Yet smashing ARC-AGI-2 That is weird, right? ARC was built to be resistant to overfitting. I guess the fully generalised world of ARC AGI puzzles is still a very narrow slice of spatial reasoning”” https://x.com/paul_cal/status/2024748708223402120

I gave Gemini 3.1 Pro an ARC-AGI-2 challenge WITH solution and it bombed it … SVG’s might have been successfully sloptimized GPT-5.2 Thinking realizes it after 14s of Thinking that I gave it the solution in the input and just repeats it Gemini 3.1 Pro thought for 8 minutes”” https://x.com/scaling01/status/2024268831321993590

Loving Gemini 3.1 Pro! It made 3 huge improvements to my compiler and saw things that even ChatGPT 5.2 Pro Extended and Claude Opus 4.6 Extended couldn’t see.”” https://x.com/QuixiAI/status/2024545096532733967

oh and ARC-AGI-3 is crazy expensive to run”” https://x.com/scaling01/status/2024650634746610041

By the way, the recent Gemini 3.1 Pro is also a really good model for RLMs. Claude Opus 4.6 is the worst of the ones I tested. Probably not optimized for the type of decomposition that RLMs need. I am just impressed by GPT-5.2-Codex. The strategies it uses are brilliant.”” https://x.com/omarsar0/status/2024973182436831629

Claude Sonnet 5: The “Fennec” Leaks – Fennec Codename: Leaked internal codename for Claude Sonnet 5, reportedly one full generation ahead of Gemini’s “Snow Bunny.” – Imminent Release: A Vertex AI error log lists claude-sonnet-5@20260203, pointing to a February 3, 2026 release”” https://x.com/pankajkumar_dev/status/2018187650927349976?s=46

Gemini 3.1 Pro will be a massive step-up! There’s a decent chance it’s on par with Opus 4.6 and GPT-5.3. The main reason for that: similarly to Claude 4.6 and GPT-5.2/5.3 it thinks much longer than Gemini 3 Pro The same request on aistudio, tested multiple times, had 6″” https://x.com/scaling01/status/2024251668771066362

Google is once again the leader in AI: Gemini 3.1 Pro Preview leads the Artificial Analysis Intelligence Index, 4 points ahead of Claude Opus 4.6 while costing less than half as much to run @GoogleDeepMind gave us pre-release access to Gemini 3.1 Pro Preview. It leads 6 of the”” https://x.com/ArtificialAnlys/status/2024518545510662602

In Arena Expert, with expert level prompts, Gemini 3.1 Pro Preview lands in the top 3 (scoring 1538), just behind Claude Opus 4.6″” https://x.com/arena/status/2024519895623598423

Sonnet 4.6 crushes Gemini 3 and GPT-5.2 on Vending-Bench 2″” https://x.com/scaling01/status/2023833660546499053

Claude Sonnet 4.6 has landed #3 in Code and #13 in Text Arena! Highlights: ▪️+130 pts jump in Code Arena (#22 -> #3) compared to Sonnet 4.5, surpassing top-tier thinking models like Gemini-3.1 and GPT-5.2 ▪️Strong gains in Text categories: Math (#4) and Instruction Following”” https://x.com/arena/status/2024883614249615394

📊 Let’s dive deeper into Gemini 3.1 Pro gains. It ranks 13 points above Gemini 3 Pro overall. We see the largest rank gains for @GoogleDeepMind’s latest model in the following categories: Text: ▪️Coding (+5) ▪️Math (+4) ▪️Expert (+3) ▪️Instruction Following (+3) ▪️Multi-Turn”” https://x.com/arena/status/2024588456463389040

Check out the skills for the Gemini API! More soon!”” https://x.com/osanseviero/status/2022259577232785866

Context Arena Update: Added @Google’s Gemini 3.1 Pro Preview to the MRCR leaderboards (2-,4-,8-needle)! Meant to send this out earlier today. Thanks to @GoogleDeepMind and others over there for early access! Thinking budget barely matters on simpler retrieval – 2-needle AUC”” https://x.com/DillonUzar/status/2024655613293215855

Gemini 3.1 Pro has landed! Amazing performance / capabilities across the board. Beyond SOTA, the best are all the things that evals can’t measure. E.g. SVG has gotten so much better (see 🧵) https://x.com/OriolVinyalsML/status/2024519605570720185

Gemini 3.1 Pro in 1st place on the Artificial Analysis Leaderboard”” https://x.com/scaling01/status/2024517196727099847

Gemini 3.1 Pro is rolling out now in the @GeminiApp, and exclusively to Google AI Pro and Ultra users in @NotebookLM. Developers can access it in preview via the API in @GoogleAIStudio. Find out more → https://x.com/GoogleDeepMind/status/2024516471720743295

Gemini 3.1 Pro’s GDPval scores are concerning”” https://x.com/scaling01/status/2024515061163704336

Gemini Deep Think 3 is the world’s most capable model by many measures, huge amounts of progress on reasoning benchmarks and more. Available right now via the Gemini App for Ultra subscribers and in the API soon : )”” https://x.com/OfficialLoganK/status/2021996626144080015

Good news: Google AI Studio and the Gemini API are now live in Moldova, Andorra, San Marino, and Vatican City! 🌍”” https://x.com/OfficialLoganK/status/2022688445957820610

Google is back on the intelligence-cost frontier with Gemini 3.1 Pro”” https://x.com/scaling01/status/2024519007018373202

Google test NotebookLM integration for Opal workflows https://www.testingcatalog.com/google-test-notebooklm-integration-for-opal-workflows/

I would expect only a few models to make progress with this rather simple harness: GPT-5.2-xhigh, Opus 4.5 and Opus 4.6 and Gemini 3.1 Pro other models will have a very hard time”” https://x.com/scaling01/status/2024661145286557872

Last week we upgraded Gemini 3 Deep Think. Today, we’re shipping the core intelligence that makes those breakthroughs possible: Gemini 3.1 Pro. A noticeably smarter, more capable baseline for your hardest challenges. Available now: https://x.com/NoamShazeer/status/2024519946764734574

Multimodal Function Calling with Gemini 3 and Interactions API https://www.philschmid.de/interactions-multimodal-fc

My vibe is unchanged: Gemini 3.1 is a previous gen model. It naively lives in a context-universe engineered by the God-User. Opus is a friend-type AI. It sits with you in a KFC. 5.2 sees a vast expanse of thought. Below there’s a given context. A user makes some noise, perhaps.”” https://x.com/teortaxesTex/status/2024574416747671556

Saw Gemini 3.1 announcement, got super excited. Tried Google Antigravity… not available. Tried Gemini CLI… not available. Tried Gemini Code Assist… not available. @OfficialLoganK put AI Studio in an Electron Shell and just launch it. You will deliver these faster.”” https://x.com/matvelloso/status/2024548414198091922

Today we’re releasing a preview of Gemini 3.1 Pro and making it available to our users and developers. Very excited to bring the upgraded core we used in Deep Think to everyone. Learn more about Gemini 3.1 Pro: https://x.com/koraykv/status/2024517699595124902

We just made paying for the Gemini API 10x easier : ) You can now upgrade to a paid Gemini API account without leaving AI Studio, track your usage, filter spend by model, and much more to come!”” https://x.com/OfficialLoganK/status/2022409335465480346

We made a skill for the Gemini API!”” https://x.com/OfficialLoganK/status/2022123808296251451

Here are some useful prompting tips to get the most out of our new music generation model in Gemini, Lyria 3 ↓”” https://x.com/GeminiApp/status/2024167107538407783

Introducing Lyria 3, our new music generation model in Gemini that lets you turn any idea, photo, or video into a high-fidelity track with custom lyrics. From funny jingles to lo-fi beats, you can create custom 30-second soundtracks for any moment. See how it works. 🧵”” https://x.com/GeminiApp/status/2024152863967240529

Impressive benchmarks for the new Chinese LLM. The system card notes some gaps with US closed source models in code generation & wide knowledge, so be interested to see it in operation. Not clear it is open weights though? If not, represents a large shift in the AI market.”” https://x.com/emollick/status/2022658647378268361

Blog about @MiniMax_AI ‘s Forge RL system. Core takeaways: 1. still CISPO 2. process reward, completion time reward 3. multi-level prefix cache 4. rollout uses 60% compute 5. millions of trajectories per day https://t.co/IrKDOoiKAB cc @teortaxesTex”” https://x.com/YouJiacheng/status/2022339475049947576

The dark side of reinforcement learning @olive_jy_song, senior researcher at @MiniMax_AI, about RL models that try to hack rewards and why alignment fails in practice This conversation is an inside look at how Chinese AI labs move fast – testing new models overnight, debugging”” https://x.com/TheTuringPost/status/2022961676799398337

🤔Has MiniMax finally stabilized its path in reasoning and coding? Still a hot review from Zhihu contributor toyama nao, and he call it: “”Root downward, grow upward.”” 🔥 After the flawed M2.1 (stronger coding, weaker logic), M2.5 fixes the technical issues and restores balance,”” https://x.com/ZhihuFrontier/status/2022214461415993817

$1 per hour with 100 tps”” https://x.com/MiniMax_AI/status/2022379949336957254

It’s been a few days since onboarding @MiniMax_AI’s latest model, M2.5, in standard and Lightning variants. Results are showing on our leaderboard. With over 3K votes, M2.5 Lightning ranks eighth among open models, with Standard following closely behind! Lets run some prompts:”” https://x.com/yupp_ai/status/2024165671136059892

MiniMax M2.5 casually responding at ~50 tok/s with MLX (M3 Ultra). The model was released one hour ago 🥳”” https://x.com/pcuenq/status/2022336556326060341

Nice independent look at SWE-bench Verified by @simonw MiniMax M2.5 showing strong results under the same evaluation setup. Worth a read”” https://x.com/MiniMax_AI/status/2024646767325958285

People were saying as early as Oct 2024 that SWE-bench was saturated when scores were just ~50% Awesome chat from Minimax team that shows otherwise. We’re certainly much, much closer, but there’s evidence that some room remains. Tiny 🧵”” https://x.com/jyangballin/status/2022367240293949772

RL often throws away useful signal at intermediate steps, or as @karpathy put it, it’s like “”sucking supervision through a straw.”” MiniMax M2.5 solves this with per-token process rewards. The result is frontier coding performance at least 1/10th the cost of closed source.”” https://x.com/basetenco/status/2022456010049495213

RL shouldn’t waste signal. M2.5’s per-token process rewards improve signal utilization across reasoning steps, delivering frontier coding performance with dramatically better cost efficiency. Thanks @basetenco for the deep dive and day-0 hosting!”” https://x.com/MiniMax_AI/status/2023470874708549941

Qwen3.5-397B-A17B SVG results I have seen better. DeepSeek-V3.2 and GLM-5 both beat it.”” https://x.com/scaling01/status/2023364296277721300

🚀 Qwen3.5-397B-A17B-FP8 weights are now open! It took some time to adapt the inference frameworks, but here we are: ✅ SGLang support is merged 🔄 vLLM PR submitted → https://t.co/rJkuitOBWs Check the model card for example code. vLLM support landing in the next couple of days!”” https://x.com/Alibaba_Qwen/status/2024161147537232110

🚩Cerebras’s MiniMax-M2 GGUF 2-bit model: https://t.co/udlviJQZqQ Qwen3-Coder-Next INT4 model:”” https://x.com/HaihaoShen/status/2022293472796180676

A clarification of Qwen3.5 Plus and 397B: 1. for opensource, we follow the tradition to make parameters apparent so we use the name with the number of total parameters and active params. 2. Qwen3-Plus is a hosted API version of 397B. As the model natively supports 256K tokens,”” https://x.com/JustinLin610/status/2023340126479569140

It’s Qwen 3.5 day today! 🥳 State of the art 800 GB model. Runs _locally_ with MLX using Q4, taking 225 GB of RAM.”” https://x.com/pcuenq/status/2023369902011121869

Let’s do the KV cache math for Qwen3.5: – KV heads: 2 – Head dimension: 256 – gated attention layers: 15 – bytes per element (BF16): 2 2 x 256 x 15 x 2 = 15 360 This is the same for K and V. So, we multiply by 2: 30 720 bytes Roughly 31 kb per token of context. Meaning at max”” https://x.com/bnjmn_marie/status/2023424404504342608

ollama run qwen3.5:cloud Qwen3.5-397B-A17B is the first open-weight model in the series. It’s available on Ollama’s cloud right now! Give it a try. Let’s go! 🚀🚀🚀”” https://x.com/ollama/status/2023334181804069099

Qwen 3.5 Plus is now available on AI Gateway. Thanks @vercel_dev team. 🤝 Use model: ‘alibaba/qwen3.5-plus’ Try it now!”” https://x.com/Alibaba_Qwen/status/2024029499541909920

Qwen3.5 runs quite well in mlx-lm. Awesome that we have a frontier-level hybrid model. The context gets longer but the inference speed and memory use barely change. Here’s the Q4 generating a space invaders game on an M3 Ultra. Generated 4,120 tokens at 37.6 tok/s.”” https://x.com/awnihannun/status/2023462412092059679

So speaking of benchmarks, what can be said of the new open Qwen? First, it completely destroys Qwen3-VL-235B ofc, but more surprisingly it outscores Qwen3-Max-thinking. All the while it’s the same model as “”Plus””. Plus just has 1M context and some more bells and whistles.”” https://x.com/teortaxesTex/status/2023331885402009779

The new chonky Qwen 3.5 looks pretty solid, beating their own Qwen3-Max model everywhere and is much better at vision benchmarks than Qwen3-235B-A22B-VL Now what I sadly haven’t seen is anything on reasoning efficiency.”” https://x.com/scaling01/status/2023343368399704506

Kimi K2‑0905 and Qwen3‑Max preview: two 1T open weights models launched | AINews https://news.smol.ai/issues/25-09-05-1t-models

(1/7) We’re releasing ThunderKittens 2.0! Faster kernels, cleaner code, industry contributions, and new state-of-the-art BF16 / MXFP8 / NVFP4 GEMMs that match or surpass cuBLAS! Alongside this release, we’re equally excited to share some insights we learned while squeezing every”” https://x.com/stuart_sul/status/2024897621874422125

[2602.11865] Intelligent AI Delegation https://arxiv.org/abs/2602.11865

[2602.12036] Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models https://arxiv.org/abs/2602.12036

[2602.13949] Experiential Reinforcement Learning https://arxiv.org/abs/2602.13949

[2602.15322] On Surprising Effectiveness of Masking Updates in Adaptive Optimizers https://arxiv.org/abs/2602.15322

🤯 With 11B active parameters (196B MoE), Step 3.5 Flash is going toe-to-toe with the best closed models. The efficiency curve is getting absurd.”” https://x.com/fdaudens/status/2021949479771861100

1/ We’ve released a report on our work on multilingual data curation @datologyai. tl;dr: We shift the performance-compute Pareto frontier for multilingual models. Entirely by improving data quality and composition. arxiv: https://t.co/bLv8IySa8G blog:”” https://x.com/agcrnz/status/2024207781524623690

10 must-read books and surveys about AI and Machine Learning ▪️ Machine Learning Systems by Vijay Janapa Reddi ▪️ Understanding Deep Learning by Simon J.D. Prince ▪️ Interpretable Machine Learning by Christoph Molnar ▪️ Foundations of LLMs ▪️ A Survey on Post-training of LLMs ▪️”” https://x.com/TheTuringPost/status/2023058041864888324

13 foundational types of AI models ▪️ LLM ▪️ SLM ▪️ VLM ▪️ MLLM ▪️ VLA ▪️ LAM ▪️ RLM ▪️ MoE ▪️ SSM ▪️ RNN ▪️ CNN ▪️ SAM ▪️ LNN Save the list and check this out for explanations and useful resource links: https://x.com/TheTuringPost/status/2022599637623038442

24 dedicated people. $30M spent on development. Extreme specialization, speed, and power efficiency. Today we launch Taalas’ first product. Check it out: Details: https://t.co/88CA0XAL71 Demo chatbot: https://t.co/ec4ladcKnw API:”” https://x.com/taalas_inc/status/2024516399251456150

5 years later @github finally implements my request. 🎉 Pull Requests can be disabled! 🎊 https://x.com/joshmanders/status/2022170444116414790

A small PSA if you’re using vLLM, you might find SGLang is faster on H100 and B200’s. A little rabbit hole + some help from the vLLM folks and we figured out it’s because vLLM would choose DeepGemm on some models, which isn’t the best (Triton is) Set VLLM_USE_DEEP_GEMM=0!”” https://x.com/TheZachMueller/status/2024619480580510117

Additionally, code execution, web fetch, memory, programmatic tool calling, tool search, and tool use examples are now generally available. Read more:”” https://x.com/alexalbert__/status/2023834875678298535?s=46

Announcing 🌇HumanLM, a RL framework that trains LLMs to simulate human users’ responses, along with 🌆Humanual, a comprehensive user simulation benchmark https://t.co/5TZ9WOOFB8 🌄 One thing that’s fascinating about our society: human users shape the world and determine the”” https://x.com/ShirleyYXWu/status/2022374624676421676

As I’ve mentioned before, since my testing methodology has nearly reached a saturation point, it may not reflect the actual user experience as closely as it used to. Also, since this is fundamentally not a coding or scientific benchmark, the results may not directly correlate”” https://x.com/Hangsiin/status/2024605313744458043

As usual, we’re releasing everything under Apache 2.0: all models (including intermediate checkpoint and prompt exploration) and PyLate training scripts 📦 Models: https://t.co/oiGjYu08dT 💻 Code: https://t.co/b9LBCRxSh5 📄 Paper:”” https://x.com/antoine_chaffin/status/2024516823685730690

Crusoe Managed Inference: Low latency and breakthrough speed https://www.crusoe.ai/cloud/managed-inference

Day Zero for Multi-Vector Retrieval. Today we’re flipping the retrieval playbook: no dense model adaptation, no retrofit. 🏗️Multi-vector from scratch, powered by PyLate. Meet ColBERT-Zero In collaboration with @EPFL and the Swiss AI initiative, @LightOnIO pre-trained it”” https://x.com/LightOnIO/status/2024517870785282545

Even if you are literally training a model to do formal verification, RLAIF (in the form of “rubrics as rewards”, https://t.co/waawRPS1Lw) seems to beat RLVR. Put formal verifiers: 🟢 in your agent loop/harness 🟢 as part of the training loop ❌ in control of the reward signal”” https://x.com/davidad/status/2022361016995319850

FastMCP 3.0 is out! I don’t even know how to fit it in a tweet… 🔌 Build servers from directories, APIs, remote servers — anything 🎭 Per-session context & progressive disclosure 🖥️ Full CLI: list, call, generate clients 🔐 Versioning, granular auth, OTEL ⚡️ DX galore”” https://x.com/jlowin/status/2024242656377700618

good article on how the swe-fficiency ranking is broken”” https://x.com/scaling01/status/2024171017929638061

How persistent is the inference cost burden? – by JS Denain https://epochai.substack.com/p/how-persistent-is-the-inference-cost

I think it must be a very interesting time to be in programming languages and formal methods because LLMs change the whole constraints landscape of software completely. Hints of this can already be seen, e.g. in the rising momentum behind porting C to Rust or the growing interest”” https://x.com/karpathy/status/2023476423055601903

Inference compute scarcity seems plausible. Hyper growth in demand with limited hardware or data center energy supply means token shortage. This could be a reason to run more AI locally.”” https://x.com/awnihannun/status/2024664226837778490

It is cool that 5.2 “”with a scaffold”” can think for 12 hours *productively*. At pleb speeds that’s 1.8M tokens, and at non-pleb speeds I don’t know what the point of giving a wall clock time is. In any case that’s well into the zone of diminishing returns on normal code tasks.”” https://x.com/teortaxesTex/status/2022401945429000685

jina-embeddings-v5-text is here! Our fifth generation of jina embeddings, pushing the quality-efficiency frontier for sub-1B multilingual embeddings. Two versions: small & nano, available today on Elastic Inference Service, vLLM, GGUF and MLX.”” https://x.com/JinaAI_/status/2024505342277964129

Jitendra Malik rants about parallel-jaw grippers being inadequate; he believes multi-fingered hands with tactile sensing are necessary for advanced dexterous manipulation. Malik is a Professor at UC Berkeley and a Distinguished Scientist at Amazon.”” https://x.com/TheHumanoidHub/status/2023138332952363354

LLMs process text from left to right — each token can only look back at what came before it, never forward. This means that when you write a long prompt with context at the beginning and a question at the end, the model answers the question having “”seen”” the context, but the”” https://x.com/burkov/status/2023822767284490263

M-courtyard looks like a neat no-code app to fine-tune LLMs locally with MLX and then exporting them for use with Ollama:”” https://x.com/awnihannun/status/2022327214218657948

MonoLoss: A Training Objective for Interpretable Monosemantic Representations “”we introduce the Monosemanticity Loss (MonoLoss), a plug-in objective that directly rewards semantically consistent activations for learning interpretable monosemantic representations. Across SAEs”” https://x.com/iScienceLuvr/status/2023303520057745501

New paper on a long-shot I’ve been obsessed with for a year: How much are AI reasoning gains confounded by expanding the training corpus 10000x? How much LLM performance is down to “”local”” generalisation (pattern-matching to hard-to-detect semantically equivalent training data)?”” https://x.com/g_leech_/status/2023384075537432662

New paper on skills. The conclusions hold up exactly 1-to-1 with our experience. Skills are better than docs, but only when made with care. Less is more. Models are really bad at making skills. 2 paragraphs of human-written condensed instructions (or best practices) are better”” https://x.com/hrishioa/status/2024713140769083461

Nobody is Talking About Generalized Hill-Climbing (at Runtime) | Daniel Miessler https://danielmiessler.com/blog/nobody-is-talking-about-generalized-hill-climbing

optimize_anything: A Universal API for Optimizing any Text Parameter – GEPA https://gepa-ai.github.io/gepa/blog/2026/02/18/introducing-optimize-anything/

Orchestration design is now a first-class optimization target, independent of model scaling. As LLMs from different providers converge toward comparable benchmark performance, picking the best model yields diminishing returns. The real lever is orchestration topology, where you”” https://x.com/omarsar0/status/2024847274157945035

PPO vs. new DPPO (Divergence PPO) – a workflow breakdown of the algorithms ➡️ Proximal Policy Optimization (PPO): Control via token ratios PPO is the default choice for RL fine-tuning LLMs. It controls learning by clipping how much individual token probabilities can change. ▪️”” https://x.com/TheTuringPost/status/2022326245745377562

Prompt repetition is way overplayed If you put the question first, the gains from repetition vanish or reduce dramatically The biggest gains are on tasks where they didn’t report question-first variants… I wonder why”” https://x.com/paul_cal/status/2024053549965934886

Pull Requests on @GitHub can now be limited to repo collaborators or disabled entirely. This should help cut down on unwanted noise and give maintainers more control over their experience”” https://x.com/jaredpalmer/status/2022395520623480970

RL with evolving rubrics (RLER) in Dr. Tulu is a great step in the direction I expect Rubrics-as-Rewards (RaR) to go. At a high level, the goal of using rubrics for RL is to generalize RL with verifiable rewards to non-verifiable domains. Instead of using deterministic rules, we”” https://x.com/cwolferesearch/status/2022384365049892974

Semantic closure: why compilers know when they are right and LLMs do not https://sderosiaux.substack.com/p/semantic-closure-why-compilers-know

Software as Wiki, Mutable Software – exe.dev blog https://blog.exe.dev/software-as-wiki

SpargeAttention2 Trainable Sparse Attention via Hybrid Top-k+Top-p Masking and Distillation Fine-Tuning paper: https://x.com/_akhaliq/status/2024873795173892483

State-of-the-art ColBERT models are trained by applying knowledge distillation on top of dense pre-trained models What if we run the whole pre-training in the multi-vector? Introducing ColBERT-Zero, a model that sets a new SOTA on BEIR, using only public data”” https://x.com/antoine_chaffin/status/2024516779129626820

strong coding and SOTA reasoning: -> no SWE-Bench verified SOTA -> but ARC-AGI-2 SOTA”” https://x.com/scaling01/status/2024505232969928952

Super excited to share our new paper, on ∆Belief-RL! When pursuing open-ended goals (like science), we need to: – efficiently explore the environment and seek novel evidence – judge our actions by whether we think they took us closer to the target. Inspired by this, we train”” https://x.com/ShashwatGoel7/status/2022341054939185345

Test-time reasoning models often converge too early. Achieving broader reasoning coverage requires longer sequences, yet the probability of sampling such sequences decays exponentially during autoregressive generation. The authors call this the “”Shallow Exploration Trap.”” This”” https://x.com/dair_ai/status/2022360649817526275

That’s what I see when talking to developers and companies: AI tools are becoming more capable and less rigid than older automation workflows. Many businesses are super interested, but they still lack the skills or time to set them up properly. The opportunity to help companies”” https://x.com/TheTuringPost/status/2022048357427163279

The AI Quality Ceiling: Why Domain Expertise Is Appreciating – philippdubach.com https://philippdubach.com/posts/the-impossible-backhand/

The decode speedup is pretty ridiculous with this new sparse MoE + GatedDeltaNet architecture”” https://x.com/scaling01/status/2023343837079572955

The Economics of LLM Inference: Batch Sizes, Latency Tiers, and Why Model Labs Have an Advantage https://mlechner.substack.com/p/the-economics-of-llm-inference-batch

The Long Tail of LLM-Assisted Decompilation | Chris’ Blog https://blog.chrislewis.au/the-long-tail-of-llm-assisted-decompilation/

The path to ubiquitous AI | Taalas https://taalas.com/the-path-to-ubiquitous-ai/

The Scarcity Trap: Why AI Still Feels Like a Metered Utility https://productics.substack.com/p/the-scarcity-trap-why-ai-still-feels

The Tiny Aya technical report is full of gems 💡 We go deep into design decisions and evaluation choices. Multilingual performance lives in the details.”” https://x.com/mziizm/status/2023775027754365044

Thrilled to have GGML with us going forward! 🤗❤️🦙 Read the announcement blog https://x.com/huggingface/status/2024871487753044243

Two different tricks for fast LLM inference https://www.seangoedecke.com/fast-llm-inference/

ÜberWeb: Insights from Multilingual Curation for a 20-Trillion-Token Dataset https://www.datologyai.com/blog/berweb-insights-from-multilingual-curation-for-a-20-trillion-token-dataset

v5-text uses decoder-only backbones with last-token pooling instead of mean pooling. Four lightweight LoRA adapters are injected at each transformer layer, handling retrieval, text-matching, classification, and clustering independently. Users select the appropriate adapter at”” https://x.com/JinaAI_/status/2024505349181755760

Voxtral Realtime paper is out ! The model is released under the Apache 2 license, and achieves state-of-the-art transcription performance at sub-500ms latency.”” https://x.com/GuillaumeLample/status/2024445949733384638

We just shipped Baseline Experiments 🚀 You can now pin any experiment as your baseline in LangSmith. This allows you track performance deltas, anchor your results, and quickly identify improvements or regressions in an experiment list. Docs: https://x.com/LangChain/status/2024208662936650152

What is On-Policy Self-Distillation (OPSD)? Now models are powerful enough to self-critique, comparing their own reasoning against a privileged, better version of themselves. This is a process of self-distillation, and OPSD performs it this way. One model plays two roles: -“” https://x.com/TheTuringPost/status/2022608611340677330

What people miss about RLMs and what makes the idea beautiful to me is that it is a harness that can implement pretty much any other model harness or workflow emergently.”” https://x.com/HammadTime/status/2024694115372499026

WO2025117006 SIMULATION OF A USER OF A SOCIAL NETWORKING SYSTEM USING A LANGUAGE MODEL https://patentscope.wipo.int/search/en/WO2025117006

Writeup on Rubric-Based RL is out now: https://t.co/io6zEeeEAZ Covers 15+ papers, the path from LLM-as-a-Judge to rubrics, and how we can use rubrics to extend RLVR beyond verifiable domains (with tons of tips / tricks from recent research). Hope it’s helpful!”” https://x.com/cwolferesearch/status/2023408158065188894

you guys have not felt the agi until you have vibe designed your 6000 person conference website at the climbing gym in between projects without reading a single line of code including 99% video asset performance optimization because why the heck not its 2026″” https://x.com/swyx/status/2021498862012334274

Skills are literally just markdown files how the hell can they have downtime???”” https://x.com/theo/status/2024785367896072599

Two views on AGI from @natolambert – What we have is AGI – A drop-in replacement for a remote worker Open models can contribute to this. They expand access, reduce power concentration, and keep the path to different forms of AGI open and transparent. Without open models,”” https://x.com/TheTuringPost/status/2023375354740809823

Why I don’t think AGI is imminent https://dlants.me/agi-not-imminent.html

lol what: Researchers found that repeating the exact same prompt twice dramatically improves LLM performance (one model improved from 21% to 97% accuracy on a name-search task) without longer outputs, slower responses, fine-tuning, or fancy prompt engineering. Because models”” https://x.com/kimmonismus/status/2024069380162936992

Looking Inside: a Maliciousness Classifier Based on the LLM’s Internals https://labs.zenity.io/p/looking-inside-a-maliciousness-classifier-based-on-the-llm-s-internals

Repeating Prompts https://daoudclarke.net/2026/02/19/repeating-prompt

MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE TL;DR: a video diffusion-based model that jointly reconstructs dense 3D geometry and scene motion from monocular video in a unified 4D latent space.”” https://x.com/Almorgand/status/2023815479534723172

17,000 tokens per second!! Read that again! LLM is hard-wired directly into silicon. no HBM, no liquid cooling, just raw specialized hardware. 10x faster and 20x cheaper than a B200. the “”waiting for the LLM to think”” era is dead. Code generates at the speed of human thought.”” https://x.com/wildmindai/status/2024810128487096357

Multilingual data finally moves from “data collection” to “data curation”. UberWeb breaches the compute-performance frontier for multilingual data. We spell out all our learnings from this year long effort at training multilingual models at the 20 trillion token data scale.”” https://x.com/pratyushmaini/status/2024157352862376280

Our recent preprint on gluon amplitudes has sparked a lot of discussion, so I want to share the backstory — including how AI helped crack a problem that had stumped us for a year. I’ll also be giving a public lecture at Harvard this week. Details at the end.”” https://x.com/ALupsasca/status/2023402422320926762

Leave a Reply

Trending

Discover more from Ethan B. Holland

Subscribe now to keep reading and get access to the full archive.

Continue reading