Image created with gemini-3.1-flash-image-preview with claude-sonnet-4-5. Image prompt: Wide static shot of a muted Chinese electronics factory interior, workers assembling circuit boards at cluttered workbenches, natural overcast light through grimy windows, a chestnut horse standing naturally among the assembly line workers, desaturated concrete and metal surfaces, documentary realism, bold white text overlay reading TECH in poster style, Jia Zhangke observational composition, flat industrial lighting, human-scale intimacy.
This is a new separate estimate for LLM time horizon doubling times and it mostly agrees with METR In this case ~4.8-5.7 months”” https://x.com/scaling01/status/2023350946139435357
Spotify’s Top Developers Haven’t Written Code Since December, CEO Says – Business Insider https://www.businessinsider.com/spotify-developers-not-writing-code-ai-2026-2
141 days for Sonnet to go from 13.6% to 60.4% on ARC-AGI-2″” https://x.com/scaling01/status/2023850250662969587
Sonnet 4.6 Benchmarks 79.6% SWE-Bench Verified 58.3% ARC-AGI-2″” https://x.com/scaling01/status/2023818940112327101
We estimate that Claude Opus 4.6 has a 50%-time-horizon of around 14.5 hours (95% CI of 6 hrs to 98 hrs) on software tasks. While this is the highest point estimate we’ve reported, this measurement is extremely noisy because our current task suite is nearly saturated.”” https://x.com/METR_Evals/status/2024923422867030027
Announcing Spreadsheet Arena | Meridian | Meridian https://www.meridian.ai/blog/all/spreadsheet-arena
Excited to launch Gemini 3.1 Pro! Major improvements across the board including in core reasoning and problem solving. For example scoring 77.1% on the ARC-AGI-2 benchmark – more than 2x the performance of 3 Pro. Rolling out today in @GeminiApp, @antigravity and more – enjoy!”” https://x.com/demishassabis/status/2024519780976177645
Gemini 3.1 Pro Benchmarks 77.1% ARC-AGI-2 80.6% SWE-Bench Verified”” https://x.com/scaling01/status/2024514798470181370
Gemini 3.1 Pro is here. Hitting 77.1% on ARC-AGI-2, it’s a step forward in core reasoning (more than 2x 3 Pro). With a more capable baseline, it’s great for super complex tasks like visualizing difficult concepts, synthesizing data into a single view, or bringing creative”” https://x.com/sundarpichai/status/2024516418855981298
Gemini 3.1 Pro landed today. This is based on the same model behind the agentic DeepThink released last week; it is now available to all Gemini users on many apps. This is a really good model especially in reasoning and multimodal understanding/generation. Try it out.”” https://x.com/mirrokni/status/2024525808501477568
Gemini 3.1 Pro: Announcing our latest Gemini AI model https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/
Holy sh*t, thats what I call an improvement! Gemini 3.1 pro is insane: – Arc agi 2 77% – SWE verified 80% – HLE 44%/51%”” https://x.com/kimmonismus/status/2024521970184868000
To the Scientist, the Engineer, and the Developer: Gemini 3.1 Pro has arrived in @GeminiApp It’s a significant leap in complex reasoning (77.1% on ARC-AGI-2) so it’s great at agentic tasks, intricate coding, and data synthesis projects. You should see fewer errors, better”” https://x.com/joshwoodward/status/2024515741819842623
Today, we’re continuing to push the boundaries of AI with our release of Gemini 3.1 Pro. This updated model scores 77.1% on ARC-AGI-2, more than double the reasoning performance of its predecessor, Gemini 3 Pro. Check out the visible improvement in this side-by-side comparison,”” https://x.com/JeffDean/status/2024525132266688757
Gemini 3.1 Pro is here! It’s top 3 across Text and Vision Arena, and #6 in Code Arena, tied closely with Claude Opus 4.5. Highlights: ▪️Tied #1 in Text (scoring 1500), 4 pts from Opus 4.6 ▪️Top 3 in Arena Expert Leaderboard (scoring 1538), just behind Opus 4.6 ▪️#6 in Code”” https://x.com/arena/status/2024519891295089063
Gemini 3.1 Pro WebDev Arena results: – 6th place behind Opus 4.5/4.6 and GPT-5.2-high”” https://x.com/scaling01/status/2024522048312054142
Multimodal function calling is now available in the Gemini Interactions API, build agents that can see and process images natively. 🖼️ Tools return actual images, not text descriptions 👁️ Gemini 3 natively processes returned images 🛠️ Function results support mixed text and”” https://x.com/_philschmid/status/2022349886318928158
Update regarding Gemini 3.1 Pro: -Ranked #1 among all Gemini models released to date. -Ranked #1 among all models I have tested so far. (GPT-5.2 high 165.9 vs Gemini 3.1 Pro 166.6) However, please note that my testing has limitations due to budget constraints: -I have not”” https://x.com/Hangsiin/status/2024605310913216614
Introducing Lyria 3, our latest and most advanced music model, available in the Gemini App starting today : ) Go from idea, image, or video to music in seconds!”” https://x.com/OfficialLoganK/status/2024153948488118513
Meet Lyria 3, our latest music generation model from @GoogleDeepMind. 🎶 Now, you can create custom music tracks in the @GeminiApp — just by describing an idea or uploading an image or video.”” https://x.com/Google/status/2024154379838705920
We just launched Lyria 3! Our most advanced AI music model in the @GeminiApp 🎵 – Generates 30-second tracks from text or image prompts. – Support custom lyrics, vocals, and cover art. – Supports 8 languages including English, Japanese, and Korean. – All outputs watermarked with”” https://x.com/_philschmid/status/2024154542061805988
Use Lyria 3 to create music tracks in the Gemini app https://blog.google/innovation-and-ai/products/gemini-app/lyria-3/
Introducing EVMbench | OpenAI https://openai.com/index/introducing-evmbench/
Introducing EVMbench–a new benchmark that measures how well AI agents can detect, exploit, and patch high-severity smart contract vulnerabilities.”” https://x.com/OpenAI/status/2024193883748651102
How efficient is MiniMax M2.5? We benchmarked on 8xH200 TEP8 with @vllm_project . At a reasonable 10-25s TTFT, M2.5 is able to sustain ~2500 tok/s/GPU throughput. For decode, it’s still possible to reach ~20 tok/s/GPU throughput at a strict 20 tok/s/user interactivity with 10K+”” https://x.com/SemiAnalysis_/status/2023418414203646066
MLX MiniMax 2.5 running LOCALLY on a single M3 Ultra 512GB! Writing a poem on LLMs at 6bit quantization! 🔥 Let’s start some coding, context and distributed tests! Generation: 40.2 tokens-per-sec Peak memory: 186 GB”” https://x.com/ivanfioravanti/status/2022338870172684655
Alibaba Yunqi: 7 models released in 4 days (Qwen3-Max, Qwen3-Omni, Qwen3-VL) and $52B roadmap | AINews https://news.smol.ai/issues/25-09-23-alibaba-yunqi
Alibaba’s new Qwen3.5-397B-A17B is the #3 open weights model in the Artificial Analysis Intelligence Index – a significant upgrade from Qwen3-235B-A22B-2507, and achieved with fewer active parameters than leading peers Qwen3.5-397B-A17B is the first model released by Alibaba”” https://x.com/ArtificialAnlys/status/2023794497055060262
Qwen https://qwen.ai/blog?id=qwen3.5#spatial-intelligence
Qwen3.5’s thinking is downright excessive.”” https://x.com/QuixiAI/status/2023995215690781143
.@mattshumer_ “”Something Big is Happening”” article now has 83 million views. Clearly, it hit a nerve. I also want to argue with it, even if that puts me on the unpopular side of the timeline. Because his piece gave me real, unproductive anxiety.”” https://x.com/TheTuringPost/status/2023743799042666989
The Future of Design Is Code and Canvas | Figma Blog https://www.figma.com/blog/the-future-of-design-is-code-and-canvas/
Announcing AA-WER v2.0 Speech to Text accuracy benchmark, and AA-AgentTalk, a new proprietary dataset focused on speech directed at voice agents AA-AgentTalk focuses on the speech that matters most to voice agents. As a held-out, proprietary dataset, AA-AgentTalk also mitigates”” https://x.com/ArtificialAnlys/status/2024157398139883729
Small update to the leaderboard at https://t.co/AU0F7BjYEh: it’s now all results from running with mini-SWE-agent v2, an upgrade over v1 that gets more juice out of the base models.”” https://x.com/OfirPress/status/2024177059895877802
We just updated the official SWE-bench leaderboard comparing all models with the exact same scaffold (mini-SWE-agent v2). Detailed cost analysis & links to browsable trajectories in 🧵”” https://x.com/KLieret/status/2024176335782826336
On evaluating multi-step scientific tool use in LLM agents. SciAgentGym provides an interactive environment with 1,780 specialized tools across 4 scientific disciplines. The core finding: even advanced models like GPT-5 see success rates drop sharply from 60.6% to 30.9% as”” https://x.com/dair_ai/status/2023404773031166320
[2602.16301] Multi-agent cooperation through in-context co-player inference https://arxiv.org/abs/2602.16301
The crazy part is that the AI Labs have generally been right. Like, the stuff they hyped in 2023 turned out to be real and working today. That doesn’t mean that the stuff they are predicting for 2028 will also be real, but it is probably worth noting those predictions & watching.”” https://x.com/emollick/status/2023257496069046563
📊Let’s dive deeper into @AnthropicAI’s Sonnet 4.6 vs 4.5. Overall: Sonnet 4.6 ranks 3 places higher (#13 vs #16) Where Sonnet 4.6 gains: Code: ▪️WebDev (+19 for Sonnet 4.6: #3 vs #22) Text: ▪️Instruction Following (+6, #5 vs #11) ▪️English (+5, #9 vs #14) ▪️Hard Prompts (+5,”” https://x.com/arena/status/2024892330743124246
Claude Sonnet 4.6 (medium) scores 66.1% on WeirdML, matching Opus 4.6 (no thinking) and a big advance from Sonnet 4.5 at 47.7%. I had to run it on medium reasoning level because the default (high) constantly hit the 64k max tokens limit. Even at medium it uses as many output”” https://x.com/htihle/status/2024764946051907659
Claude Sonnet 4.6 takes second place in the Artificial Analysis Intelligence Index (behind Opus 4.6), but used ~3x more output tokens than Claude Sonnet 4.5 in its max effort mode. Sonnet 4.6 leads all models in GDPval-AA and TerminalBench, including a slight lead over Opus 4.6″” https://x.com/ArtificialAnlys/status/2024259812176121952
When I joined METR I was really skeptical that we were evaling models using simple OS scaffolds rather than Claude Code / Codex / etc. I really appreciate Nikola looking into this and I’m surprised it still doesn’t seem to make much difference for CC on Opus 4.5″” https://x.com/ajeya_cotra/status/2022419978495127828
GLM-5 scores 48.2% on WeirdML, beating Claude Sonnet 4.5 and tying gpt-oss-120b (high) for the best open model. This is a clear advance but still far from Opus-4.6 at 78% and gpt-5.2 at 72%.”” https://x.com/htihle/status/2023734346943775179
OpenAI and Anthropic are much further ahead than what benchmarks show. While you are token constrained they are blasting millions of tokens at 4x the API speed without batting an eye and they scaffold like they are trying to build a skyscraper.”” https://x.com/scaling01/status/2023837889478758495
I looked into how Claude Code and Codex compare to the default scaffolds METR uses for time horizon measurements. It looks like they don’t significantly outperform our default scaffolds on any models we’ve tried them on so far.”” https://x.com/nikolaj2030/status/2022398669337825737
A paper worth paying close attention to. It presents Lossless Context Management (LCM), which reframes how agents handle long contexts. It outperforms Claude Code on long-context tasks. Recursive Language Models give the model full autonomy to write its own memory scripts. LCM”” https://x.com/dair_ai/status/2023765147970662761
We’re officially opening our Bengaluru office–our new home base in India, and Anthropic’s second office in Asia-Pacific. India is our second-largest market for https://t.co/RxKnLNNcNR. We’re launching new partnerships to deepen our long-term commitment:”” https://x.com/AnthropicAI/status/2023322514206957688
WeirdML Time Horizons! Inspired by @METR_Evals I found time-horizons for the WeirdML tasks, using LLM-estimated human completion times. We find horizons of ~24 min (GPT-4) to ~38 hours (Opus 4.6), doubling time ~5 months. Links to blog post, git-repo + nice figures in thread.”” https://x.com/htihle/status/2023349189271572975
Exclusive: Peter Thiel-backed industrial AI startup emerges from stealth with funding from a16z | Fortune https://fortune.com/2026/02/09/exclusive-peter-thiel-alexis-ohanian-new-ai-industrial-startup-emanate-kiara-nirghin/
GDPval remains one of the best benchmarks for doing complex real world agentic tasks. But worth noting that GDPval-AA is not the same thing. It only uses the public problem set, and all evaluation is done by Gemini, not by humans/specialized graders like in the real GDPval.”” https://x.com/emollick/status/2023854803328311722
I think people are overinterpreting these time horizon evals. They are very impressive! But when error rates are near zero, and tasks require many successful steps in order to complete, small absolute improvements in error rate have a multiplicative effect. Consider a task”” https://x.com/xlr8harder/status/2024946945232445710
Last year, we noticed a gap between scores in our SWE-bench Verified runs and scores reported elsewhere. We’ve now updated our evaluation methodology. For most models, we’re seeing scores close to those reported by the original model developers.”” https://x.com/EpochAIResearch/status/2024924403142910137
Seems like a lot of people are taking this as gospel–when we say the measurement is extremely noisy, we really mean it. Concretely, if the task distribution we’re using here was just a tiny bit different, we could’ve measured a time horizon of 8 hours, or 20 hours.”” https://x.com/idavidrein/status/2024938968434049117
very curious that the extra reasoning that everyone observed during testing isn’t showing up on AA-index”” https://x.com/scaling01/status/2024519669680320659
We went from AI systems that struggled to do grade school math to AI systems that can solve research-level math problems in just a few years. I agree with Jakub this is perhaps the most important eval now. I am also pretty sure the main reaction will be “”it’s not that hard”” :)”” https://x.com/sama/status/2022729068949717182
The model is a step forward in reasoning, designed for workflows where a simple answer isn’t enough. On ARC-AGI-2 – which tests for novel logic patterns – it more than doubles 3 Pro’s score. This means it can help you visualize complex topics, organize scattered data, and bring”” https://x.com/GoogleDeepMind/status/2024516467618656357
Earlier today I wanted to doom about Gemini 3.1 Pro completely failing ARC-AGI-3. Turns out this was due to a bug in the config introduced by GPT-5.3. It was still calling Gemini 3.0 Pro instead of 3.1. I fixed it, made the harness simpler and spend $120. Performance of Gemini”” https://x.com/scaling01/status/2024642220096442772
Gemini 3.1 is the faster horse. It’s like a horse with rocket fuel. Truly insane. Everyone else makes cars now.”” https://x.com/theo/status/2024808734053347608
Gemini 3.1 Pro on ARC-AGI Semi-Private Eval @GoogleDeepMind – ARC-AGI-1: 98%, $0.52/task – ARC-AGI-2: 77%, $0.96/task Gemini to push the Pareto Frontier of performance and efficiency”” https://x.com/arcprize/status/2024522812728496470
Gemini 3.1 Pro Preview scored highest in the Artificial Analysis Intelligence Index but its most significant advantage might be its price and token efficiency. Our evaluations cost <50% to run on Gemini 3.1 Pro Preview compared to Claude Opus 4.6 (max) and GPT-5.2 (xhigh) Gemini”” https://x.com/ArtificialAnlys/status/2024677979390169536
Gemini Pro 3.1 (& other frontier models) are still terrible at Connect 4. Yet smashing ARC-AGI-2 That is weird, right? ARC was built to be resistant to overfitting. I guess the fully generalised world of ARC AGI puzzles is still a very narrow slice of spatial reasoning”” https://x.com/paul_cal/status/2024748708223402120
I gave Gemini 3.1 Pro an ARC-AGI-2 challenge WITH solution and it bombed it … SVG’s might have been successfully sloptimized GPT-5.2 Thinking realizes it after 14s of Thinking that I gave it the solution in the input and just repeats it Gemini 3.1 Pro thought for 8 minutes”” https://x.com/scaling01/status/2024268831321993590
Loving Gemini 3.1 Pro! It made 3 huge improvements to my compiler and saw things that even ChatGPT 5.2 Pro Extended and Claude Opus 4.6 Extended couldn’t see.”” https://x.com/QuixiAI/status/2024545096532733967
oh and ARC-AGI-3 is crazy expensive to run”” https://x.com/scaling01/status/2024650634746610041
By the way, the recent Gemini 3.1 Pro is also a really good model for RLMs. Claude Opus 4.6 is the worst of the ones I tested. Probably not optimized for the type of decomposition that RLMs need. I am just impressed by GPT-5.2-Codex. The strategies it uses are brilliant.”” https://x.com/omarsar0/status/2024973182436831629
Claude Sonnet 5: The “Fennec” Leaks – Fennec Codename: Leaked internal codename for Claude Sonnet 5, reportedly one full generation ahead of Gemini’s “Snow Bunny.” – Imminent Release: A Vertex AI error log lists claude-sonnet-5@20260203, pointing to a February 3, 2026 release”” https://x.com/pankajkumar_dev/status/2018187650927349976?s=46
Gemini 3.1 Pro will be a massive step-up! There’s a decent chance it’s on par with Opus 4.6 and GPT-5.3. The main reason for that: similarly to Claude 4.6 and GPT-5.2/5.3 it thinks much longer than Gemini 3 Pro The same request on aistudio, tested multiple times, had 6″” https://x.com/scaling01/status/2024251668771066362
Google is once again the leader in AI: Gemini 3.1 Pro Preview leads the Artificial Analysis Intelligence Index, 4 points ahead of Claude Opus 4.6 while costing less than half as much to run @GoogleDeepMind gave us pre-release access to Gemini 3.1 Pro Preview. It leads 6 of the”” https://x.com/ArtificialAnlys/status/2024518545510662602
In Arena Expert, with expert level prompts, Gemini 3.1 Pro Preview lands in the top 3 (scoring 1538), just behind Claude Opus 4.6″” https://x.com/arena/status/2024519895623598423
Sonnet 4.6 crushes Gemini 3 and GPT-5.2 on Vending-Bench 2″” https://x.com/scaling01/status/2023833660546499053
Claude Sonnet 4.6 has landed #3 in Code and #13 in Text Arena! Highlights: ▪️+130 pts jump in Code Arena (#22 -> #3) compared to Sonnet 4.5, surpassing top-tier thinking models like Gemini-3.1 and GPT-5.2 ▪️Strong gains in Text categories: Math (#4) and Instruction Following”” https://x.com/arena/status/2024883614249615394
📊 Let’s dive deeper into Gemini 3.1 Pro gains. It ranks 13 points above Gemini 3 Pro overall. We see the largest rank gains for @GoogleDeepMind’s latest model in the following categories: Text: ▪️Coding (+5) ▪️Math (+4) ▪️Expert (+3) ▪️Instruction Following (+3) ▪️Multi-Turn”” https://x.com/arena/status/2024588456463389040
Check out the skills for the Gemini API! More soon!”” https://x.com/osanseviero/status/2022259577232785866
Context Arena Update: Added @Google’s Gemini 3.1 Pro Preview to the MRCR leaderboards (2-,4-,8-needle)! Meant to send this out earlier today. Thanks to @GoogleDeepMind and others over there for early access! Thinking budget barely matters on simpler retrieval – 2-needle AUC”” https://x.com/DillonUzar/status/2024655613293215855
Gemini 3.1 Pro has landed! Amazing performance / capabilities across the board. Beyond SOTA, the best are all the things that evals can’t measure. E.g. SVG has gotten so much better (see 🧵) https://x.com/OriolVinyalsML/status/2024519605570720185
Gemini 3.1 Pro in 1st place on the Artificial Analysis Leaderboard”” https://x.com/scaling01/status/2024517196727099847
Gemini 3.1 Pro is rolling out now in the @GeminiApp, and exclusively to Google AI Pro and Ultra users in @NotebookLM. Developers can access it in preview via the API in @GoogleAIStudio. Find out more → https://x.com/GoogleDeepMind/status/2024516471720743295
Gemini 3.1 Pro’s GDPval scores are concerning”” https://x.com/scaling01/status/2024515061163704336
Gemini Deep Think 3 is the world’s most capable model by many measures, huge amounts of progress on reasoning benchmarks and more. Available right now via the Gemini App for Ultra subscribers and in the API soon : )”” https://x.com/OfficialLoganK/status/2021996626144080015
Good news: Google AI Studio and the Gemini API are now live in Moldova, Andorra, San Marino, and Vatican City! 🌍”” https://x.com/OfficialLoganK/status/2022688445957820610
Google is back on the intelligence-cost frontier with Gemini 3.1 Pro”” https://x.com/scaling01/status/2024519007018373202
Google test NotebookLM integration for Opal workflows https://www.testingcatalog.com/google-test-notebooklm-integration-for-opal-workflows/
I would expect only a few models to make progress with this rather simple harness: GPT-5.2-xhigh, Opus 4.5 and Opus 4.6 and Gemini 3.1 Pro other models will have a very hard time”” https://x.com/scaling01/status/2024661145286557872
Last week we upgraded Gemini 3 Deep Think. Today, we’re shipping the core intelligence that makes those breakthroughs possible: Gemini 3.1 Pro. A noticeably smarter, more capable baseline for your hardest challenges. Available now: https://x.com/NoamShazeer/status/2024519946764734574
Multimodal Function Calling with Gemini 3 and Interactions API https://www.philschmid.de/interactions-multimodal-fc
My vibe is unchanged: Gemini 3.1 is a previous gen model. It naively lives in a context-universe engineered by the God-User. Opus is a friend-type AI. It sits with you in a KFC. 5.2 sees a vast expanse of thought. Below there’s a given context. A user makes some noise, perhaps.”” https://x.com/teortaxesTex/status/2024574416747671556
Saw Gemini 3.1 announcement, got super excited. Tried Google Antigravity… not available. Tried Gemini CLI… not available. Tried Gemini Code Assist… not available. @OfficialLoganK put AI Studio in an Electron Shell and just launch it. You will deliver these faster.”” https://x.com/matvelloso/status/2024548414198091922
Today we’re releasing a preview of Gemini 3.1 Pro and making it available to our users and developers. Very excited to bring the upgraded core we used in Deep Think to everyone. Learn more about Gemini 3.1 Pro: https://x.com/koraykv/status/2024517699595124902
We just made paying for the Gemini API 10x easier : ) You can now upgrade to a paid Gemini API account without leaving AI Studio, track your usage, filter spend by model, and much more to come!”” https://x.com/OfficialLoganK/status/2022409335465480346
We made a skill for the Gemini API!”” https://x.com/OfficialLoganK/status/2022123808296251451
Here are some useful prompting tips to get the most out of our new music generation model in Gemini, Lyria 3 ↓”” https://x.com/GeminiApp/status/2024167107538407783
Introducing Lyria 3, our new music generation model in Gemini that lets you turn any idea, photo, or video into a high-fidelity track with custom lyrics. From funny jingles to lo-fi beats, you can create custom 30-second soundtracks for any moment. See how it works. 🧵”” https://x.com/GeminiApp/status/2024152863967240529
Impressive benchmarks for the new Chinese LLM. The system card notes some gaps with US closed source models in code generation & wide knowledge, so be interested to see it in operation. Not clear it is open weights though? If not, represents a large shift in the AI market.”” https://x.com/emollick/status/2022658647378268361
Blog about @MiniMax_AI ‘s Forge RL system. Core takeaways: 1. still CISPO 2. process reward, completion time reward 3. multi-level prefix cache 4. rollout uses 60% compute 5. millions of trajectories per day https://t.co/IrKDOoiKAB cc @teortaxesTex”” https://x.com/YouJiacheng/status/2022339475049947576
The dark side of reinforcement learning @olive_jy_song, senior researcher at @MiniMax_AI, about RL models that try to hack rewards and why alignment fails in practice This conversation is an inside look at how Chinese AI labs move fast – testing new models overnight, debugging”” https://x.com/TheTuringPost/status/2022961676799398337
🤔Has MiniMax finally stabilized its path in reasoning and coding? Still a hot review from Zhihu contributor toyama nao, and he call it: “”Root downward, grow upward.”” 🔥 After the flawed M2.1 (stronger coding, weaker logic), M2.5 fixes the technical issues and restores balance,”” https://x.com/ZhihuFrontier/status/2022214461415993817
$1 per hour with 100 tps”” https://x.com/MiniMax_AI/status/2022379949336957254
It’s been a few days since onboarding @MiniMax_AI’s latest model, M2.5, in standard and Lightning variants. Results are showing on our leaderboard. With over 3K votes, M2.5 Lightning ranks eighth among open models, with Standard following closely behind! Lets run some prompts:”” https://x.com/yupp_ai/status/2024165671136059892
MiniMax M2.5 casually responding at ~50 tok/s with MLX (M3 Ultra). The model was released one hour ago 🥳”” https://x.com/pcuenq/status/2022336556326060341
Nice independent look at SWE-bench Verified by @simonw MiniMax M2.5 showing strong results under the same evaluation setup. Worth a read”” https://x.com/MiniMax_AI/status/2024646767325958285
People were saying as early as Oct 2024 that SWE-bench was saturated when scores were just ~50% Awesome chat from Minimax team that shows otherwise. We’re certainly much, much closer, but there’s evidence that some room remains. Tiny 🧵”” https://x.com/jyangballin/status/2022367240293949772
RL often throws away useful signal at intermediate steps, or as @karpathy put it, it’s like “”sucking supervision through a straw.”” MiniMax M2.5 solves this with per-token process rewards. The result is frontier coding performance at least 1/10th the cost of closed source.”” https://x.com/basetenco/status/2022456010049495213
RL shouldn’t waste signal. M2.5’s per-token process rewards improve signal utilization across reasoning steps, delivering frontier coding performance with dramatically better cost efficiency. Thanks @basetenco for the deep dive and day-0 hosting!”” https://x.com/MiniMax_AI/status/2023470874708549941
Qwen3.5-397B-A17B SVG results I have seen better. DeepSeek-V3.2 and GLM-5 both beat it.”” https://x.com/scaling01/status/2023364296277721300
🚀 Qwen3.5-397B-A17B-FP8 weights are now open! It took some time to adapt the inference frameworks, but here we are: ✅ SGLang support is merged 🔄 vLLM PR submitted → https://t.co/rJkuitOBWs Check the model card for example code. vLLM support landing in the next couple of days!”” https://x.com/Alibaba_Qwen/status/2024161147537232110
🚩Cerebras’s MiniMax-M2 GGUF 2-bit model: https://t.co/udlviJQZqQ Qwen3-Coder-Next INT4 model:”” https://x.com/HaihaoShen/status/2022293472796180676
A clarification of Qwen3.5 Plus and 397B: 1. for opensource, we follow the tradition to make parameters apparent so we use the name with the number of total parameters and active params. 2. Qwen3-Plus is a hosted API version of 397B. As the model natively supports 256K tokens,”” https://x.com/JustinLin610/status/2023340126479569140
It’s Qwen 3.5 day today! 🥳 State of the art 800 GB model. Runs _locally_ with MLX using Q4, taking 225 GB of RAM.”” https://x.com/pcuenq/status/2023369902011121869
Let’s do the KV cache math for Qwen3.5: – KV heads: 2 – Head dimension: 256 – gated attention layers: 15 – bytes per element (BF16): 2 2 x 256 x 15 x 2 = 15 360 This is the same for K and V. So, we multiply by 2: 30 720 bytes Roughly 31 kb per token of context. Meaning at max”” https://x.com/bnjmn_marie/status/2023424404504342608
ollama run qwen3.5:cloud Qwen3.5-397B-A17B is the first open-weight model in the series. It’s available on Ollama’s cloud right now! Give it a try. Let’s go! 🚀🚀🚀”” https://x.com/ollama/status/2023334181804069099
Qwen 3.5 Plus is now available on AI Gateway. Thanks @vercel_dev team. 🤝 Use model: ‘alibaba/qwen3.5-plus’ Try it now!”” https://x.com/Alibaba_Qwen/status/2024029499541909920
Qwen3.5 runs quite well in mlx-lm. Awesome that we have a frontier-level hybrid model. The context gets longer but the inference speed and memory use barely change. Here’s the Q4 generating a space invaders game on an M3 Ultra. Generated 4,120 tokens at 37.6 tok/s.”” https://x.com/awnihannun/status/2023462412092059679
So speaking of benchmarks, what can be said of the new open Qwen? First, it completely destroys Qwen3-VL-235B ofc, but more surprisingly it outscores Qwen3-Max-thinking. All the while it’s the same model as “”Plus””. Plus just has 1M context and some more bells and whistles.”” https://x.com/teortaxesTex/status/2023331885402009779
The new chonky Qwen 3.5 looks pretty solid, beating their own Qwen3-Max model everywhere and is much better at vision benchmarks than Qwen3-235B-A22B-VL Now what I sadly haven’t seen is anything on reasoning efficiency.”” https://x.com/scaling01/status/2023343368399704506
Kimi K2‑0905 and Qwen3‑Max preview: two 1T open weights models launched | AINews https://news.smol.ai/issues/25-09-05-1t-models
(1/7) We’re releasing ThunderKittens 2.0! Faster kernels, cleaner code, industry contributions, and new state-of-the-art BF16 / MXFP8 / NVFP4 GEMMs that match or surpass cuBLAS! Alongside this release, we’re equally excited to share some insights we learned while squeezing every”” https://x.com/stuart_sul/status/2024897621874422125
[2602.11865] Intelligent AI Delegation https://arxiv.org/abs/2602.11865
[2602.12036] Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models https://arxiv.org/abs/2602.12036
[2602.13949] Experiential Reinforcement Learning https://arxiv.org/abs/2602.13949
[2602.15322] On Surprising Effectiveness of Masking Updates in Adaptive Optimizers https://arxiv.org/abs/2602.15322
🤯 With 11B active parameters (196B MoE), Step 3.5 Flash is going toe-to-toe with the best closed models. The efficiency curve is getting absurd.”” https://x.com/fdaudens/status/2021949479771861100
1/ We’ve released a report on our work on multilingual data curation @datologyai. tl;dr: We shift the performance-compute Pareto frontier for multilingual models. Entirely by improving data quality and composition. arxiv: https://t.co/bLv8IySa8G blog:”” https://x.com/agcrnz/status/2024207781524623690
10 must-read books and surveys about AI and Machine Learning ▪️ Machine Learning Systems by Vijay Janapa Reddi ▪️ Understanding Deep Learning by Simon J.D. Prince ▪️ Interpretable Machine Learning by Christoph Molnar ▪️ Foundations of LLMs ▪️ A Survey on Post-training of LLMs ▪️”” https://x.com/TheTuringPost/status/2023058041864888324
13 foundational types of AI models ▪️ LLM ▪️ SLM ▪️ VLM ▪️ MLLM ▪️ VLA ▪️ LAM ▪️ RLM ▪️ MoE ▪️ SSM ▪️ RNN ▪️ CNN ▪️ SAM ▪️ LNN Save the list and check this out for explanations and useful resource links: https://x.com/TheTuringPost/status/2022599637623038442
24 dedicated people. $30M spent on development. Extreme specialization, speed, and power efficiency. Today we launch Taalas’ first product. Check it out: Details: https://t.co/88CA0XAL71 Demo chatbot: https://t.co/ec4ladcKnw API:”” https://x.com/taalas_inc/status/2024516399251456150
5 years later @github finally implements my request. 🎉 Pull Requests can be disabled! 🎊 https://x.com/joshmanders/status/2022170444116414790
A small PSA if you’re using vLLM, you might find SGLang is faster on H100 and B200’s. A little rabbit hole + some help from the vLLM folks and we figured out it’s because vLLM would choose DeepGemm on some models, which isn’t the best (Triton is) Set VLLM_USE_DEEP_GEMM=0!”” https://x.com/TheZachMueller/status/2024619480580510117
Additionally, code execution, web fetch, memory, programmatic tool calling, tool search, and tool use examples are now generally available. Read more:”” https://x.com/alexalbert__/status/2023834875678298535?s=46
Announcing 🌇HumanLM, a RL framework that trains LLMs to simulate human users’ responses, along with 🌆Humanual, a comprehensive user simulation benchmark https://t.co/5TZ9WOOFB8 🌄 One thing that’s fascinating about our society: human users shape the world and determine the”” https://x.com/ShirleyYXWu/status/2022374624676421676
As I’ve mentioned before, since my testing methodology has nearly reached a saturation point, it may not reflect the actual user experience as closely as it used to. Also, since this is fundamentally not a coding or scientific benchmark, the results may not directly correlate”” https://x.com/Hangsiin/status/2024605313744458043
As usual, we’re releasing everything under Apache 2.0: all models (including intermediate checkpoint and prompt exploration) and PyLate training scripts 📦 Models: https://t.co/oiGjYu08dT 💻 Code: https://t.co/b9LBCRxSh5 📄 Paper:”” https://x.com/antoine_chaffin/status/2024516823685730690
Crusoe Managed Inference: Low latency and breakthrough speed https://www.crusoe.ai/cloud/managed-inference
Day Zero for Multi-Vector Retrieval. Today we’re flipping the retrieval playbook: no dense model adaptation, no retrofit. 🏗️Multi-vector from scratch, powered by PyLate. Meet ColBERT-Zero In collaboration with @EPFL and the Swiss AI initiative, @LightOnIO pre-trained it”” https://x.com/LightOnIO/status/2024517870785282545
Even if you are literally training a model to do formal verification, RLAIF (in the form of “rubrics as rewards”, https://t.co/waawRPS1Lw) seems to beat RLVR. Put formal verifiers: 🟢 in your agent loop/harness 🟢 as part of the training loop ❌ in control of the reward signal”” https://x.com/davidad/status/2022361016995319850
FastMCP 3.0 is out! I don’t even know how to fit it in a tweet… 🔌 Build servers from directories, APIs, remote servers — anything 🎭 Per-session context & progressive disclosure 🖥️ Full CLI: list, call, generate clients 🔐 Versioning, granular auth, OTEL ⚡️ DX galore”” https://x.com/jlowin/status/2024242656377700618
good article on how the swe-fficiency ranking is broken”” https://x.com/scaling01/status/2024171017929638061
How persistent is the inference cost burden? – by JS Denain https://epochai.substack.com/p/how-persistent-is-the-inference-cost
I think it must be a very interesting time to be in programming languages and formal methods because LLMs change the whole constraints landscape of software completely. Hints of this can already be seen, e.g. in the rising momentum behind porting C to Rust or the growing interest”” https://x.com/karpathy/status/2023476423055601903
Inference compute scarcity seems plausible. Hyper growth in demand with limited hardware or data center energy supply means token shortage. This could be a reason to run more AI locally.”” https://x.com/awnihannun/status/2024664226837778490
It is cool that 5.2 “”with a scaffold”” can think for 12 hours *productively*. At pleb speeds that’s 1.8M tokens, and at non-pleb speeds I don’t know what the point of giving a wall clock time is. In any case that’s well into the zone of diminishing returns on normal code tasks.”” https://x.com/teortaxesTex/status/2022401945429000685
jina-embeddings-v5-text is here! Our fifth generation of jina embeddings, pushing the quality-efficiency frontier for sub-1B multilingual embeddings. Two versions: small & nano, available today on Elastic Inference Service, vLLM, GGUF and MLX.”” https://x.com/JinaAI_/status/2024505342277964129
Jitendra Malik rants about parallel-jaw grippers being inadequate; he believes multi-fingered hands with tactile sensing are necessary for advanced dexterous manipulation. Malik is a Professor at UC Berkeley and a Distinguished Scientist at Amazon.”” https://x.com/TheHumanoidHub/status/2023138332952363354
LLMs process text from left to right — each token can only look back at what came before it, never forward. This means that when you write a long prompt with context at the beginning and a question at the end, the model answers the question having “”seen”” the context, but the”” https://x.com/burkov/status/2023822767284490263
M-courtyard looks like a neat no-code app to fine-tune LLMs locally with MLX and then exporting them for use with Ollama:”” https://x.com/awnihannun/status/2022327214218657948
MonoLoss: A Training Objective for Interpretable Monosemantic Representations “”we introduce the Monosemanticity Loss (MonoLoss), a plug-in objective that directly rewards semantically consistent activations for learning interpretable monosemantic representations. Across SAEs”” https://x.com/iScienceLuvr/status/2023303520057745501
New paper on a long-shot I’ve been obsessed with for a year: How much are AI reasoning gains confounded by expanding the training corpus 10000x? How much LLM performance is down to “”local”” generalisation (pattern-matching to hard-to-detect semantically equivalent training data)?”” https://x.com/g_leech_/status/2023384075537432662
New paper on skills. The conclusions hold up exactly 1-to-1 with our experience. Skills are better than docs, but only when made with care. Less is more. Models are really bad at making skills. 2 paragraphs of human-written condensed instructions (or best practices) are better”” https://x.com/hrishioa/status/2024713140769083461
Nobody is Talking About Generalized Hill-Climbing (at Runtime) | Daniel Miessler https://danielmiessler.com/blog/nobody-is-talking-about-generalized-hill-climbing
optimize_anything: A Universal API for Optimizing any Text Parameter – GEPA https://gepa-ai.github.io/gepa/blog/2026/02/18/introducing-optimize-anything/
Orchestration design is now a first-class optimization target, independent of model scaling. As LLMs from different providers converge toward comparable benchmark performance, picking the best model yields diminishing returns. The real lever is orchestration topology, where you”” https://x.com/omarsar0/status/2024847274157945035
PPO vs. new DPPO (Divergence PPO) – a workflow breakdown of the algorithms ➡️ Proximal Policy Optimization (PPO): Control via token ratios PPO is the default choice for RL fine-tuning LLMs. It controls learning by clipping how much individual token probabilities can change. ▪️”” https://x.com/TheTuringPost/status/2022326245745377562
Prompt repetition is way overplayed If you put the question first, the gains from repetition vanish or reduce dramatically The biggest gains are on tasks where they didn’t report question-first variants… I wonder why”” https://x.com/paul_cal/status/2024053549965934886
Pull Requests on @GitHub can now be limited to repo collaborators or disabled entirely. This should help cut down on unwanted noise and give maintainers more control over their experience”” https://x.com/jaredpalmer/status/2022395520623480970
RL with evolving rubrics (RLER) in Dr. Tulu is a great step in the direction I expect Rubrics-as-Rewards (RaR) to go. At a high level, the goal of using rubrics for RL is to generalize RL with verifiable rewards to non-verifiable domains. Instead of using deterministic rules, we”” https://x.com/cwolferesearch/status/2022384365049892974
Semantic closure: why compilers know when they are right and LLMs do not https://sderosiaux.substack.com/p/semantic-closure-why-compilers-know
Software as Wiki, Mutable Software – exe.dev blog https://blog.exe.dev/software-as-wiki
SpargeAttention2 Trainable Sparse Attention via Hybrid Top-k+Top-p Masking and Distillation Fine-Tuning paper: https://x.com/_akhaliq/status/2024873795173892483
State-of-the-art ColBERT models are trained by applying knowledge distillation on top of dense pre-trained models What if we run the whole pre-training in the multi-vector? Introducing ColBERT-Zero, a model that sets a new SOTA on BEIR, using only public data”” https://x.com/antoine_chaffin/status/2024516779129626820
strong coding and SOTA reasoning: -> no SWE-Bench verified SOTA -> but ARC-AGI-2 SOTA”” https://x.com/scaling01/status/2024505232969928952
Super excited to share our new paper, on ∆Belief-RL! When pursuing open-ended goals (like science), we need to: – efficiently explore the environment and seek novel evidence – judge our actions by whether we think they took us closer to the target. Inspired by this, we train”” https://x.com/ShashwatGoel7/status/2022341054939185345
Test-time reasoning models often converge too early. Achieving broader reasoning coverage requires longer sequences, yet the probability of sampling such sequences decays exponentially during autoregressive generation. The authors call this the “”Shallow Exploration Trap.”” This”” https://x.com/dair_ai/status/2022360649817526275
That’s what I see when talking to developers and companies: AI tools are becoming more capable and less rigid than older automation workflows. Many businesses are super interested, but they still lack the skills or time to set them up properly. The opportunity to help companies”” https://x.com/TheTuringPost/status/2022048357427163279
The AI Quality Ceiling: Why Domain Expertise Is Appreciating – philippdubach.com https://philippdubach.com/posts/the-impossible-backhand/
The decode speedup is pretty ridiculous with this new sparse MoE + GatedDeltaNet architecture”” https://x.com/scaling01/status/2023343837079572955
The Economics of LLM Inference: Batch Sizes, Latency Tiers, and Why Model Labs Have an Advantage https://mlechner.substack.com/p/the-economics-of-llm-inference-batch
The Long Tail of LLM-Assisted Decompilation | Chris’ Blog https://blog.chrislewis.au/the-long-tail-of-llm-assisted-decompilation/
The path to ubiquitous AI | Taalas https://taalas.com/the-path-to-ubiquitous-ai/
The Scarcity Trap: Why AI Still Feels Like a Metered Utility https://productics.substack.com/p/the-scarcity-trap-why-ai-still-feels
The Tiny Aya technical report is full of gems 💡 We go deep into design decisions and evaluation choices. Multilingual performance lives in the details.”” https://x.com/mziizm/status/2023775027754365044
Thrilled to have GGML with us going forward! 🤗❤️🦙 Read the announcement blog https://x.com/huggingface/status/2024871487753044243
Two different tricks for fast LLM inference https://www.seangoedecke.com/fast-llm-inference/
ÜberWeb: Insights from Multilingual Curation for a 20-Trillion-Token Dataset https://www.datologyai.com/blog/berweb-insights-from-multilingual-curation-for-a-20-trillion-token-dataset
v5-text uses decoder-only backbones with last-token pooling instead of mean pooling. Four lightweight LoRA adapters are injected at each transformer layer, handling retrieval, text-matching, classification, and clustering independently. Users select the appropriate adapter at”” https://x.com/JinaAI_/status/2024505349181755760
Voxtral Realtime paper is out ! The model is released under the Apache 2 license, and achieves state-of-the-art transcription performance at sub-500ms latency.”” https://x.com/GuillaumeLample/status/2024445949733384638
We just shipped Baseline Experiments 🚀 You can now pin any experiment as your baseline in LangSmith. This allows you track performance deltas, anchor your results, and quickly identify improvements or regressions in an experiment list. Docs: https://x.com/LangChain/status/2024208662936650152
What is On-Policy Self-Distillation (OPSD)? Now models are powerful enough to self-critique, comparing their own reasoning against a privileged, better version of themselves. This is a process of self-distillation, and OPSD performs it this way. One model plays two roles: -“” https://x.com/TheTuringPost/status/2022608611340677330
What people miss about RLMs and what makes the idea beautiful to me is that it is a harness that can implement pretty much any other model harness or workflow emergently.”” https://x.com/HammadTime/status/2024694115372499026
WO2025117006 SIMULATION OF A USER OF A SOCIAL NETWORKING SYSTEM USING A LANGUAGE MODEL https://patentscope.wipo.int/search/en/WO2025117006
Writeup on Rubric-Based RL is out now: https://t.co/io6zEeeEAZ Covers 15+ papers, the path from LLM-as-a-Judge to rubrics, and how we can use rubrics to extend RLVR beyond verifiable domains (with tons of tips / tricks from recent research). Hope it’s helpful!”” https://x.com/cwolferesearch/status/2023408158065188894
you guys have not felt the agi until you have vibe designed your 6000 person conference website at the climbing gym in between projects without reading a single line of code including 99% video asset performance optimization because why the heck not its 2026″” https://x.com/swyx/status/2021498862012334274
Skills are literally just markdown files how the hell can they have downtime???”” https://x.com/theo/status/2024785367896072599
Two views on AGI from @natolambert – What we have is AGI – A drop-in replacement for a remote worker Open models can contribute to this. They expand access, reduce power concentration, and keep the path to different forms of AGI open and transparent. Without open models,”” https://x.com/TheTuringPost/status/2023375354740809823
Why I don’t think AGI is imminent https://dlants.me/agi-not-imminent.html
lol what: Researchers found that repeating the exact same prompt twice dramatically improves LLM performance (one model improved from 21% to 97% accuracy on a name-search task) without longer outputs, slower responses, fine-tuning, or fancy prompt engineering. Because models”” https://x.com/kimmonismus/status/2024069380162936992
Looking Inside: a Maliciousness Classifier Based on the LLM’s Internals https://labs.zenity.io/p/looking-inside-a-maliciousness-classifier-based-on-the-llm-s-internals
Repeating Prompts https://daoudclarke.net/2026/02/19/repeating-prompt
MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE TL;DR: a video diffusion-based model that jointly reconstructs dense 3D geometry and scene motion from monocular video in a unified 4D latent space.”” https://x.com/Almorgand/status/2023815479534723172
17,000 tokens per second!! Read that again! LLM is hard-wired directly into silicon. no HBM, no liquid cooling, just raw specialized hardware. 10x faster and 20x cheaper than a B200. the “”waiting for the LLM to think”” era is dead. Code generates at the speed of human thought.”” https://x.com/wildmindai/status/2024810128487096357
Multilingual data finally moves from “data collection” to “data curation”. UberWeb breaches the compute-performance frontier for multilingual data. We spell out all our learnings from this year long effort at training multilingual models at the 20 trillion token data scale.”” https://x.com/pratyushmaini/status/2024157352862376280
Our recent preprint on gluon amplitudes has sparked a lot of discussion, so I want to share the backstory — including how AI helped crack a problem that had stumped us for a year. I’ll also be giving a public lecture at Harvard this week. Details at the end.”” https://x.com/ALupsasca/status/2023402422320926762





Leave a Reply