Image created with gemini-3.1-flash-image-preview with claude-sonnet-4-5. Image prompt: Photorealistic aerial view of a frozen winter bay at dusk where natural ice fractures form intricate circuit board patterns with geometric traces and nodes, warm sunset light glowing through thin ice cracks creating copper-gold pathways against deep blue shadows, floating ice chunks reveal layered technological geometry, National Geographic quality, 4K resolution, physically grounded natural phenomenon, bold sans-serif ‘Tech’ title overlaid.

MiniMax-M2.5 is a surprising new step in open coding models. The first model where I’ve been able to independently confirm that it’s better than the most recent Claude Sonnet. It showed up in our benchmarks below, and in my vibe checks it felt strong and diverse.”” https://x.com/gneubig/status/2021988250240598108

80.2% on SWE-Bench Verified and 76.3% on BrowseComp is quite impressive. Try @MiniMax_AI M2.5 on @Eigent_AI”” https://x.com/guohao_li/status/2021984827923476922

M2.5 runs at 100 tokens per second. That’s 3x faster than Opus. At $0.06/M blended with caching, you can run subagents in the CLI and just leave them going. Fast models exist. Cheap models exist. Both at SOTA performance is new.”” https://x.com/cline/status/2022034678065373693

A sane but extremely bull case on OpenClaw (Clawdbot) | Brandon Wang https://brandon.wang/2026/clawdbot

Apple’s iOS 26.4 Siri Update Runs Into Snags in Internal Testing; iOS 26.5, 27 – Bloomberg https://www.bloomberg.com/news/articles/2026-02-11/apple-s-ios-26-4-siri-update-runs-into-snags-in-internal-testing-ios-26-5-27

Meta AI prepares Avacado, Manus Agent, OpenClaw integration https://www.testingcatalog.com/meta-ai-redies-avacado-manus-agent-and-openclaw-integration/

So far “telling a satisfying and well-written medium-length story” has proved far harder for LLMs than mathematical proofs, music generation, research reports, code, and many other forms of work. The technical reasons are pretty clear, but they are supposed to be language models”” https://x.com/emollick/status/2020993610540605560

An updated Gemini 3 Deep Think is out today: 📈 Achieves SOTA on ARC-AGI-2, MMMU-Pro, and HLE. 🥇Gold-medal level on Physics & Chemistry Olympiads. It turns out the best way to solve hard problems is still to think about them. Read more: https://x.com/NoamShazeer/status/2021988459519652089

Gemini 3 Deep Think (2/26) Semi Private Eval – ARC-AGI-1: 96.0%, $7.17/task – ARC-AGI-2: 84.6% $13.62/task New ARC-AGI SOTA model from @GoogleDeepMind”” https://x.com/arcprize/status/2021985585066652039

Gemini 3 Deep Think scores 84.6% on ARC-AGI-2″” https://x.com/scaling01/status/2021981766249328888

Sundar buried the real story in the cost data. Gemini 3 Deep Think went from 45.1% to 84.6% on ARC-AGI-2 in under 3 months. That’s an 88% improvement on a benchmark specifically designed to resist brute-force scaling. The number that matters: $13.62 per task. The previous Deep”” https://x.com/aakashgupta/status/2022025020839801186

The new Gemini Deep Think is achieving some truly incredible numbers on ARC-AGI-2. We certified these scores in the past few days.”” https://x.com/fchollet/status/2021983310541729894

Thrilled to announce a big upgrade to Gemini 3 Deep Think that hits new records on the most rigorous benchmarks in maths, science & reasoning – including 84.6% on ARC-AGI-2, 48.4% Humanity’s Last Exam without tools, and 3455 Elo rating on Codeforces!”” https://x.com/demishassabis/status/2022053593910821164

Today, we updated Gemini 3 Deep Think to further accelerate modern science, research and engineering. With 84.6% on ARC-AGI-2 and a new standard on Humanity’s Last Exam, see how this specialized reasoning mode is advancing research & development 🧵↓”” https://x.com/Google/status/2021982003818823944

We updated Gemini 3 Deep Think in @GeminiApp. Available for Ultra subscribers and slowly opening Gemini API access (fill out form below). – 48.4%, without tools on Humanity’s Last Exam. – 84.6% on ARC-AGI-2, verified by the ARC Prize Foundation. – Elo of 3455 on Codeforces. -“” https://x.com/_philschmid/status/2021989093110927798

An updated & faster Gemini 3 Deep Think is taking off! 🚀 Our smartest mode to date!™️ PhD-level reasoning to the most rigorous STEM challenges (models’ gotta think harder). Gold medal-level results on Physics & Chemistry Olympiads. 🧪💻 Full details: https://x.com/OriolVinyalsML/status/2021982720860233992

Anupam Pathak, a Google R&D lead in Google’s Platforms and Devices division, tested Deep Think’s ability to speed up the design of physical components. It’s proving that deep reasoning can translate directly into faster, more efficient prototyping.”” https://x.com/Google/status/2022007994897379809

At Duke University, the Wang Lab used Deep Think to optimize crystal growth for new semiconductors. Deep Think designed a recipe to grow thin films larger than 100 μm — hitting a precision target that previous methods had challenges to hit.”” https://x.com/Google/status/2022007988823973977

Gemini 3 Deep Think: AI model update designed for science https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-deep-think/

nano-banana is Gemini‑2.5‑Flash‑Image, beating Flux Kontext by 170 Elo with SOTA Consistency, Editing, and Multi-Image Fusion | AINews https://news.smol.ai/issues/25-08-26-nano-banana

The upgraded Gemini 3 DeepThink is now live! 🚀 We’re already seeing engineers and researchers leverage it as a partner in their design and development processes I love this example of Anupam Pathak using DeepThink to go from prompt to physical prototype–actually designing”” https://x.com/tulseedoshi/status/2021997867305775324

We’ve updated Gemini 3 Deep Think to better tackle the complexity of real-world research, science, and engineering. ♊ 🚀 It achieves gold-medal standards on the written portions of the Physics and Chemistry Olympiads, building on gold-level performance at IMO and ICPC and has”” https://x.com/JeffDean/status/2021989820604539250

We’ve upgraded our specialized reasoning mode Gemini 3 Deep Think to help solve modern science, research, and engineering challenges – pushing the frontier of intelligence. 🧠 Watch how the Wang Lab at Duke University is using it to design new semiconductor materials. 🧵”” https://x.com/GoogleDeepMind/status/2021981510400709092

What’s ahead for commercial experiences in 2026 https://blog.google/products/ads-commerce/digital-advertising-commerce-2026/

people sleep on last week’s open multimodal releases > GLM-OCR: sota OCR model > MiniCPM-o-4.5: Gemini 2.5-flash level Omni model that runs on your phone > InternS1: efficient generalist VLM outperforming on science tasks all allow commercial use freely 🔥”” https://x.com/mervenoyann/status/2021233480957304913

This is batshit insane. Gemini 3 Deep Think just scored a 3455 on Codeforces, equivalent to the #8 best competitive programmer in the world. The previous best was 2727 (#175) from OpenAI o3. This is an absolutely superhuman result for AI and technology at large.”” https://x.com/deedydas/status/2022021396768133336?s=46

GLM-5: From Vibe Coding to Agentic Engineering https://simonwillison.net/2026/Feb/11/glm-5/

GLM-5: From Vibe Coding to Agentic Engineering https://z.ai/blog/glm-5

Introducing GLM-5: From Vibe Coding to Agentic Engineering GLM-5 is built for complex systems engineering and long-horizon agentic tasks. Compared to GLM-4.5, it scales from 355B params (32B active) to 744B (40B active), with pre-training data growing from 23T to 28.5T tokens.”” https://x.com/Zai_org/status/2021638634739527773

GLM-5 was pre-trained on 28.5T tokens and uses DeepSeek Sparse Attention”” https://x.com/scaling01/status/2021627498451370331

A glance of MiniMax 2.5, are you ready?”” https://x.com/SkylerMiao7/status/2021578926884053084

Congrats @MiniMax_AI! 🎉 Free for 3 days on Qoder, it’s time to put M2.5 through some serious coding sessions!”” https://x.com/qoder_ai_ide/status/2021983111161213365

MiniMax just dropped M2.5 and it’s on par with Opus 4.6 while being 20x cheaper and 3x faster???”” https://x.com/shydev69/status/2021989925143597123

// Automating Sub-Agent Creation for Agentic Orchestration // Multi-agent systems are powerful but inflexible. Building agentic systems today relies on static and predefined roles. For example, an agentic AI coder might have a coder agent, a searcher agent, a reviewer agent.”” https://x.com/dair_ai/status/2021215864557797608

Agent Labs: Welcome to GPT Wrapper Summer – by swyx (Shawn) https://www.latent.space/p/agent-labs

Agents @ Work: Dust.tt – Latent.Space https://www.latent.space/p/dust

Agents @ Work: Lindy.ai – Latent.Space https://www.latent.space/p/lindy

Before LangChain, teams stitched together a patchwork: a framework (or bespoke glue code), generic observability (logs/APM), a spreadsheet of prompts and test cases, and a deployment stack designed for stateless APIs. That approach fails for agents for a reason LangChain keeps”” https://x.com/marvinvista/status/2021605778285814092

CooperBench update: we gave agents git. It didn’t cure the curse of coordination, but we found more interesting cases of miscoordination. We set up self-hosted git servers so agent pairs could actually see and share each other’s code. Cooperation improves marginally, but new”” https://x.com/_Hao_Zhu/status/2021252996848550005

deepagents now supports byo sandboxes, giving your agents the power to execute code in an isolated env. you can use our builtin integrations for @modal, @daytonaio, and @RunloopAI, or bring your own sandbox provider! docs:”” https://x.com/sydneyrunkle/status/2022025934774374503

Does the sandbox run your agent? Or does your agent run the sandbox? Sounds arcane. It’s not. Agent-in-Sandbox: Fast to ship, but LLM-generated code has the same permissions as your whole agent. Sandbox-as-Tool: Agent calls out to sandboxes for execution only. You can give”” https://x.com/chriscorcoran/status/2021631151970865530

Everyone is building “”data agents”” but nobody agrees on what that means. The term gets applied to everything from a simple SQL chatbot to a fully autonomous data scientist. This ambiguity makes it impossible for users and builders to reason about what a system can actually do.”” https://x.com/dair_ai/status/2021252863150924244

Expanding our long-running agents research preview · Cursor https://cursor.com/blog/long-running-agents

How Cognition Uses Devin to Build Devin – by Nader Dabit https://nader.substack.com/p/how-cognition-uses-devin-to-build

I hold this truth to be self-evident: Putting the agent in a different container than the environment makes a lot more architectural sense.”” https://x.com/bernhardsson/status/2021527682534760709

I simply do not see how Open Claw and systems like it won’t completely disrupt virtual assistant businesses like Athena etc. It’s been an absolute game changer allowing me to context switch like a madman without dropping a beat. VA doesn’t even do it justice. It’s like I have”” https://x.com/bilawalsidhu/status/2019612006811095199

I think one of the most important questions in multi-agent AI right now is one almost nobody is asking: when you add more agents, are you actually getting collaboration, or are you just spending more compute? Collaboration and communication are huge bottlenecks for multi-agent”” https://x.com/omarsar0/status/2021013257348419670

Long-running agents are now available at https://t.co/3PT8c7azU3 for Ultra, Teams, and Enterprise plans. With our new harness, agents can complete much larger tasks. https://x.com/cursor_ai/status/2022046178708492445

Minions: Stripe’s one-shot, end-to-end coding agents | Stripe Dot Dev Blog https://stripe.dev/blog/minions-stripes-one-shot-end-to-end-coding-agents

More and more agents need a workspace: a container to execute code and other processes We see two different ways of setting this up: 1. Agent IN a sandbox 2. Sandbox as a tool Wrote up the pros and cons of each! Ty to @nfcampos @RunloopAI @e2b @0thernet for their insights”” https://x.com/hwchase17/status/2021265779803521245

New course: A2A: The Agent2Agent Protocol, built with @googlecloudtech and @IBMResearch, and taught by Holt Skinner, @ivnardini, and Sandi Besen. Connecting agents built with different frameworks usually requires extensive custom integration. This short course teaches you A2A,”” https://x.com/AndrewYNg/status/2021985280102973931

OpenEnv in Practice: Evaluating Tool-Using Agents in Real-World Environments https://huggingface.co/blog/openenv-turing

Personal, Local, Private AI Agents: Soumith Chintala – YouTube https://www.youtube.com/watch?v=jMoAaZP_Kkw

Pi: The Minimal Agent Within OpenClaw | Armin Ronacher’s Thoughts and Writings https://lucumr.pocoo.org/2026/1/31/pi/

tl;dr Today, we’re announcing our new company @EntireHQ to build the next developer platform for agent-human collaboration. Open, scalable, independent, and backed by a $60M seed round. Plus, we are shipping Checkpoints to automatically capture agent context. In the last three”” https://x.com/ashtom/status/2021255786966708280

Welcome to the team, @cognition. The dedicated AI coding agent company joins us as a Global Partner. Find out more: https://x.com/AstonMartinF1/status/2020845510345830653

Why Agentic AI Breaks Legacy Identity — and What Infrastructure Leaders Must Do Next | Teleport https://goteleport.com/why-agentic-ai-breaks-legacy-identity/

Don’t Build Agents, Build Skills Instead – Barry Zhang & Mahesh Murag, Anthropic – YouTube https://www.youtube.com/watch?v=CEvIs9y1uog&t=715s

Folks claim to set the state of the art on ARC-AGI-2 using an RLM, a deeply recursive one, to manage the long horizon. “”Other agent harnesses keep everything in the model’s context window. We don’t. Agentica uses a stateful REPL to manage context. This is an RLM-style loop.”””” https://x.com/lateinteraction/status/2021994073675247816

🤖 From this week’s issue: A research article presenting Google’s evaluation of 180 agent configurations, revealing multi-agent systems boost parallelizable tasks by 81% but degrade sequential tasks by 70%.”” https://x.com/dl_weekly/status/2020935994787143726

Kimi Agent Swarm blog is here 🐝 https://t.co/XjPeoRVNxG Kimi can spawn a team of specialists to: – Scale output: multi-file generation (Word, Excel, PDFs, slides) – Scale research: parallel analysis of news from 2000-2025 – Scale creativity: a book in 20 writing styles”” https://x.com/Kimi_Moonshot/status/2021141949416362381

Kimi Agent Swarm: 100 Sub-Agents at Scale https://www.kimi.com/blog/agent-swarm

Launching mini-SWE-agent 2.0, the simplest coding agent. Near SoTA performance, with the agent/model/environment only ~100 lines each. Powering benchmarks and RL training at NVIDIA, Anyscale, Stanford and many more!”” https://x.com/KLieret/status/2021606142699356215

The companies that succeed in the future are going to make very heavy use of AI. People will manage teams of agents to do very complex things. Today we are launching Frontier, a new platform to enable these companies.”” https://x.com/sama/status/2019441198734209374

What the hell happened with AGI timelines in 2025? | 80,000 Hours https://80000hours.org/podcast/episodes/agi-timelines-in-2025/

Announcing our $10M seed round and pitch deck | Adapt https://adapt.com/blog/pitch-deck

For those who doesn’t know: babushkin – in Russian – means “grandma’s” Grandma’s Ventures in AI – feels so cosy! so “come in, don’t stand in the cold,” so quietly judgmental about your benchmark charts, but still loving unconditionally”” https://x.com/TheTuringPost/status/2019541218355790191

Lots of folks spread false narratives about how ARC-1 was created in response to LLMs, or how ARC-2 was only created because ARC-1 was saturated. Setting the record straight: 1. ARC-1 was designed 2017-2019 and released in 2019 (pre LLMs). 2. The coming of ARC-2 was announced”” https://x.com/fchollet/status/2022036543582638517

Our ability to measure AI has been outpaced by our ability to develop it, and this evaluation gap is one of the most important problems in AI. Today we’re launching Open Benchmarks Grants — a $3M commitment to fund open benchmarks for frontier AI and close the evaluation gap.”” https://x.com/vincentsunnchen/status/2021663737716125781

Public benchmarks lag behind what frontier labs are using internally to test and develop LLMs, yet they are the key driver of progress for LLMs. This needs to change! Excited to work with @SnorkelAI who are investing $3M do build out the evaluation ecosystem with the community.”” https://x.com/lvwerra/status/2021671530108006705

StepFun-Flash-3.5 is now the #1 model on MathArena 🧮🏆 Fast enough to think. Reliable enough to reason. More updates coming soon. We are so back. 🚀 MathArena: https://t.co/b09fJSVecL OpenRouter: https://t.co/ZIaNfkCu7j Website: https://t.co/HcGbiBN8po Blog:”” https://x.com/CyouSakura/status/2021511358626554322

They don’t say it in the top level post but this is a recursive language model getting SOTA on ARC-AGI-2″” https://x.com/deepfates/status/2021991526856110252

$3M to support the development of open benchmarks!”” https://x.com/percyliang/status/2021701152333877681

Short post about Engram, recent paper by DeepSeek: It is essentially very similar to SCONE (link below), where authors train embeddings for a large number of n-grams (e.g. 1B common n-grams like “”Alexander the Great””). [1/2]”” https://x.com/gabriberton/status/2020612533502222459

AI needs better evaluations. Today we’re announcing Arena’s Academic Partnerships Program to fund independent academic research in AI evaluation and measurement. ▫️Up to $50K/project. Q1 Deadline: March 31, 2026. See more in thread for details and how to apply 👇”” https://x.com/arena/status/2021268433619374336

1/ AxiomProver has solved Fel’s open conjecture on syzygies of numerical semigroups, autonomously generating a formal proof in Lean with zero human guidance. This is the first time an AI system has settled an unsolved research problem in theory-building math and self verifies.”” https://x.com/axiommathai/status/2019449659807219884?s=20

If you are in any situation where being right matters, you would, at this point, be making a mistake to not ask a frontier LLM for help. That can mean checking your own work, second opinions on other experts, or getting help with a complex problem. Have judgement, but use them”” https://x.com/emollick/status/2021052930410021335

Can just a 4B model solve IMO-level proof problems at the level of much stronger LLMs like Gemini 3 Pro? Yes, if you can train the LLM to scale test-time compute well! We’re very excited to release our 4B model “”QED-Nano””, built via an awesome open collab! Details below🧵⬇️”” https://x.com/aviral_kumar2/status/2022057927368995097

Early testers of Gemini 3 Deep Think are already seeing results. We partnered with researchers to explore how this model could tackle rigorous, real-world applications — from spotting hidden flaws in research papers to optimizing semiconductor growth. Here’s how early testers”” https://x.com/Google/status/2022007977419415958

If you’re an Ultra subscriber, you can try the latest in the Gemini App, but we’re also making Deep Think available for the first time in the Gemini API! Request early access here:”” https://x.com/tulseedoshi/status/2021997870858350640

@GeminiApp Do people realize how crazy that thing is??”” https://x.com/LexnLin/status/2021986194780041394

Codeforces results is “”no tools””? So Gemini 3.0 Deep Think cannot write test cases to test its solution before submission? I guess even the top1 human can’t get 3455 under this condition.”” https://x.com/YouJiacheng/status/2021985843074994534

Gemini 3 Deep Think benchmarks look amazing! On Codeforces, it scored 3,455 Elo. Apparently, only 7 humans in the world have a higher coding Elo score! A friend just sent me an output about a cancer mechanism that was so great that I am now resubscribing to Ultra for DT access!”” https://x.com/DeryaTR_/status/2022030594037989493

Gemini 3 Deep Think can help make things. 🧠 Here’s our side project: We sketched a laptop stand and Deep Think coded that into an interactive prototyping tool. We used that tool to generate a STL file, which we sent to @fleet_ai. And now I have a new laptop stand! What will”” https://x.com/joshwoodward/status/2022001967795777996

Gemini 3 Deep Think is available now in the @GeminiApp for Google AI Ultra subscribers and via the Gemini API to select researchers, engineers and enterprises through our early access program. Learn more ↓”” https://x.com/Google/status/2021982018679312829

Gemini 3 Deep Think is getting a significant upgrade. We’ve refined Deep Think in close partnership with scientists and researchers to tackle tough, real-world challenges. And it’s pushing the frontier across the most challenging benchmarks, achieving an unprecedented 84.6% on”” https://x.com/sundarpichai/status/2022002445027873257

Gemini 3 Deep Think now excels across scientific domains like chemistry and physics — achieving gold medal-level results on the written sections of the 2025 International Physics and Chemistry Olympiads.”” https://x.com/Google/status/2021982010739503138

Parsing PDFs at scale with LLMs is cost prohibitive. Newer models (e.g. gemini 3) are good at reading pdfs, but you burn unnecessary vision tokens even when the page is text heavy. We’ve built in a “cost-optimizer” within LlamaParse that will dynamically route pages to”” https://x.com/jerryjliu0/status/2021267495123140760

The upgraded Deep Think mode is rolling out now in the @GeminiApp for Google AI Ultra subscribers. For scientific researchers and developers, we’re opening a Vertex AI Early Access Program for the API. Start discovering → https://x.com/GoogleDeepMind/status/2021981517791342807

There are only 7 people on the planet who can beat Gemini 3 Deep Think in coding competitions. It has an Elo of 3455. A bit over a year ago the best systems were at 2727 (o3-preview).”” https://x.com/scaling01/status/2021983388442509478

Today, we’re releasing a significant upgrade to our specialized reasoning mode, Gemini 3 Deep Think. Deep Think is built to drive practical applications, enabling researchers to interpret complex data and engineers to model physical systems through code. With the updated Deep”” https://x.com/GeminiApp/status/2021985731577852282

Opus 4.6 dethroned GPT-5.2-xhigh on WeirdML and is now in clear first place! Opus finds much shorter (so presumably more simple and elegant) solutions to the problems. But code execution times went up. So maybe the difference in code length is due to optimizations? Would love”” https://x.com/scaling01/status/2020847174909665712

Opus 4.6, Codex 5.3, and the post-benchmark era https://www.interconnects.ai/p/opus-46-vs-codex-53

🤖 From this week’s issue: Official blog post announcing Qwen3-Coder-Next, an 80B-parameter coding model achieving competitive performance on SWE-Bench (70.6% on Verified) while enabling 10x higher throughput for repository-level agentic workflows.”” https://x.com/dl_weekly/status/2021690941879250945

A training-free framework that guides robot behavior in real time. [📍 Project, paper & videos below 👇] VLS runs uncut, steering pretrained policies across long-horizon tasks. Most robots don’t fail because they lack skill. They fail because their behavior isn’t aligned with”” https://x.com/IlirAliu_/status/2019696630283260003

«Head Parallel achieves O(1) communication volume regardless of the number of activated experts, perfectly balanced traffic across GPUs, and deterministic communication patterns.» Pretty crazy. This kind of work makes models with >1000 experts and extreme sparsity inevitable.”” https://x.com/teortaxesTex/status/2020767825715929332

🔦 A methodology to evaluate LLMs on genuine research-level mathematics @Stanford, @UTAustin, @Harvard and others released 10 unpublished math questions for AI systems – each solvable with short proofs unknown online. They span: – algebra – topology – analysis – numerical”” https://x.com/TheTuringPost/status/2021198248728502354

1/5 Go Big or Go OOM: The Art of Scaling vLLM 🎯. We doubled throughput and cut latency in half-same GPUs, just better vLLM config then added smart autoscaling to handle traffic bursts. Here’s what we learned optimizing LLM-as-a-Judge for GRPO training. 🧵”” https://x.com/AI21Labs/status/2020787359285944746

5. Results: With smart IO-aware optimizations, Multi-Head LatentMoE trains: • 1.61× faster than standard MoE • Same model quality • Up to 4× less inter-GPU communication (k=4) With finer-grained experts it achieves higher overall accuracy, being still 1.11× faster than”” https://x.com/TheTuringPost/status/2020884105886593325

A new design: Multi-Head LatentMoE + Head Parallelism (HP) strategy ➡️ Each token is split into smaller pieces (heads) and those are evenly sent to GPUs before routing. Then, each GPU does routing + expert work locally. Why does this new MoE type help? • It’s up to 1.61×”” https://x.com/TheTuringPost/status/2020884031630610484

A new paper from CDS’ @ylecun and Brown’s @randall_balestr introduces LeJEPA, a simpler way to train AI systems without labels. The method drops many common training tricks, scales efficiently, and still performs well on benchmarks like ImageNet.”” https://x.com/NYUDataScience/status/2021983784577745065

A Social Filesystem — overreacted https://overreacted.io/a-social-filesystem/#a-record

DFlash: Block Diffusion for Flash Speculative Decoding – Z Lab https://z-lab.ai/projects/dflash/

ERNIE 5.0 tech report is out https://t.co/WCeiJ27gDy Although it wasn’t a very good model there still might be some interesting stuff in here”” https://x.com/scaling01/status/2020863398162972822

Experts Have World Models. LLMs Have Word Models. https://www.latent.space/p/adversarial-reasoning

Great writeup from @AI21Labs on scaling vLLM for high-throughput, bursty workloads. TL;DR: systematic config tuning + queue-based autoscaling = 2x throughput from the same GPUs. 🚀 Useful for anyone running vLLM in production with variable traffic patterns. Thanks to the”” https://x.com/vllm_project/status/2021196826058338321

here is how sparsity evolved in recent large open MoEs. there are two ways to think about sparsity: > expert sparsity: number of selected expert (top-k + shared expert) / total number of experts > parameters sparsity: active parameters / total parameters model included: -“” https://x.com/eliebakouch/status/2020956220694171718

How Rob Pike got spammed with an AI slop “act of kindness” https://simonwillison.net/2025/Dec/26/slop-acts-of-kindness/

I finally had a chance to read this paper. I am now convinced that Recursive Language Models (RLMs) are going to be the next big thing in AI advances! Attention is shifting toward very large context windows. Very impressive paper! Congrats to Alex who is a new PhD student at MIT.”” https://x.com/DeryaTR_/status/2020978003963244838

i’m publishing a new blog post on this insanely useful feature of triton: it is what makes the custom triton NVFP4 quant kernel go hand-in-hand or beat CUDA. many people may not be aware about it so go read!”” https://x.com/maharshii/status/2021266717641474194

iGRPO: Self-Feedback-Driven LLM Reasoning “”In Stage 1, iGRPO samples multiple exploratory drafts and selects the highest-reward draft using the same scalar reward signal used for optimization. In Stage 2, it appends this best draft to the original prompt and applies a”” https://x.com/iScienceLuvr/status/2021160967774634071

In order to effectively evaluate AI traces across thousands of SQL questions, I had to build a test harness where could I quickly evaluate correctness visually.”” https://x.com/matsonj/status/2020630608029036764

Inference is the New Sales & Marketing Spend | SaaStr https://www.saastr.com/inference-is-the-new-sales-marketing-spend/

Inspired by this tweet, I tested the model using ‘Expedition 33,’ which is my go-to topic for long-context testing. I have to agree–this model is impressive. It demonstrated a high level of comprehension, especially considering it didn’t spend much time reasoning. The”” https://x.com/Hangsiin/status/2021599414457188666

It’s actually a super interesting insight into ultra-scale LLM training, even though they’re witholding a lot of stuff, and they are clearly inept at post-training. This base model had tremendous potential, just like Burkina-Faso.”” https://x.com/teortaxesTex/status/2020867552356778427

John Carmack muses using a long fiber line as an L2 cache for streaming AI data — programmer imagines fiber as alternative to DRAM | Tom’s Hardware https://www.tomshardware.com/pc-components/ram/john-carmack-muses-using-a-long-fiber-line-as-as-an-l2-cache-for-streaming-ai-data-programmer-imagines-fiber-as-alternative-to-dram

Learning to Self-Verify Makes Language Models Better Reasoners “”learning to self-verify alone can significantly improve generation performance”” “”Learning to self-verify requires significantly fewer tokens to solve the same problems.”””” https://x.com/iScienceLuvr/status/2021164018132505081

Legibility Explained (by People Who Don’t Hate Legibility) — with Steve Krouse and Henrik Karlsson – YouTube https://www.youtube.com/watch?v=96S_64ipHOA

LLMs could be, but shouldn’t be compilers https://alperenkeles.com/posts/llms-could-be-but-shouldnt-be-compilers/

LLMs tripled new book releases since 2022. Average quality fell: most new entries are, indeed, slop BUT books 100-1,000 per category are actually better than before, & pre-LLM authors got more productive. And since people only read the good books, it is net positive for readers.”” https://x.com/emollick/status/2021287459016053083

LoRA but with Only 13 Parameters?? – by Benjamin Marie https://kaitchup.substack.com/p/lora-but-with-only-13-parameters

Lots of good threads on RLMs the last couple days. Have a blog post on the potential of RLMs, in plain language, I’ll try to get out tomorrow. In short:”” https://x.com/dbreunig/status/2020723909491114294

Magic Tricks, Moats, and the Three-Body Problem of AI Networks https://www.caseyaccidental.com/p/magic-tricks-moats-and-the-three

Mooncake Joins PyTorch Ecosystem – PyTorch https://pytorch.org/blog/mooncake-joins-pytorch-ecosystem/

Must-read AI research of the week: ▪️ Golden Goose: Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text ▪️ Reinforced Attention Learning ▪️ Rethinking the Trust Region in LLM Reinforcement Learning ▪️ GRP-Obliteration: Unaligning LLMs With a Single Unlabeled Prompt”” https://x.com/TheTuringPost/status/2021176608506499127

My entire net worth is in third order Rapture derivatives. If the chance of “the chance of “the chance of the Rapture exceeds 5%” exceeds 5%” exceeds 5%, i lose my house”” https://x.com/it_is_fareed/status/2021281774819496154

Next Concept Prediction in Discrete Latent Space Leads to Stronger Language Models “”Our model, ConceptLM, quantizes hidden states using Vector Quantization and constructs a concept vocabulary.”” “”Results on 13 benchmarks show that NCP yields consistent performance gains over”” https://x.com/iScienceLuvr/status/2021161792110559311

Oh, MoE should definitely die. Like what do you mean “”mixture of experts””. No experts, just tiny FFNs with a non-differentiable top-k between them, all day they collapse, eat hot chip and hallucinate. No, we want a unified latent space and flexible conditional computation. Alas…”” https://x.com/teortaxesTex/status/2020915555151040829

On DeepWiki and increasing malleability of software. This starts as partially a post on appreciation to DeepWiki, which I routinely find very useful and I think more people would find useful to know about. I went through a few iterations of use: Their first feature was that it”” https://x.com/karpathy/status/2021633574089416993

Optimal Timing for Superintelligence https://nickbostrom.com/optimal.pdf

Our new paper “”Deriving neural scaling laws from the statistics of natural language”” https://t.co/I40lAO5Il1 lead by @Fraccagnetta & @AllanRaventos w/ Matthieu Wyart makes a breakthrough! We can predict data-limited neural scaling law exponents from first principles using the”” https://x.com/SuryaGanguli/status/2021291213639516184

queue.json on object storage is all you need to build a reliable distributed job queue → FIFO execution → at-least-once delivery → 10x lower tail latencies”” https://x.com/turbopuffer/status/2022014743322800384

Really huge: Isomorphic Labs unveiled its IsoDDE engine, claiming 2×+ accuracy gains over AlphaFold 3, 2.3× better antibody predictions, and binding-affinity results that beat physics-based gold standards, all at a fraction of the time and cost. The platform generalizes to truly”” https://x.com/kimmonismus/status/2021206410755707307

Submitted 6 apps to the iOS App Store last week. 4 got approved, and the remaining 2 I expect will be approved shortly. Next, it’s time to figure out distribution. So I’ve generated 6 AI influencers promoting one app. It’s pretty crazy how real these look and sound!”” https://x.com/philo01/status/2008880081456996510?s=46

TBD, but this might be the first time I see plausible evidence that nested (depth >= 2) recursion can be useful for RLMs in practice. I chuckled at the authors referring to “”traditional”” RLMs though, since those are only a few months old. Congrats folks!”” https://x.com/lateinteraction/status/2021995467564020095

Tencent HY Research https://hy.tencent.com/research/100025?langVersion=en

The core unlock is having 2 pools for context, token space & programmatic space, & giving the LLM the ability to move context to token space as it sees fit. This turns long context problems into tasks that can be solved with coding, which LLMs are very good at.”” https://x.com/dbreunig/status/2020723910724174283

The gentle obsolescence – by Benn Stancil – benn.substack https://benn.substack.com/p/the-gentle-obsolescence

The Iso team has cooked something incredible: our new technical report unveils the latest results from our drug design engine, the IsoDDE, progressing far beyond AlphaFold 3. This breaks new ground compared to AF and other similar methods by a significant degree across all key”” https://x.com/maxjaderberg/status/2021170265242173677

The Limit in the Loop | Weaviate https://weaviate.io/blog/limit-in-the-loop

The LLM Context Tax: Best Tips for Tax Avoidance https://www.nicolasbustamante.com/p/the-llm-context-tax-best-tips-for

The many masks LLMs wear – by Kai Williams https://www.understandingai.org/p/the-many-masks-that-llms-wear

The more I read up, the more impressive the breakthrough Isomorphic labs has made here. Isomorphic Labs’ IsoDDE doesn’t just predict protein structures better than AlphaFold 3, it can find hidden binding pockets in seconds that used to take six months of lab work, and predict how”” https://x.com/kimmonismus/status/2021217873708917087

The Potential of RLMs https://www.dbreunig.com/2026/02/09/the-potential-of-rlms.html

This probably isn’t vanilla DSA feels looks more like the promise of NSA, except I can’t believe they’re getting this performance with block attention. A mature, fully flexible form of their sparse approach. MODEL1 with adaptive top-k? Something better? so awesome”” https://x.com/teortaxesTex/status/2021547122223460495

Transformers made multi modal architectures trivial Imagine implementing BLIP with a CNN and an LSTM, a nightmare! This is why transformers won also in vision. Not because of slightly better results”” https://x.com/gabriberton/status/2020595051609698764

Triton is cool for prototyping ideas, but what if you can beat hand-written CUDA kernels with just a sprinkle for inline elementwise assembly when needed. Next up on @fal performance blog by @maharshii goes deep into this.”” https://x.com/isidentical/status/2021264421163590085

Unwrap: AI-Powered Customer Intelligence https://www.unwrap.ai/

V4 Lite now live in the app. 1M context length. Text-only. Muon + mHC confirmed. Larger version is still on the way. @zephyr_z9 @teortaxesTex”” https://x.com/yifan_zhang_/status/2021574517089321284

Vending-Bench 2 | Andon Labs https://andonlabs.com/evals/vending-bench-2

WarpGrep: Fast, Parallel Code Retrieval with RL | Morph https://www.morphllm.com/blog/fast-context-rl-retrieval

We post-trained QED-Nano using RL with rubrics as rewards, along with a neat trick to enable efficient use of test-time compute. Today, we open source the model and will share the full training recipe and data very soon :)”” https://x.com/_lewtun/status/2022003877407818222

We’re excited to welcome Mooncake to the PyTorch Ecosystem! Mooncake is designed to solve the “memory wall” in LLM serving. By integrating Mooncake’s high performance KVCache transfer and storage capabilities with PyTorch native inference engines like SGLang, vLLM, and”” https://x.com/PyTorch/status/2022079425001504933

What if your model could learn from its own drafts during RL training? 🚀 🔥 New paper: iGRPO: Iterative Group Relative Policy Optimization We add a self-feedback loop to GRPO: the model drafts multiple solutions, picks its best one, then learns to refine beyond it. Core idea:”” https://x.com/ahatamiz1/status/2021116982029123874#m

Why haven’t SuperApps conquered the US Yet? | LinkedIn https://www.linkedin.com/pulse/why-havent-superapps-conquered-us-yet-cassandra-king-qsxge/

Why I’m excited about RLMs: they’re a simple, generally applicable test-time strategy with tons of low-hanging fruit for optimization. Great for dealing with long contexts today. Tomorrow? Could be so much more…”” https://x.com/dbreunig/status/2020994879078400408

wrote a custom CUDA kernel that uses 256-bit gmem loads starting from version 12.9 on blackwell and for smaller shapes it is indeed faster than triton until the shape (8192, 8192). it still seems to be slower on larger shapes, i bet there is room for more speed.”” https://x.com/maharshii/status/2021241686031008119

📄We just launched PDF uploads in Arena. Upload PDFs with your prompts to add richer context and test models on document reasoning, bringing evaluations closer to real-world use. ▪️Ask questions directly against documents ▪️Digest complex, technical content in minutes ▪️Extract”” https://x.com/arena/status/2021300537711526113

Good thread on how we need academic research to be fast+updated. Imagine if the contribution of the SWE-Bench paper was “”AI can’t do software engineering””, and then the paper came out a year after the experiments were run.”” https://x.com/gneubig/status/2021370741237694705

Intelligence too cheap to meter: This 10 minute clip took 8 hours to creaste and cost around $60. That’s fast and inexpensive for an excellent anime clip. Soon everyone will be a movie-director. And I think many still don’t understand what that means. We’ve crossed the”” https://x.com/kimmonismus/status/2021604639557464134

Your data is the real battlefield. And the fight to keep context with the user is worth having. Raffi Krikorian, CTO of @mozilla, explains why → #TP_interview”” https://x.com/TheTuringPost/status/2021101204059849039

Say hello to the new @GoogleAIStudio home page : ) We made it way easier to quickly get back to past chats, vibe coded apps, check project usage, and quickly start building with the new Omnibar. And this is just the start!”” https://x.com/OfficialLoganK/status/2021640117220520289

Leading Inference Providers Cut AI Costs by up to 10x With Open Source Models on NVIDIA Blackwell | NVIDIA Blog https://blogs.nvidia.com/blog/inference-open-source-models-blackwell-reduce-cost-per-token/

Today we share a technical report demonstrating how our drug design engine achieves a step-change in accuracy for predicting biomolecular structures, more than doubling the performance of AlphaFold 3 on key benchmarks and unlocking rational drug design even for examples it has”” https://x.com/IsomorphicLabs/status/2021162400494264517

Very excited to share QED-Nano: the smallest theorem proving model to date 🤏At just 4B parameters, it matches the performance of much larger models on the challenging IMO-ProofBench benchmark and operates entirely in natural language, with no reliance on Lean or external tools.”” https://x.com/_lewtun/status/2022003874500845813

Imho SeeDance looks the most natural, the most human. It’s the little things: the wine moving in the glass, the facial expressions, the details. SeeDance is forcing Google and OpenAI to quickly update their models to Sora 2.5 / Veo 3.2, thus boosting performance.”” https://x.com/kimmonismus/status/2021176568563785908

❤️ GLM-5 is on Ollama’s cloud! It’s free to start, and with higher limits available on the paid plans. ollama run glm-5:cloud It’s fast. You can connect it to Claude Code, Codex, OpenCode, OpenClaw via ollama launch! Claude: ollama launch claude –model glm-5:cloud”” https://x.com/ollama/status/2021667631405674845

🎉 The mysterious Pony Alpha is finally revealed, congrats to @Zai_org on releasing GLM-5! SGLang is ready to support on day-0. 🛠️ 744B params (40B active) model built for complex systems engineering & long-horizon agentic tasks 📚 28.5T tokens pretraining for a stronger”” https://x.com/lmsysorg/status/2021639499374375014

🔥Congrats to @Zai_org on launching GLM-5 — 744B parameters (40B active), trained on 28.5T tokens, integrating DeepSeek Sparse Attention to keep deployment cost manageable while preserving long-context capacity. vLLM has day-0 support for GLM-5-FP8 with: 📖 DeepSeek Sparse”” https://x.com/vllm_project/status/2021656482698387852

🚀 Zhipu AI GLM-5: A Real Step Into the Top Tier? Zhihu contributor toyama nao offers a concise verdict: “”A hard road upward — the stairway to godhood.”” 🔮From recovery to contention Over the past six months (4.5 → 5.0), Zhipu has climbed back into China’s first tier and now”” https://x.com/ZhihuFrontier/status/2022161058321047681

GLM-5 by @Zai_org is now the #1 open model in Code Arena, tied with Kimi-K2.5-Thinking! Overall #6 on par with Gemini-3-pro, 100+pts below Claude-Opus-4.6 in agentic webdev tasks. Congrats to the @Zai_org GLM team on the new milestone! 👏”” https://x.com/arena/status/2021996281141629219

GLM-5 from @Zai_org just climbed to #1 among open models in Text Arena! ▫️#1 open model on par with claude-sonnet-4.5 & gpt-5.1-high ▫️#11 overall; scoring 1452, +11pts over GLM-4.7 Test it out in the Code Arena and keep voting, we’ll see how GLM-5 performs for agentic coding”” https://x.com/arena/status/2021725350481526904

GLM-5 is coming to Coding Plan Pro users within one week, and we’re working to bring it to everyone after that. To be upfront: compute is very tight. Even before the GLM-5 launch, we were pushing every chip to its limit just to serve inference. We appreciate your understanding”” https://x.com/Zai_org/status/2021656633320018365

GLM-5 is now on AI Gateway. Better long-range planning, multiple thinking modes, and improved multi-step agent tasks versus previous https://t.co/Yqx8kVZ3i8 models. Use 𝚖𝚘𝚍𝚎𝚕: ‘𝚣𝚊𝚒/𝚐𝚕𝚖-𝟻’ to get started.”” https://x.com/vercel_dev/status/2021655129347539117

GLM-5 is the new leading open weights model! GLM-5 leads the Artificial Analysis Intelligence Index amongst open weights models and makes large gains over GLM-4.7 in GDPval-AA, our agentic benchmark focused on economically valuable work tasks GLM-5 is @Zai_org’s first new”” https://x.com/ArtificialAnlys/status/2021678229418066004

GLM-5 is ZAI’s new flagship. 744B params (40B active), trained on 28.5T tokens, and built for complex systems engineering and long-horizon agentic tasks. Two things worth paying attention to: 1. They integrated DeepSeek Sparse Attention to cut deployment costs while keeping”” https://x.com/cline/status/2021999167875555694

GLM-5 just launched — now available in Qoder. On Qoder Bench — our benchmark for real-world software engineering tasks — GLM-5 outperforms Sonnet 4.5 and approaches Opus 4.5. At a fraction of the cost. High demand expected — brief waits possible during peak hours. Scaling in”” https://x.com/qoder_ai_ide/status/2021639227814092802

GLM-5, the latest frontier open model from @Zai_org, is available now on Modal. We partnered with https://t.co/nhqgwNEWkB to release an endpoint that will be free for a limited time.”” https://x.com/modal/status/2021645783733616800

Pony Alpha Stealth model reveal: GLM-5 from @Zai_org GLM-5 is a new 744B foundation model for coding and agentic usecases. It achieves SOTA scores on top agent benchmarks, and has been used successfully in many agent flows during its Stealth period. Live now on OpenRouter!”” https://x.com/OpenRouter/status/2021639702789730631

Average Throughput of GLM-5 on Openrouter is 14 tps”” https://x.com/scaling01/status/2021981416452764058

Build more. Spend less. GLM-5 is now on YouWare. Landing pages, portfolios, prototypes. All handled fast, with a 200K context window. Save your premium credits for the big builds.”” https://x.com/YouWareAI/status/2021982784948936874

Congrats @Zai_org on GLM-5! Love the permissive MIT license (vs K2.5’s modified MIT). Haven’t chatted with it yet so no vibes, but from the numbers I’m not compelled to switch from @Kimi_Moonshot K2.5: • Similar evals, but GLM-5’s are at bf16 while K2.5’s are at int4 – GLM-5″” https://x.com/QuixiAI/status/2021651135615184988

Day-0 with @Zai_org: GLM-5 is live on DeepInfra 🔥 Built for long-horizon agents that plan, orchestrate, and self-correct. Serving ~100 TPS at launch and as usual the best price on the market!”” https://x.com/DeepInfra/status/2021666854088110318

GLM 5 is 2x the total parameter of GLM 4.5 + deepseek sparse attention for efficient long context this is going to be a crazy model”” https://x.com/eliebakouch/status/2020824645868630065

GLM MoE DSA”” is landing in transformers 👀”” https://x.com/xeophon/status/2020815776890909052

GLM-4.7-Flash-GGUF is now the most downloaded model on @UnslothAI.”” https://x.com/Zai_org/status/2021207517557051627

GLM-5 already available on OpenRouter (with even lower prices)”” https://x.com/scaling01/status/2021637257103651040

GLM-5 has a 200k context length and maximum output of 128k”” https://x.com/scaling01/status/2021628691357298928

GLM-5 is massive. 745B params. LETS FUCKING GOOOOO This should be fun!”” https://x.com/scaling01/status/2020840989947298156

GLM-5 Pricing $1 and $3.2 Output There is also a GLM-5 Code variant that is more expensive👀 almost 8 times cheaper than Opus”” https://x.com/scaling01/status/2021628971939418522

GLM-5 runs with mlx-lm on a single 512GB M3 Ultra in Q4. It’s quite good in my initial testing and pretty fast as well. It generated a highly functional space invaders game using 7.1k tokens at 15.4 tok/s and 419GB memory. Thanks to @ActuallyIsaak and @kernelpool for the port.”” https://x.com/awnihannun/status/2022007608811696158

https://t.co/ctlyPtiB3j GLM-5 architecture is out: ~740B parameters ~50B active 78 layers, MLA attention lifted from DeepSeek V3, plus DeepSeek V3.2’s sparse attention indexer for 200k context. Basically DeepSeek V3 scale with DSA bolted on.”” https://x.com/QuixiAI/status/2021111352895393960

GLM-5 is out on @huggingface 🔥 > A40B/744B, trained on more tokens (28.5T) > outperforms/on par with closed sota > allows commercial use (MIT licensed) 💗 use with vLLM/SGLang locally or through HF Inference Providers thanks to @novita_labs and @Zai_org 📦”” https://x.com/mervenoyann/status/2021642658188538348

DeepSeek V4-lite, Minimax 2.5, GLM-5 what a bloodbath will Qwen accelerate the release of 3.5?”” https://x.com/teortaxesTex/status/2021586965594857487

Leave a Reply

Trending

Discover more from Ethan B. Holland

Subscribe now to keep reading and get access to the full archive.

Continue reading