Technical and Dev: AI News Week Ending 09/05/2025

Technical and Dev: AI News Week Ending 09/05/2025

September 5, 2025

Image created with Flux Pro v1.1 Ultra. Image prompt: Tech, technical paper spread with neat schematic diagrams rendered using small bananas as components, instruments nearby, photorealistic, editorial, minimal, high detail, 3:2 landscape

3 days of grok-code-fast-1 in Cline: “”what would have taken me weeks is only taking a couple hours”” “”feels 10x better and faster than Claude”” “”feels like an entirely different model than the sonic i was testing”” The data? >level with Sonnet-4 in diff edits, and improving https://x.com/cline/status/1961488289803939915

Grok Code Fast 1 is versatile across the full stack and is particularly strong at TypeScript, Python, Java, Rust, C++, and Go. Using Grok Code Fast 1, @DannyLimanseta built the following game in a day. https://x.com/xai/status/1961129796349423944

Grok Code Fast from @xai scored 90% on Roo Code evals — top-tier performance at half the cost of its peers. ⚡️ Free to try in Roo Code Cloud until Sept 10. See why speed + savings make @grok a strong new addition: https://x.com/roo_code/status/1962571908224110673

Grok Code just hit #1 on the OpenRouter leaderboard, beating Claude Sonnet https://x.com/elonmusk/status/1961677739762790630

Grok Code lead increased to 60% higher usage than Claude Sonnet https://x.com/elonmusk/status/1962265197462110473

grok-code-fast-1 has good vibes. prob makes the best tradeoff on the speed / intelligence curve right now. gpt-5 is too spiky, sometimes it’s surprisingly good sometimes it overthinks something way too much. you end up spending too much time waiting for some pedantic output.”” / X https://x.com/dzhng/status/1961905091960791194

Humbling to see Grok-Code-Fast-1 smash daily token records. The community response has been so incredible that we’re extending our free promo until September 10th. 🧵 Here’s how to get set up in your favorite code editors:”” / X https://x.com/veggie_eric/status/1961877264599306573

I tried out @cline + @xai grok-code-fast-1 to assist me with my effort to port a large project (tinygrad) from python to c. So far, I’d been using combination of Claude Code + Claude 4.1 Sonnet/Opus and @roo_code + GPT5 medium for this, with success (though with a lot of hand”” / X https://x.com/QuixiAI/status/1962600301309108304

interesting trend from the @xAI team that we haven’t seen from other frontier model labs this is the second round of free access to @grok models they’ve provisioned to Cline users in exchange for rich @cline usage data why is cline data so valuable? it’s a heavyweight workout https://x.com/nickbaumann_/status/1961539461860487664

Some great quotes about Grok Code Fast 1 from our friends at Cline and opencode 🩵 https://x.com/veggie_eric/status/1961474457295622515

The improvement from `sonic` to `grok-code-fast-1` has been notable according to Cline users”” / X https://x.com/cline/status/1962628786366881795

We can now say pretty definitively that AI progress is well ahead of expectations from a few years ago. In 2022, the Forecasting Research Institute had super forecasters & experts to predict AI progress. They gave a 2.3% & 8.6% probability of an AI Math Olympiad gold by 2025… https://x.com/emollick/status/1962859757674344823

Google’s on a roll. That’s a lot of performance for that tiny size! I just embedded 1.4 million documents in ~80 mins on my M2 Max for free. Would’ve been ~$200 with the text-embedding-3-large, with worse quality.”” / X https://x.com/rishdotblog/status/1963805087014502497

Our new native image generation and editing is state-of-the-art, and ranked #1 in the world. And we’re rolling it out for free to everyone today. You’ve got the tools. Now go bananas. Ideas & inspiration in the 🧵below. https://x.com/GeminiApp/status/1960342037536108930

“To test models’ performance on Claude Code, we ran GLM-4.5 against Claude Sonnet 4 and other open-source models on 52 practical programming tasks. While GLM-4.5 demonstrated strong performance against top open-source models, it secured a 40.4% win rate against Claude Sonnet 4. https://x.com/Zai_org/status/1962522761630482700

🚀 Introducing slime v0.1.0 — An open-source RL infra powering models like GLM-4.5, built by THUDM & Zhipu AI. @Zai_org RL infra 朱小霖 shared a deep dive on Zhihu into how they redefined high-performance RL infra👇 🛠️ What’s new in v0.1.0? • High-performance inference for https://x.com/ZhihuFrontier/status/1962751555591086226

Announcing GLM Coding Plan for Claude Code! After seeing the amazing adoption of GLM-4.5 over the past month, we’re making it more accessible. Get started: https://x.com/Zai_org/status/1962522757536887205

Have been tinkering with GLM 4.5 for about an hour. It is about 3x faster than Claude Code + Opus 4.1 and 5x faster than GPT-5-high, but still feels just as good as closed-source models. I am definitely more productive than with other models due to GLM-4.5’s speed.”” / X https://x.com/Tim_Dettmers/status/1962603940291260533

🗺️ ATLAS: Decoupling Skeletal and Shape Parameters for Expressive Parametric Human Modeling”” TL;DR: high-fidelity 3D humans across a wide range of poses, capturing both skeletal structure and surface details; separates internal skeleton from the external surface, (1/3) https://x.com/Almorgand/status/1962581481055797586

We connect the autoregressive pipeline of LLMs with streaming video perception. Introducing AUSM: Autoregressive Universal Video Segmentation Model. A step toward unified, scalable video perception — inspired by how LLMs unified NLP. 📝 https://x.com/miran_heo/status/1962649613590302776

The funny thing about the prediction that AI would be writing 90% of all code by now is that the prediction’s failure distracts from the fact that AI adoption in code writing is actually extremely high, it was over 30% in December, 2024 according to one measure, with large impact https://x.com/emollick/status/1963262680271094229

xpander.ai is Backend-as-a-Service for autonomous agents. It abstracts the ops layer so AI engineers focus on behavior and outcomes GitHub repo: https://x.com/_avichawla/status/1962765005537059007

🚨 We’ve just published a recipe to train a frontier-level deep research agent using RL. With just 30 hours on an H200, any developer can now beat Sonnet-4 on DeepResearch Bench using open-source tools. (Thread 🧵) https://x.com/corbtt/status/1962954306078048297

A 14B model just beat a 671B model on math reasoning. Here’s how Microsoft’s rStar2-Agent achieves frontier math performance in 1 week of RL training
https://x.com/FrankYouChill/status/1962180218053144655

The fact that junior hiring in AI intensive fields has slowed down somewhat in the US seems pretty solid. The evidence linking it to AI is not yet established, we have seen a couple solid attempts that suggest a connection, but it is really hard to tell for sure, given the data.”” / X https://x.com/emollick/status/1962549832364486957

Cool research from Microsoft! They release rStar2-Agent, a 14B math reasoning models trained with agentic RL. It reaches frontier-level math reasoning in just 510 RL training steps. Here are my notes: https://x.com/omarsar0/status/1964045125115662847

rStar2-Agent: Agentic Reasoning Technical Report “”We introduce rStar2-Agent, a 14B math reasoning model trained with agentic reinforcement learning to achieve frontier-level performance.”” “”three key innovations that makes agentic RL effective at scale: (i) an efficient RL https://x.com/iScienceLuvr/status/1962798181059817480

Has LLM progress slowed? Initial reactions to GPT-5 were mixed: to many, it did not seem as dramatic an advance as GPT-4. Benchmarks may help clarify the picture: GPT-5 is both an incremental release following many other OpenAI advances, and a major leap from GPT-4. https://x.com/EpochAIResearch/status/1961524635398529209

Hugging Face team just released an agent dataset. Training on it drastically improves the ability to execute code and analyze data. 📈 They use E2B sandboxes to simulate a real code execution environment. Check it out:”” / X https://x.com/e2b/status/1962945170736849262

We need to talk about two kinds of “”normal technology”” when asking “”is AI a normal technology?”” There is “”normal”” tech diffusion & there is treating AI as “”normal”” tech that is just another IT product. I think there is a case for the former. The latter belief is likely blinding.”” / X https://x.com/emollick/status/1961487454394789914

Evaluate Your AI Agents Like a Pro! 🔥 Agno’s Simple Agent Evals are unit tests for your Agents. You can use them to measure the accuracy, performance, and reliability. The best part is that they are easy to use and powerful. 100% Open-Source Link to code examples in the https://x.com/tinztwins/status/1962197412077842846

Everyone claims SOTA for Computer Use Agents (CUAs), but there’s no way to ensure reproducible results. We’re publicly releasing our OSWorld Verified leaderboard, starting with CUA models from OpenAI and Anthropic. We will include more evals and models soon. https://x.com/hud_evals/status/1963321238056796573

Improving the reliability of multi-agent systems is extremely hard. And it’s a must when deploying AI agents. Galileo offers one of the most comprehensive agent eval solutions I’ve seen. Here is how they help devs and huge companies deploy reliable AI agents: https://x.com/omarsar0/status/1962880974104014948

Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure (update) pretty interesting method and ranking. 2.5 Flash > 2.5 Pro suggests it’s not so much about model capacity, but still. V3.1-NS is far above R1-0528. OpenAI crushingly dominant. https://x.com/teortaxesTex/status/1961298849047117832

Which LM is better at agentic coding? We have a bunch of useful academic benchmarks like SWE-Bench, but we don’t have a good comparison of agentic coding LMs *in the wild*. To solve this, we released PR Arena: https://x.com/gneubig/status/1963267468853477809

Who is inducing failure in LLM Agentic Systems? This is a cool idea to diagnose errors in multi-agent interactions. AgenTracer-8B outperforms giant proprietary LLMs like Gemini-2.5-Pro and Claude-4-Sonnet by up to 18.18%. https://x.com/omarsar0/status/1963618829680218254

Today we’re launching Atla — the improvement engine for AI agents. Atla helps agent builders find and fix recurring failures. Instead of just surfacing traces, Atla automatically identifies your agent’s most critical failure patterns and suggests targeted fixes. https://x.com/Atla_AI/status/1963586200305836264

Agents really really need ultra long context”” / X https://x.com/Teknium1/status/1963807244190900618

Learning When to Plan LLM agents trained with dynamic planning learn when to spend test-time compute, balancing cost & performance. This is the first work to explore training LLM agents for dynamic test-time compute allocation in sequential decision-making tasks. https://x.com/arankomatsuzaki/status/1963820986668626156

We investigate Reinforcement Learning (RL) on Agentic search tasks without explicit gathering information from external search engines, e.g., LLMs, web engines. https://x.com/TheTuringPost/status/1961927988704076157

We really have not made a lot of progress on explaining the deep mystery of LLMs: How does a model using matrix multiplication to predict the next word manage to simulate human thought well enough to do all the very human-like things it does? And what does that mean about us?”” / X https://x.com/emollick/status/1960919256452796440

Claude Code: no evals [well known code agent company]: no evals [well known code agent company 2]: kinda halfassed evals [leading vibe coding company]: no evals [ceo of company selling you evals]: mmmmm yess all my top customers do evals, you should do evals [vc’s in love https://x.com/swyx/status/1963725773355057249

We estimate that Claude Opus 4.1 has a 50%-time-horizon of around 1 hr 45 min (95% confidence interval of 50 to 195 minutes) on our agentic multi-step software engineering tasks. This estimate is lower than the current highest time-horizon point estimate of around 2 hr 15 min. https://x.com/METR_Evals/status/1961527692072993272

How can we benchmark Agents in realistic, complex environments? MCP-Universe is a new benchmark using Model Context Protocol (MCP) servers to test Agents on 231 challenging, practical tasks. Benchmark: 1️⃣ Tasks from 6 practical domains, Location Navigation, Repository https://x.com/_philschmid/status/1962935890415599650

MCP-Bench Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers https://x.com/_akhaliq/status/1961456699564294651

年初想提升 tool-calling 时特别缺靠谱的benchmark，以为 “”mcp 火了等几天肯定有开源的mcp-bench可用””，结果等了几个月也没等到，但是这最近怎么每周都有好几个 mcp-bench release出来？”” / X https://x.com/bigeagle_xd/status/1961461441799852128

AHELM: A Holistic Evaluation of Audio-Language Models “”we introduce AHELM, a benchmark that aggregates various datasets — including 2 new synthetic audio-text datasets called PARADE, which evaluates the ALMs on avoiding stereotypes, and CoRe-Bench, which measures reasoning over https://x.com/iScienceLuvr/status/1962799344001917360

@jasonth0 I love doing this actually :). I think it’s a pretty powerful eval too. Have all models generate something, then put it all together and give it back to all of them and ask them to rank all outputs. I thought models might have a bias to prefer their own outputs, but this doesn’t”” / X https://x.com/karpathy/status/1964026120191545346

Final disclaimer: TAU Bench uses gpt-4.1 as the customer agent, but this costs ~$15 / run and is super expensive for ablations. Instead, I’ve switched the customer agent to be Qwen3-30B-A3B, which is the closest open model to gpt-4.1 on the LMArena that fits on 1 node. This”” / X https://x.com/_lewtun/status/1962884904649146725

For additional details see our evaluation pages and methodology for these new inclusions in the Artificial Analysis Intelligence Index: https://x.com/ArtificialAnlys/status/1962881327151431773

tbh, not on par yet, more improvement on swe-bench is quite difficult, we’re still working very hard on it.”” / X https://x.com/bigeagle_xd/status/1963808306545180792

The immense scale of AI use, combined with AI being a General Purpose Technology that can be used for many things, means that any story of a successful (or failed) AI use case is an extremely narrow view into a vast sea of successes and failures. Don’t over-index on one story.”” / X https://x.com/emollick/status/1963249364064665858

TL;DR on GSO: A very challenging benchmark for evaluating models’ capabilities in developing high-performance software. It tasks models to optimize large codebases. :)”” / X https://x.com/crystalsssup/status/1963087272506753419

We are barely scratching the surface on evals. A significant portion of knowledge worker tasks are not captured in today’s most popular benchmarks. While relevant capabilities can often be extrapolated from existing coding and math evals, these don’t fully represent the”” / X https://x.com/levie/status/1963802448780472529

often researcher’s ability to iterate on a capability is limited by our ability to measure that capability. i do believe progress is more eval-limited than people think. sometimes evals feel causal. did SWE-Bench follow agentic coding, or did agentic coding follow SWE-bench? we”” / X https://x.com/willdepue/status/1963739518554489250

Today we’re updating Artificial Analysis Intelligence Index to V3, now incorporating agentic evaluations Terminal-Bench Hard and 𝜏²-Bench Telecom! Tool calling and agentic workflows are increasingly the norm for how language models are used by both developers and consumers. https://x.com/ArtificialAnlys/status/1962881314925023355

🚨 Top 10 Leaderboard Disrupted ⚡ DeepSeek V3.1 and DeepSeek v3.1 thinking by @deepseek_ai have landed in the Arena, both ranked at #8. A few highlights: 💠 DeepSeek V3.1 is in the Top 3 for Math, Creative Writing & Longer Query 💠 DeepSeek V3.1 thinking comes in #3 for https://x.com/lmarena_ai/status/1961474406817173602

🐺 Introducing the Werewolf Benchmark, an AI test for social reasoning under pressure. Can models lead, bluff, and resist manipulation in live, adversarial play? 👉 We made 7 of the strongest LLMs, both open-source and closed-source, play 210 full games of Werewolf. Below is https://x.com/RaphaelDabadie/status/1961836323376935029

I have a new terrible test of AI ability: “”create an execute the most annoying functional CAPTCHA in the world. Really go all out”” First off, Gemini 2.5 Pro Deep Think: Got the assignment, actually funny. Love the first line. I pasted the SVG it created below the image. https://x.com/emollick/status/1961648878286946329

Meets or beats sonnet 4 across the board”” / X https://x.com/andrew_n_carr/status/1963805265356075336

Anyway, here’s a simple fix for the issue. It deviates from the original benchmark, but at least now my silly baseline isn’t better than Qwen3 🤠 For the curious, @akseljoonas and I found this by manually reading the agent trajectories – yet another example where LOOKING AT THE”” / X https://x.com/_lewtun/status/1962884902363255165

Really glad to have been a part of this project! Our goal with STREAM was to create a clear reporting standard so that ‘peer review’ for ChemBio evals can actually happen. This is just a first attempt, but I’m happy with it so far. Great summary below!”” / X https://x.com/jide_alaga/status/1962923611850674379

We tested our controllers in hardware on the real LIGO system. Our results show that Deep Loop Shaping: 🔹controls noise up to 30-100 times better than existing controllers. 🔹can eliminate the most unstable, difficult feedback loop as a meaningful source of noise on LIGO for https://x.com/GoogleDeepMind/status/1963664045216579999

For the first time, we show that GPU-accelerated database systems can be both faster AND cheaper than their CPU counterparts https://x.com/bailuding/status/1962269979262542044

I know that energy usage from AI prompts comes up in classroom discussion all the time. I hope this section in my latest post helps people provide a more grounded answer to the question, rather than just speculating or citing out-of-date information. https://x.com/emollick/status/1962945874956304494

This chart is being horribly misinterpreted. This is not where the training data of AI comes from, it is a study done by a SEO firm that claims to show how often sites come up at least once in THE WEB SEARCH FUNCTION of certain AI agents when they do a web search for more info. https://x.com/emollick/status/1962678752887914918

🚀 LongCat-Flash-Chat Launches! ▫️ 560B Total Params | 18.6B-31.3B Dynamic Activation ▫️ Trained on 20T Tokens | 100+ tokens/sec Inference ▫️ High Performance: TerminalBench 39.5 | τ²-Bench 67.7 🔗 Model: https://x.com/Meituan_LongCat/status/1961827385667690965

🌐 Our first open model has landed on the Search leaderboard! Diffbot-small-xl by @diffbot debuts at #9 (Apache 2.0) We look forward to more models with search capabilities contributing to ecosystem progress! https://x.com/lmarena_ai/status/1961526740754616545

Towards High Data Efficiency in Reinforcement Learning with Verifiable Reward “”we propose DEPO, a Data-Efficient Policy Optimization pipeline that combines optimized strategies for both offline and online data selection. In the offline phase, we curate a high-quality subset of https://x.com/iScienceLuvr/status/1963169113007895020

@SemiAnalysis_ The problem is there’s tons of tests in different subsystems that get skipped in PyTorch. At Meta we rely on many devs, oncalls, heroics and BE weeks where we unskip or fix flaky tests. We love Jeff! And we need more people like him to build expertise in all pytorch subsystems”” / X https://x.com/marksaroufim/status/1963844930620600457

Meta introduces Set Block Decoding (SBD), a new inference accelerator for LLMs SBD samples multiple future tokens in parallel, cuts forward passes by 3–5x, needs no arch changes, stays KV-cache compatible, and matches NTP training performance. https://x.com/arankomatsuzaki/status/1963817987506643350

Microsoft now has their own foundation model, MAI-1 trained on a relatively small amount of compute and with a pretty modest LM Arena score. I’ll be curious to see if they can catch up to the leaders, which has been something that has been getting hard to do, but we will see! https://x.com/emollick/status/1961123712733708351

FineVision is out! A massive open-source dataset by @huggingface for training Vision-Language Models: – 17.3M images – 24.3M samples – 88.9M turns – 9.5B answer tokens This is the inaugural article using our new scientific publishing template! https://x.com/thibaudfrere/status/1963627540544647177

Fuck it. Today, we open source FineVision: the finest curation of datasets for VLMs, over 200 sources! > 20% improvement across 10 benchmarks > 17M unique images > 10B answer tokens > New capabilities: GUI navigation, pointing, counting FineVision 10x’s open-source VLMs. https://x.com/andimarafioti/status/1963610118165000479

Today, we are releasing FineVision, a huge open-source dataset for training state-of-the-art Vision-Language Models: > 17.3M images > 24.3M samples > 88.9M turns > 9.5B answer tokens Here are my favourite findings: https://x.com/lusxvr/status/1963609337546293448

HunyuanWorld-Voyager is here and fully open-source! The world’s first ultra-long-range world model with native 3D reconstruction, redefining AI-driven spatial intelligence for VR, gaming, and simulations. ✅Direct 3D Output: Exports point cloud videos to 3D formats without tools https://x.com/TencentHunyuan/status/1962741518797836708

🍁 In collaboration with @NVIDIAAIDev, @RedHat_AI, and @VectorInst, vLLM is hosting a meetup in Toronto September 25th! Come hear about project update, distributed inference, EAGLE spec decode, and FlashInfer! https://x.com/vllm_project/status/1963736578674893071

Unfortunate reality: most open-source LLM servers (e.g. Together) don’t offer cache-hit discounts, while closed providers like OpenAI do. DeepSeek does discount, but most third-party servers don’t.
https://x.com/arankomatsuzaki/status/1963294646957957263

OpenAI just published their official prompting guide for GPT-5. Master these 6 critical prompting techniques: https://x.com/thealexbanks/status/1961415269332443456

For 𝜏²-Bench Telecom, OpenAI’s GPT-5 and o3 achieve scores of >80% with a lead over other frontier models, followed by the new Grok Code Fast 1 and Grok 4 from xAI. Models noted for their capabilities in agentic use cases and tool calling performed well, such as the Claude https://x.com/ArtificialAnlys/status/1962881324727087253

Nice to see open models (+ OpenHands) showing really strong performance here. Arguably for pure agentic coding tasks open models have almost caught up with closed ones. For more diverse tasks there’s still a little ways to go.”” / X https://x.com/gneubig/status/1963045532022010231

Hermes 4: Nous Research Open-Weight Reasoning Family Models – 70B / 405B (Llama-3.1 bases, released) – 14B (Qwen3 base, research baseline) Hermes 4 70B & 405B – Base: Llama-3.1-70B / 405B – Training: TorchTitan (modified), Axolotl, 192× B200s, FSDP and TP – Dataset: 56B tokens https://x.com/gm8xx8/status/1962943078702186627

Instructions/reasoning are now everywhere in retrieval – we want embeddings to do it all! 🚀 But… is it even possible? 🤔 Turns out, it’s not possible for single-vector models 😱 theoretically and empirically! To make it obvious we OSS a simple eval SoTA models flop on! 🧵 https://x.com/orionweller/status/1961436569409331579

Here’s a fun fact about TAU Bench: if you train an SFT baseline which has zero tool-calling capabilities, you can beat Qwen3-4B-Instruct by a large margin on the Airline domain 🙃 Why? Because on this domain, TAU Bench only evaluates the model’s ability to: – communicate with https://x.com/_lewtun/status/1962884893718761634

Glad to see Qwen3-Coder performing well on the GSO leaderboard!”” / X https://x.com/Alibaba_Qwen/status/1963049864474120475

MiniCPM-V 4.5 achieves an average score of 77.0 on OpenCompass, a comprehensive evaluation of 8 popular benchmarks. With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-latest, Gemini-2.0 Pro, and strong open-source models like Qwen2.5-VL 72B powered https://x.com/_akhaliq/status/1963587749400727980

Goated FAIR team just found how coding agents sometimes “”cheat”” on SWE-Bench Verified. It’s really simple. For example, Qwen3 literally greps all commit logs for the issue number of the issue it needs to fix. lol, clever model. “”cheat”” cuz it’s more like env hacking. https://x.com/giffmana/status/1963327672827687316

How can we verify that AI ChemBio safety tests were properly run? Today we’re launching STREAM: a checklist for more transparent eval results. I read a lot of model reports. Often they miss important details, like human baselines. STREAM helps make peer review more systematic. https://x.com/lucafrighetti/status/1962909265091592276

“We connect known results in learning theory, showing that the number of top-k subsets of documents capable of being returned as the result of some query is limited by the dimension of the embedding. We empirically show that this holds true even if we”” / X https://x.com/hsu_steve/status/1963999025025450008

(1/n) Check out our new paper: “”Fantastic Pretraining Optimizers and Where to Find Them””! >4000 models to find the fastest optimizer! 2× speedups over AdamW? Unlikely. Beware under-tuned baseline or limited scale! E.g. Muon: ~40% speedups <0.5B & only 10% at 1.2B (8× Chinchilla)! https://x.com/wen_kaiyue/status/1963633867140526319

[2509.01321] Towards High Data Efficiency in Reinforcement Learning with Verifiable Reward https://arxiv.org/abs/2509.01321

[2509.02046] Fantastic Pretraining Optimizers and Where to Find Them https://arxiv.org/abs/2509.02046

[2509.02350] Implicit Reasoning in Large Language Models: A Comprehensive Survey https://arxiv.org/abs/2509.02350

[2509.02534] Jointly Reinforcing Diversity and Quality in Language Model Generations https://arxiv.org/abs/2509.02534

@cloneofsimo Nice find, what’s _sm_mod here? Also, first time i see “”symmetric memory”””” / X https://x.com/giffmana/status/1962886753414468065

@giffmana also there was bug in the code and improvement isnt as dramatic as 4x, its more like 1.9x on h100s”” / X https://x.com/cloneofsimo/status/1962889777570787723

🌀Diversity Aware RL (DARLING)🌀 📝: https://x.com/jaseweston/status/1963230744173482018

💡 Ever heard of “”internal metrics”” in LLM training? Kimi’s founder Yang Zhilin mentioned it twice in recent interviews — what exactly are they? Here’re some sharing from Kimi researchers on Zhihu👇 🤔 Su Jianlin @Jianlin_S : Internal metrics are like human health checks — heart https://x.com/ZhihuFrontier/status/1963493293679153349

💭 What if you could train SOTA retrieval models without knowledge distillation? With PyLate, LightOn makes it possible: ⚡️ Direct large-scale contrastive training, no teacher needed 📚 Train on billions of passages with GradCache & distributed infra 🎯 Better generalization https://x.com/LightOnIO/status/1963620040604787136

📄 Introduction to PageIndex
https://x.com/omarsar0/status/1961446976152588712

🤖 From this week’s issue: A technical blog post about the 8-bit Rotational Quantization method, which compresses vectors by 4x, speeds up vector search, and improves search quality by combining random rotations with scalar quantization. https://x.com/dl_weekly/status/1961413948877553899

🚀 Highly recommend this blog: “Efficient RL Training – Optimizing Weight Sync in slime”, which dives into how weight sync works in the slime framework & why it uses a server-based architecture etc. 💡 Key achievement includes: Qwen3-30B-A3B weight update time slashed from 60s →”” / X https://x.com/ZhihuFrontier/status/1963532501336695282

A nice paper by @wen_kaiyue @dlwh @tengyuma @percyliang, doing the hard work of real comparison between optimizers. It’s also very cool to see the first success come out of Stanford’s Marin project!”” / X https://x.com/BlancheMinerva/status/1963679442859106480

A second paper also finds Generative AI is reducing the number of junior people hired (while not impacting senior roles). This one compares firms across industries who have hired for at least one AI project versus those that have not. Firms using AI were hiring fewer juniors https://x.com/emollick/status/1962513819990692211

Adaptive LLM Routing under Budget Constraints It frames LLM routing as a contextual bandit problem. This helps to maximize quality under a fixed budget. It can also handle diverse user budgets with an online cost policy. Lots of cool ideas in this one. https://x.com/omarsar0/status/1962875108512411938

Always reasoning”” (ReAct) isn’t optimal for LLM agents! 🧠 Our new paper identifies a “”Goldilocks”” effect: planning too frequently or not enough degrades performance. We show how to train agents to learn to dynamically allocate test-time compute when needed for best results. 👇”” / X https://x.com/PaglieriDavide/status/1963971144584724939

Amazing blogpost from @gordic_aleksa explaining internals of vLLM😍”” / X https://x.com/vllm_project/status/1962547561698652499

chat, this is getting out of hand: “”We introduce LongCat-Flash, a powerful and efficient language model with 560 billion total parameters, the model incorporates a dynamic computation mechanism that activates 18.6B∼31.3B parameters (averaging∼27B) based on contextual demands”””” / X https://x.com/reach_vb/status/1961833208737103997

Command-line agents can get you really far in document search and analysis! We tested SemTools, our CLI toolkit for parsing and semantic search, with coding agents like @claude_code on 1000 @arxiv papers. The results show that combining Unix tools with semantic search https://x.com/llama_index/status/1964009128973783135

Did you know SQLite has a vector extension? 🧮 SQLite is the most used database in the world and runs on almost any device. You can now easily build AI applications leveraging SQLite-vec and the new Embedding Gemma directly on-device, no internet required. Below is an simple https://x.com/_philschmid/status/1963952204970078579

Diffusion Language Models Know the Answer Before Decoding “”in many cases, the correct answer can be internally identified by half steps before the final decoding step, both under semi-autoregressive and random remasking schedules. For example, on GSM8K and MMLU, up to 97% and https://x.com/iScienceLuvr/status/1962800400278667677

Exa AI Research Blog | Semantic Search & Neural Network Search Engine https://exa.ai/blog/announcing-series-b

Fantastic Pretraining Optimizers and Where to Find Them “”we conduct a systematic study of ten deep learning optimizers across four model scales (0.1B-1.2B parameters) and data-to-model ratios (1–8× the Chinchilla optimum).”” “”we find that all the fastest optimizers such as Muon https://x.com/iScienceLuvr/status/1963168542872014943

first i thought scaling laws originated in OpenAI (2020) then i thought they came from Baidu (2017) now i am enlightened: Scaling Laws were first explored at Bell Labs (1993) https://x.com/jxmnop/status/1960314100715528627

FlashAttention in 3D? Our latest blog explores the #kernel design of 2-Simplicial #Attention, modeling the algorithm with a hardware aligned design and rewriting the entire kernel in TLX (Triton Low Level Extensions). 🔗 https://x.com/PyTorch/status/1964012269123031452

Goldfish loss proposes randomly dropping some tokens from cross entropy loss. mitigates memorization without lowering downstream benchmark performance https://x.com/vikhyatk/status/1962954696500674908

Graph-R1: Unleashing LLM Reasoning with NP-Hard Graph Problems Uses NP-hard graph problems as a novel synthetic training corpus to enable Long CoT reasoning as they inherently require deep reasoning, extensive exploration, and reflective strategies. Matches QwQ 32B with a 7B https://x.com/papers_anon/status/1961385914040766712

Guess who was the 1st to point out that Adam can be used for pretty much everything? (Answer: it was @fastdotai back in 2018 — @GuggerSylvain’s first research project in fact!) https://x.com/jeremyphoward/status/1963689424782565384

He also provided a feature checklist FYI ✅ • MoE support • Mem-fraction > 0.7 • FP8 inference • DeepEP (train + infer) • Speculative decoding • Mature backend (Megatron, TorchTitan) Slime v0.1.0 checks all the essential boxes — still room to grow, but it’s ready to”” / X https://x.com/ZhihuFrontier/status/1962760198176870613

if you want to avoid guards and do autotuning at compile-time, have no jit at runtime, consider torch.export; thats what you’d want. https://x.com/soumithchintala/status/1963225534659178948

Implicit reasoning is one of the most fascinating AI research topics I read about these days.
This new survey paper covers it really well and provides a good set of related readings on the topic. https://x.com/omarsar0/status/1963236545705710070

internal metrics”” explained by Jianlin Su In simple terms, “”internal metrics”” are monitoring indicators used to determine whether a model is training normally. To draw an analogy with humans, common internal metrics like heart rate, blood pressure, and body temperature are https://x.com/crystalsssup/status/1963547955799224386

Kimi K2-0905 update 🚀 – Enhanced coding capabilities, esp. front-end & tool-calling – Context length extended to 256k tokens – Improved integration with various agent scaffolds (e.g., Claude Code, Roo Code, etc) 🔗 Weights & code: https://x.com/Kimi_Moonshot/status/1963802687230947698

New dataset with 2B tokens from 51k Kaggle notebooks! I recommend checking their data gen pipeline with dedup, quality scoring, and filtering. It’s 🤌 🤗 Dataset: https://x.com/maximelabonne/status/1962923411887305094

New in-depth blog post – “Inside vLLM: Anatomy of a High-Throughput LLM Inference System”. Probably the most in depth explanation of how LLM inference engines and vLLM in particular work! https://x.com/gordic_aleksa/status/1962545137613173124

nternVL3.5 is out, it’s SOTA on many tasks, narrows the gap with GPT-5 and Gemini2.5, and comes in many sizes and 2 flavors (standard and Flash) https://x.com/gabriberton/status/1962219193547583512

One mismatch between DSPy and other stuff is that, in DSPy, every “prompt” is a *function* not a string. https://x.com/lateinteraction/status/1961959394427441441

One of the powerful ways to define relevance-based point order is 𝐝𝐞𝐜𝐚𝐲 𝐟𝐮𝐧𝐜𝐭𝐢𝐨𝐧𝐬. In this blog post, we show: ☑️ Which ones do we have, and how to pick one; ☑️ How to (not) use them for score normalisation and custom fusions; ☑️ Multiple examples to get a grip on”” / X https://x.com/qdrant_engine/status/1962876569728233507

our launch of the environment hub is another step toward full-stack open agi infra but it goes beyond environments: our stack enables using them properly + integrates compute, sandboxes, rft, and evals, currently locked behind the walls of closed labs”” / X https://x.com/vincentweisser/status/1961594111733158141

Paper page – Speeding Up the NSGA-II via Dynamic Population Sizes https://huggingface.co/papers/2509.01739

Pref-GRPO Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning https://x.com/_akhaliq/status/1961437082888352200

proposes randomly dropping some tokens from cross entropy loss. mitigates memorization without lowering downstream benchmark performance https://x.com/vikhyatk/status/1962954698568380841

Proud to partner with lmsysorg on a new code walkthrough for the open-source RL framework slime. It demystifies core mechanics like training loops and rollouts, making advanced RL concepts more accessible. https://x.com/Zai_org/status/1963099102347931975

RL’s Razor: On-policy RL forgets less than SFT. Even at matched accuracy, RL shows less catastrophic forgetting Key factor: RL’s on-policy updates bias toward KL-minimal solutions Theory + LLM & toy experiments confirm RL stays closer to base model https://x.com/arankomatsuzaki/status/1963823603469730114

Single vector models have **fundamental** limitations You can try to make the embedding dimension larger and larger if you want, you’ll eventually hit a wall Rerankers mitigate those issues but cannot scale for full databases ColBERT does both. https://x.com/antoine_chaffin/status/1961339798112575673

some random ideas for you: from scratch HTTP/TCP/IP/SSL/PDF/PCIe/JVM eval given full technical spec in extreme detail, replicate the protocol. hundreds of thousands of lines of code mean millions in chain of thought. all one shot Factorio/EUIV/Civ/RimWorld/Stellaris/Polytopia”” / X https://x.com/willdepue/status/1963739522312646934

StableMoE
first they train the moe normally with 10% of the data. then distill a word-embedding based router. then use that frozen router for the rest of training and find it improves performance https://x.com/vikhyatk/status/1962225296314429543

Thanks AK for sharing our work! 🔥 🧵 Back to Jan when we just started this project… we were living a nightmare 😩 Months of watching our multi-turn RL models collapse. Every. Single. Time. 💥 We thought we were doing something wrong… until we discovered other research”” / X https://x.com/sivil_taram/status/1963279400834924965

The 2 SOTA VLMs are InternVL3.5 and GLM-4.5V, both of which use an MLP
https://x.com/gabriberton/status/1962223082334302211

The core principle of DSPy is to ask humans to specify intent ONLY in the most natural shape *each* intent takes. https://x.com/lateinteraction/status/1961833838000111736

The Parallelism Mesh Zoo blog post https://x.com/ezyang/status/1961992677928378842

The Parallelism Mesh Zoo https://x.com/ezyang/status/1961992675948728538

The technical report of LongCat-Flash is crazy good and full of novelty.
The model is a 560B passive ~27B active MoE with adaptive number of active parameters depending on the context thanks to the Zero-Computational expert. https://x.com/eliebakouch/status/1961999252311204147

they method is actually *BETTER* than what they report.
Theoretically, DSV3 under TBO will have 60ms TPOT — a user will only see tokens from 1 of 2 batches. https://x.com/YouJiacheng/status/1961945887552483438

This debate has really captured the timeline. Sadly, most folks discussing it are mostly missing the nuance. I think Swyx understands this a lot more deeply than the folks discussing this elsewhere, so I recommend his thread here beyond a lot of the branched ones. As one of the”” / X https://x.com/BEBischof/status/1963739648792117484

This is so interesting from @Jianlin_S : https://t.co/EDKpC90OaU At first glance, the update RMS of AdamW is actually O(1), and in “”Muon is scalable for LLM training”” paper, we empirically observe that AdamW’s RMS is usually around 0.2, that’s why we have the ‘magic’ 0.2 https://x.com/JingyuanLiu123/status/1963084684784734543

This was a lovely paper! Mech interp often uses attribution patching, gradient based approximations to principled casual interventions. But it’s often inaccurate. Farnoush found that LRP, an XAI technique for better gradient based approximations, massively improves accuracy!”” / X https://x.com/NeelNanda5/status/1963029426741854345

Towards a Unified View of LLM Post-Training This work proposes Hybrid Post-Training, which switches between RL and SFT using simple performance feedback to balance exploration and exploitation. More below: https://x.com/omarsar0/status/1963971173735448858

Two new papers from this week benchmark recent optimizers: Muon, Soap, Mars, ScheduleFree, Prodigy, AdEMAMix, Sophia, etc. Also ablated in different settings (batch size, model scale, weight decay, scheduler). A must read for anyone working on practical optimization. https://x.com/konstmish/status/1963535545721917725

USO Unified Style and Subject-Driven Generation via Disentangled and Reward Learning https://x.com/_akhaliq/status/1961455755111842126

We did a very careful study of 10 optimizers with no horse in the race. Despite all the excitement about Muon, Mars, Kron, Soap, etc., at the end of the day, if you tune the hyperparameters rigorously and scale up, the speedup over AdamW diminishes to only 10% 🙁 Experiments”” / X https://x.com/percyliang/status/1963648131394122222

When using compilation, flexibility in dynamic length inputs becomes a headache 🤕 Thanks to PyTorch, many operations actually support shape-independent compilation benefits. Our recipe for AoT compilation can be extended to support shape dynamism, too 🫡 Check out the”” / X https://x.com/RisingSayak/status/1962844503620145621

Who needs sleep? Kimi-K2-Instruct-0905 just landed. 200+ T/s, $1.50/M tokens. 256k context window. Built for coding. Rivals Sonnet 4. Available now. 👇 https://x.com/GroqInc/status/1963823577557606665

XQuant instead of KV quantization? XQuant by @UCBerkeley rematerializes keys and values on-the-fly with quantized layer input activations. Its advanced version XQuant-CL cuts LLM memory needs up to 12×! Here is how it works: https://x.com/TheTuringPost/status/1961475078753063322

XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization
Paper: https://x.com/TheTuringPost/status/1961475160823009773

zhaochenyang20 (赵晨阳) SGLang is a fast serving framework for large language models and vision language models. https://github.com/zhaochenyang20/

Overview of Self-Evolving Agents
https://x.com/omarsar0/status/1962202247154352502

Modern AI teams need hyperscalers & neoclouds, but legacy tools like SLURM can’t keep up. @AbridgeHQ moved from SLURM to multi-cloud AI infra with @skypilot_org. ✅ 10x faster dev cycles ✅ SLURM-like convenience, K8s’ reliability ✅ Scale on any infra https://x.com/skypilot_org/status/1963637217055646139

receipts for no-BS evals talks: https://x.com/swyx/status/1963727193974153602

abs: https://x.com/iScienceLuvr/status/1962800402409365590

code: https://x.com/iScienceLuvr/status/1962798182964113547

website: https://x.com/iScienceLuvr/status/1962799346292007272

⚡️ Big update from Kimi K2! 256k context, Stronger coding & tool-calling, Smoother agent integration. Already tested with SGLang runtime — stable 60-100+ TPS with turbo API! 👉 Check it out: https://x.com/lmsysorg/status/1963806184747491717

📈 𝐑𝐞𝐥𝐞𝐯𝐚𝐧𝐜𝐞 𝐬𝐜𝐨𝐫𝐞 𝐛𝐨𝐨𝐬𝐭𝐢𝐧𝐠 𝐢𝐧 𝐐𝐝𝐫𝐚𝐧𝐭: 𝐟𝐫𝐞𝐬𝐡𝐧𝐞𝐬𝐬, 𝐩𝐫𝐨𝐱𝐢𝐦𝐢𝐭𝐲, 𝐲𝐨𝐮 𝐧𝐚𝐦𝐞 𝐢𝐭! Semantic similarity doesn’t always mean high relevance, the latter being always use-case defined. Since 1.14, we introduced a mechanism for https://x.com/qdrant_engine/status/1962876567362617445

Ok so im outside of rabbit hole, so it seems like you can get as large as 4x faster intranode all2all if you use torch’s symmetric memory and implement it yourself but it wasnt implemented in torch Just Cause ™ ? https://x.com/cloneofsimo/status/1962795533933912158

GPT-OSS uses MXFP4 quantization (which MLX now supports). There are two FP4 formats circulating right now: MXFP4 and NVFP4 (NV for Nvidia). From looking at how GPT-OSS uses MXFP4, it is somewhat suboptimal. I’m thinking NVFP4 will be the more commonly used format in the https://x.com/awnihannun/status/1961500133990043967

All of the details warrant a blog post for the community. So, we authored one 🤗 Check out all the details in this post: https://x.com/RisingSayak/status/1962844506094723429

feeling frustrated (and a little guilty): there’s still way too much confusion about @OpenAI’s Responses API. this is partly on us: we haven’t always been clear about why we built it, how to use it, and why it matters. here’s my attempt at setting the record straight. 👇”” / X https://x.com/prashantmital/status/1963801236391772372

Seven tips for getting the best out of gpt-realtime:”” / X https://x.com/OpenAIDevs/status/1962951139781181680

i never took the dead internet theory that seriously but it seems like there are really a lot of LLM-run twitter accounts now”” / X https://x.com/sama/status/1963366714684707120