Technical and Dev: AI News Week Ending 09/19/2025

Image created with gemini-2.5-flash-image with claude-sonnet-4-5-20250929. Image prompt: A cinematic photograph of ornate brass scales of justice on an oak desk in a wood-paneled English legal chamber, one side holding an illuminated circuit board and the other holding leather-bound law books, perfectly balanced, lit by warm amber lamplight with rich shadows suggesting Lincoln’s Inn gravitas.

🏖️ Summarization Middleware As agent loops get long (either because lots of messages or lots of tool calls) you want to summarize what has occurred so you don’t overflow context (and break your workflow). LangChain’s new middleware automatically summarizes history to keep you https://x.com/sydneyrunkle/status/1967991069368275282

Avoid overflowing context windows with LangChain’s SummarizationMiddleware. This is especially important for long running conversations that have lots of messages and agent loops with lots of tool calls.”” / X https://x.com/LangChainAI/status/1967993889958031560

Reasoning models (apparently without tool use) scored #1 (OpenAI) & tied for #2 (Google) in the International Collegiate Programming Contest Its been one year since reasoners were first announced, it is genuinely surprising how good they have gotten at hard problems, so quickly https://x.com/emollick/status/1968402884627697950

Last week, our reasoning models took part in the 2025 International Collegiate Programming Contest (ICPC), the world’s premier university-level programming competition. Our system solved all 12 out of 12 problems, a performance that would have placed first in the world (the best”” / X https://x.com/merettm/status/1968363783820353587

Our general-purpose reasoning models solved all 12 problems at the 2025 International Collegiate Programming Contest (ICPC) World Finals, the world’s top university programming competition which was enough for a 1st-place human ranking.”” / X https://x.com/OpenAI/status/1968368133024231902

1/n I’m really excited to share that our @OpenAI reasoning system got a perfect score of 12/12 during the 2025 ICPC World Finals, the premier collegiate programming competition where top university teams from around the world solve complex algorithmic problems. This would have https://x.com/MostafaRohani/status/1968360976379703569

A third of American adults use AI “many times a day to almost constantly” & another third several times a week. I can’t usefully add much to discussions of valuation bubbles, but if “bubble” means a disappointing technology that is overhyped & not useful, that doesn’t match data https://x.com/emollick/status/1968418031123501452

Demis Hassabis: calling today’s chatbots “PhD intelligences” is nonsense. They can dazzle at a PhD level one moment and fail high school math the next. True AGI won’t make trivial mistakes. It will reason, adapt, and learn continuously. We’re still 5–10 years away. https://x.com/vitrupo/status/1966752552025792739

Google Gemini is the top free iPhone app https://9to5google.com/2025/09/13/gemini-top-free-apple-app-store/

Made it to no.1 in the App Store. Congrats to the @GeminiApp team for all their hard work, and this is just the start, so much more to come!”” / X https://x.com/demishassabis/status/1966931091346125026

(1/3) Thrilled to announce a new Gemini breakthrough! Building on our success at IMO this year, an advanced version of Gemini Deep Think achieved gold-medal level performance at the ICPC 2025 World Finals – one of the world’s leading competitive programming competitions.”” / X https://x.com/quocleix/status/1968361041487904855

(2/3) Our model solved 10 out of 12 problems to achieve gold medal level. We were able to achieve this through breakthroughs in parallel thoughts, multi-step reasoning, and novel reinforcement learning techniques. You can find Gemini’s solutions here: https://x.com/quocleix/status/1968361222849642929

An advanced version of Gemini 2.5 Deep Think has achieved gold-medal level performance at the ICPC 2025 – one of the world’s most prestigious programming contests. 🏅 Building on the model’s success in math at the IMO, this marks another historic milestone for advanced AI. 🧵 https://x.com/GoogleDeepMind/status/1968361776321323420

Incredible milestone: an advanced version of Gemini 2.5 Deep Think achieved gold-medal performance at the ICPC World Finals, a top global programming competition, solving an impressive 10/12 problems. Such a profound leap in abstract problem-solving – congrats to @googledeepmind!”” / X https://x.com/sundarpichai/status/1968365605851218328

AI has officially beaten me at the ICPC World Finals. It reminds me of a rare ICPC skill: being able to quickly read a teammate’s code and spot bugs. This skill takes years to train, and explains why AI often makes coding slower (see arXiv:2507.09089). No matter how strong AI”” / X https://x.com/ZeyuanAllenZhu/status/1968568919482089764

amazing to get all 12 problems correct!”” / X https://x.com/sama/status/1968474300026859561

ICPC is a very hard and meaningful challenge:”” / X https://x.com/gdb/status/1968415631906324792

perfect score on the 2025 ICPC programming competition from our latest reasoning system:”” / X https://x.com/gdb/status/1968404060001968429

🐻Qwen3-Next just dropped on Together AI 80B parameters, 3B activated. Two models: ⚡Thinking: Outperforms Gemini-2.5-Flash-Thinking on reasoning benchmarks 🧠Instruct: Matches 235B model performance on key tasks Available now via our API 🚀 https://x.com/togethercompute/status/1966932629078634543

Qwen3 Next 80B A3B Thinking outperforms higher-cost and closed models like Gemini 2.5 Flash Thinking on benchmarks, nearing Qwen’s flagship model quality at a fraction the size. We have it ready to deploy in our model library, running on @nvidia and the Baseten Inference Stack. https://x.com/basetenco/status/1967688601640288288

📢 @Alibaba_Qwen new open-source model Qwen3-Next-80B-A3B is making waves. With a hybrid architecture & strong long-context reasoning, it’s sparking intense debate in the Zhihu community🔥 🔧 Zhihu contributor toyama nao with evalution： TLDR: A new “”gatekeeper”” for open-source https://x.com/ZhihuFrontier/status/1966415278922989813

🚨 Top 10 Open Model Leaderboard Update New open models have entered the Text Arena, and the top 10 rankings by provider have shifted for September! 🔹Qwen-3-235b-a22b-instruct from @Alibaba_Qwen holds the crown at #1 🏆 🔹Longcat-flash-chat from @Meituan_LongCat makes a strong https://x.com/arena/status/1968705194868535749

Alibaba has released Qwen3 Next 80B: an open weights hybrid reasoning model that achieves DeepSeek V3.1-level intelligence with only 3B active parameters Key takeaways: 💡 Novel architecture: First model to introduce @Alibaba_Qwen’s ‘Qwen3-Next’ foundation models, with several https://x.com/ArtificialAnlys/status/1966523300781428788

The new open-source Qwen3-Next Instruct and Thinking models put state-of-the-art long-context reasoning into the hands of everyone. We collaborated with #opensource frameworks from SGLang (@lmsysorg) and @vllm_project to enable communities to deploy Qwen3-Next across the https://x.com/NVIDIAAIDev/status/1967575419638468667

(1/n) Scheming has been a key concern in AI safety for 20+ years. It’s when an AI acts aligned while hiding true goals. New OpenAI + Apollo research found scheming in every tested frontier model, though no harmful scheming has been seen in production traffic.”” / X https://x.com/woj_zaremba/status/1968360708808278470

Today we’re releasing research with @apolloaievals. In controlled tests, we found behaviors consistent with scheming in frontier models—and tested a way to reduce it. While we believe these behaviors aren’t causing serious harm today, this is a future risk we’re preparing”” / X https://x.com/OpenAI/status/1968361701784568200

This is significant progress, but we have more work to do. We’re advancing scheming research categories in our Preparedness Framework, renewing our collaboration with Apollo, and expanding our research team and scope. And because solving scheming will go beyond any single lab,”” / X https://x.com/OpenAI/status/1968361716770816398

This OpenAI update on anti-scheming is exceptionally good for an AIco, clearing an (extremely low) bar of “”Exhibiting some idea of some problems that might arise in scaling the work to ASI”” and “”Not immediately claiming to have fixed everything already.”” https://x.com/ESYudkowsky/status/1968388335354921351

IntrEx: The first dataset for engagement modeling in educational dialogues It provides sequence-level annotations for interestingness & expected interestingness in teacher-student chats, collected from over 100 second-language learners. https://x.com/HuggingPapers/status/1967562091570827588

I think the significance of this is under-appreciated: the assumption has often been that AI agents are brittle as one failure in a chain breaks a task But this paper shows smart models are self-correcting & that small gains in accuracy lead to exponential gains in task horizons”” / X https://x.com/emollick/status/1968365586628694101

Alibaba’s WebSailor-V2: SOTA open-source web agents arrive A groundbreaking framework, powered by synthetic data and a dual-environment RL pipeline, achieves state-of-the-art results on BrowseComp & HLE. It outperforms existing open-source models and closes the gap to https://x.com/HuggingPapers/status/1968346179894235444

1/7 We’re launching Tongyi DeepResearch, the first fully open-source Web Agent to achieve performance on par with OpenAI’s Deep Research with only 30B (Activated 3B) parameters! Tongyi DeepResearch agent demonstrates state-of-the-art results, scoring 32.9 on Humanity’s Last Exam, https://x.com/Ali_TongyiLab/status/1967988004179546451

🚀 Kimi K2 Official Turbo API — 50% OFF for 30 days Code faster, ship sooner. Try it now: https://x.com/Kimi_Moonshot/status/1967829577037910427

Our engineer wrote about the thinking and technical story behind Checkpoint Engine. 👉 https://x.com/Kimi_Moonshot/status/1967923416008462785

Excited to release a preview of Moondream 3. A 9B param, 2B active MoE vision language model that makes no compromises; offering state-of-the-art visual reasoning while still retaining an efficient and deployment-friendly form factor. https://x.com/vikhyatk/status/1968800178640429496

when chatgpt said moondream wasn’t a frontier model, i took it personally”” / X https://x.com/vikhyatk/status/1968811248381784167

OpenAI claims hallucinations persist because evaluations reward guessing and that GPT-5 is better calibrated. Do results from HAL support this conclusion? On AssistantBench, a general web search benchmark, GPT-5 has higher precision and lower guess rates than o3! https://x.com/PKirgis/status/1966547382033936577

OpenAI has finally fixed their SWEBench errors and we can now finally apples to apples compare their scores over the entire 500 sample set (the fact that it took this long says alot about how much they care about SWEBench internally and maybe there’s a lesson here) https://x.com/nrehiew_/status/1967781400528245221

OpenAI just revealed that they have an internal unreleased SWE-bench-style benchmark for large ‘refactoring’ PRs, like the one mentioned here that edits 3.5k lines across 232 files. Their new model gets 51% accuracy on this benchmark. Who wants to make a public version of this? https://x.com/OfirPress/status/1967652031704994131

OpenAI’s Models Are Getting Too Smart For Their Human Teachers — The Information https://www.theinformation.com/articles/openais-models-getting-smart-human-teachers

GPT-5 is the best model for code quality out there 2 years ago, we created the world’s hardest software design quiz. Only 5 questions, multiple choice. Yet only about 3% of software engineers get them. The average score is somewhere between 2 and 3. Supposedly brilliant models https://x.com/jimmykoppel/status/1968683689421701413

Agent benchmarks lose *most* of their resolution because we throw out the logs and only look at accuracy. I’m very excited that HAL is incorporating @TransluceAI’s Docent to analyze agent logs in depth. Peter’s thread is a simple example of the type of analysis this enables,”” / X https://x.com/sayashk/status/1966550402129592738

For AI devs The main takeaway is that a simple RL recipe, plus smart context and length management, can unlock single-agent research abilities that rival multi-agent scaffolds. If you’re building agents, consider: – Limiting tools to force strategy learning. – Normalizing RL https://x.com/omarsar0/status/1966900784844730562

ReSum: Long-Horizon Web Agents Without Context Limits • Problem: ReAct hits context limits in long searches (32k tokens) • Solution: ReSum periodically compresses history → compact reasoning states • ReSumTool-30B: specialized summarizer extracts key evidence & gaps • https://x.com/arankomatsuzaki/status/1968161796642279549

The jaggedness of AI remains even as models have rapidly come to exceed human abilities in many of the hardest timed math & science contests. Yet there is much less progress on good puns. True AGI would be figure out our limits in more than calculus (sorry, but also seriously). https://x.com/emollick/status/1968447706969329718

LiveMCP-101 This paper introduces LiveMCP-101, a novel real-time evaluation framework with a benchmark designed to stress-test agents on complex, real-world tasks. It moves beyond the mock data and synthetic environments of previous works. More notes ↓ https://x.com/omarsar0/status/1966525731082768782

Hey Claude, ChatGPT, Gemini: “”I am time traveling back to the 75 BC Rome for one day. I can’t bring anything back. What is the one thing I could learn that would most advance today’s knowledge and what is one thing I could do there that would make me richest today”” Pretty good https://x.com/emollick/status/1967009330789589077

Evals now support native audio inputs and audio graders. Evaluate model audio responses, with no text transcription needed. Get started in the Cookbook guide: https://t.co/V8qD5XFNqt https://t.co/tZuaCYccnQ” / X
https://x.com/OpenAIDevs/status/1965923707085533368

Clementine just dropped an incredibly useful guide to evaluations in 2025 ✨ The key insight: we’re transitioning from testing knowledge retention to measuring practical problem-solving ability. Her framework spans core capabilities, integrated assistant tasks, adaptive”” / X https://x.com/joelniklaus/status/1968596729852231813

New SOTA on ARC-AGI – V1: 79.6%, $8.42/task – V2: 29.4%, $30.40/task Custom submissions by @jerber888 and @_eric_pang_ are now the best known solutions to ARC-AGI Both: * Are open source * Use Grok 4 * Implement program-synthesis outer loops with test-time adaptation https://x.com/arcprize/status/1967998885701538060

Fireworks passed ASIC speed! First time, GPU based inference crossed an ASIC provider. Benchmark credit to AA Model: GPT-OSS-120B Speed: 540 TPS Legend: Purple – Fireworks on B200; Orange – Groq https://x.com/lqiao/status/1967641702484807695

Last week we found an issue with SWE-Bench, allowing agents to cheat by looking at future commits. Instead of celebrating the SWE-Bench Devs for quickly fixing the issue and being transparent, the HN crowd is dunking on them and drawing wildly inaccurate conclusions about”” / X https://x.com/TacoCohen/status/1966421688846778561

ARC just published new #1 and #2 reproducible SOTA scores on our public leaderboard from @jerber888 and @_eric_pang_. And their code is now open source! My analysis below — includes suggestions for application layer AI and future research directions. New SOTA: – v1: 79.6%,”” / X https://x.com/mikeknoop/status/1967999305983381630

Frontier Models Struggle The results are revealing: even the most advanced LLMs achieve a task success rate below 60%. Performance degrades substantially as task difficulty increases, with the top model, GPT-5, scoring only 39.02% on hard tasks. https://x.com/omarsar0/status/1966525793586360384

GenExam: The first multidisciplinary text-to-image exam is now on Hugging Face This new benchmark challenges T2I models with 1,000 rigorous, exam-style prompts across 10 subjects. It comes with ground-truth images and detailed scoring for semantic correctness and visual https://x.com/HuggingPapers/status/1968527551703433595

Qwen3 Next 80B used ~100M tokens with reasoning and ~25M without reasoning to run the Artificial Analysis Intelligence Index, slightly less verbose than Qwen3 235B 2507 with reasoning, and similar to it without reasoning https://x.com/ArtificialAnlys/status/1966523306338893979

Check out Seedream 4 and Seedream 4 High Res in the Arena in Battle, Side by Side and Direct modes here: https://x.com/arena/status/1966673632069132770

including one more new feature, you can configure client idle detection with a single parameter. If this threshold is hit, the input_audio_buffer.timeout_triggered event will be fired. https://x.com/juberti/status/1968105091002667356

Towards a Physics Foundation Model Proposes GPhyT (General Physics Transformer), a large transformer trained on 1.8 TB of simulation data across fluid flows, shock waves, heat transfer, and multiphase dynamics. Here are a few key notes: https://x.com/omarsar0/status/1968681177189077366

Excited to share what friends and I have been working on at @Standard_Kernel We’ve raised from General Catalyst (@generalcatalyst), Felicis (@felicis), and a group of exceptional angels. We have some great H100 BF16 kernels in pure CUDA+PTX, featuring: – Matmul 102%-105% perf https://x.com/anneouyang/status/1967610221712519612

Congrats to @deepseek_ai ! DeepSeek-R1 was published in Nature yesterday as the cover article, and vLLM is proud to have supported its RL training and inference🥰 https://x.com/vllm_project/status/1968506474709270844

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning | Nature https://www.nature.com/articles/s41586-025-09422-z

Nature Portfolio also addressed this on Zhihu: Publishing this paper is itself a significant milestone👏 🤔 DeepSeek-R1 learns step-by-step reasoning with minimal human help: • Reinforcement learning: correct answers get rewards, mistakes penalized • Learns to self-verify & https://x.com/ZhihuFrontier/status/1968603082167828494

The issue with the “PhD level intelligence” discussion over AI is that they are graduate-level in a widening range of valuable but narrow areas right now (examples in my post) But they are also jagged in ability, inconsistent & fail at many simple things https://x.com/emollick/status/1966897516923834597

To use GEPA (best optimizer in DSPy) you need two things 1. Hand labeled data set that’s accurate 2. Text explanation for WHY it’s accurate Here’s a worked example applied to tagging financial statements. This approach can save you both time and money. 🧵”” / X https://x.com/AsfiShaheen/status/1967866903331999807

You’re in a ML Inference engineer interview at Google, and the interviewer asks: “”What’s the real bottleneck in LLM serving throughput? How can PagedAttention help?”” Here’s how you can answer:”” / X https://x.com/athleticKoder/status/1967925267864928669

There was something deeply satisfying about ImageNet. It had a well curated training set. A clearly defined testing protocol. A competition that rallied the best researchers. And a leaderboard that spawned ResNets and ViTs, and ultimately changed the field for good. Then NLP”” / X https://x.com/DrJimFan/status/1966877464598094334

With over 4.5k dedicated votes in the Text-to-Image modality, Seedream 4 ranks at #5. 🥇Gemini 2.5 Flash Image is tied with Image 4.0 Ultra Generate for #1. 🥉GPT-Image-1 and Image 4.0 Generate Preview rank tied for #3. Check out the leaderboard details for Image Edit and https://x.com/arena/status/1966562486897029274

🚨 Leaderboard Update: With over 43k votes collected, the community has spoken! 🥈 Seedream 4 by ByteDance has landed at #2 on the Image Edit Leaderboard 🔸 It is also ranked #5 for Text-to-Image Real prompts and votes at scale illustrate sharper confidence intervals and more https://x.com/arena/status/1966562484506230922

🚨New Model update before the weekend 📣 By popular demand, we’ve added a “”High Res”” version of Seedream 4 that supports an output at 4096×4096 dimensions. We’ll see how this version of Seedream 4 stacks up vs. all the other top Image generation models soon. https://x.com/arena/status/1966673628327801255

Prompt for Nano Banana / Seedream: “Imagine what an entity sees that exists outside of time in a higher dimension that can concurrently visualize everything that has ever happened or will ever happen when looking at [insert point of interest]. Now generate that image projected https://x.com/bilawalsidhu/status/1966191138530013661

Our lightweight open-source eval library “”lighteval”” now ships with 7,000+ (!!) benchmarks baked in. Running it locally is literally a one-liner: >> lighteval vllm “”model_name=gpt2″” “”leaderboard|truthfulqa:mc|0″” (there is also a Python API for in/post-training evals ofc)”” / X https://x.com/Thom_Wolf/status/1967926861889163304

Disaggregated Inference at Scale with #PyTorch & #vLLM: Meta’s vLLM disagg implementation improves inference efficiency in latency & throughput vs its internal stack, with optimizations now being upstreamed to the vLLM community. 🔗 https://x.com/PyTorch/status/1966546293733437799

We are learning from OpenAI and Anthropic about how people use AI for work. It is primarily for high-level tasks – critical thinking, the interpretation of information, getting/giving advice & being creative (both companies categorize a little differently, but similar patterns) https://x.com/emollick/status/1967800804301283452

HunyuanImage 2.1 is the new leading open weights text to image model from @TencentHunyuan , surpassing HiDream-I1-Dev and Qwen-Image in the Artificial Analysis Image Arena! HunyuanImage 2.1 is the latest release from Tencent – a 17B DiT text-to-image model natively supporting https://x.com/ArtificialAnlys/status/1967800071115903358

First test of MLX batch generation PR on Mac Studio M3 Ultra 512GB with Qwen3-1.7B (4K ctx, 64 tokens) 🔥 Batch generation = WOW bf16 vs 4bit (avg of 3 runs) Batch of 1 → 127 vs 237 t/s 5 → 365 vs 515 t/s 10 → 556 vs 625 t/s 15 → 672 vs 617 t/s MLX vllm not a dream anymore! https://x.com/ivanfioravanti/status/1966903782400545196

LM Studio now supports Qwen3-Next with MLX on Mac! 🧵 https://x.com/lmstudio/status/1967985102845366280

Woah, 66 tok/s on a Macbook M4 Max 64GB with qwen3-next-80b-a3b-instruct-mlx@4bit, which uses about 41GB. Amazing job to the folks working on MLX, aware of at least these guys: @ivanfioravanti @ActuallyIsaak @awnihannun https://x.com/rwojo/status/1967767157250592899

Check out the actual speed (not yet the final version) of Qwen3-Next-80B-A3B-Instruct on Apple MLX! 🔥 4-bit: 67 TPS 8-bit: 58 TPS bf16: 48 TPS Movie normal speed, only waiting times removed. @awnihannun and @ActuallyIsaak did it and I bet there is still room for improvement 💪 https://x.com/ivanfioravanti/status/1966866942461177925

@Alibaba_Qwen Massive efficiency gains for long contexts. 262K context native, extensible to 1M+ tokens. Perfect for: ⚡ Repository-scale code analysis 🧠 Complex reasoning tasks 📄 Long document processing Both models available now → Instruct: https://x.com/togethercompute/status/1966933240683319556

We’re announcing a major advance in the study of fluid dynamics with AI 💧 in a joint paper with researchers from @BrownUniversity, @nyuniversity and @Stanford. https://x.com/GoogleDeepMind/status/1968691852678173044

Learning the natural history of human disease with generative transformers | Nature https://www.nature.com/articles/s41586-025-09529-3

Big day for AI agents! Tongyi Lab (@Ali_TongyiLab) just dropped half a dozen new papers, most focused on Deep Research agents. I’ll walk you through the highlights in this thread. (1/N) https://x.com/arankomatsuzaki/status/1968161775712620628

Prefix cache-aware routing is now available in Ray 2.49 🚀 Scaling input token-heavy workloads (like multi-turn convos & agent loops) requires maintaining prefix cache hit rate across 100s of vLLM engine replicas, and PrefixCacheAffinityRouter makes it easy. Here’s how it https://x.com/seiji_________/status/1967639835381993488

RL done right is no joke! The most interesting AI paper I read this week. It trains a top minimal single-agent model for deep research. Great example of simple RL-optimized single agents beating complex multi-agent scaffolds. Now let’s break it down: https://x.com/omarsar0/status/1966900691009720455

We managed to save >50% VRAM for multimodal RL in @UnslothAI with vLLM weight sharing! GSPO, Dr GRPO and much longer contexts are also possible! Our notebook shows how you can design good reward functions to teach a VLM to answer hard maths and logic questions with RL Currently”” / X https://x.com/danielhanchen/status/1967993163500622266

I have heard of several folks using torchtitan internally for RL training. However, torchtitan doesn’t directly support GRPO, which means folks are adding an implementation themselves. A few questions: 1. Are there any good open-source torchtitan forks with GRPO support? 2. What”” / X https://x.com/iScienceLuvr/status/1968509941578338560

🚀 Big news: we’re moving towards the v5 release of transformers! After months of teasing, it’s finally happening 🎉 What to expect in v5: ✨ Cutting-edge stack — fast models, with fast kernels ✨ Smarter defaults — better out-of-the-box experience ✨ Cleaner codebase —”” / X https://x.com/art_zucker/status/1966470835558093226

(ofc online rl has been around for a while, but i haven’t seen it much with language models. i don’t see why we can’t move from months -> weeks -> hours for continuous training of production chat assistants)”” / X https://x.com/willdepue/status/1966878536247243260

has anyone published a good RL paper since GRPO, or was the last one before the onset of RL slop”” / X https://x.com/vikhyatk/status/1967375151638716810

Important insights on long-horizon tasks in LLMs ⬇️ (from a recent paper “”The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs””) – Tiny boosts in step accuracy → exponential gains in how long a model runs without failing. – Failures of LLMs come from https://x.com/TheTuringPost/status/1967374791700369451

New LLM releases are now announced as pull requests to the Transformers repo.”” / X https://x.com/lvwerra/status/1966451134727352326

QoL update: We now display the total repo size on the repo itself! 👀 This has been an ask from the community for a long time, thanks to @mishig25 it’s available to everyone now! 🤗 https://x.com/reach_vb/status/1968614454725075443

SearchInstruct: Data-Efficient SFT for LLM Domain Adaptation A new framework for high-quality instruction dataset creation. It expands human questions & retrieves domain resources for precise answers. Significantly boosts LLM performance and facilitates model editing. https://x.com/HuggingPapers/status/1967983770717335804

SimpleVLA-RL Scaling VLA Training via Reinforcement Learning https://x.com/_akhaliq/status/1966883040627769511

SpikingBrain Technical Report: Spiking Brain-inspired Large Models https://arxiv.org/pdf/2509.05276

The illusion of diminishing returns in LLMs is just that—an illusion Our new paper reveals that marginal gains in single-step accuracy compound into exponential leaps for long-horizon tasks. We identify “”self-conditioning”” on errors as a key barrier, overcome by thinking models. https://x.com/HuggingPapers/status/1967440503189754190

Training long-context LLMs is getting easier! TRL now supports Context Parallelism (CP), letting you scale sequences across multiple GPUs, even multi-node setups, seamlessly 💆 Combine TRL and accelerate to run it effortlessly! https://x.com/SergioPaniego/status/1967974475892510820

we’ve been pushing commits to transformers discretely, time to talk about we’ve been cooking the last few months: ⚡️ Continuous Batching is in transformers ⚡️ this will simplify, most notably, evaluation and your training loop: no need for extra dependencies or infra to get https://x.com/LucSGeorges/status/1966550465769775305

The new @code release dropped yesterday – check out our latest video from @JamesMontemagno telling you exactly what you need to know! https://x.com/code/status/1966546512717946894

Make your ZeroGPU Spaces go brrr with ahead-of-time compilation https://x.com/RisingSayak/status/1966447207688569028

In the past 2 weeks 7 new model architectures were added to MLX LM. Of those 7, 6 are MoEs. Of those 6 MoEs, 3 are hybrid SSM / attention models. Architectures change slowly then suddenly.”” / X https://x.com/awnihannun/status/1966936728469729546

It turned out that model collapse didn’t happen. I think there are many reasons to be skeptical of AI lab claims (and point out bad predictions & watch for bubbles) but I also think it is worth reflecting that “”AI development is going to stop”” arguments have been wrong so far.”” / X https://x.com/emollick/status/1967301317145296954

lighteval supports MMMU 🤠💗”” / X https://x.com/mervenoyann/status/1967854864098361786

Many people think LLMs are non-deterministic. This is often not true! You just need 3 lines of code to make your LLM deterministic LLMs (as any PyTorch model) are non-deterministic only when they include certain operations or when using multiple GPUs Try the code yourself https://x.com/gabriberton/status/1968559505966350705

MLX batch generation is gold! A full MMLU Pro on M3 Ultra 512 with 8192 max tokens in 6 hours and 40 minutes! Instead of 30+ hours! 🔥🔥🔥 https://x.com/ivanfioravanti/status/1967229451806318904

MoEs are too popular for us not to invest in them! (even tho I must admit… I personally HATE MoEs, that’ s another story) ⚡️ We are refactoring all moes in `transformers` to use kernels natively! ⚡️ Some of the benches are quite incredible! https://x.com/art_zucker/status/1967923948999618961

New: Batch Inference API updates 🚀 • Intuitive UI • Works with all models • 3000× higher rate limits (30B tokens) • 50% discount on most serverless models The easiest way to run massive AI workloads. Try it today or read more below. https://x.com/togethercompute/status/1967624765625315393

probably beating a dead horse being the #efficiency guy but token parsimony is now a key battle ground for koding models. I dont fully endorse the below as my personal experience, but obviously developers are going to prefer the model that sips or spends tokens according to https://x.com/swyx/status/1967662188962910709

solid in-depth explanation of paged attention in this blog. https://x.com/novasarc01/status/1966413957679428054

Some perf related must-reads: • How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: https://x.com/fleetwood___/status/1968716580621271076

We have spiced up the post, including 2 new things 🔥 * Regional compilation with AOT – slay the compilation times like a pro 🔪 * Share and load precompiled graphs from the Hub 🤗 Both of these will help you attain speed fast and easy with ZeroGPU 🚀 Links ⬇️ https://x.com/RisingSayak/status/1966447203381092675

With the latest @code release, the Language Model Chat Provider extension API is finalized – what does that mean for you? Models can now be contributed via extensions, so with a single install, you get more model choice via Bring Your Own Key. Here are some extensions you can”” / X https://x.com/code/status/1966638511269794238

With Together Instant Clusters, you can keep responses fast during launch‑day spikes. ⚡🚀 https://x.com/togethercompute/status/1968661658617692379

Paper2Agent brings research papers ‘to life.’ This open tool from @Stanford transforms static papers into interactive AI assistants that can explain and apply their methods. It builds on the MCP and works in 2 layers: – Paper2MCP: Extracts the paper’s methods and code into an https://x.com/TheTuringPost/status/1968829219858956774

This thread misses the point. Evals are not a scam. They’re misunderstood. Here’s how I think about evals ➡️ Logging is not evals. That’s like saying completing an exam is the same as getting graded on the exam. Imagine you take an exam. You start by quizzing yourself. You run”” / X https://x.com/rebeccatqian/status/1967758557174174027

Why Agents Fail The paper provides a fine-grained failure analysis, identifying seven common error types: ignoring requirements, overconfident self-solving, unproductive thinking, wrong tool selection, syntactic errors, semantic errors, and output parsing errors. Paper: https://x.com/omarsar0/status/1966525809302417436

🔋 Naveen Rao is leaving Databricks (a $100Bn startup) to build a next generation computer to shrink AI compute costs, and Databricks plans to invest. Databricks sits around $100B and just raised $1B. A core problem for the AI industry is that large models are limited by memory https://x.com/rohanpaul_ai/status/1966378718009635087

SAPO – Swarm sAmpling Policy Optimization – is a new RL training method by @gensynai. It works in a decentralized “swarm” of computers instead of synchronized GPU clusters: – Each computer (node) trains its own model – Nodes share rollouts with others in plain text – Any device https://x.com/TheTuringPost/status/1967575689844166834

Build an LLM from scratch https://x.com/rasbt/status/1966876565788135837

1. Model agnostic. 2. Inference agnostic. & now, 3. Platform agnostic. Cline for JetBrains is here. (install it below) https://x.com/cline/status/1968360125686759505

We just merged LeRobotDataset v3.0 + Streaming 🔥 A important change of our dataset format: > 📦Chunked episodes for massive scale (OXE-level) > 📽️ Efficient video storage + streaming > ⚡️ Faster loading > 📈Unified parquet metadata (no more scattered JSONs) This makes LeRobot”” / X https://x.com/LeRobotHF/status/1967985390117343737

This is Ray3. The world’s first reasoning video model, and the first to generate studio-grade HDR. Now with an all-new Draft Mode for rapid iteration in creative workflows, and state of the art physics and consistency. Available now for free in Dream Machine. https://x.com/LumaLabsAI/status/1968684330034606372