Technical and Dev: AI News Week Ending 12/19/2025

Technical and Dev: AI News Week Ending 12/19/2025

December 19, 2025

Image created with gemini-2.5-flash-image with claude-sonnet-4-5. Image prompt: Photorealistic 35mm cinema shot of child aged 6-8 viewed from side angle sitting on plush bedroom rug surrounded by panoramic arc of glowing TV screens, warm domestic lighting with cool blue screen glow, scattered disassembled circuit boards and Raspberry Pi components on floor, soldering iron and tech manuals mixed with newspapers, one screen showing code or chip schematics, shallow depth of field, soft focus, cozy yet subtly technical atmosphere, bold text ‘TECH’ at top of frame, warm pastels with LED blue highlights

Sonnet 4.5 was underestimated on METR its time horizon improves around 20 minutes https://x.com/scaling01/status/2001476927362605354

We’re working on updating and improving our time horizon task suite. Recently, we found two issues with our tasks, one of which was differentially lowering the performance of Claude models. We think these also illustrate some interesting model behavior.”” / X https://x.com/METR_Evals/status/2001473506442375645

All the frontier AIs now pass all levels of the very challenging Chartered Financial Analyst (CFA) exam The paper used paywalled, new mock exams to reduce the risk of leakage but AI grading for the essays. Interestingly, prompting strategy doesn’t matter for most question types https://x.com/emollick/status/2000605774695837711

BREAKING: OpenAI releases “”GPT-Image-1.5″” (ChatGPT Images) & It instantly takes the #1 Spot on LMArena, beating Google’s Nano Banana Pro. : r/singularity https://www.reddit.com/r/singularity/comments/1po98xo/breaking_openai_releases_gptimage15_chatgpt/

NVIDIA Debuts Nemotron 3 Family of Open Models | NVIDIA Newsroom https://nvidianews.nvidia.com/news/nvidia-debuts-nemotron-3-family-of-open-models

GPT-5.2 is here and it’s the best model out there for everyday professional work. On GDPval, the thinking model beats or ties human experts on 70.9% of common professional tasks like spreadsheets, presentations, and document creation. It’s also better at general intelligence,”” / X https://x.com/fidjissimo/status/1999183159356006450

Today I ran two complex tasks through Codex with GPT 5.2 Extra High The first ran for 2 hours 30 minutes The second ran for 1 hours 45 minutes Both resulted in: – all acceptance criteria resolved – all test coverage complete – zero broken or non-working code Amazing”” / X https://x.com/nummanali/status/2000228337030152347

Whoa. This new GDPval score is a very big deal. Probably the most economically relevant measure of AI ability suggesting that in head-to-head competition with human experts on tasks that require 4-8 hours for a human to do, GPT-5.2 wins 71% of the time as judged by other humans https://x.com/emollick/status/1999189828756263359

GPT Image 1.5 achieves both #1 in Text to Image and Image Editing in the Artificial Analysis Image Arena, surpassing Nano Banana Pro GPT Image 1.5 is OpenAI’s newest flagship image generation model, demonstrating improved image quality and prompt fidelity relative to earlier https://x.com/ArtificialAnlys/status/2001016199094948185

GPT Image 1.5 is now available in the API: ✏️ More precise image editing and preservation of logos & faces 🎯 Better instruction following and adherence to prompts 🔤 Improved text rendering, particularly for denser and smaller text Learn more in docs: https://x.com/OpenAIDevs/status/2000992413402456485

Grace Li (@grx_xce): “”This is the biggest jump in Image Arena that we’ve seen since Nano Banana GPT-Image-1.5 has taken #1 on Image Arena with a significant lead Huge congratulations to the team at @OpenAI for this achievement!”” | XCancel https://xcancel.com/grx_xce/status/2000993261914350070?s=20

Introducing ChatGPT Images, powered by our flagship new image generation model. – Stronger instruction following – Precise editing – Detail preservation – 4x faster than before Rolling out today in ChatGPT for all users, and in the API as GPT Image 1.5. https://x.com/OpenAI/status/2000990989629161873

The Image Arena is buzzing 👀 @OpenAI’s GPT-image-1.5 is live and already shaking up the leaderboard. Watch it in action below, then try your own prompt and share what you create 👇🎨 https://x.com/arena/status/2001014708254773549

The new ChatGPT Images is here | OpenAI https://openai.com/index/new-chatgpt-images-is-here/

This is the biggest jump in Image Arena that we’ve seen since Nano Banana GPT-Image-1.5 has taken #1 on Image Arena with a significant lead Huge congratulations to the team at @OpenAI for this achievement! https://x.com/grx_xce/status/2000993261914350070

Reasoning Models Ace the CFA Exams https://arxiv.org/pdf/2512.08270

xAI’s new Grok Voice Agent is the new leading Speech to Speech model, surpassing Gemini 2.5 Flash Native Audio and GPT Realtime in our Big Bench Audio benchmark The new model achieves a score of 92.3% on Big Bench Audio, just ahead of the previous leader, Google’s Gemini 2.5 https://x.com/ArtificialAnlys/status/2001388724987527353

🎥 Kling 2.6 Motion Control Feature Is Now Live! To celebrate the launch of Kling 2.6 Motion Control Feature, we’re kicking off a new contest – and the prizes are one post away from you! 🔥 Show us your creative power with Kling 2.6 Motion Control Feature – The Kling 2.6 Motion https://x.com/Kling_ai/status/2001891240359632965

🎥 Kling 2.6 Voice Control Feature Is Now Live! To celebrate the launch of Kling 2.6 Voice Control Feature, we’re kicking off a new contest – and the prizes are one post away from you! 🔥 Show us your creative power with Kling 2.6 Voice Control Feature – Use your signature voices https://x.com/Kling_ai/status/2001198609115628029

🚀 Motion Control, Leveled Up Newly upgraded Motion Control is now live in Kling VIDEO 2.6! Experience precise, full control over every action & expression ✅ Full-Body Motions — Body movements captured in stunning detail ✅ Fast & Complex Actions — From martial arts to https://x.com/Kling_ai/status/2001306445262823431

🚨 Kling O1 Video Standard is here on fal! 🎬 Same powerful editing model, 720P mode ✨ Start & end frame control for precision 🎯 3-10 second range for flexible videos 💰 Faster generation, lower cost https://x.com/fal/status/2000590369545744599

🚨Video Leaderboard Updates Kling 2.6 Pro by @kling_AI and the new Kandinsky 5.0 open models by @kandinskylab have now landed on the Video Arena leaderboard. Kling 2.6 Pro delivers a major 16-point jump over Kling-2.5-turbo-1080p. While Kandinsky 5.0 enters strong, taking the https://x.com/arena/status/1999530939886768205

A new prompt unlock? Multiple gliding rack focus through a cyberpunk nightclub, yes the characters in close up are prompted, prompt share in later post. Not keyframes. Created in @Kling_ai 2.6 Image to video. 🔊🔊🎧 https://x.com/StevieMac03/status/2002001196383391813

Do you want to create ultra-dynamic action animations with @Kling_ai 2.6? 🎬⚡️ After testing many prompts, I’ve noticed what works best. And here’s the key. 👉 What usually gives the best results is starting the prompt with “”High-speed anime battle.”” Other combinations that https://x.com/Artedeingenio/status/2001960379610767835

Kling 2.6「MotionControl」ダンス動画で検証・全身のステップや重心移動が自然・髪の毛の追尾性能も◎ こういったダンスやアクションの方が相性が良く、強みを発揮できる印象です✨ https://x.com/genel_ai/status/2001532885673873677

Oh my… Kling just dropped the next era of motion control. Kling VIDEO 2.6 can copy any action with perfect lip-sync, lifelike motion and expressive gesture. It outperforms Wan 2.2-Animate, Act-Two and DreamActor 1.5 across all metrics. More examples below. https://x.com/AngryTomtweets/status/2001569619375698199

Quick test of Kling 2.6 Motion Control Shall I keep going? 😭 https://x.com/blizaine/status/2001849003819098168

Your frames. Your timing. Kling VIDEO O1 now supports Start & End Frames generation with freely selectable durations from 3- 10s, giving you smoother transitions and more control over pacing. From fast, high-impact moments to fully immersive cinematic shots–your story moves the https://x.com/Kling_ai/status/2000581619556421673

How good is AI for science? Yesterday, OpenAI released a benchmark, FrontierScience, to measure frontier model performance on scientific tasks. This is the most sophisticated benchmark for science I’ve seen. FrontierScience has 160 questions across various subdomains, https://x.com/jungofthewon/status/2001302379527114798

⚖️ Pairwise Annotations: Scores are hard, preferences are easy. Agents handle tasks that are tough to score but easy to compare: support responses where tone matters, code refactors where both work but one feels cleaner, product specs where “”good”” is subjective. In practice, https://x.com/LangChain/status/2001361753851203724

AI concepts you HAVE to know about at the end of 2025 – Reinforcement learning – RLHF variations: DPO, RRHF, RLAIF – Continual learning – Test-time scaling – Neuro-Symbolic AI – Hardware that powers AI: GPU, CPU, TPU, ASICs, APU, NPUs and others – Robotics Find everything from https://x.com/TheTuringPost/status/2001441981780890063

Concepts and Methods you HAVE to Know About -> AI 101 Recap
https://www.turingpost.com/p/2025-concept-method-recap

Measuring AI Ability to Complete Long Tasks – METR https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/

The updated time horizon numbers are live on the dashboard on our website: https://x.com/METR_Evals/status/2001473519197335899

Best AI research of the week: ▪️ On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning LMs ▪️ Native Parallel Reasoner ▪️ Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving ▪️ DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent https://x.com/TheTuringPost/status/2000874193249034463

The Adoption and Usage of AI Agents: Early Evidence from Perplexity https://arxiv.org/pdf/2512.07828

InternGeometry: An LLM Agent tackles Olympia-level geometry. This novel agent solves 44 of 50 International Math Olympiad problems, beating gold medalists with only 13K training examples. It uses iterative reasoning & Complexity-Boosting RL. https://x.com/HuggingPapers/status/1999572332906438987

Inference Economics 101: Reserved Compute versus Inference APIs https://www.datagravity.dev/p/inference-economics-101-reserved

NEW Research from Apple. When you think about it, RAG systems are fundamentally broken. Retrieval and generation are optimized separately, retrieval selects documents based on surface-level similarity while generators produce answers without feedback about what information is https://x.com/omarsar0/status/2000570838920434037

All the most recent models now do this right first try. https://x.com/emollick/status/1999960137386361093

Can AI reviewers catch real bugs without flooding PRs? Akshay Utture, Applied AI Engineer at @augmentcode, and his team benchmarked 7 AI code review tools on large open-source projects. Here are the results: ▪️ They saw the same pattern: Missed issues came from missing https://x.com/TheTuringPost/status/1999619297057112275

Honestly weird that the frontier models do not diverge that much in terms of abilities, prompt adherence, and other factors. Whether you pick any of the big American closed source models or the Chinese and French open models, they are all very similar to each other, and have been”” / X https://x.com/emollick/status/1999712938861674798

I see multiple QTs saying “”train on test”” But the way i understand it, I don’t think he is doing anything wrong? And this does not look like the classic “”oops i trained on test”” to me? Arc-agi is a meta-learning benchmark, but they don’t like to call it that. – On the left, he”” / X https://x.com/giffmana/status/2002111246225621296

Individual results across the 10 evals we run independently for the Artificial Analysis Intelligence Index: MMLU-Pro, GPQA Diamond, Humanity’s Last Exam, LiveCodeBench, SciCode, AIME 2025, IFBench, AA-LCR, Terminal-Bench Hard, 𝜏²-Bench Telecom https://x.com/ArtificialAnlys/status/2001335963952521243

Our lack of any reliable measures of human error rates across intellectually demanding tasks and fields is a huge hindrance to understanding the thresholds of hallucination and reliability that AI might cross incrementally that could lead to sudden leaps in usefulness & adoption.”” / X https://x.com/emollick/status/2001310462890160443

SimpleBench results extremely disappointing for GPT-5.2 GPT-5.2 scores below Sonnet 3.7, an almost 1 year old model GPT-5.2 Pro doesn’t fare much better, barely beating GPT-5 https://x.com/scaling01/status/1999466846563762290

Text to Image Leaderboard | Artificial Analysis https://artificialanalysis.ai/image/leaderboard/text-to-image

we desperately need new and better benchmarks I think I need to sit down another 2 hours with Opus 4.5 and cook up a LisanBench follow up But I really want to see more benchmarks on complex games (well ARC-AGI-3 is already going in that direction of dynamic environments) I”” / X https://x.com/scaling01/status/1999321464319754290

Were you right a year ago? Let’s see! Revisiting our early-2025 predictions Last December, we made a bold bet: 2025 would be the “”Year of Inference-Time Search.”” Looking back, that prediction defined the entire year. ⬇️ 1. The Big Win: The “”Thinking”” Shift @fchollet nailed https://x.com/TheTuringPost/status/1999097028023062937

would like to clarify this this work is actually _very interesting_ for exactly the reasons listed in the community note: 1) you can train purely on arc agi train set and get a new “”pareto frontier”” for arc agi 2) the cost of doing (1) is so low that it’s effectively ~free to https://x.com/suchenzang/status/2002100653049753901

Zoom AI sets new state-of-the-art benchmark on Humanity’s Last Exam | Zoom https://www.zoom.com/en/blog/humanitys-last-exam-zoom-ai-breakthrough/

GDPval-AA Leaderboard: https://x.com/ArtificialAnlys/status/1999404589049872615

@OpenAI Super cool to see the eval on the Hugging Face hub too – OPEN SOURCE EVALS FTW! 🔥 https://x.com/reach_vb/status/2000982838171328882

Important new eval!”” / X https://x.com/sama/status/2000980694588383434

Tinker is now open to everyone! We are also adding: – Vision support with Qwen3-VL – New model: Kimi K2 Thinking (1T params) – OpenAI API-compatible inference Start training models within minutes: https://x.com/dchaplot/status/1999543675765031289

Turn any Autoregressive LLM into a Diffusion LM. dLLM is a Python library that unifies the training & evaluation of diffusion language models. You can also use it to turn ANY autoregressive LM into a diffusion LM with minimal compute. 100% open-source. https://x.com/akshay_pachaar/status/2001562985043783908

We are deeply grateful for the message and example set by AI research leaders who have pledged $1M through this public letter supporting OpenReview. Their encouragement strengthens the infrastructure behind open scientific dialogue and peer review innovation. Others are welcome”” / X https://x.com/openreviewnet/status/2001837352692675007

we’re open-sourcing a new frontier science eval in biology, chemistry, and physics. there are 2 tracks: olympiad level and advanced research level. as models become saturated on GPQA, this is a nice unsaturated alternative with clean test-time compute scaling. kudos to https://x.com/tejalpatwardhan/status/2000982763500175683

💡 LMArena Deep Dive: DeepSeek v3.2 (Text Arena) Leaderboard rank doesn’t always tell the full story. As previously reported, DeepSeek released v3.2 two weeks ago. Its results varied across categories and, overall, ranked lower than earlier v3.1 and v3.2 Experimental versions. https://x.com/arena/status/2000637978662821942

We spun up a new GitHub repo for all things MCP at @Google. Get info on our remote managed MCP servers, open source MCP servers, examples, and learning resources. https://x.com/rseroter/status/2000607267675410609

New benchmark from Google Research. Models get better at benchmarks, but do they actually get more factual? Previous evaluations focused on narrow slices: grounding to documents, answering from memory, or using search. A model excelling at one often fails at another. This new https://x.com/omarsar0/status/2000935220049273303

after testing GPT-5.2 I no longer think that it is a much larger model or anywhere the size Gemini 3 Pro is”” / X https://x.com/scaling01/status/1999566015873569174

🚨BREAKING: Leaderboard updates for Text, Vision & WebDev Gemini-3-Flash by @GoogleDeepMind is now ranked top 5 across Text, Vision, and WebDev, making it the most cost-efficient frontier model (input $0.5 and output $3/MTokens). Gemini-3-Flash highlights: 🔹 Top 5 across Text, https://x.com/arena/status/2001322123730788698

Gemma Scope 2: Helping the AI Safety Community Deepen Understanding of Complex Language Model Behavior – Google DeepMind https://deepmind.google/blog/gemma-scope-2-helping-the-ai-safety-community-deepen-understanding-of-complex-language-model-behavior/

Introducing Gemma Scope 2 🤗Largest open release of interpretability tools (over 1 trillion parameters trained!) 🔬Works as a microscope to analyze all Gemma 3 models’ internal activations 🗣️Advanced tools for analyzing chat behaviors https://x.com/osanseviero/status/2001989567998836818

FunctionGemma has day-0 support on MLX 🔥🚀 A tiny but mighty single-turn function calling model. Great for on-device tool use, MCP, RAG, routing and more. Get started today: > pip install -U mlx-lm Or run it on your iPhone using MLX-Swift. Notebook example: https://x.com/Prince_Canuma/status/2001713991115026738

I am asking, once again, for @GoogleDeepMind to provide benchmarks for different thinking levels. If they’re giving me low, medium, and high thinking level parameters, I wanna know as a builder how they compare. I don’t think think that’s too much to ask @OfficialLoganK”” / X https://x.com/RobertHaisfield/status/2001327612887785904

🖼️🚨 Image Leaderboard Update Competition in the Arena continues to drive leaderboard movement, with Flux-2-Max making a competitive debut. 🔹 #3 on Text-to-Image (1167) 🔹 #7 on Image Edit (1247) The Text-to-Image leaderboard tightens as Flux-2-Max slots ahead of https://x.com/arena/status/2000947088738431408

🖼️🚨 Image Leaderboard Update Competition in the Arena continues to shake up the leaderboards. Flux-2-Dev lands on the board with solid early results. 🔹 #7 on Text-to-Image (1149) 🔹 #8 on Image Edit (1240) Margins remain slim on the Text-to-Image leaderboard, where https://x.com/arena/status/1999560495867793881

🚨 FLUX.2 [max] live on fal! ✨ Black Forest Labs’ top-tier: quality + edit consistency 🎯 Better than FLUX.2 [pro], easier prompting 🎨 Consistent edits: characters, objects, styles, backgrounds 💡 Most creative FLUX model: same prompt, varied outputs that still follow https://x.com/fal/status/2000945229977829784

🚀 The GeoAI QGIS Plugin is here 🔥 You can run Moondream vision-language models, object detection, image segmentation (SAM 3), and even train your own geospatial segmentation model end-to-end. Website: https://x.com/giswqs/status/1999536028282179721

GPT Image 1.5’s IQ is far behind Nano Banana Pro. It fails the math problem here (left: GPT, right: 🍌), also other math/physics/maze problems. Nano Banana Pro is a multimodal built on Gemini 3 Pro. I suspect GPT Image 1.5 is still stuck on the older GPT-4o architecture. https://x.com/Yuchenj_UW/status/2001023040763920870

GPT-5.2 below Opus 4.5 and Gemini 3 Pro on LiveBench https://x.com/scaling01/status/1999323401421488319

GPT-5.2 scores 152 on the Epoch Capabilities Index (ECI), our tool for aggregating benchmark scores. This puts it second only to Gemini 3 Pro. 🧵 with individual scores. https://x.com/EpochAIResearch/status/1999548496198926728

GPT-5.2 xhigh doing better than Gemini 3 Pro on MRCR long context eval https://x.com/scaling01/status/1999327512401527107

Autoregressive generation can be seen as a special case of block diffusion where the block size is just one token. @PKU1898 and @huaweitechnolgy presented a gradual way for this autoregressive (AR) → block-diffusion transition: To make it work, they: – Use an attention pattern https://x.com/TheTuringPost/status/2001697220387913818

MiniMax (Hailuo) Video Team Has Open Sourced VTP (Visual Tokenizer Pre-training)! VTP is a scalable pre-training framework for visual tokenizers, built for next-gen generative models. It challenges the conventional belief in Latent Diffusion Models that scaling the stage-1 https://x.com/MiniMax_AI/status/2000935213506171197

🚀🚀🚀 We’re excited to support @NVIDIA and their new open family of models: NVIDIA Nemotron 3! Open in weights, data, tools, and training, Nemotron 3 is built for multi-agent apps and features: ⚡️An efficient hybrid Mamba‑Transformer MoE architecture 🧾1M token context for”” / X https://x.com/vllm_project/status/2000623058076492276

Agent demos often fail for reasons that are hard to see: unclear tool traces, silent failures, and changes that improve one behavior but break another. Our new course with @Nvidia shows how to use their NeMo Agent Toolkit to surface these issues with OpenTelemetry tracing, run https://x.com/DeepLearningAI/status/2001329113622073611

Baseten supports @nvidia Nemotron 3 Nano on day zero Up to 4× faster token generation, high accuracy, and predictable inference built for agentic AI. Available to deploy today on Baseten for high-performance inference. Read more here: https://x.com/basetenco/status/2000582868532121688

Introducing NVIDIA Nemotron 3 Nano, a fully open 30B with 3B active parameter hybrid MoE model engineered for maximum efficiency and benchmark-leading accuracy. AI natives can now use Nemotron 3 Nano on Together AI — with fast, reliable inference for specialized agentic systems https://x.com/togethercompute/status/2000572943718314392

NEWS: NVIDIA announces the NVIDIA Nemotron 3 family of open models, data, and libraries, offering a transparent and efficient foundation for building specialized agentic AI across industries. Nemotron 3 features a hybrid mixture-of-experts (MoE) architecture and new open https://x.com/nvidianewsroom/status/2000588337896198481

.@nvidia Nemotron 3 Nano is now available on Ollama! Local ollama run nemotron-3-nano Cloud ollama run nemotron-3-nano:30b-cloud https://x.com/ollama/status/2000820163231232167

🚀 Day-0 support for @NVIDIA Nemotron 3 Nano in SGLang SGLang now supports Nemotron 3 Nano on Day 0 🎉 A highly efficient, fully open Hybrid MoE model with 1M context, thinking budget, and industry-leading accuracy per compute. ✅ Open weights, data, and recipes ⚡ Fast, https://x.com/lmsysorg/status/2000567938949243111

As AI Grows More Complex, Model Builders Rely on NVIDIA | NVIDIA Blog https://blogs.nvidia.com/blog/leading-models-nvidia/

BREAKING CUDA MOAT EXPANDS: Today, NVIDIA has acquired SchedMD, makers of SLURM, a widely used “”open source”” workload scheduler. Many AI companies such as Mistral, Thinking Machines, parts of Meta’s FAIR division, university academic labs use SLURM. NVIDIA’s acquisition expands https://x.com/SemiAnalysis_/status/2000620209262985641

BREAKING: NVIDIA just dropped an open 30B model that beats GPT-OSS and Qwen3-30B — and runs 2.2-3.3× faster Nemotron 3 Nano: • Up to 1M-token context • MoE: 31.6B total params, 3.6B active • Best-in-class performance for SWE-Bench • Open weights + training recipe + https://x.com/AskPerplexity/status/2000589984818954719

First time I see a major org release @huggingface collections inside collections 🤯 Kudos @nvidia for this brilliant release https://x.com/NielsRogge/status/2000639749514760465

In collaboration with NVIDIA, the new Nemotron 3 Nano model is fully supported in llama.cpp Nemotron 3 Nano features an efficient hybrid, Mamba, MoE architecture. It’s a promising model, suitable for local AI applications on mid-range hardware. The large context window makes it”” / X https://x.com/ggerganov/status/2000574990425415765

Inside NVIDIA Nemotron 3: Techniques, Tools, and Data That Make It Efficient and Accurate | NVIDIA Technical Blog https://developer.nvidia.com/blog/inside-nvidia-nemotron-3-techniques-tools-and-data-that-make-it-efficient-and-accurate/

New mlx-lm release: pip install -U mlx-lm Includes support for a few new models: – Nemotron 3 Nano (Nvidia) – Devstral (Mistral) – rnj-1 (Essential AI) https://x.com/awnihannun/status/2000974327660077298

Nvidia continues to put out some of the strongest and fastest open models. Pretraining and post training data are released as well, something very few orgs have done”” / X https://x.com/tri_dao/status/2000707760288092655

NVIDIA Debuts Nemotron 3 Family of Open Models | NVIDIA Newsroom https://nvidianews.nvidia.com/news/nvidia-debuts-nemotron-3-family-of-open-models/?ncid=so-twit-561360

NVIDIA has just released Nemotron 3 Nano, a ~30B MoE model that scores 52 on the Artificial Analysis Intelligence Index with just ~3B active parameters Hybrid Mamba-Transformer architecture: Nemotron 3 Nano combines the hybrid Mamba-Transformer approach @NVIDIAAI has used on https://x.com/ArtificialAnlys/status/2000602570092675402

NVIDIA just released Nemotron-Agentic-v1 on Hugging Face This dataset empowers LLMs as interactive, tool-using agents for multi-turn conversations and reliable task completion. Ready for commercial use. https://x.com/HuggingPapers/status/2000628009049760072

NVIDIA just released Nemotron-Cascade-8B on Hugging Face A powerful 8B general-purpose reasoning model that achieves best-in-class performance across diverse benchmarks, from math to coding, by using novel Cascade RL. https://x.com/HuggingPapers/status/2001065870676603333

NVIDIA releases Nemotron 3 Nano, a new 30B hybrid reasoning model! 🔥 Nemotron 3 has a 1M context window and the best in class performance for SWE-Bench, reasoning and chat. Run the MoE model locally with 24GB RAM. Guide: https://x.com/UnslothAI/status/2000568378407452746

Really impressive release from NVIDIA, who not only went head-to-head with Qwen3, but: – innovated on the architecture (risky for most open labs) – did legit multi-env RL, complete with agentic evals (first time I see this from an open lab) – plan to open source the pretraining”” / X https://x.com/_lewtun/status/2000599470099099990

SemiAnalysis InferenceMAX showing GPT OSS on Blackwell is 33% more tokens per $ in just 1 month thanks to the awesome work of @vllm_project and @nvidia”” / X https://x.com/dylan522p/status/2002135815233970295

This is not just another strong open model. Nemotron actually releases training data (!), RL environments, and training code. This is a big difference: almost all model developers just want people to use their models; NVIDIA is enabling people to make their own models. We are”” / X https://x.com/percyliang/status/2000608134205985169

Today, @NVIDIA is launching the open Nemotron 3 model family, starting with Nano (30B-3A), which pushes the frontier of accuracy and inference efficiency with a novel hybrid SSM Mixture of Experts architecture. Super and Ultra are coming in the next few months. https://x.com/ctnzr/status/2000567572065091791

vLLM delivers even more inference performance with the same GPU platform. In just 1 month, we’ve worked with NVIDIA to increase @nvidia Blackwell maximum throughput per GPU by up to 33% — significantly reducing cost per token — while also enabling even higher peak speed for https://x.com/vllm_project/status/2001449658984632699

When @NVIDIA announced Nemotron 3 – it marked a symbolic turning point in a year that fundamentally reshaped open-source AI leadership. Is NVIDIA the new open-source king? What’s behind this strategy? Let’s see. ▪️ It releases 3 trillion tokens of new pretraining, 18 million https://x.com/TheTuringPost/status/2001087448299065372

GPT-5.2 xhigh reasoning scores 89.3 on the Extended NYT Connections benchmark, compared with 77.9 for GPT-5.2 high reasoning. GPT-5.2 Pro scores lower (86.7) but above GPT-5 Pro (83.9). https://x.com/LechMazur/status/1999582591905583256

Ok GPT-5.2 is *much* stronger at proof-writing. It notices BS previous models wrote immediately (I like to test this between model iterations to see if they notice what I notice). It also has better sense for what problems seem more tractable, and makes further progress.”” / X https://x.com/AcerFur/status/1999314476320063546

Real user feedback matters in model evaluation. ✨GPT-5.2 Instant, meant for everyday work, is #1 on @yupp_ai’s Text Leaderboard while GPT-5.2 (High) is #1 on our SVG Leaderboard. @openai’s strategy of releasing model variants suited to the task looks sound. Congrats @openai! 🎉 https://x.com/lintool/status/2000368978708119958

Yeah it’s over AI explained specified that this GPT-5.2 result was with reasoning effort xhigh aka 100k tokens spent thinking”” / X https://x.com/scaling01/status/1999535536130662576

GPT-5.2 just overtook Claude Opus 4.5 to achieve the highest score in GDPval-AA, a benchmark that focuses on performance in real-world economically valuable tasks However, GPT-5.2 is also the most expensive model to run GDPval-AA: GPT-5.2 cost $620, compared to Claude Opus 4.5’s https://x.com/ArtificialAnlys/status/1999404579599823091

@OpenAI NOTE: OpenAI calls their official results MRCRv2. I reported to them a few weeks ago that ~5%-10% of their tests in MRCRv1 had issues, which came from their generation. The results above are using corrected tests, similar to OpenAI’s MRCRv2. Here’s the MRCRv2 dataset from OpenAI”” / X https://x.com/DillonUzar/status/1999328225164431394

Evaluating chain-of-thought monitorability | OpenAI https://openai.com/index/evaluating-chain-of-thought-monitorability/

GPT 5.2 (xhigh) scores 72.2% and takes the lead on WeirdML, ahead of gemini 3 at 69.9%. 5.2 xhigh uses a lot of tokens (28k on avg, vs 7.8k for gemini and 3.7k for opus). It struggles with some tasks, but is really good at optimising the solutions to the other tasks to reliably https://x.com/htihle/status/2000571235734810805

GPT-5 Pro by @OpenAI is the Best Reasoning Model of 2025. 🏆 Calculated across SEAL’s reasoning leaderboards, GPT-5 Pro was the best at answering complicated questions, explaining its thinking, and solving multi-step problems. https://x.com/scale_AI/status/2000998950824968482

GPT-5.2 is a big improvement over GPT-5.1 on VendingBench-2 but barely beats Sonnet 4.5 and loses to Gemini 3 Pro and Claude 4.5 Opus https://x.com/scaling01/status/1999449402776387808

To preserve chain-of-thought (CoT) monitorability, we must be able to measure it. We built a framework + evaluation suite to measure CoT monitorability — 13 evaluations across 24 environments — so that we can actually tell when models verbalize targeted aspects of their”” / X https://x.com/OpenAI/status/2001791131353542788

Evaluating AI’s ability to perform scientific research tasks | OpenAI https://openai.com/index/frontierscience/

Science 🤝 GPT-5. Our new FrontierScience benchmark will be a valuable way to measure the performance of AI models on hard chemistry, biology, physics, and more. Plus, GPT-5 operating in a wet lab environment suggested experiments to increase a molecular cloning protocol’s”” / X https://x.com/kevinweil/status/2000982202067165253

We’re releasing a new eval to measure expert-level scientific reasoning: FrontierScience. This benchmark measures PhD-level scientific reasoning across physics, chemistry, and biology. It contains hard, expert-written questions (both olympiad-style problems and longer”” / X https://x.com/OpenAI/status/2000975293448905038

i wanted to compare gemini 3 pro and gpt 5.2 thinking on the long context eval MRCR v2, but i can’t make sense of the already high score reported by gemini for gpt 5.1? gemini is doing an average with samples < 128k, but i get 46.2% when doing that for gpt 5.1 (which is a 14% https://x.com/eliebakouch/status/1999482968717279441

I’m satisfied with GPT-5.2’s long-context capability. Up to now, I’ve always used Gemini to summarize podcasts, but I can now switch this use case over to ChatGPT. What I like is that, with the same prompt, it produces summaries with richer detail compared to Gemini. (That”” / X https://x.com/Hangsiin/status/2000738988378968224

OpenAI just released circuit-sparsity https://x.com/_akhaliq/status/1999528833490239864

You can now fine-tune LLMs and deploy them directly on your phone! 🚀 We collabed with PyTorch so you can export and run your trained model 100% locally on your iOS or Android device. Deploy Qwen3 on Pixel 8 and iPhone 15 Pro at ~40 tokens/sec. Guide: https://x.com/UnslothAI/status/2001305185206091917

– moe helps sample efficiency – rl helps sample efficiency – reasoning helps sample efficiency the tradeoff is you spend more compute on finetuning. but compute is cheap and data labeling is expensive”” / X https://x.com/vikhyatk/status/2001233256356962512

[2507.10524] Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation https://arxiv.org/abs/2507.10524

[2512.10685] Sharp Monocular View Synthesis in Less Than a Second https://arxiv.org/abs/2512.10685

@srush_nlp I have some thoughts on this, and I mostly agree with you. But the picture is nuanced… Longer reply below: 👇 I agree that when RL is done right — with the right prompt mixture, possibly a curriculum, and a sufficiently high-capacity base model, we should generally expect”” / X https://x.com/aviral_kumar2/status/2001855734485582239

🚀 New paper alert — Nemotron-Cascade We introduce Cascaded Domain-Wise RL (Cascade RL), a sequential RL pipeline that trains the model across domains one stage at a time, simplifying training while optimizing performance in each domain. 🔥 Key wins: – Cascade RL yields https://x.com/zihan_johan_liu/status/2001011462979117138

200k Tokens Is Plenty – Amp https://ampcode.com/200k-tokens-is-plenty

Ai2 Playground https://playground.allenai.org/

Also some nice improvements to the CUDA back-end including: – Quantize-quantize matmuls for NVFP4 and MXFP8 by @NasFilippova – You can now `pip install mlx[cuda13]` for x86 and arm (e.g. DGX Spark) – Much faster LLM prefill and training thanks to @zcbenz and @angeloskath”” / X https://x.com/awnihannun/status/2001679244917907912

Announcing Angular v21. Authors: Jens Kuehlers, Mark “Techson”… | by Angular | Nov, 2025 | Angular Blog https://blog.angular.dev/announcing-angular-v21-57946c34f14b

Baldur’s Gate 3 Dev Embraces Machine Learning For “”Tasks That Nobody Wants To Do”” – GameSpot https://www.gamespot.com/articles/baldurs-gate-3-dev-embraces-machine-learning-for-tasks-that-nobody-wants-to-do/1100-6531123/

Congratulations to the @XiaomiMiMo team on the release of MiMo-V2-Flash! MiMo-V2-Flash is live now on OpenRouter, free for a limited time.”” / X https://x.com/OpenRouterAI/status/2000956004675281094

Contra DSPy and GEPA https://benanderson.work/blog/contra-dspy-gepa/

DEER Draft with Diffusion, Verify with Autoregressive Models https://x.com/_akhaliq/status/2001685493919158362

every fp4 value (e2m1) as a list: [0, 0.5, 1, 1.5, 2, 3, 4, 6, -0, -0.5, -1, -1.5, -2, -3, -4, -6] the index in the list corresponds to the sign of the value, e.g. 0001 = 0.5 and 1001 = -0.5″” / X https://x.com/maharshii/status/2000475239835455750

Great paper from @a_karvonen! A nice example of meta-models work, a promising research area: Can we train networks to take activations as input and write natural language explanations? The bitter lesson says, in the long-run, scalable methods win. Does that apply to interp too?”” / X https://x.com/NeelNanda5/status/2001795630973493279

here is the start of a series of things i didnt know: torch has a grouped gemm fn called torch._grouped_mm it also has a (slow, which is why i didn’t use it) fp8 version in torch._scaled_* https://x.com/_xjdr/status/2001231675066396837

How long have you been “”planning to understand”” how modern LLM inference works? We just gave you a readable version of SGLang you can finish over the weekend. Introducing mini-SGLang ⚡ We distilled SGLang from 300K into 5,000 lines. Kept the core design, cut the complexity. https://x.com/lmsysorg/status/2001356624855023669

I assume that many cases of “”degradation”” that people think they’re experiencing with newer models is really that they frequently reach a flow state in these tools and do think the models can kind of read their minds.”” / X https://x.com/kylebrussell/status/2002018579957346680

i can confidently say nvfp4 training is solved at least for MoEs. when properly applied, it can be better https://x.com/_xjdr/status/2001234330236940444

I’m glad this paper of ours is getting attention. It shows that there are more efficient and effective ways for models to use their thinking tokens than generating a long uninterrupted thinking trace. Our PDR (parallel/distill/refine) orchestration gives much better final”” / X https://x.com/prfsanjeevarora/status/2001302776966533396

If busting through Pareto optimality also makes you say “”oh yeah””, check out our blogpost, preprint, codebase, and model release: Blog: https://x.com/lukemerrick_/status/1999516722030870542

Introducing Bolmo, a new family of byte-level language models built by “”byteifying”” our open Olmo 3–and to our knowledge, the first fully open byte-level LM to match or surpass SOTA subword models across a wide range of tasks. 🧵 https://x.com/allen_ai/status/2000616646042399047

Introducing Manus 1.6: Max Performance, Mobile Dev, and Design View https://manus.im/blog/manus-max-release

Is resumable LLM streaming hard? No, it’s just annoying. | Stardrift Blog https://stardrift.ai/blog/streaming-resumptions

It’s actually a terrible meme because the logic is completely backwards you would make training 61320 times slower”” / X https://x.com/scaling01/status/1999456392495923555

Lots of discussion on Jevons Paradox for AI: does cheaper AI lead to more total usage? New paper finds short-run elasticity ~1 (so no short-run paradox) but prices fell 1000x in two years & demand exploded. So Jevons happens over time, as firms gradually adopt AI at lower prices https://x.com/emollick/status/1999672997121265867

Maybe consider putting “”cutlass”” in your CUDA/Triton kernels | Henry Zhu https://maknee.github.io/blog/2025/Maybe-Consider-Putting-Cutlass-In-Your-CUDA-Kernels/

MiMo-V2-Flash is live. It’s just step 2 on our AGI roadmap, but I wanted to dump some notes on the engineering choices that actually moved the needle. Architecture: We settled on a Hybrid SWA. It’s simple, elegant, and in our internal benchmarks, it outperformed other Linear”” / X https://x.com/_LuoFuli/status/2001002838953222653

mlx-lm is becoming quite a powerful little inference framework! The latest release adds tensor-parallel LLM inference for use with the new low-latency JACCL back-end in MLX (h/t @angeloskath). Also updated to support Transformers V5! https://x.com/awnihannun/status/2001781067880239597

More documentation on how to use the new back-end here: https://x.com/awnihannun/status/2001672689325609028

nano 3 30b a3b is surprisingly good. I’m so used to benchmaxxed models, it’s refreshing to talk to a solid open model”” / X https://x.com/andrew_n_carr/status/2000630563015905608

No signs of an end to rapid gains in AI ability at ever-decreasing costs (which is a log scale) yet. I have to update this monthly or more frequently at this point. All AI benchmarks are flawed, but GPQA Diamond has been a pretty good one, though likely close to being maxed out. https://x.com/emollick/status/2001387039858823325

Olmo 3.1 32B Think shows that not just frontier labs can scale RL. My favorite RL run yet over 7+ years of doing RL. The biggest fully open RL run ever? We left the same RL job running from our v3 Think for an extra 3 weeks. When we were releasing Olmo 3 32B on Nov. 20th we had https://x.com/natolambert/status/1999528636085649532

Olmo 3.1 is here. We extended our strongest RL run and scaled our instruct recipe to 32B–releasing Olmo 3.1 Think 32B & Olmo 3.1 Instruct 32B, our most capable models yet. 🧵 https://x.com/allen_ai/status/1999528336318509316

OpenReview is one of the most important pillars supporting AI research and knowledge sharing, through open peer review and publishing. But as a non-profit, it needs our community’s support. Please consider making a donation to this great institution! https://x.com/AndrewYNg/status/2001842857070743613

Paper: https://x.com/TheTuringPost/status/2001697302562685034

People keep running into this kind of bug in LLM-RL codebases, and it’s exactly why I wrote our Env API from the get go to 1) use a tokens-in-tokens-out signature, 2) have a Trajectory class that ensures consistency between contexts seen at inference and training time. I also”” / X https://x.com/TacoCohen/status/2001242003581870337

Performance Hints Over the years, my colleague Sanjay Ghemawat and I have done a fair bit of diving into performance tuning of various pieces of code. We wrote an internal Performance Hints document a couple of years ago as a way of identifying some general principles and we’ve https://x.com/JeffDean/status/2002089534188892256

Pre-pretraining LMs on formal, rule-based languages before natural language can really help them learn human language better – found by @nyuniversity But not every formal language works equally well. It needs to: – Have similar structure to natural language (especially https://x.com/TheTuringPost/status/2000673904013271338

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains https://arxiv.org/pdf/2507.17746

Sample-Level Debugging: For your MLOps Pipeline | Voxel51 https://voxel51.com/blog/sample-level-debugging-the-missing-layer-in-your-mlops-pipeline

Scaling MoE inference is often communication + KV-cache bound: once you push expert parallelism, decode can become dominated by collectives and imbalance, and prefill stragglers can stall an entire EP group. New community benchmark results for vLLM wide-EP on multi-node H200 https://x.com/vllm_project/status/2001695354983723361

Self Improving Text2Sql Agent with Dynamic Context and Continuous Learning https://www.ashpreetbedi.com/articles/sql-agent

So excited to see new models with inference-oriented optimizations, like slide window attention (SWA) and multi-token prediction (MTP). With joint optimization between SGLang @lmsysorg and Xiaomi @XiaomiMiMo, one can now serve a 300B model at 150+ token/s, while achieving open”” / X https://x.com/BanghuaZ/status/2000981251575181723

Stanford AI Experts Predict What Will Happen in 2026 | Stanford HAI https://hai.stanford.edu/news/stanford-ai-experts-predict-what-will-happen-in-2026

Stronger Normalization-Free Transformers – new paper. We introduce Derf (Dynamic erf), a simple point-wise layer that lets norm-free Transformers not only work, but actually outperform their normalized counterparts. https://x.com/liuzhuang1234/status/1999321116641497355

Structured Outputs Create False Confidence | BAML Blog https://boundaryml.com/blog/structured-outputs-create-false-confidence

testing moondream RL lora ft on “”is the gate open or closed””. 97 training samples with reasoning, it goes from 84.2% -> 94.7% on the test set after 17 mins of training (17 epochs). without reasoning, it trains for 2 minutes and goes from 84.2% -> 78.9% (1.2 epochs) fascinating https://x.com/vikhyatk/status/2001232634584948878

The first workshop on late interaction invites your submissions on training recipes, theoretical understanding, analysis, tooling, and applications. The workshop will be at ECIR 2026 in Delft, Netherlands. Check out @bclavie’s thread for details!”” / X https://x.com/lateinteraction/status/2001306319001616798

The Late Interaction Workshop is officially coming to ECIR 2026, and the CFP is up! It’ll be a venue discuss all things multi-vector retrieval, including multimodality and future development, and will feature a keynote by the one and only @lateinteraction. Info & Link 🔽 https://x.com/bclavie/status/2001297672741790024

The latest MLX is out! And it has a new distributed back-end (JACCL) that uses RDMA over TB5 for super low-latency communication across multiple Macs. Thanks to @angeloskath https://x.com/awnihannun/status/2001667839539978580

What we have – and what we can carry from 2025 into 2026 – in reinforcement learning? ▪️ RLHF became classic ▪️ But there’s also a shift from human to AI judgement with RLAIF ▪️ RLVR – a promise that faces significant controversy ▪️ Though @karpathy says that RL is overall https://x.com/TheTuringPost/status/1999149244180250747

When you serve vLLM at scale, request distribution is not a stateless problem. KV cache locality matters for conversational traffic, and prefill/decode (P/D) disaggregation introduces two specialized worker pools with very different bottlenecks. Generic load balancers typically https://x.com/vllm_project/status/2000882750010876179

While working on the GitHub ui modernization/migration, I’ve spent half of the time on harness engineering (skills, agents, commands, test setup, verification etc.). It reminds me of game development where you spend most of the time on the engine and level builder.”” / X https://x.com/jaredpalmer/status/2001831913129226341

Yes, AGI Can Happen – A Computational Perspective https://danfu.org/notes/agi/

Seer is a small repo for interp researchers working on/with agents. Makes it easier to set up environments, equip agents with your techniques, and build on papers. Fixes a lot of the annoying stuff from using Claude Code out of the box. https://x.com/AJakkli/status/2002019487797711064

Note that our 90% prediction intervals are quite wide, spanning a factor of 2x longer or shorter than our central estimate. Also, ECI underestimated previous Claude models on Time Horizons by 30% on average. If we account for that, we predict Opus 4.5 will get 3.8 hours. https://x.com/EpochAIResearch/status/1999585243003781413

Efficiently Reconstructing Dynamic Scenes One 🎯 D4RT at a Time”” TL;DR: self-attention encoder transforms the input video into the latent Global Scene Representation; decoder can query 3D position P of any given 2D point (u, v) from the source timestep at target timestep 1/2 https://x.com/Almorgand/status/1999138551972221358

@pli_cachete Man discovers unsupervised learning and confuses it with what “”training on test”” actually means.”” / X https://x.com/jeremyphoward/status/2002136723573387537

Man discovers training on test improves performance on test. 1.1k people cheer https://x.com/pli_cachete/status/2002068489386004596

useful lifetime of a benchmark these days is measured in months”” / X https://x.com/gdb/status/1999454952801075353

Just dropped a new text embedding methodology. Fast as heck on CPU only and still great for document similarity analysis, clustering, and classification. How? Use a tiny ReLU network to approximate a big transformer from lexical (term frequency / bag of words) features. https://x.com/lukemerrick_/status/1999516702808375791

NEW Research from Meta Superintelligence Labs and collaborators. The default approach to improving LLM reasoning today remains extending chain-of-thought sequences. Longer reasoning traces aren’t always better. Longer traces conflate reasoning depth with sequence length and https://x.com/dair_ai/status/2000581380733030703

Low latency communication is crucial for tensor parallel inference which is now available on the latest mlx-lm (not on pypi yet). In the following video Devstral is generating a quicksort in C++ 1.7x faster on 2 M3 Ultras (right) vs on 1 (left). https://x.com/angeloskath/status/2001739468425040002

Great paper on why RL actually works for LLM reasoning. Apparently, “”aha moments”” during training aren’t random. They’re markers of something deeper. Researchers analyzed RL training dynamics across eight models, including Qwen, LLaMA, and vision-language models. The findings https://x.com/omarsar0/status/1999483394963701911

Mech interp question: do the new nemotron models make use of negative zero for any circuits?”” / X https://x.com/andrew_n_carr/status/2000744793480270236

I’m open-sourcing jax-js — a machine learning library for the web, in pure JavaScript jax-js is the first ML compiler that runs in the browser, generating fast WebGPU kernels. Built from scratch over the past year as a personal side project Details: https://x.com/ekzhang1/status/2001680771363254646

kling2.6(@Kling_ai )のモーションコントロールについて v2vの最大の魅力は AIで再現できない演技をさせること。実例として私が恥を晒して再現したから見てほしい。こんな動きプロンプトでは無理なんですよ。 https://x.com/onofumi_AI/status/2001840428250022087