Technical and Dev: AI News Week Ending 12/05/2025

Image created with gemini-2.5-flash-image with claude-sonnet-4-5. Image prompt: Cinematic wide shot of an empty luxury data center at night, single row of black server racks with cold blue LED lights stretching into darkness, polished concrete floors reflecting the lights, dramatic shadows, architectural photography style, the word TECH in bold white sans-serif across the center, minimalist composition emphasizing vast empty space and technological isolation

Amazon has launched a new speech-to-speech model, Nova Sonic 2.0, which ranks #2 on our Artificial Analysis Big Bench Audio Speech Reasoning benchmark! The new model achieves a reasoning accuracy score of 87.1% on Big Bench Audio, placing second overall behind Google’s Gemini https://x.com/ArtificialAnlys/status/1995950101068763393

Congrats to the ARC Prize 2025 winners! The Grand Prize remains unclaimed, but nevertheless 2025 saw remarkable progress on LLM-driven refinement loops, both with “”local”” models and with commercial frontier models. We also saw the rise of zero-pretraining DL approaches like HRM”” / X https://x.com/fchollet/status/1997011262723801106

Image Leaderboard Update: 🖼️📊 Our image leaderboard ranks image generation AIs according to user preference – and Seedream 4.5 from @BytePlusGlobal is speeding up the rankings! Seedream 4.5’s standard version comes in at #4, just below Nano Banana Pro – and the Max version is https://x.com/yupp_ai/status/1997032930846396466

Nano banana pro is hitting the threshold for images that Veo 4 will unlock for video. We’ll suddenly go from static infographics to pro-grade animated motion graphics — like having a custom youtube video essay on any topic imaginable. And just like that ai video will become a”” / X https://x.com/bilawalsidhu/status/1994110158138646693

Nano Banana Pro with 2k resolution is now #1 on the lmarena image editing leader board (with regular Nano Banana Pro at #2). It looks like users prefer higher resolution: who’d have thunk it?!”” / X https://x.com/JeffDean/status/1996457766349848753

Surprisingly good for the first try. Nano banana pro: “”create a map of the US where every state is made out of its most famous food (the states should actually look like they are made of the food, not a picture of the food). Check carefully to make sure each state is right.”” https://x.com/emollick/status/1995720976068137048

🚨BREAKING: Text Leaderboard Update: A new open source model has landed on the leaderboard! Mistral-Large-3 lands at #6 among open models and #28 overall on the Text leaderboard. Mistral 3 is the next generation of Mistral AI models and their most capable model family to date. https://x.com/arena/status/1995877395510051253

Introducing Mistral 3 | Mistral AI https://mistral.ai/news/mistral-3

Introducing Mistral Code | Mistral AI https://mistral.ai/news/mistral-code

Introducing the Mistral 3 family of models: Frontier intelligence at all sizes. Apache 2.0. Details in 🧵 https://x.com/MistralAI/status/1995872766177018340

Magistral | Mistral AI https://mistral.ai/news/magistral

Mistral Small 3 | Mistral AI https://mistral.ai/news/mistral-small-3

Mistral Small 3.1 | Mistral AI https://mistral.ai/news/mistral-small-3-1

Voxtral | Mistral AI https://mistral.ai/news/voxtral

Runway released Gen-4.5 today and it is already ranked first on the Video Arena leaderboard. We sat down with CEO @c_valenzuelab to discuss how a small team is currently beating Google and Meta in the race for state-of-the-art video generation. The full episode is below! https://x.com/wandb/status/1995548641801765249

CORE-Bench is solved (using Opus 4.5 with Claude Code) TL;DR: Last week, we released results for Opus 4.5 on CORE-Bench, a benchmark that tests agents on scientific reproducibility tasks. Earlier this week, Nicholas Carlini reached out to share that an updated scaffold that uses https://x.com/sayashk/status/1996334941832089732

AI agents can talk to each other. But they don’t always understand each other. This problem leads to inefficiency in collaboration for long-horizon problems and complex domains. The default approach in multi-agent systems today focuses on message structure. Protocols like MCP https://x.com/dair_ai/status/1996227436913340858

📊 Evaluating DeepAgents CLI on Terminal Bench 2.0 📊 The DeepAgents CLI is a coding agent built on top of the Deep Agents SDK, offering an interactive terminal interface with shell execution, filesystem tools, and persistent memory. How well does it actually perform on https://x.com/LangChain/status/1997006806904984002

🚨New Models in the Arena! 🐳DeepSeek V3.2: a new family of reasoning-first, agent-oriented models from @deepseek_ai are now live in the Arena. Standard, Thinking, and Speciale are all in the Text Arena, waiting for your toughest prompts! Get your votes in: we’ll see how they https://x.com/arena/status/1995564824718442620

At this point, papers testing whether AI can or cannot do something should try to test the strongest case, as well as a default. It is fine to say Llama 2 failed, but did a serious attempt to use GPT-5.1 Thinking in an agentic harness work? It would help better map the frontier.”” / X https://x.com/emollick/status/1994913383871586563

From Code Foundation Models to Agents and Applications: A Comprehensive Survey and Practical Guide to Code Intelligence https://arxiv.org/pdf/2511.18538

Thrilled to release our new paper MAP: Measuring Agents in Production ⚙️🚀 2025 is the year of agents… but do they actually work in the real world? Is it just hype? A group of 25 researchers from Berkeley, Stanford, UIUC, IBM, and Intesa Sanpaolo investigated what makes agents https://x.com/melissapan/status/1996975916971626763

What’s missing to build useful deep research agents? Deep research agents promise analyst-level reports through automated search and synthesis. However, current systems fall short of genuinely useful research. The question is: where exactly do they fail? This new paper https://x.com/omarsar0/status/1995915929973403827

WOW! @AnthropicAI released interviews with 1,250 professionals about how they use AI for work. You can find it on @huggingface as an open dataset! https://x.com/calebfahlgren/status/1996646452509266266

📢 If you’re interested in working at @arena please ping me. I will be at NeurIPS today and part of tomorrow. 📢 We are looking for excellent researchers (ICs and leaders) in machine learning, statistics, and evaluation. We can promise an intense, high-performance,”” / X https://x.com/ml_angelopoulos/status/1997006962522021992

AI Adoption Rates Starting to Flatten Out – Apollo Academy https://www.apolloacademy.com/ai-adoption-rates-starting-to-flatten-out/

Arena Expert launched last month as a new system for identifying the most difficult prompts–the kinds of questions people at the forefront of their fields are expected to ask. Since the launch, we looked at how “thinking” and “non-thinking” models perform across both general and https://x.com/arena/status/1997018150068801911

cloudflare down again 🙃 https://x.com/crystalsssup/status/1996869639608164505

Hey twitter! I’m releasing the LLM Evaluation Guidebook v2! Updated, nicer to read, interactive graphics, etc! https://x.com/clefourrier/status/1996250279033839918

How do the Top 10 open models really compare? We ran the “SF sea lion in front of the Golden Gate Bridge, as an SVG” test to find out. Prompt: “SVG sea-lion balancing a beach ball on its nose with the Golden Gate bridge in the background” https://x.com/arena/status/1995534738485129706

Interesting all AIs struggle with: “Updated version of the fighting temeraire with the same style and feel but entirely different subject appropriate for today” All four models get the idea of a retiring technology but miss most of the symbolism of what is being retired and how https://x.com/emollick/status/1994945921076138012

Introducing the Artificial Analysis Openness Index: a standardized and independently assessed measure of AI model openness across availability and transparency Openness is not just the ability to download model weights. It is also licensing, data and methodology – we developed a https://x.com/ArtificialAnlys/status/1995523178521846191

Most AI benchmarks share a common flaw: they saturate too quickly to study long-run trends. Our solution: “stitch” many benchmarks together. This lets us compare models across a wide range of capabilities on a single unified scale. Here’s how this works.🧵 https://x.com/EpochAIResearch/status/1996248575400132794

This is a remarkable result. V3.2 has high Pass@1 on Tool Decathlon, mid (ie GPT-5 tier) Passˆ3 (all three trajectories correct), but Pass@3 is #2, right behind new Opus. Do you know what this looks like? Like a model that’s *still* not RL’d anywhere close to its ceiling. https://x.com/teortaxesTex/status/1995538676332278238

Three years of the Lem Test, from the release of ChatGPT-3.5 (though it was not called that at the time) to Claude Opus 4.5 last week. https://x.com/emollick/status/1995025704870887652

Yupp LIVE leaderboard news 📰📊 A live model searches the Internet, integrating the latest real-world information into its responses. No dreaded old “knowledge cut-off time!” The new Claude Opus 4.5 Online and Claude Opus 4.5 (Thinking) Online have quickly risen to the top. https://x.com/yupp_ai/status/1996963861455593829

Yupp’s SVG AI Leaderboard is live! Why? It’s one of the clearest ways to demonstrate models’ reasoning and coding capabilities.”” / X https://x.com/yupp_ai/status/1996697775585787924

this one chart explains EVERYTHING about why OpenAI, xAI and Deepmind dropped everything to go chase after the grand prize in koding usecases as i said at AIE CODE and in my cogpost, Code AGI will be achieved in 20% of the time of full AGI, and capture 80% of the value of AGI. https://x.com/swyx/status/1996760294614507929

Unlock the secret to AI success | Forrester study https://miro.com/events/secret-to-ai-success-forrester-study/

🚨Top 10 Open Models by Provider for November The open model race continues with new models entering the Text Arena. Confidence intervals are getting tighter and the competition is heating up! Here are the November Top 3: 🥇 #1 Kimi-K2-Thinking-Turbo by @Kimi_Moonshot (Modified https://x.com/arena/status/1995534475070243043

Announcing the ARC Prize 2025 Top Score & Paper Award winners The Grand Prize remains unclaimed Our analysis on AGI progress marking 2025 the year of the refinement loop https://x.com/arcprize/status/1997010070585201068

🚨🖼️ Image Leaderboard Update Seedream 4.5 by Bytedance has officially entered the Arena on both the Image Edit and Text-to-Image leaderboards. Here is where it landed: 🔹 #3 on Image Edit (score: 1338) 🔹 #7 on Text-to-Image (score: 1146) This update delivers a 27-pt increase https://x.com/arena/status/1996641968005566876

🚨BREAKING: Text Leaderboard Update 🐳 Deepseek-v3.2 enters the leaderboard at #38, and Deepseek-v3.2-thinking lands at #41. For comparison, previous versions ranked higher: 🔹 v3.2 ranks #38 (-5 pts v3.1 and -14 pts v3.2-exp) 🔹 v3.2-thinking ranks #41 (-7 pts vs v3.1-thinking https://x.com/arena/status/1996707563208167881

Compare how DeepSeek V3.2 performs relative to models you are using or considering at: https://x.com/ArtificialAnlys/status/1996110266065715249

DeepSeek’s new DeepSeekMath-V2 hits gold-medal performance on IMO and Putnam. It’s the first open model that can check its own proofs, fix mistakes, and improve itself. DeepSeekMath-V2 uses two “minds” in one model: ▪️ A verifier – Reads a proof and points out issues. – https://x.com/TheTuringPost/status/1994926897248288813

very smart choices by @stochasticchasm and the arcee team. in terms of arch, this is pretty much the perfect setup if you’re a bit constrained by compute/time and can’t do 100s of ablations hybrid nope, gated attention, norms to stabilize everything, muon, deepseek routing this”” / X https://x.com/eliebakouch/status/1995600008603697346

Gemini 3 Deep Think mode is now live in the Gemini app for Ultra users. 🚀 Building on the technology that reached a gold-medal level at the ICPC World Finals & IMO, it uses parallel thinking to excel at difficult coding and scientific tasks. https://x.com/quocleix/status/1996659461851885936

Gemini 3 Pro is the frontier of multimodal AI, delivering SOTA performance across document, screen, spatial, and video understanding. Read our deep dive on how we’ve pushed our core capabilities to power hero use cases across: + Docs: “”derender”” complex docs into structured https://x.com/googleaidevs/status/1996973083467333736

Google out here building the Borg cube for real https://x.com/bilawalsidhu/status/1995650915785986491

Happy to share that the @GoogleDeepMind Gemini team is starting a new research team in Singapore! This new team will be focused on advanced reasoning, LLM/RL and improving bleeding edge SOTA models such as Gemini, Gemini Deep Think and beyond. 🔥 This team will be led by yours https://x.com/YiTayML/status/1996640869584445882

I was in Singapore earlier this year to visit the office, and this is going to be a very-high impact part of the Gemini team! If you’re interested in working on Gemini and want to be in Singapore working with awesome people like @YiTayML and @quocleix, see below ⬇️”” / X https://x.com/JeffDean/status/1996644208854388983

Opera rolls out Gemini-powered AI features across its browsers – 9to5Mac https://9to5mac.com/2025/12/01/opera-browsers-get-google-gemini-integration/

Our Gemini 3 Vibe Code hackathon started!, Build applications using the new Gemini 3 Pro model with a price pool of $500k. 🤯 > Top 50 winners receive $10,000 in Gemini API credits each. > Access Gemini 3 Pro Preview directly in Google AI Studio. > Leverage advanced reasoning”” / X https://x.com/_philschmid/status/1996990062836244732

Take an early look at how Google Gemini projects will work – Android Authority https://www.androidauthority.com/google-gemini-projects-2-3620950/

Today, we’re rolling out an updated Deep Think mode available in the Gemini app for Google AI Ultra subscribers. Here’s what you need to know: — Gemini 3 Deep Think mode pushes the boundaries of intelligence even further, delivering meaningful improvement in reasoning https://x.com/GoogleAI/status/1996657213390155927

Ultra users, ready to try Gemini 3 Deep Think mode? Here’s how: 1) Select ‘Deep Think’ in the prompt bar 2) Select ‘Thinking’ from the model drop down 3) Type your prompt & submit”” / X https://x.com/GeminiApp/status/1996670867770953894

We’re hiring research scientists & student researchers at Google DeepMind. DM or email me if you’re interested! I’ll be at NeurIPS this week. Happy to chat in person!”” / X https://x.com/RuiqiGao/status/1995572419218796567

We’re pushing the boundaries of intelligence even further with Gemini 3 Deep Think. 🧠 This mode meaningfully improves reasoning capabilities by exploring many hypotheses simultaneously to solve problems. Here’s how it coded a simulated dominoes game from a single prompt ⬇️ https://x.com/GoogleDeepMind/status/1996658401233842624

With state-of-the-art reasoning, richer visuals, and deeper interactivity, Gemini 3 is more intuitive, more powerful, and more personalized. Start exploring at https://x.com/GeminiApp/status/1995534313044238347

👀Introducing a brand new @yupp_ai SVG leaderboard ranking frontier models on the generation of coherent and visually appealing SVGs! Gemini 3 Pro by @GoogleDeepMind takes the crown as the most powerful model! 👏 We’re also releasing a public SVG dataset. Details in🧵 https://x.com/lintool/status/1996696157985398812

Gemini 3 Deep Think mode is live for Ultra users today. When using parallel lines of thought in this mode, Gemini shows meaningful improvement on key reasoning benchmarks such as ARC-AGI-2 & HLE. I think you will be deeply impressed. https://x.com/NoamShazeer/status/1996679619031060680

@venturetwins Feels a bit underwhelming? Not sure but a lot of hype for something that has existed for months and with better results. Nano Banana for video is Aleph: https://x.com/c_valenzuelab/status/1995559319962783919

Nano Banana Pro keeps getting more SOTA (support for 2K and 4K is available in the API!) 🍌 https://x.com/OfficialLoganK/status/1996036187979678088

Editing video models (think nano banana for video) will cause a boom in faithful remasters of old classics. Suddenly you can afford decisions that used to be cost prohibitive. Maybe even tackle cult favorites that never had the fan base to justify the expense (Stargate SG-1 https://x.com/bilawalsidhu/status/1995883669358006526

Nested Learning (NL) is @GoogleResearch’s fresh look at Continual Learning. It treats neural nets as stacked layers of memory. ▪️ Models turn into a bunch of smaller learners, each running on its own timescale (fast vs. slow) Here’s how this hierarchy of memories works: https://x.com/TheTuringPost/status/1994714278579073247

Today at #NeurIPS2025, we present Titans, a new architecture that combines the speed of RNNs with the performance of Transformers. It uses deep neural memory to learn in real-time, effectively scaling to contexts larger than 2 million tokens. More at: https://x.com/GoogleResearch/status/1996674393842614338

And Mistral Large 3, a frontier class open source MoE. https://x.com/MistralAI/status/1995872771516354828

🎉 Congratulations to the Mistral team on launching the Mistral 3 family! We’re proud to share that @MistralAI, @NVIDIAAIDev, @RedHat_AI, and vLLM worked closely together to deliver full Day-0 support for the entire Mistral 3 lineup. This collaboration enabled: • NVFP4 https://x.com/vllm_project/status/1995890057224618154

Europe still has one frontier model maker that can generally keep pace with Chinese open weights models, though no reasoner for Mistral 3 yet means they are behind the curve of actual performance – DeepSeek r1 got 71.5% on GPQA Diamond (& 1-shot, not 5-shot) back in January. https://x.com/emollick/status/1996068920596594932

I want to especially thank @MistralAI for releasing the base models for Mistral 3. Fewer companies are sharing base models and this opens many use cases from custom instruct to non-instruct cases”” / X https://x.com/QuixiAI/status/1996272948378804326

Meet the Ministral 3 models from @MistralAI! – 3B, 8B, and 14B models – Instruct, reasoning, and base variants – Supports tool use and vision input – Open-weights, Apache 2.0 licensed https://x.com/lmstudio/status/1995908228526604451

Mistral 3 is now available on Ollama v0.13.1 (currently in pre-release on GitHub). 14B: ollama run ministral-3:14b 8B: ollama run ministral-3:8b 3B: ollama run ministral-3:3b Please update to the latest Ollama. https://x.com/ollama/status/1995885696360566885

Mistral releases Ministral 3, their new reasoning and instruct models! 🔥 Ministral 3 comes in 3B, 8B, and 14B with vision support and best-in-class performance. Run the 14B models locally with 24GB RAM. Guide + Notebook: https://x.com/UnslothAI/status/1995874975631503479

NEW: @MistralAI released a fantastic family of multimodal models, Ministral 3. You can fine-tune them for free on Colab using TRL ⚡️, supporting both SFT and GRPO https://x.com/SergioPaniego/status/1996257877871509896

NEW: @MistralAI releases Mistral 3, a family of multimodal models, including three start-of-the-art dense models (3B, 8B, and 14B) and Mistral Large 3 (675B, 41B active). All Apache 2.0! 🤗 Surprisingly, the 3B is small enough to run 100% locally in your browser on WebGPU! 🤯 https://x.com/xenovacom/status/1995879338583945635

Run Mistral Large 3 on Ollama’s cloud: ollama run mistral-large-3:675b-cloud”” / X https://x.com/ollama/status/1996682858933768691

Super nice to see Mistral Large 3 as the #1 OSS model for coding on lmarena 🥳😎🙌 And the spoiler alert! 👀👀”” / X https://x.com/sophiamyang/status/1996587296666128398

Support for running Mistral Large 3 locally will be available in Ollama soon.”” / X https://x.com/ollama/status/1996683156817416667

The Bert-Nebulon Alpha Stealth model is live now as @MistralAI’s new Mistral Large 3! Try the full release now on OpenRouter: https://x.com/OpenRouterAI/status/1995904288560988617

The world’s best small models–Ministral 3 (14B, 8B, 3B), each released with base, instruct and reasoning versions. https://x.com/MistralAI/status/1995872768601325836

Mistral Large 3 debuts as the #1 open source coding model on the @arena leaderboard. We’d love for you to try it! More on coding in a few days… 👀 https://x.com/MistralAI/status/1996580307336638951

Mistral AI raises 1.7B€ to accelerate technological progress with AI | Mistral AI https://mistral.ai/news/mistral-ai-raises-1-7-b-to-accelerate-technological-progress-with-ai

NVIDIA Shatters MoE AI Performance Records With a Massive 10x Leap on GB200 ‘Blackwell’ NVL72 Servers, Fueled by Co-Design Breakthroughs https://wccftech.com/nvidia-shatters-moe-ai-performance-records-with-a-massive-10x-leap-on-gb200-nvl72/

Curiosity is a requirement for greatness. You win when you keep asking new questions every day. That’s why I am proud to announce my investment in Perplexity. Perplexity is powering the world’s curiosity, and together we will inspire everyone to ask more ambitious questions. https://x.com/Cristiano/status/1996626923720462425

Robotics keeps hitting the same wall. Single task RL works, but… it does not scale to hundreds of tasks or new embodiments. This new paper looks like a real step toward fixing that. The team introduces MMBench, a benchmark with 200 tasks across many domains and robots, and https://x.com/IlirAliu_/status/1994695830612447330

[2512.02008] The Art of Scaling Test-Time Compute for Large Language Models https://arxiv.org/abs/2512.02008

@hammer_mt tested it on invoice extraction. The data was a mess–it had different formats and labels that didn’t match. – Starting accuracy: 26% – After 9 automatic rewrites: 71% – Time: 20 minutes The AI identified edge cases he never would have caught manually.”” / X https://x.com/every/status/1997002142675353809

> I am quite curious what the score would have looked like if the model had produced outputs for every sample without exceeding the maximum output token limit. They really need to reduce reasoning verbosity, and/or extend context to 256K+. DSA makes that economical, in theory.”” / X https://x.com/teortaxesTex/status/1995922668839645418

✨ Introducing ThreadWeaver 🧵⚡ — an approach that significantly reduces LLM reasoning latency on challenging problems by enabling models to adaptively spawn parallel reasoning threads and merge them later in the process. (An off-the-shelf reasoning LLM can be retrofitted to”” / X https://x.com/VictoriaLinML/status/1995602943169741157

🎙️ Dr. Hendrik Susemihl, CEO and Co-founder of @GoodBytz, shows how fully automated kitchens can solve the labor crisis in food service and still serve better: We talk about his path from taking apart PCs as a teenager, to building large automation systems at Fraunhofer, to https://x.com/IlirAliu_/status/1994044791227847157

🤝 Proud to share the first production-ready vLLM plugin for Gaudi, developed in close collaboration with the Intel team and fully aligned with upstream vLLM. 🔧 This release is validated and ready for deployment, with support for the latest vLLM version coming soon. 📘 The https://x.com/vllm_project/status/1996207672245518782

🚀 Big news for the industry: AI21 Maestro is now available for deployment in your @awscloud VPC , seamlessly integrating with LLMs delivered via @awscloud Bedrock. With AI21 Maestro, organizations can orchestrate models, tools, and data to automate mission-critical tasks, https://x.com/AI21Labs/status/1996572699959722017

🚨New Model in Text Arena INTELLECT-3 by @PrimeIntellect is now live in the Arena! It’s a 106B MoE model, trained with SFT + RL on top of the GLM 4.5 Air Base. Apache 2.0 and MIT. Curious how it stacks up across math, creative writing, and more? Jump in and vote in Battle https://x.com/arena/status/1996324769013391839

1/ In our recent work – MixtureVitae – we show it is possible to have a permissively-licensed pretraining dataset – mitigating lawsuit risks from data like Books2 – while at the same time achieving strong performance on math/code, closing the gap to non-permissive datasets.”” / X https://x.com/JJitsev/status/1997072728332161420

a lot of people don’t seem to understand it’s dumb to compare reasoning and non reasoning models”” / X https://x.com/qtnx_/status/1996146690496049349

A new blog post about how we’ve adopted Antithesis as part of our testing story. This is kind of a new thing for us, because we liked Antithesis (both the people and the product) well enough that we’re now leading their next funding round. https://x.com/yminsky/status/1996202143368573035

AGI is unlike previous tools in that in can integrate and diffuse itself. Human-on-a-server would diffuse *extremely* rapidly. From @steve47285’s excellent post on this (link below): https://x.com/dwarkesh_sp/status/1996269777443242107

AI #145: You’ve Got Soul – by Zvi Mowshowitz https://thezvi.substack.com/p/ai-145-youve-got-soul

Archive for 2025 https://simonwillison.net/2025/

as a researcher, it makes no sense to compare reasoning vs non reasoning models on benches like the ones in Artificial Analysis without normalizing somehow by cost or output tokens. non reasoning models (base/instruct) are important for the open ecosystem since research teams and https://x.com/eliebakouch/status/1996214163215978967

Blog https://agiopen.org/blog

Check out SkillFactory! Priming LLMs with SFT before RL is pretty cheap and lets models learn cognitive skills from RL more effectively. And adding this inductive bias via SFT data is nicely compatible with the bitter lesson!”” / X https://x.com/gregd_nlp/status/1996621316267655453

Context plumbing (Interconnected) https://interconnected.org/home/2025/11/28/plumbing

Cool! FP8 RL runs on just 5GB VRAM!”” / X https://x.com/Alibaba_Qwen/status/1996474298169802799

Defining Reinforcement Learning Down – by Ben Recht https://www.argmin.net/p/defining-reinforcement-learning-down

Details on the trinity architecture that we settled on! https://x.com/stochasticchasm/status/1995595431448121593

economies-of-open-intelligence.pdf https://www.dataprovenance.org/economies-of-open-intelligence.pdf

Erdős Problem #124 – Discussion thread https://www.erdosproblems.com/forum/thread/124#post-1892

How Can Interpretability Researchers Help AGI Go Well? — AI Alignment Forum https://www.alignmentforum.org/posts/MnkeepcGirnJn736j/how-can-interpretability-researchers-help-agi-go-well

How prompt caching works – Paged Attention and Automatic Prefix Caching plus practical tips | sankalp’s blog https://sankalp.bearblog.dev/how-prompt-caching-works/

I strongly recommend that interpretability researchers take a look at this thoughtful post on the GDM interp team’s recent research philosophy! Their approach emphasizes making measurable progress on carefully-selected downstream tasks.”” / X https://x.com/saprmarks/status/1995643857049149668

I’ve been saying mechanistic interpretability is misguided from the start. Glad people are coming around many years later.”” / X https://x.com/hendrycks/status/1995540567019934185

In our new research, we present AutoJudge — an inference acceleration method that learns which tokens are important for the answer. The result? 1.5-2x speedups compared to speculative decoding, and steady gains when combined with advanced techniques. 🚀 https://x.com/togethercompute/status/1996654662456639913

It’s Hard to Feel the AGI · Tensor Labbet https://tensorlabbet.com/2025/11/30/hard-to-feel-agi/

𝐌𝐢𝐠𝐫𝐚𝐭𝐢𝐧𝐠 𝐟𝐫𝐨𝐦 𝐄𝐥𝐚𝐬𝐭𝐢𝐜𝐬𝐞𝐚𝐫𝐜𝐡 𝐭𝐨 𝐐𝐝𝐫𝐚𝐧𝐭 – 𝐃𝐞𝐞𝐩-𝐃𝐢𝐯𝐞 𝐛𝐲 Mahimai Raja J We think this is a a great technical breakdown on why many teams are moving from Elasticsearch to Qdrant for their vector search workloads. Why he wrote this? After https://x.com/qdrant_engine/status/1996127270487183567

Monocular Online Reconstruction with Enhanced Detail Preservation”” TL;DR: Hierarchical Gaussian Management for smart distribution; Global Consistency Optimization for rock-solid alignment; Multi-level Occupancy Hash Voxels for crisp fine-to-coarse detail https://x.com/Almorgand/status/1994455735313322488

New in @code Insiders: Language Models editor https://x.com/code/status/1995557008292913443

Please ignore the sensationalism. We think there’s a lot of things Interpretability can do to make things safer, just that past mech interp strategies have been somewhat misguided and that we can do better https://x.com/NeelNanda5/status/1995903183038673155

Power Overwhelming – by Kevin Zhang – East Wind https://eastwind.substack.com/p/power-overwhelming

proof-of-concept method that trains models to report when they break instructions or take unintended shortcuts:”” / X https://x.com/gdb/status/1996330850074722725

Quiet Feature Learning in Transformers This is one of the most fascinating papers I have read this week. Let me explain: It argues that loss curves can mislead about what a model is learning. The default approach to monitoring neural network training relies on loss as the https://x.com/omarsar0/status/1996233046799106128

Recent progress on automated proofs has convinced me that the @METR_Evals task lengths measurements plausibly generalize well beyond software engineering tasks, which I was previously skeptical of. Capabilities seem plausibly on par with a capable solver working for 1-3 hrs. https://x.com/littmath/status/1996245072149430482

RL amplifies existing behaviors. Let’s prime models w/ good behaviors for better RL! Introducing SkillFactory: ✂️Rearrange model traces on a problem to demo verification + retry ⚙️SFT on those traces 🦾RL Result: Learn robust explicit verification + retry across domains 🧵 https://x.com/ZayneSprague/status/1996615552987546050

See our blog post about the by now old titan paper and a followup Miras paper. It provides further insights and history of the titan memory architecture line of work. We will present the poster later today at NeurIPS’25.”” / X https://x.com/mirrokni/status/1996705597241413869

State of AI 2025: 100T Token LLM Usage Study | OpenRouter https://openrouter.ai/state-of-ai

Thanks @_akhaliq for sharing! 😍 📢#PosterCopilot enables precise layout reasoning and multi-round, layer-wise editing for high-quality graphic design. – Paper: https://x.com/jzw1365297/status/1996976559023091809

The .0003 kWh measure for a standard prompt. is very consistent. For an independent assessment, here is ML Energy Leaderboard, which uses a collection of 500 human prompts for testing, and is not particularly optimized for energy consumption, again finding right around .0003 kWh. https://x.com/emollick/status/1994858469355393216

The “”commodity AI”” thesis is wrong. The API market is splitting into two modalities: – Premium models (Claude) dominate programming and high-stakes work. Users pay $2+/M tokens because correct code > cheap code. – Cheap open models own roleplay and creative tasks. Volume is https://x.com/maximelabonne/status/1996931127735472187

The boundary between you prompting the model and the model prompting you is going to get blurry in 2026″” / X https://x.com/alexalbert__/status/1997009693622128911

The Dawn of the 10x Team | Sentry https://blog.sentry.io/the-dawn-of-the-10x-team/

The End of the Train-Test Split https://folio.benguzovsky.com/train-test

The leaderboard Illusion is presented at #NeurIPS2025 today! Come learn why leaderboards mislead — and how we can fix them.”” / X https://x.com/mziizm/status/1996489947159961740

The Thinking Game Film https://thinkinggamefilm.com/

They also make several innovations to stabilize RL training (far beyond what that other “”open bell labs”” place published in blog posts 👀): 1) unbiased kl estimate, with different kl reg for different domains (!) 2) mask significantly negative adv sequences (to not “”throw off”” https://x.com/suchenzang/status/1995537223698243998

This is unserious. V3.2-thinking, one of the strongest LLMs around, is below tons of relatively weak models and even older versions of itself, like V3.1, V3.2-exp, R1-0528. Maybe the clearest case of lmarena being cooked. https://x.com/teortaxesTex/status/1996801926546313473

This paper really is groundbreaking. It solves a long-standing embarrassment in machine learning: despite all the hype around deep learning, traditional tree-based methods (XGBoost, CatBoost, random forests, etc) have dominated tabular data–the most common data format in https://x.com/burkov/status/1996102081996861907

Thoughts on AI progress (Dec 2025) – by Dwarkesh Patel https://www.dwarkesh.com/p/thoughts-on-ai-progress-dec-2025

Thread by @a16z on Thread Reader App – Thread Reader App https://threadreaderapp.com/thread/1995538428684087439.html

TiledMLP which is one of the key components of Arctic Long Sequence Training is now available to you in Unsloth!”” / X https://x.com/StasBekman/status/1995516206682628569

Today we release the transformers version 5 RC! 🤗 With this, we enable e2e interoperability with our friends in ecosystem, ease up adding new models and simplify the library 🙌🏻 Read our blog to learn more: https://x.com/huggingface/status/1995575607456137716

Together AI now offers the fastest inference for the most popular OSS LLMs including Qwen3 235B 2507, GPS-OSS-20B, and Kimi-K2-0905. https://x.com/togethercompute/status/1995559800835637725

Transformers v5 is OUTTT! 🔥 It’s been quite wild ride from v4 -> v5 20k -> 3M+ installs/day 40 -> 400+ architectures ~1k -> 750k+ checkpoints 🤯 1.2B+ total installs PyTorch-only, modular model defs, quantization-first, OpenAI-compatible transformers serve (with Responses API”” / X https://x.com/reach_vb/status/1995565504480882765

We’re back to the age of research, just with big computers. 2012-2020 – the age of research 2020-2025 – the age of scaling Now the scaling is so big. If you just 100x the scale, everything would be transformed? – @ilyasut doesn’t think that’s true. This is what he said in his https://x.com/TheTuringPost/status/1994075019060929004

What if you could breed better prompts, like a chef perfecting a recipe? @hammer_mt walked through how a system called GEPA doubled his accuracy in 20 minutes. 🧵 https://x.com/every/status/1997002100640039125

What is continual learning? Why does it matter now more than ever? Today’s models operate on the immediate present, while older memories stay frozen in time. When they acquire new knowledge and skills, they face catastrophic forgetting, losing what they learned before. ▪️ https://x.com/TheTuringPost/status/1994149996749623764

When Professor Yejin Choi @YejinChoinka spots the acronym collision 😃… Thrilled to see our work “”EPO: Entropy-Regularized Policy Optimization”” featured in her KeyNote in NeurIPS! Thanks for the shoutout! Code: https://x.com/fnruji316625/status/1996837482357457205

Why Optimizely Opal is NOT ‘just another’ AI tool for marketers – Optimizely https://www.optimizely.com/insights/blog/why-opal-is-not-just-another-ai-tool-for-marketers/

http://www.ibm.com https://www.ibm.com/reports/cost-of-complexity

You can see the same exponential gain in AI abilities over time for areas ranging from math to doing long tasks… …but this time the graph is of the total revenue that various AI models would’ve made from cyberattacks on smart contracts based on real exploits post-AI training https://x.com/emollick/status/1995680363872748000

You should try ML3 for coding tasks! Good answers and the right level of details.”” / X https://x.com/b_roziere/status/1996587193372930061

@eliebakouch @OpenBMB For IFEval there’s a major footgun where you need to make sure the reasoning content is stripped off. Since that depends on the reasoning delimiter e.g. </think> vs [/THINK] I guess the MiniCPM eval suite needs to include Mistral’s [/THINK] delimiter”” / X https://x.com/_lewtun/status/1996671492143124901

So exciting to see overlap b/w NIST’s new publication on “”accelerating AI innovation”” & what a subset of us have been advocating for/working on: Measurement science! For me, starting w/ explaining how Evaluation should work in Model Cards, a few things:🧵 https://x.com/mmitchell_ai/status/1996669236513751499

The consensus view of AI researchers that artificial general intelligence is possible in a handful of years has never been clearer. You don’t have to believe them but they appear to be sincere. Even more importantly: we do not need better AI than we have today for major impacts.”” / X https://x.com/emollick/status/1994464998945427573

🌍 Global MMLU was released exactly a year ago and has already become a key reference for multilingual evaluation. Today, we’re introducing Global MMLU 2.0 now covering more languages and refining the benchmark for what’s next. Excited for what’s yet to come in 2026 🚀🚀 🚀”” / X https://x.com/mziizm/status/1996517093039382879

Are we in a GPT-4-style leap that evals can’t see? – Martin Alderson https://martinalderson.com/posts/are-we-in-a-gpt4-style-leap-that-evals-cant-see/

Cohere Labs x NeurIPS 2025: “The Leaderboard Illusion” The Leaderboard Illusion highlights how private testing, selective score retraction, and data access gaps can distort leaderboard rankings, affecting AI model evaluation reliability. Congrats to authors https://x.com/Cohere_Labs/status/1996593263609045458

Today we are announcing the creation of the AI Evaluator Forum: a consortium of leading AI research organizations focused on independent, third-party evaluations. Founding AEF members: @TransluceAI @METR_Evals @RANDCorporation @halevals @SecureBio @collect_intel @Miles_Brundage”” / X https://x.com/aievalforum/status/1996641899332198403

We need rigorous, transparent evaluation if we want the world to understand advanced AI capabilities and risks. We’re excited to join with other independent evaluators through the AI Evaluator Forum to raise the bar on measurement best practices.”” / X https://x.com/METR_Evals/status/1996656514774524054

Intel presents SignRoundV2 Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs https://x.com/_akhaliq/status/1996975161854017702

The Art of Scaling Test-Time Compute for LLMs This is a large-scale study of test-time scaling (TTS). It also provides a practical recipe for selecting the best test-time scaling strategy. (bookmark it) My takeaways: Test-time compute scaling works – Allocating more https://x.com/omarsar0/status/1995862532310057320

Thrilled to share that @annadgoldie and I are launching @RicursiveAI, a frontier lab enabling recursive self-improvement through AIs that design their own chips. Our vision for transforming chip design began with AlphaChip, an AI for layout optimization used to design four”” / X https://x.com/Azaliamirh/status/1995937492194001367

Transformers v5’s first release candidate is out 🔥 The biggest release of my life. It’s been five years since the last major (v4). From 20 architectures to 400, 20k daily downloads to 3 million. The release is huge, w/ tokenization (no slow tokenizers!), modeling & processing. https://x.com/LysandreJik/status/1995558230567878975

Document understanding is a huge use case for VLMs, but historically there’s been no single “”good”” benchmark to measure progress here (unlike SWE-bench for coding). This past week I did a deep dive into OlmOCR-Bench, a recent document OCR benchmark that is a huge step in the https://x.com/jerryjliu0/status/1996668513562644823

with transformers v5 RC comes `any-to-any` pipeline and a new model class: AutoModelForMultimodalLM 👏 these unlock models that take in 2+ inputs and 2+ outputs, like Gemma3n (all modalities to text) and Qwen3-Omni (all modalities to text+audio) docs on the next one 🙌🏻 https://x.com/mervenoyann/status/1996908863673737450

TIL you can compile quantized models thanks to quanto although memory blows up a bit on Qwen3-VL https://x.com/mervenoyann/status/1996998362118201850

A must-read → Deep Research: A Systematic Survey Covers: – A 3-stage roadmap for deep research systems – Deep research vs. RAG – Key components: query planning, information acquisition, memory management, answer generation – Optimization: prompting, supervised fine-tuning, https://x.com/TheTuringPost/status/1994793698002178410

Most of the model’s computations over the RAG context aren’t needed. @AIatMeta introduced REFRAG, a decoding approach that: – Compresses the long context – Identifies which parts actually matter – Expands them when needed It gives up to 30.85× faster time-to-first-token and https://x.com/TheTuringPost/status/1994177724316045796

Open-source: complete codebase covering multiple simulation backends, training, retargeting, and real-world inference. Infra built for humanoid, but also readily modified for quadruped (also included). Lots of infra gems/conveniences we rely on consistently. Hopefully equally”” / X https://x.com/pabbeel/status/1995629150082924644

The mystery is over. It’s Runway with a leaderboard topping video model. Can’t wait to give it a test drive.”” / X https://x.com/bilawalsidhu/status/1995541831103512965

With Gen-4.5 you can achieve an unprecedented level of cinematic realism while still achieving novel creative concepts. The model is exceptionally good at generating objects that move with realistic weight, momentum and force. Even when suspended in zero gravity. Gen-4.5 early https://x.com/runwayml/status/1995857775771918574

With Gen-4.5 you can explore worlds that represent very specific points of view and aesthetic characteristics. The model allows you to precisely generate the look, feel and atmosphere of the world you want to create and the stories you want to tell. https://x.com/runwayml/status/1996942421121191987

YingVideo-MV: Music-Driven Multi-Stage Video Generation | Giant AI Lab https://giantailab.github.io/YingVideo-MV/