Technical and Dev: AI News Week Ending 12/12/2025

Image created with gemini-2.5-flash-image with claude-sonnet-4-5. Image prompt: Black and white cinematic photograph of fast-moving cirrus clouds streaking horizontally across bright sky, wispy formations stretched by wind into forward motion, slight motion blur, high contrast, film grain texture, bold sans-serif ‘TECH’ title card in lower third, contemplative skyward perspective, square format

Measuring AI Ability to Complete Long Tasks – METR
https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/

To measure these capabilities, we’re open-sourcing DeepSearchQA, a new benchmark to evaluate agents on complex web search tasks. Deep Research achieves state-of-the-art performance on this benchmark, as well as on the full Humanity’s Last Exam set (reasoning & knowledge), and https://x.com/GoogleDeepMind/status/1999165706231820297

We tested one of the most common prompting techniques: giving the AI a persona to make it more accurate We found that telling the AI “”you are a great physicist”” doesn’t make it significantly more accurate at answering physics questions, nor does “”you are a lawyer”” make it worse. https://x.com/emollick/status/1998063517681799418

Yes, there is a leak. I had investigated this. Some of the ARC-AGI-1 public evaluation examples can be found in the ARC-AGI-2 training examples. So training on both ARC-AGI-1 and ARC-AGI-2 training data is cheating as it leads to crazy good accuracy for ARC-AGI-1.”” / X https://x.com/jm_alexia/status/1998487516182467055

Gemini 3 Pro: the frontier of vision AI https://blog.google/innovation-and-ai/technology/developers-tools/gemini-3-pro-vision/

We’ve developed the FACTS Benchmark Suite with @GoogleResearch. 📊 It’s the industry’s first comprehensive test evaluating LLM factuality across four dimensions: internal model knowledge, web search, grounding, and multimodal inputs. https://x.com/GoogleDeepMind/status/1998831084277313539

OpenAI testing new Image-2 models on LM Arena https://www.testingcatalog.com/openai-testing-new-image-2-models-on-lm-arena/

🚨BREAKING: New Model & WebDev Leaderboard Update! GPT-5.2 by @OpenAI has officially made its debut in the Arena, appearing on the WebDev leaderboard. Current leaderboard standings: 🥈 #2 for GPT-5.2-high in WebDev (score: 1486) 🔹 #6 for GPT-5.2 in WebDev (score: 1399) https://x.com/arena/status/1999183339283185878

Poetiq | Traversing the Frontier of Superintelligence https://poetiq.ai/posts/arcagi_announcement/

A year ago, we verified a preview of an unreleased version of @OpenAI o3 (High) that scored 88% on ARC-AGI-1 at est. $4.5k/task Today, we’ve verified a new GPT-5.2 Pro (X-High) SOTA score of 90.5% at $11.64/task This represents a ~390X efficiency improvement in one year https://x.com/arcprize/status/1999182732845547795

OpenAIs latest model GPT-5.2 Thinking still not beating Opus 4.5 at SWE-Bench Verified however SWE-Bench Pro looking juicy over 10% higher score than Sonnet 4.5 https://x.com/scaling01/status/1999182909144519019

the-state-of-enterprise-ai_2025-report.pdf https://cdn.openai.com/pdf/7ef17d82-96bf-4dd1-9df2-228f7f377a29/the-state-of-enterprise-ai_2025-report.pdf

I meet a lot of very smart AI critics who never seriously try to make AI work for them by spending a couple of hours with a frontier model. People can be (and should be & are) critical after realizing what AI can do, but experience leads to better-informed and sharper critiques.”” / X https://x.com/emollick/status/1998398372986736777

We released OfficeQA today — a hard benchmark for evaluating agents on grounded reasoning tasks. More details in our blog https://x.com/bemikelive/status/1998491671609405748

[2512.08296] Towards a Science of Scaling Agent Systems https://arxiv.org/abs/2512.08296

❓How are evals and observability different from AI agents compared to simpler LLM applications? Come join me and Nick this Thursday as we discuss patterns we are seeing in the wild Will be a combo of presentation with a chunk of Q&A at the end! https://x.com/hwchase17/status/1998176795737383033

📦 New in LangChain 1.1: dynamically triggered context compaction ➗ Select the proportion of the context window at which you’d like to trigger summarization and the fraction you want to retain. 🧠 In DeepAgents, we’ve seen success when compacting at 85% and retaining 10%. https://x.com/sydneyrunkle/status/1998011509482647676

🔍@OpenRouterAI just made observability even easier. Their new Broadcast feature lets you send all your traces directly to LangSmith with no code changes required — whether you’re tracing with LangChain, provider SDKs, or the OpenRouter SDK. Demo: https://x.com/LangChain/status/1999168118833512783

🕵️‍♀️Turn your coding agent into an AI engineer `langsmith-fetch` is a CLI for pulling down LangSmith data. Can easily be given to Claude Code and other coding agents to let them help debug and improve your agents Repo: https://x.com/hwchase17/status/1999159856071401692

🦜Meet Polly – your new AI assistant for agent engineering We’re launching Polly in LangSmith today. It can help with three workflows that are becoming increasingly popular for debugging agents: 1. Debug a trace 2. Suggest prompt improvements 3. Analyze a conversation 🧵 https://x.com/hwchase17/status/1998809833693467100

🚀 Deep Agents: The Weekly Roundup 🚀 Dive into our latest resources to help you build Deep Agents capable of handling complex, long-running tasks. ✏️ Evaluating Deep Agents – Deep Agents can’t be evaluated like simple LLM tasks. After building and testing 4 production agents https://x.com/LangChain/status/1997843687376904400

100 Notion AI Agent Use Cases https://info.notion.so/resources/100-notion-ai-agent-use-cases

Agent Engineering: A New Discipline https://www.blog.langchain.com/agent-engineering-a-new-discipline/

Agent engineering: A new discipline Traditional software assumes known inputs and predictable behavior. Agents give you neither. That’s why teams shipping reliable agents are adopting a new discipline: agent engineering. Agent engineering is driven by a few core ideas: 🔹 https://x.com/LangChain/status/1998458777696350393

Agent HQ in @code has everything you need for agentic development with today’s release! – Manage all of your agents – local, background, or cloud – from a single view in VS Code – Run multiple isolated backgrounded agents at the same time – Delegate tasks to background or cloud https://x.com/pierceboggan/status/1998829467649937690

AI21 Maestro is @AI21Labs’ orchestration platform for building reliable end-to-end AI workflows. It combines multi-step planning, automatic compute scaling, built-in validation, proprietary RAG, and execution graphs/scorecards to keep AI agents accurate and transparent in https://x.com/AI21Labs/status/1998014705638523267

AWS did it again! They have introduced a novel way for developers to build Agents. Today, when you build an Agent, you start with a simple goal, then end up juggling prompts, routing logic, error handling, tool orchestration, and fallback flows. One unexpected user input and https://x.com/_avichawla/status/1998279303902244942

🔌 LangChain MCP Adapters 0.2.0 is out! This new release features: 🖼️ Multimodal tool support using LangChain’s standard content blocks ❓Elicitation support via callbacks 🏗️ Structured content for tools, stored as an artifact on tool results 🛠️ Tool name prefixes, preventing”” / X https://x.com/sydneyrunkle/status/1998380720016789938

🔊 The STT → Agent → TTS “sandwich” is a standard voice agent pattern. It’s easy to get started, tough to build reliable systems. 😵‍💫 Learn how to debug voice agents: We created a voice agent with @pipecat_ai and sent traces to LangSmith to show exactly how to get visibility https://x.com/LangChainAI/status/1998814975033487822

Build a voice agent with LangChain – Docs by LangChain https://docs.langchain.com/oss/javascript/langchain/voice-agent

Build a voice agent with LangChain – YouTube https://www.youtube.com/watch?v=kDPzdyX76cg

Clients – Agent Client Protocol https://agentclientprotocol.com/overview/clients

Cursor can now fix your hardest bugs. Debug Mode instruments your code, spins up a server to capture logs, and streams runtime data to the agent. Also in 2.2: Plan Mode improvements, multi-agent judging, and more. https://x.com/cursor_ai/status/1998821350333440133

First large-scale field study of how people actually use AI agents in the wild. The hype says 2025 is the year of agentic AI. But systematic behavioral evidence on real-world agent adoption has been almost nonexistent until now. Researchers from Harvard and Perplexity analyzed https://x.com/dair_ai/status/1999117070576058415

Congrats to the @MistralAI team on the launch of Devstral 2! 🚀 vLLM now delivers Day-0 support for the Devstral 2 Instruct models — optimized for agentic coding, deep codebase exploration, and multi-file editing at scale. Feel free to reach out 👇 https://x.com/vllm_project/status/1998428798891765926

Major new research from Google and MIT. “”More agents is all you need”” has become a mantra for AI developers. We know multi-agent systems can be effective, but we do this mostly based on heuristics. The default approach to building complex AI systems today remains adding more https://x.com/omarsar0/status/1999135611392053586

😂Check my own demonstrating of running #AutoGLM on my phone for “”Like the top 3 posts from Andrej Karpathy on X app, and summarize them to me”” @karpathy 🚀You can try ANY AGENT TASKS you want on your Android phone now! https://x.com/ShawLiu12/status/1999123320269402374

Made this video to explain evals https://x.com/HamelHusain/status/1998452926935695649

CLM: Removing the GPU Memory Barrier for 3D Gaussian Splatting”” TL;DR: CPU-offloading + Trains 102M Gaussians on a single RTX 4090, enabling city-scale reconstruction that previously required multi-GPU setup + Achieves 55-97% of GPU-only training throughput https://x.com/Almorgand/status/1998429866794918310

Content-Aware Texturing for Gaussian Splatting”” TL;DR: Texturing instead of millions of Gaussians for fine details; fixed texel size https://x.com/Almorgand/status/1996569599597305908

Diving deeper into SAM 3 & SAM 3D in my new YouTube video. Check it out here: https://x.com/bilawalsidhu/status/1997351635920847237

New blog: Learning to love mesh-oriented sharding https://x.com/ezyang/status/1997902916384932112

SPFSplatV2: Efficient Self-Supervised Pose-Free 3D Gaussian Splatting from Sparse Views”” TL;DR: feed-forward framework for 3DGS (sparse multi-view images); no ground-truth poses during training and inference; attention mechanism for target poses; reprojection loss https://x.com/Almorgand/status/1998703073309413869

SplatPainter: Interactive Authoring of 3D Gaussians from 2D Edits via Test-Time Training”” TL;DR: state-aware feedforward model that enables continuous editing of 3D Gaussian assets from user-provided 2D view(s) https://x.com/Almorgand/status/1998785583032971282

📈Arena Trends Update We pulled Arena scores for the Top 10 labs since the beginning of 2025, and the top climbers may surprise you. With tighter confidence intervals and new entries in the mix, the Arena continues to shift. Stay tuned for more EOY insights and updates from the https://x.com/arena/status/1998536014000959497

🚨Text Arena Update ERNIE-5.0-Preview-1103 by Baidu @ernieforDevs has landed on the Text leaderboard with a score of 1431 putting it in the top 20 in the most competitive Arena. A few highlights: 🔹scores 1471 in the Software & IT Services Occupational field on par with https://x.com/arena/status/1998437959553716260

ARC Prize – Leaderboard https://arcprize.org/leaderboard

ARC Prize 2025 Results and Analysis https://arcprize.org/blog/arc-prize-2025-results-analysis

Individual AI benchmarks saturate too quickly to give us a long-run trend of AI progress. We can solve this by “”stitching”” them together. As @ansonwhho explains, this lets us forecast AI capabilities, quantify algorithmic improvements, and detect accelerations in AI progress. https://x.com/EpochAIResearch/status/1998823086473568277

Poetiq | ARC-AGI-2 SOTA at Half the Cost https://poetiq.ai/posts/arcagi_verified/

Takeaway 4: Process-verified outcome rewards mitigate reward hacking and enhance reasoning fidelity. We find that incorporating process verification into outcome rewards delivers: 1) More truthful, error-resistant reasoning and 2) Better generalization on complex, multi-step https://x.com/xiangyue96/status/1998489119660638257

SVG Contest time 🎉 Our biggest contest yet! This one is about prompting AIs on Yupp to produce brilliant SVG outputs. 3 categories, 15 winners, and nearly 1M Yupp credits as prizes! Hosted by renowned prompting master/AI red teamer @chetaslua. Full details in our Discord 👇 https://x.com/yupp_ai/status/1998120413285769302

Interesting study, but this is somewhat unexpected. (green is programming, yellow is role playing) https://x.com/emollick/status/1996758326877868268

Today we’re introducing OfficeQA, a new benchmark grounded in ~89,000 pages of U.S. Treasury Bulletins that reflects the complex, document-heavy tasks enterprises actually face. Unlike existing benchmarks, OfficeQA measures economically valuable, real-world reasoning: parsing https://x.com/databricks/status/1998424470881525822

Directly comparing a benchmark of Devstral2-123B on my hardware to MiniMax-M2 (230B-A10B) shows the difference in performance MoE can give. At 100 requests concurrently: MiniMax is 2x faster At 2 requests concurrently: MiniMax is 3.5x faster https://x.com/JustinWaugh/status/1998467712235028888

🎉 Introducing Parallel Coordinated Reasoning (PaCoRe) 📈 An 8B model beats GPT-5 on HMMT25 by unlocking parallel thinking for test-time scaling! 📂 Open-source deep think: data + model + inference code! 🆓 MIT-licensed — use it however you want 🔍Key findings: 1. Message https://x.com/CyouSakura/status/1998344501262533011

Low-bit LLM quantization doesn’t have to mean painful accuracy trade-offs or massive tuning runs. Intel’s AutoRound PTQ algorithm is now integrated into LLM Compressor, producing W4A16 compressed-tensor checkpoints you can serve directly with vLLM across Intel Xeon, Gaudi, Arc https://x.com/vllm_project/status/1998710451312771532

Juuuust a bit outside https://x.com/buccocapital/status/1999303168568754348

Gemini 3 Pro continues to be SOTA on most multi-modal benchmarks and use cases! https://x.com/OfficialLoganK/status/1997003665433838026

We just updated our suite of Gemini TTS models 🗣️, they now come with: – Richer tone versatility and stricter adherence to style prompts – Smarter context-aware speed adjustments and better instruction following – Consistent character voices in multi-speaker scenarios”” / X https://x.com/OfficialLoganK/status/1998884687457173580

Google tests new Gemini 3 models on LM Arena https://www.testingcatalog.com/google-tests-new-gemini-3-models-on-lm-arena/

It’s not perfect tho. Some post-training might still be needed – I did see a few loops (repeating the same text over and over again) in my testing. Overall this is a SOLID model – especially priced cheaper than gemini-2.5-flash, a model it beats hands down. What a time to be”” / X https://x.com/hrishioa/status/1998636284533944725

We evaluated 15 leading models. Gemini 3 Pro achieved the top score of 68.8%. While search and internal knowledge has improved, multimodal factuality remains an industry-wide challenge. We’re sharing these benchmarks on @kaggle to help the research community build more reliable”” / X https://x.com/GoogleDeepMind/status/1998831088324473025

Gemini 3 Pro scores 69% trust in blinded testing up from 16% for Gemini 2.5: The case for evaluating AI on real-world trust, not academic benchmarks | VentureBeat https://venturebeat.com/ai/gemini-3-pro-scores-69-trust-in-blinded-testing-up-from-16-for-gemini-2-5

This Google paper presented at #NeurIPS2025 is a true gem. In their search for a better backbone for sequence models, they: • Reframe Transformers & RNNs as associative memory systems driven by attentional bias • Reinterpret “”forgetting”” as retention regularization, not as https://x.com/TheTuringPost/status/1997808277116338266

Massive achievement by @NousResearch To make this clear: this is actually a 3b active parameter model that works on any new MacBook Air/Mini that would be #2 on Putnam, which is harder than IMO Now think of all the tasks that are easier than that (a lot) Implications =>”” / X https://x.com/EMostaque/status/1998686465279025190

For a long time, Yann LeCun and others believed in gradient-based planning, but it didn’t work very well … until now. Here’s how we did it using incredibly simple techniques. But first, an introduction to gradient-based planning: 🧵1/11 https://x.com/micahgoldblum/status/1999149319925227786

Multimodal fusion is key to building AI that truly understands the world. But it’s still hard to find the right way to do it, partly because diffusion is dynamic while text is static. @AIatMeta and @AI_KAUST proposed MoS – Mixture of States, which fixes this mismatch by routing https://x.com/TheTuringPost/status/1996585873652203808

GigaTIME: Scaling tumor microenvironment modeling using virtual population generated by multimodal AI – Microsoft Research https://www.microsoft.com/en-us/research/blog/gigatime-scaling-tumor-microenvironment-modeling-using-virtual-population-generated-by-multimodal-ai/

🚀 New InferenceMAX results are live! The team at @NVIDIA has pushed the boundaries of sglang-dsr1-1k1k-FP8 on the @SemiAnalysis_ InferenceMAX dashboard. The new submission delivers: 🔹 20% higher peak throughput 🔹 4260 tok/s/GPU at 30 TPS/user 🔹 Interactivity extended to 102 https://x.com/lmsysorg/status/1998454089903226967

GPT-5.2 weaker than GPT-5.1 Codex Max on CVE-Bench an eval that tasks models with identifying and exploiting real-world web application vulnerabilities https://x.com/scaling01/status/1999186361169871055

An important lesson that ARC-AGI has internalized, but not many others have, is that benchmark perf is a function of test-time compute. @OpenAI publishes single-number benchmark results because it’s simpler and people expect to see it, but ideally all evals would have an x-axis.”” / X https://x.com/polynoamial/status/1999189845164667132

LisanBench results for GPT-5.2 Thinking GPT-5.2 Thinking improves over GPT-5 and o3 but does not match other frontier models like Opus 4.5, Gemini 3 Pro, DeepSeek-V3.2 Speciale or Grok 4 GPT-5.2 Thinking improves over GPT-5 in average validity ratio, meaning it’s less likely to https://x.com/scaling01/status/1999240662147825876

The AI Consumer Index (ACE) Most AI benchmarks today focus on reasoning and coding. But most people use AI to shop, cook, and plan their weekends. In those domains, LLM hallucinations continue to be a real problem. 73% of ChatGPT messages (according a recent report) are now https://x.com/omarsar0/status/1998039629556256995

Announcing GDPval-AA — our leaderboard and evaluation harness for comparing models on OpenAI’s GDPval dataset of real-world knowledge work tasks Earlier today, we announced our agentic harness called Stirrup, which we built to run GDPval tasks on any language model. We’re https://x.com/ArtificialAnlys/status/1998841566627246173

GPT-5.2 is a massive model scoring even higher than Gemini 3 Pro on GPQA Diamond (91.9%) https://x.com/scaling01/status/1999183900673798454

Holy moly, thats insane: Nomos 1 is a 30B open-source model that just scored 87/120 on this year’s Putnam, good enough for an estimated #2/3988, showing that near-top human math performance is now possible with relatively small models plus good post-training and reasoning https://x.com/kimmonismus/status/1998749650984255985

Putnam, the world’s hardest college-level math test, ended yesterday 4p PT. Noon today, AxiomProver solved 9/12 problems in Lean autonomously (3:58p PT yesterday, it was 8/12). Our score would’ve been #1 of ~4000 participants last year and Putnam Fellow (top 5) in recent years”” / X https://x.com/axiommathai/status/1997767850279440715

.@essential_ai’s rnj-1 model is now on Ollama! ollama run rnj-1 8B parameter, open-weight dense model trained from scratch. The model is optimized for code and STEM with capabilities on par with other state of the art open-weight models. Let’s go! 🚀🚀🚀 https://x.com/ollama/status/1998305925762048030

“random improvement property of RLVR disappears with proper data decontamination” makes sense thank you @natolambert and the rest of Olmo team”” / X https://x.com/teortaxesTex/status/1998302405080055993

(13) Deep Dive into LLMs like ChatGPT – YouTube https://www.youtube.com/watch?v=7xTGNNLPyMI

(2) X https://x.com/cwolferesearch/status/1998289169052045516

[2504.13173] It’s All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization
https://arxiv.org/abs/2504.13173

[2512.05145] Self-Improving VLM Judges Without Human Annotations https://arxiv.org/abs/2512.05145

@axiommathai To be specific, AxiomMath used Tinker to do RL in developing AxiomProver!”” / X https://x.com/thinkymachines/status/1998925489084498094

@StasBekman @awscloud Isn’t NVLink usually spec’d in gibabytes/sec and EFA gigabits/sec? I also imagine the total switching capacity is much higher on the NVLink”” / X https://x.com/wightmanr/status/1998915115744428369

> Today, we’re building an infrastructure-first, deep-tech company with a simple and ambitious mission: “”Make frontier-level AI infrastructure open and accessible to everyone.”” this is very very exciting 🥹”” / X https://x.com/eliebakouch/status/1998081613213954475

🎙️ In episode #91 of Building Deep Tech, I talk with @___Harald___ , CTO at @comma_ai, where he and the team are building one of the most interesting autonomy efforts in the world: They work on end to end driving and generative world models is changing how small teams can https://x.com/IlirAliu_/status/1996656248700522727

🪦text-generation-inference is now in maintenance mode. Going forward, we will accept pull requests for minor bug fixes, documentation improvements and lightweight maintenance tasks. TGI has initiated the movement for optimized inference engines to rely on a transformers https://x.com/LysandreJik/status/1999137874378125436

🚀 We introduce Soft Adaptive Policy Optimization (SAPO) — a smooth, stable, and highly effective RL method for training large language models. Why SAPO? 🔹 Hard clipping is brittle — gradients vanish or explode 🔹 MoE models amplify variance, making training even more unstable”” / X https://x.com/Alibaba_Qwen/status/1998300361514500554

1/5 🚀Apriel-1.6-15B-Thinker: a 15B multimodal reasoner scoring 57 on the Artificial Analysis Intelligence Index – approaching the performance of ~200B-scale frontier models while remaining an order of magnitude smaller. 🧠Model weights: https://x.com/ServiceNowRSRCH/status/1998482927597007313

15 Outstanding Research Papers from NeurIPS 2025 ▪️ Faster R-CNN ▪️ Artificial Hivemind: The Open-Ended Homogeneity of LMs (and Beyond) ▪️ Optimal Mistake Bounds for Transductive Online Learning ▪️ Gated Attention for LLMs ▪️ Superposition Yields Robust Neural Scaling ▪️ Why https://x.com/TheTuringPost/status/1997647379932278997

200k Tokens Is Plenty – Amp https://ampcode.com/200k-tokens-is-plenty

2025 was the year when artificial intelligence’s full potential roared into view, and when it became clear that there will be no turning back. For delivering the age of thinking machines, for wowing and worrying humanity, for transforming the present and transcending the https://x.com/TIME/status/1999097617633189955

A lot of discussion on open weights models seems to assume there is a clear incentive for building them. I don’t see how is the case. Unless you have no need for money (government sponsored?), there are no real ways to capture value from your model even as model cost increases.”” / X https://x.com/emollick/status/1998884862858854417

A must-read → A Comprehensive Survey and Practical Guide to Code Intelligence Covers: – Full lifecycle of code LLMs: data, pre-training, SFT, RL, prompting – General vs. specialized code models – Key challenges: security, large context, real-world workflows – Design choices – https://x.com/TheTuringPost/status/1998050879807828055

Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models https://arxiv.org/pdf/2511.15304

AI #145: You’ve Got Soul – by Zvi Mowshowitz https://thezvi.substack.com/p/ai-145-youve-got-soul

Axiom https://axiommath.ai/territory/learning-collatz-the-mother-of-all-rabbit-holes

Coming soon to Diffusers! If you have always wanted pipelines in Diffusers to benefit from context parallelism, you might wanna check this PR out 🔥 Note that we already support Ring and Ulysses. Cc: @maharshii I think you will like this. https://x.com/RisingSayak/status/1998333353419026501

Congratulations to @axiommathai on their achievement! AxiomProver, a mathematics model fine-tuned with Tinker, got top scores on the Putnam Math Competition.”” / X https://x.com/thinkymachines/status/1998903749000180183

Don’t think of LLMs as entities but as simulators. For example, when exploring a topic, don’t ask: “”What do you think about xyz””? There is no “”you””. Next time try: “”What would be a good group of people to explore xyz? What would they say?”” The LLM can channel/simulate many”” / X https://x.com/karpathy/status/1997731268969304070

Easy to miss because it’s on the last page of the paper, but Olmo 3 RL-Zero has a really nice sub-section on RL with random rewards! Prior papers (Shao et al – “”Spurious Rewards: Rethinking Training Signals in RLVR””) show RLVR still improves performance on math problems even https://x.com/cwolferesearch/status/1998289169052045516

Efficiently Reconstructing Dynamic Scenes One D4RT at a Time https://x.com/_akhaliq/status/1998763356883452031

Excited to share a preview of Isaac-0.2 open source 1B and 2B hybrid reasoning models built for Perception, many new capabilities (try the demo)! Isaac-0.1 was a proof point of our infrastructure, Our goal with Isaac 0.2 was to establish the foundation of the perception stack -“” / X https://x.com/AkshatS07/status/1998818590405935468

Folks managed to use GEPA right after it came out and then submit to a NeurIPS workshop, demonstrating 12.5% -> 62.5% gains on their task 📸”” / X https://x.com/DSPyOSS/status/1997879916583391705

For those wondering, this was part of the 2.0.64 release: https://x.com/omarsar0/status/1998777320434290729

I hate to keep bringing this up, but studies cannot lump reasoners with earlier models when considering AI abilities And while studies don’t need to always use the latest models, they should test to see if there are trends in ability as model size scales to anticipate the future https://x.com/emollick/status/1998119842268738040

Introducing @ServiceNow AI’s Apriel-1.6-15B-Thinker, a 15B multimodal reasoning model that matches the performance of 235B models while being 15x smaller. AI natives can now use Apriel-1.6-15B-Thinker on Together AI — and benefit from reliable inference at production scale. https://x.com/togethercompute/status/1998484754417725637

It’s great to see that inter-node network speeds are starting to catch up with intra-node speeds. e.g. B300 instances on @awscloud are: – 800GBps inter-node EFA v4 – 900GBps intra-node NVlink-5 So inter-node is gradually becoming less of a bottleneck.”” / X https://x.com/StasBekman/status/1998821183844938000

Key ways to fuse modalities: ▪️ Attention-based fusion – uses attention mechanisms to control which parts of each modality the model should focus on. • Cross-attention – turns the final text hidden state into key-value vectors. Text and image tokens interact through added https://x.com/TheTuringPost/status/1997239959024185562

Let Tensors Fly — Accelerating Large Model Weight Loading with R-Fork | LMSYS Org https://lmsys.org/blog/2025-12-10-rfork/

Model page: https://x.com/ollama/status/1998293405668180297

MoE performance optimisation finally coming to transformers: https://x.com/art_zucker/status/1998326537586651558

our neurips paper offers one potential avenue to mitigate this; use an estimator for unactivated experts, like the sample mean. if you pass this forward and backwards you’re no longer doing true sparsity. https://x.com/PandaAshwinee/status/1998294930125701433

Power Up FSDP2 as a Flexible Training Backend for Miles | LMSYS Org https://lmsys.org/blog/2025-12-03-miles-fsdp/

Power your AI with Human Insight https://www.originalvoices.ai/

Prompts for Open Problems – by Ben Recht – arg min https://www.argmin.net/p/prompts-for-open-problems

Reasoning with Sampling: Your Base Model is Smarter Than You Think https://arxiv.org/html/2510.14901

Saber: Scaling Zero-Shot Reference-to-Video Generation https://franciszzj.github.io/Saber/

Sample packing in the best case makes training X times faster, depending on the ratio of short sequences to long ones. For eg if 80% short, then training at max 1/0.2 = 5x faster! Update unsloth via `pip install –upgrade unsloth` Details @UnslothAI blog https://x.com/danielhanchen/status/1998770352646914146

SGLang is the best inference framework for LLMs. We heavily used it at xAI because it allowed us to run large models faster than anyone else. The developers have now started RadixArk to expand the mission and serve all your infrastructure needs. Excited to see what they will”” / X https://x.com/ibab/status/1998098312051011817

So I tested the bigger model with my typical standard test queries which are not… | Hacker News https://news.ycombinator.com/item?id=46213498

𝐒𝐨𝐥𝐯𝐢𝐧𝐠 𝐭𝐡𝐞 “𝐙𝐞𝐫𝐨 𝐑𝐞𝐬𝐮𝐥𝐭𝐬” 𝐏𝐫𝐨𝐛𝐥𝐞𝐦 𝐢𝐧 𝐕𝐞𝐜𝐭𝐨𝐫 𝐒𝐞𝐚𝐫𝐜𝐡 𝐰𝐢𝐭𝐡 𝐐𝐝𝐫𝐚𝐧𝐭 + 𝐀𝐂𝐎𝐑𝐍 Shared by our Qdrant star Niranjan Akella 🌟 If you’ve ever applied multiple strict filters on top of semantic search and ended up with zero results, https://x.com/qdrant_engine/status/1998976425018405322

Tangle – Visual ML Pipeline Editor | Tangle https://tangleml.com/

Technically | Justin | Substack https://read.technically.dev/for-ai-at-work-start-with-something

Tensor Parallelism (TP) in Transformers: 5 Minutes to Understand https://huggingface.co/blog/qgallouedec/tp

The Boring Phase of AI – by James Wang – Weighty Thoughts https://weightythoughts.com/p/the-boring-phase-of-ai

The multiverse of movie madness https://x.com/bilawalsidhu/status/1998508447667773464

The New Skill in AI is Not Prompting, It’s Context Engineering https://www.philschmid.de/context-engineering

The return of symbolic AI? “”Three things coming together makes now really good timing to build an AI mathematician using a hybrid method of formal verification and also informal reasoners.”” @CarinaLHong, co-founder and CEO of @axiommathai https://x.com/TheTuringPost/status/1997971709996212561

There is no data-generating distribution – by Ben Recht https://www.argmin.net/p/there-is-no-data-generating-distribution

There’s got to be a better way! – by Ben Recht – arg min https://www.argmin.net/p/theres-got-to-be-a-better-way

These three paragraphs from Kahneman in 2017 (pre-LLM) are something else – full of, as James says, “”painful claims”” that are grounded in a lifetime of research. https://x.com/emollick/status/1997688647764848640

turbopuffer queries are strongly consistent, which requires scanning new writes in the WAL while indexing happens async tpuf’s WAL scan is now up to 2x faster https://x.com/turbopuffer/status/1998058954149208096

Vectorized MAXSCORE over WAND, especially for long LLM-generated queries https://turbopuffer.com/blog/fts-v2-maxscore

Very neat paper combining art and economic history. https://x.com/emollick/status/1997188746727325950

Wan-Move Motion-controllable Video Generation via Latent Trajectory Guidance https://x.com/_akhaliq/status/1998606187500097588

We’re releasing new fused varlen RoPE + int64 triton kernels for 3x faster training with no accuracy degradation. We enabled auto padding free training, automatically making all runs >2x faster with no loss curve changes & uncontaminated packing via FA3, Xformers, SDPA backends! https://x.com/danielhanchen/status/1998770347081109864

When everyone flew to NeurIPS, I went to Art Basel Miami to see how AI is doing in the wild (art). Why it’s good when machines hallucinate, and how much a robodog with Elon Musk’s head costs ↓ https://x.com/TheTuringPost/status/1998152928943833199

When HNSW fails under filters, ACORN keeps the search alive. 🕵 Kameshwara Pavan Kumar Mantha just shared a great deep-dive on ACORN, and why it’s a breakthrough for filtered vector search in Qdrant. And here’s a quick breakdown of what it’s about ⬇️ 🔍 The Problem Traditional https://x.com/qdrant_engine/status/1997939453965336741

Whitepaper: Practitioner’s guide to reinforcement learning – Weights & Biases https://wandb.ai/site/resources/whitepapers/reinforcement-learning-ebook/

In today’s episode of programming horror… In the Python docs of random.seed() def, we’re told “”If a is an int, it is used directly.”” [1] But if you seed with 3 or -3, you actually get the exact same rng object, producing the same streams. (TIL). In nanochat I was using the https://x.com/karpathy/status/1998236299862659485

New paper from @GoogleDeepMind explored principles for optimizing agent harnesses based on measurable task properties. •Evaluated 180 configurations across 5 harnesses. •Centralized coordination improves parallelizable by 80.9%. •Coordination yields negative returns once https://x.com/_philschmid/status/1998957966343446844

I have been pretty frustrated with the current focus of interpretability research. Promising to see the focus on scalability and generalization. Without these two properties, works often end up being neuron interpretation overfit to a single model and not particularly”” / X https://x.com/sarahookr/status/1997795206096429415

Large scale-experiments in UK, US & Poland where people chatted with LLMs about political topics found AI is very good at persuasion, primarily by providing lots of fact-based claims Plus, AI is getting more persuasive as models grow bigger & persuasion effects lasted over time. https://x.com/emollick/status/1996770000389169205

This radiance meshes paper is very cool https://x.com/bilawalsidhu/status/1996680321908252778

Jina-VLM achieves state-of-the-art performance among open 2B-scale VLMs, leading with the highest average (72.3) across eight general VQA benchmarks, particularly strong on diagrams, charts, and scene text. Its multilingual capabilities stand out most, achieving best-in-class https://x.com/JinaAI_/status/1997926493456834978

Amp, Inc. – Amp https://ampcode.com/news/amp-inc?

Titans + MIRAS: Helping AI have long-term memory https://research.google/blog/titans-miras-helping-ai-have-long-term-memory/

window seat reflection removal Reflection Removal through Efficient Adaptation of Diffusion Transformers https://x.com/_akhaliq/status/1998752500673888409

Chris Olah’s talk is happening right now at the NeurIPS mech interp workshop, room 30, top floor. Called “”reflections on interpretability””! Followed by invited lightning talks at 16:00 https://x.com/NeelNanda5/status/1997812818788467157

We just dropped what we believe is the world’s largest study of AI conversations + it found what you talk to AI about has a lot to do with what time it is. @MicrosoftAI researchers found 3 different trends by day, time, and month and 1 rock solid constant https://x.com/mustafasuleyman/status/1998833489333100814

Releasing jina-VLM: our new 2B vision language model achieves SOTA on multilingual visual question answering and document understanding among open 2B-scale VLMs. https://x.com/JinaAI_/status/1997926488843190481