Technical and Dev: AI News Week Ending 12/26/2025

Technical and Dev: AI News Week Ending 12/26/2025

December 26, 2025

Image created with gemini-2.5-flash-image with claude-sonnet-4-5. Image prompt: Seamless repeating damask pattern on navy background featuring elegant circuit board traces, microchips, and neural network nodes rendered as ornamental Victorian flourishes, with copper and silver metallic accents and embossed TECH monogram integrated into decorative cartouches, sophisticated textile design quality with subtle texture, premium gift wrap aesthetic

2025 LLM Year in Review | karpathy https://karpathy.bearblog.dev/year-in-review-2025/

Evaluating chain-of-thought monitorability | OpenAI https://openai.com/index/evaluating-chain-of-thought-monitorability/

Background Coding Agents: Context Engineering (Part 2) | Spotify Engineering
https://engineering.atspotify.com/2025/11/context-engineering-background-coding-agents-part-2

This paper asked 25 different AI models to write a metaphor about time. Nearly all said “time is a river” or “time is a weaver.” It is not completely clear why: likely overlapping training, alignment processes, and synthetic data contamination. More idea diversity would be good https://x.com/emollick/status/2002183640453685280

this is a brilliant read for anyone building with code agents like codex/ claude code – quick notes: > default to building CLIs first (easier for agents to verify) and progressively add other surfaces (UI) > for macOS/iOS apps default to using Swift build tooling + codex”” / X https://x.com/reach_vb/status/2005554360307065023

Researchers proposed Sample-Efficient Modality Integration (SEMI), which plugs any pretrained encoder (image, audio, video, sensors, graphs) into an LLM using one projector plus LoRA adapters generated from a handful of paired examples. Trained on data-rich domains, SEMI https://x.com/DeepLearningAI/status/2003593131132916204

Understanding AI Benchmarks – by Shrivu Shankar https://blog.sshh.io/p/understanding-ai-benchmarks

Interesting research from Google. Research has shown that neural networks don’t just memorize facts. They build internal maps of how those facts relate to each other. The view of how transformers store knowledge is associative: co-occurring entities get stored in a weight https://x.com/dair_ai/status/2005480659209400789

Learnings from RL training of an LLM – Google Docs https://docs.google.com/document/d/1Sm-XUZ4MvYHcOw7gsoIpdEu38GhCpgNCMnx6Fa0grks/edit?tab=t.3awwxw6mhl75#heading=h.xy9wi236lxm

LiquidAI/LFM2-2.6B-Exp · Hugging Face https://huggingface.co/LiquidAI/LFM2-2.6B-Exp

77% of machine learning applications in science rely on traditional techniques like Random Forest, XGBoost, and CatBoost, not transformers or diffusion models. The gap between AI headlines and lab reality is much larger than you might think. Here is @Marktechpost’s analysis https://x.com/TheTuringPost/status/2002168088834552278

The WAU effect | Mobile Dev Memo by Eric Seufert https://mobiledevmemo.com/the-wau-effect/

why is scaling attention scores by 1/√d_k so critical for transformers? ok well without this scaling, dot products (q•k) grow unstable as dimension increases. here’s a bit of math behind it : uhhm, now in multi-head attention, each head operates on dimension d_k right https://x.com/viplismism/status/2003807608571076782

🧵Five pretraining tricks from CAI. Before the Google deal, @character_ai was running pretraining on GCP H100-TCPX which has 1/4 of bandwidth as IB (!). @NoamShazeer invented a gradient compression algorithm called “”Squinch”” maintaining SOTA MFU despite the poor networking. https://x.com/simon_mo_/status/2003608325624406482

2026 Predictions – by FD – Robonomics https://robonomics.substack.com/p/2026-predictions

A must-read → A Survey of Context Engineering for LLMs Covers: – Why LLM performance is shaped at inference time – What’s beyond prompt design – Core components of Context Engineering: retrieval & generation, processing, memory & compression – System implementations: RAG, https://x.com/TheTuringPost/status/2002154397132833127

ACE-SLAM: Scene Coordinate Regression for Real-Time SLAM”” TL;DR: first neural implicit SLAM system using a Scene Coordinate Regression network as a scene representation, achieving strict real-time performance on live streams https://x.com/Almorgand/status/2002059078617739372

AI methods you really HAVE to know about at the end of 2025 – Switching BF16 → FP16 precision – Modular Manifolds – XQuant and XQuant-CL – Multimodal fusion, including Mixture of States (MoS) method – Mixture-of-Recursions (MoR) – Causal Attention with Lookahead Keys (CASTLE) https://x.com/TheTuringPost/status/2002303731468304522

Autoregressive generation can be seen as a special case of block diffusion where the block size is just one token. @PKU1898 and @huaweitechnolgy presented a gradual way for this autoregressive (AR) → block-diffusion transition: To make it work, they: – Use an attention pattern https://x.com/TheTuringPost/status/2001697220387913818

Chatterbox Turbo | Resemble AI https://www.resemble.ai/chatterbox-turbo/

Even if transformers didn’t lead to LLMs, they still would have tremendous promise as a new method of modelling complex data. Interesting post on using transformers for economic modelling.”” / X https://x.com/emollick/status/2003172860064633321

Excited to release a new paper today: “End-to-End Test-Time Training for Long Context”. Our method, TTT-E2E, enables models to continue learning at test-time via next-token prediction on the given context – compressing context into model weights. For our main result, we extend https://x.com/arnuvtandon/status/2005704949381095828

For even higher throughput and lower latency: batch generation + tensor parallel with mlx-lm + and mlx.distributed. Here it’s generating at 63 tok/sec (throughput) with GLM 4.7 in 6-bit and batch size 4 on 4 M3 Ultras: https://x.com/awnihannun/status/2003854411848904937

Interestingly, it was still hard to tell when AI models gain better reasoning – during pre-training, mid-training, or RL. Researchers at @CarnegieMellon found that each of them plays distinct roles: – RL truly improves reasoning only in specific conditions – Generalizing across https://x.com/TheTuringPost/status/2002555031942226127

Is there a standardized unified wrapper over every single API provider library”” / X https://x.com/Teknium/status/2005608503269093549

Looking Ahead to 2026 | sn scratchpad https://snscratchpad.com/posts/looking-ahead-2026/

Love the boring stuff. The things everyone hates to do. Do onboarding. Do documentation. Do… … onboarding that works without (!) you. Do documentation people actually (!) use Do logging you can actually use in the field And please, do not let your engineers design the”” / X https://x.com/IlirAliu_/status/2001654181229400517

Must-read AI research of the week: ▪️ MMGR: Multi-Modal Generative Reasoning ▪️ Are We on the Right Way to Assessing LLM-as-a-Judge? ▪️ Nemotron-Cascade: Scaling Cascaded RL for General-Purpose Reasoning Models ▪️ Fast and Accurate Causal Parallel Decoding using Jacobi Forcing https://x.com/TheTuringPost/status/2003239230022254955

Nature is Laughing at the AI Build Out – https://markmaunder.com/2025/nature-is-laughing-at-the-ai-build-out/

One of the underrated papers this year: “”Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful”” ( https://x.com/rasbt/status/2005667911013441753

Our new paper shows that RoPE–the positional encoding used in most modern LLMs like Qwen, Gemma, DeepSeek–has a fundamental flaw: it entangles “”what”” (content) and “”where”” (position) information. Our fix (PoPE) is simple but powerful. Paper: https://x.com/agopal42/status/2003900815560659303

Our new paper, “End-to-End Test-Time Training for Long Context,” is a step towards continual learning in language models. We introduce a new method that blurs the boundary between training and inference. At test-time, our model continues learning from given context using the https://x.com/karansdalal/status/2005704608996540887

Pretty cool post but L2-normalizing attention weights is variance-preserving only in the special case where the value vectors are effectively uncorrelated across positions. When values are correlated, the output variance depends on both the L2 and L1 norms of A (the vector of”” / X https://x.com/ArmenAgha/status/2003918120881475832

Prompt engineering ⟶ context engineering The main design patterns you should know about 9 popular techniques for prompting: ▪️ Zero-shot ▪️ Few-shot ▪️ Role prompting ▪️ Instruction-based ▪️ Chain-of-Thought (CoT) ▪️ Tree-of-Thought (ToT) ▪️ Reasoning-action prompting https://x.com/TheTuringPost/status/2002765247900262620

reading about hypertext apps from the 80~90s, back when the role was called “”hypertext designer”” rather than web designer or developer. many came out of academia, and most failed, but there’s sth exciting about seeing such a diverse range of forms that web could take https://x.com/poetengineer__/status/2005511136037474635

SpecBundle & SpecForge v0.2: Production-Ready Speculative Decoding Models and Framework | LMSYS Org https://lmsys.org/blog/2025-12-23-spec-bundle-phase-1/

tcgen05 for dummies – gau-nernst’s blog https://gau-nernst.github.io/tcgen05/

Test, don’t (just) verify https://alperenkeles.com/posts/test-dont-verify/

Thanks for diving deep into vLLM and sharing your findings. 🫶 We’re working to make more beginner-friendly documentation available. In the meantime, we recommend using the `Search (AI)` button on our https://x.com/vllm_project/status/2005640089133830371

The authors ask whether an N-layer ViT can be rewritten using just K<<N layers by recurring on them. Remarkably, they match DINOv2 performance with only 2-3 layers. The paper also offers rich dynamical-systems analysis. Very cool work! 🔗 https://x.com/f14bertolotti/status/2003760506214158693

The changing drivers of LLM adoption https://epochai.substack.com/p/the-changing-drivers-of-llm-adoption

the fastest GLM 4.7 available to try today on Baseten https://x.com/basetenco/status/2005699615379841325

The new @z_ai model GLM-4.7 is now available in Roo Code. You can access it through Zai’s GLM coding plans and our Roo Code cloud and other providers now! https://x.com/roocode/status/2003652972555997560

The Shape of AI: Jaggedness, Bottlenecks and Salients https://www.oneusefulthing.org/p/the-shape-of-ai-jaggedness-bottlenecks

The Shape of Artificial Intelligence – by Alberto Romero https://www.thealgorithmicbridge.com/p/the-shape-of-artificial-intelligence

this is how Noam and the pre-training team at c .ai did knowledge distillation!! https://x.com/eliebakouch/status/2003632344159424562

Universal Reasoning Model Universal Transformers crush standard Transformers on reasoning tasks. But why? Prior work attributed the gains to elaborate architectural innovations like hierarchical designs and complex gating mechanisms. But these researchers found a simpler https://x.com/omarsar0/status/2005640015964250267

We’re closing out the year with a new release that brings 𝗢𝗯𝗷𝗲𝗰𝘁 𝗧𝗶𝗺𝗲-𝘁𝗼-𝗟𝗶𝘃𝗲 (𝗧𝗧𝗟), 𝗺𝘂𝗹𝘁𝗶𝗺𝗼𝗱𝗮𝗹 𝗱𝗼𝗰𝘂𝗺𝗲𝗻𝘁 𝗲𝗺𝗯𝗲𝗱𝗱𝗶𝗻𝗴𝘀, and both 𝗝𝗮𝘃𝗮 𝘃𝟲 𝗖𝗹𝗶𝗲𝗻𝘁 and 𝗙𝗹𝗮𝘁 𝗜𝗻𝗱𝗲𝘅 𝗥𝗤 general availability! Here’s what’s new: • https://x.com/weaviate_io/status/2005673260344877186

Final Sarah Paine lecture: Why Russia lost the Cold War. To me, the most interesting question is not why the Soviet Union ultimately collapsed – it’s how a brutal, centrally planned, stupendously inefficient, colonial land empire survived for so long. I was surprised to learn https://x.com/dwarkesh_sp/status/2002075498101551554

Generative Refocusing: Flexible Defocus Control from a Single Image”” TL;DR: two-step process; DeblurNet to recover all-in-focus images from various inputs+BokehNet for controllable bokeh;semi-supervised training. https://x.com/Almorgand/status/2003140933223919815

How to game the METR plot – by Shashwat Goel https://shash42.substack.com/p/how-to-game-the-metr-plot

Fantastic piece by @muradhem on the AI talent wars @the_logic . My 2c is talented people have many options. In the end, what matters to people who drive innovation is finding like minded people who are pushing what is possible. Our approach at @adaptionlabs is very simple. https://x.com/sarahookr/status/2003581788850127276

strong new open 32B VLM model from Korea, which has good english and very korean benchmark scores! the Artificial Analysis score is in part due to a very high taubench score, but other bench are good as well + vision understanding also seems strong! now the fun part while https://x.com/eliebakouch/status/2005549508063559876

One of the weirdest law review titles ever. And the title refers to an actual thing that is weirder than the title of the paper. But also the paper has some interesting things to say about new approaches to IP protection that might be especially relevant in the time of AI. https://x.com/emollick/status/2002632525953605799

Survey on vision encoders in VLMs. The encoder side is weirdly understudied: everyone’s busy scaling the LM while recycling the same 400M CLIP from 2021. We looked at 70+ models and found that training methodology beats scale. A well-trained 400M encoder outperforms a 6B one. https://x.com/JinaAI_/status/2005646823201951849

The Sparks paper was an innovative attempt at trying to figure out ways of pointing at GPT-4 & saying “there is something unexpected here that is hard to measure right now” I think Early Science Acceleration feels similar. A blurry picture that will become clearer coming years https://x.com/emollick/status/2001456094418256077