Technical and Dev: AI News Week Ending 02/14/2025

Technical and Dev: AI News Week Ending 02/14/2025

February 13, 2025

“Evolution and The Knightian Blindspot of Machine Learning” Fascinating paper. Makes the case that the harsh filter of evolution enabled organisms (including us) to navigate unexpected events (“unknown unknowns”), something that AI systems struggle with. https://x.com/hardmaru/status/1888958032039813469

“Evolution and The Knightian Blindspot of Machine Learning” https://x.com/hardmaru/status/1888958032039813469

“Hierarchical LLM Reasoning ReasonFlux is a hierarchical reasoning framework for LLMs that optimizes complex problem-solving using scaling thought templates. It outperforms state-of-the-art models in mathematical reasoning. Key contributions include: • Structured Thought
https://x.com/omarsar0/status/1889343676272525600

(WIP) A Little Bit of Reinforcement Learning from Human Feedback https://rlhfbook.com/

current version of @lovable_dev is about 10% as good as it will be in 6 months still people say it’s the best way to idea -> software” / X https://x.com/antonosika/status/1886446548843962515

From surviving to thriving with GenAI in production: Lessons to successfully scale your data layer – Unstructured https://unstructured.io/webinars/elastic-webinar-genai-data-thriving-in-production

Half of “prompt engineering” was actually just prompting LLMs to act like Reasoners before the labs realized that was a thing. / X https://x.com/emollick/status/1889805704011383140

Hierarchical LLM Reasoning ReasonFlux is a hierarchical reasoning framework for LLMs that optimizes complex problem-solving using scaling thought templates. It outperforms state-of-the-art models in mathematical reasoning. Key contributions include: • Structured Thought https://x.com/omarsar0/status/1889343676272525600

Many of the original AI benchmarks are genuinely bad: the reason AI stopped getting better at them is, in large part, because they were filled with errors that made getting the right answer impossible. / X https://x.com/emollick/status/1887636560327172289

Only those who dug deep into offline RL knows the importance of online RL https://x.com/shaneguML/status/1889505192229609864

Rethinking Mixture-of-Agents This work investigates whether mixing different LLMs is truly beneficial. They also propose Self-MoA, an ensemble method that aggregates outputs from only the single top-performing LLM. Self-MoA outperforms standard MoA, which mixes different LLMs https://x.com/omarsar0/status/1886792384954163347

To boost performance, we pulled out all the tricks in the book: fine-tuning, filtering on test cases, prompting the model to generate additional test cases, clustering solutions by similarity, ranking the clusters, etc. But the bitter lesson remains as bitter as ever.” / X https://x.com/alexwei_/status/1889727571106918694

UTF-8 🤦‍♂️ I already knew about the confusables”, e.g.: e vs. е. Which look ~same but are different. But you can also smuggle arbitrary byte streams in any character via “variation selectors”. So this emoji: 😀󠅧󠅕󠄐󠅑󠅢󠅕󠄐󠅓󠅟󠅟󠅛󠅕󠅔 is 53 tokens. Yay https://x.com/karpathy/status/1889714240878940659

You can tell the RL is done properly when the models cease to speak English in their chain of thought / X https://x.com/karpathy/status/1835561952258723930

A few implications of tricks like this: 1) We are still VERY early in the development of Reasoners 2) There is high value in understanding how humans solve problems & applying that to AI 3) Higher possibility of further exponential growth in AI capabilities as techniques compound” / X https://x.com/emollick/status/1887884562958569969

2502.04896 https://arxiv.org/pdf/2502.04896

PlayMate111 Homepage https://playmate111.github.io/

SYNTHETIC-1: Scaling Distributed Synthetic Data Generation for Verified Reasoning https://www.primeintellect.ai/blog/synthetic-1

[2502.06282] Jakiro: Boosting Speculative Decoding with Decoupled Multi-Head via MoE https://arxiv.org/abs/2502.06282

[2408.04301v1] Tackling Noisy Clients in Federated Learning with End-to-end Label Correction https://arxiv.org/abs/2408.04301v1

GitHub releases: https://x.com/ollama/status/1890130798353031389

“New Work on InSTA: A pipeline for Internet-scale training of web agents across 150k diverse websites without human annotations. Paper + Code: https://x.com/rsalakhu/status/1889492471630946662

[2502.04891] GNNs Getting ComFy: Community and Feature Similarity Guided Rewiring https://arxiv.org/abs/2502.04891

[2502.06733] Dynamic Loss-Based Sample Reweighting for Improved Large Language Model Pretraining https://arxiv.org/abs/2502.06733

2502.03544 https://arxiv.org/pdf/2502.03544

[2502.06807] Competitive Programming with Large Reasoning Models https://arxiv.org/abs/2502.06807

DynVFX: Augmenting Real Videos with Dynamic Content https://dynvfx.github.io/

Goku https://saiyan-world.github.io/goku/

[2502.03032] Analyze Feature Flow to Enhance Interpretation and Steering in Language Models https://arxiv.org/abs/2502.03032

HumanDiT https://agnjason.github.io/HumanDiT-page/

Large Memory Models for Long-Context Reasoning This paper focuses on improving long-context reasoning with explicit memory mechanisms. It presents LM2, a Transformer-based architecture equipped with a dedicated memory module to enhance long-context reasoning, multi-hop https://x.com/omarsar0/status/1889681118913577345

“TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models” has been accepted at #ICLR2025 as a Spotlight Paper ✨ Blog post (English): https://x.com/SakanaAILabs/status/1889708905280028809

RepoChat Blog & Dataset Release! 🚀 Since launching in Nov, RepoChat has collected 11K+ conversations and 4K+ votes—allowing users to chat with your GitHub repos! Our blog dives into: 📌 How users interact with RepoChat 📌 Leaderboard results (both retriever & answer models) 📌 https://x.com/lmarena_ai/status/1889741525808193635

[2502.04403] Agency Is Frame-Dependent https://arxiv.org/abs/2502.04403

How I Warped Your Noise https://warpyournoise.github.io/

Enabling Autoregressive Models to Fill In Masked Tokens Hybrid autoregressive and masked language model for infilling by training a linear decoder that takes their concatenated hidden states as input. Provides faster inference with KV caching. MARIA significantly outperforms https://x.com/iScienceLuvr/status/1889542518465077557

This meme summarizes the paper nicely https://x.com/polynoamial/status/1889541408065028421

[2502.02996] Building Bridges between Regression, Clustering, and Classification https://arxiv.org/abs/2502.02996

Similarity affects Oversight https://model-similarity.github.io/

On the Emergence of Thinking in LLMs Is there a truly emergent behavior akin to “move 37” that surpasses human reasoning or at least unexpected? This work investigates how to enable reasoning in LLMs while focusing on simple and scalable search. It proposes a post-training https://x.com/omarsar0/status/1889697727703134544

This paper is wild – a Stanford team shows the simplest way to make an open LLM into a reasoning model. They used just 1,000 carefully curated reasoning examples & a trick where if the model tries to stop thinking, they append Wait” to force it to continue. Near o1 at math. https://x.com/emollick/status/1887696014829641983

Good test of the ability of AIs on hard math problems they haven’t seen before*: the brand new qualifying math olympiad test Reasoners are a big breakthrough. And DeepSeek does well but loses to o3-mini on ability & cost * Though contamination is always impossible to rule out. https://x.com/emollick/status/1888058246264418465

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach We study a novel language model architecture that is capable of scaling test-time computation by implicitly reasoning in latent space. Our model works by iterating a recurrent block, thereby unrolling https://x.com/iScienceLuvr/status/1888792081382137966

[2502.05171] Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach https://arxiv.org/abs/2502.05171

Automating GPU Kernel Generation with DeepSeek-R1 and Inference Time Scaling | NVIDIA Technical Blog https://developer.nvidia.com/blog/automating-gpu-kernel-generation-with-deepseek-r1-and-inference-time-scaling/

[2412.06769] Training Large Language Models to Reason in a Continuous Latent Space https://arxiv.org/abs/2412.06769