Tech Papers, Training, and Development: AI News Week Ending 01/24/2025

Tech Papers, Training, and Development: AI News Week Ending 01/24/2025

January 23, 2025

“”techno optimism” is no longer a real philosophy, it’s co-opted into the language of incurious merchants. you need to be actively describing the kind of lightcone you wish to be painting. obviously technology is ascendant, now what?” / X
https://x.com/tszzl/status/1881108195432940007

“Inference time compute is a terrible name for an important concept and it would be great if the AI community came up with something better soon. I would suggest “letting the AI think” but anthropomorphism is dangerous (except that AI already uses “training” “reasoning” etc.)” / X
https://x.com/emollick/status/1880123595361522069

[2501.04765] TREAD: Token Routing for Efficient Architecture-agnostic Diffusion Training
https://arxiv.org/abs/2501.04765

“Any good commentary on this? This uses QwQ (worse than o1 mini) to generate synth data that is then: – filtered to remove incorrect answers according to the datasets – “We then rewrite QwQ traces with GPT-4o-mini into a well-formatted version” And then this matches o1. Seems” / X
https://x.com/TrelisResearch/status/1879530546038022623

“brief realization: 2025 will be a lot “slower” than 2024 if 2024 was about making inference as fast as possible (“lightning”, “realtime”, “mini”, “lite”, “turbo”) then 2025 is letting it run as long as it meaningfully improves performance (*1, level 2-4 agents,” / X
https://x.com/swyx/status/1882104864509190632

“The specific RL alg doesn’t matter much We tried PPO, GRPO and PRIME. Long cot all emerge and they seem all work well. We haven’t got the time to tune the hyper-parameters, so don’t want to make quantitative conclusions about which alg works better.
https://x.com/jiayi_pirate/status/1882839504899420517

“To name one highlight for CUA and Operator each: – CUA is *long-horizon* — it could act 20min autonomously if needed! – Operator uses *remote VMs*, which is good for managing safety and access, and means you can parallelize and save time! Soon, you’ll have 100 24/7 interns!” / X
https://x.com/ShunyuYao12/status/1882507506557288816

“There has been a lot of criticism of LLM scaling recently, but what exactly does the science behind scaling laws tell us? Power laws. LLM scaling laws are based upon the idea of a power law, which is defined via the function below: y = ax^{p} Here, we have two quantities–x and
https://x.com/cwolferesearch/status/1878929929611448588

[2501.12570v1] O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning
https://arxiv.org/abs/2501.12570v1

“🎉 Easier evals are here! Evals are a vital part of bringing LLM apps into production, but are often ignored because they are tedious to set up. This new DX with Pytest/Vitest reduces that friction! Really excited to ship this w/ @baga_tur! Let us know what you think.” / X
https://x.com/Hacubu/status/1882134158916600187

“write LLM evals like you write software tests (pytest/vitest/jest) writing software tests is standard practice. writing evals for LLMs is equally important, but we don’t see it as common place yet we hope this helps bridge the gap!” / X
https://x.com/hwchase17/status/1882141857012199427

“I’ve been corrected. embarrassing. However, here I present to you: ZERO vs V3. How’s that for the RLHF hypothesis? (Yes Zero also invents a bizarre multi-tier response format)
https://x.com/teortaxesTex/status/1882198981637500930

“For those trying to understand @deepseek_ai Group Relative Policy Optimization (GRPO). Here, in simple steps: 1️⃣ Generate multiple outputs for each prompt using the current policy 2️⃣ Score these outputs using a reward model (rule or outcome) 3️⃣ Average the rewards and use it as
https://x.com/_philschmid/status/1881423639741960416

The second wave of AI coding is here | MIT Technology Review
https://www.technologyreview.com/2025/01/20/1110180/the-second-wave-of-ai-coding-is-here/

“We’re excited to introduce Transformer², a machine learning system that dynamically adjusts its weights for various tasks!
https://x.com/SakanaAILabs/status/1879325924887613931

“here’s a good thread about Implicit CoT from its author:
https://x.com/jxmnop/status/1882830393373774310

AI can write improved code, but you have to know how to ask • The Register
https://www.theregister.com/2025/01/07/ai_can_write_improved_code_research/

“Either base or instruct model works – Instruct model learns faster, but converges to about same performance as base – Instruct model’s output are more structured and readable So extra instruction tuning isn’t necessary, which supports R1-Zero’s design decision
https://x.com/jiayi_pirate/status/1882839494828896730

“Here is the performance of CUA on the OSWorld and WebArena benchmarks. CUA performs better than previous SoTA but still has a long way to go when compared to human performance.
https://x.com/omarsar0/status/1882501699757379666

“LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation Author’s Explanation:
https://x.com/TheAITimeline/status/1881211041247359146

[2501.03840v1] Machine learning applications in archaeological practices: a review
https://arxiv.org/abs/2501.03840v1

“Why you want to look at ModernBERT embedding and ranking models! 👀 ModernBERT correctly associated Corona with the virus, while the previous open sota old model didn’t. Easy choice, right? No, both perform roughly equally well on the BEIR benchmark! 🫠 This is why relying
https://x.com/_philschmid/status/1882074406534385848

[2501.09891] Evolving Deeper LLM Thinking
https://arxiv.org/abs/2501.09891

“Can LLMs demonstrate behavioral self-awareness? This new paper shows that after fine-tuning LLMs on behaviors like outputting insecure code, the LLMs show behavioral self-awareness. In other words, without explicitly trained to do so, the model that was tuned to output insecure
https://x.com/omarsar0/status/1882079780918747303

“We’re releasing Humanity’s Last Exam, a dataset with 3,000 questions developed with hundreds of subject matter experts to capture the human frontier of knowledge and reasoning. State-of-the-art AIs get <10% accuracy and are highly overconfident. @ai_risk @scaleai
https://x.com/DanHendrycks/status/1882433928407241155

[2209.07663] Monolith: Real Time Recommendation System With Collisionless Embedding Table
https://arxiv.org/abs/2209.07663

“Physical Intelligence released Fast, a tokenizer that compresses actions to train Transformers on robotic control Fast’s compression enables 5x faster VLA training, even on dexterous tasks like folding laundry, bussing tables, and packing bags
https://x.com/adcock_brett/status/1881024696306557134

LOKI
https://opendatalab.github.io/LOKI/

[2501.09466v1] DEFOM-Stereo: Depth Foundation Model Based Stereo Matching
https://arxiv.org/abs/2501.09466v1

“3/ Documentation release agents. Create release notes and documentation updates based on code changes to your repository Tools: – @firecrawl_dev – Github API + workflows Framework – LangGraph” / X
https://x.com/AlexReibman/status/1879702354393751619

FoundationStereo: Zero-Shot Stereo Matching
nvlabs.github.io/FoundationStereo/

[2410.01795v1] Knowledge-Driven Feature Selection and Engineering for Genotype Data with Large Language Models
https://arxiv.org/abs/2410.01795v1

“AIs working in the world can create complex systems with unexpected & risky feedback loops. Each output affects the world, affecting inputs, affecting outputs… Even simple loops can drive optimization behavior, making systems turn extreme. No training needed, just interaction.
https://x.com/emollick/status/1881027771734093912

“🚀 New Approach to Training MoE Models! We’ve made a key change: switching from micro-batches to global-batches for better load balancing. This simple tweak lets experts specialize more effectively, leading to: ✅ Improved model performance ✅ Better handling of real-world
https://x.com/Alibaba_Qwen/status/1882064440159596725

“V3 Base → R1 Zero (Stage 0/4) ⚙️GRPO: “PPO without a value function using monte carlo estimates of the advantage” – @natolambert 🔍 Data Strategy: Verified prompts via rule-based rewards (IFEval/Tülu 3) + test cases (math/code). 💡Emergent: reasoning/reflection + long CoT.
https://x.com/casper_hansen_/status/1881404608591085817