Technical and Dev: AI News Week Ending 05/16/2025

Technical and Dev: AI News Week Ending 05/16/2025

May 16, 2025

Image created with GPT Image 1. Image prompt: stark white plane with fine black baseline rule, Substance monochrome palette, minimalist graphic design inspired by New Order’s ‘Substance’, metaphor for silicon valley skyline circuitry, flat color, subtle texture, 1980s Saville typography style

Salesforce Signs Definitive Agreement to Acquire Convergence.ai – Salesforce https://www.salesforce.com/news/stories/salesforce-signs-definitive-agreement-to-acquire-convergence-ai/

In September, 2024, physicians working with AI did better at the Healthbench doctor benchmark than either AI or physicians alone. With the release of o3 and GPT-4.1, AI answers are no longer improved on by physicians. Also error rates appear to be dropping for newer AI models. https://x.com/emollick/status/1922145507461197934

Beyond Text-Only AI: On-Demand UI Generation for Better Conversational Experiences – fka.dev https://blog.fka.dev/blog/2025-05-16-beyond-text-only-ai-on-demand-ui-generation-for-better-conversational-experiences/

the AI labs spent a few years quietly scaling up supervised learning, where the best-case outcome was obvious: an excellent simulator of human text now they are scaling up reinforcement learning, which is something fundamentally different. and no one knows what happens next”” / X https://x.com/jxmnop/status/1922078186864566491

Report: Spring 2025 AI Model Usage Trends – Poe https://poe.com/blog/spring-2025-ai-model-usage-trends

A common question is “”can an AI make money?”” This benchmark, where AIs run a simulated vending machine over time, suggests yes, with an important caveat. On average, Claude 3.5 & o3-mini beat a human, but they are high in variance, and fail at random times for complex reasons. https://x.com/emollick/status/1921048218353197470

NVIDIA AI Developer on X: “🎉 Congratulations to the FlashInfer team – their technical paper, “FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving,” just won best paper at #MLSys2025. 🏆 🙌 We are excited to share that we are now backing FlashInfer – a supporter and” / X
https://x.com/NVIDIAAIDev/status/1922354691251294644

Evaluations are essential to understanding how models perform in health settings. HealthBench is a new evaluation benchmark, developed with input from 250+ physicians from around the world, now available in our GitHub repository. https://x.com/OpenAI/status/1921983050138718531

Introducing Continuous Thought Machines New Blog: https://x.com/SakanaAILabs/status/1921749814829871522

How far can reasoning models scale? | Epoch AI https://epoch.ai/gradient-updates/how-far-can-reasoning-models-scale

FlyLoop – AI Agent for Scheduling Meetings and Managing Your Calendar | Hacker News https://news.ycombinator.com/item?id=43972660

Introducing a new abstraction for agentic memory 💫 Memory is a huge topic for agentic systems – how do you store and retain information over time? – and while we’ve seen a lot of approaches on this topic, we spent a lot of time figuring out the right abstraction balance between https://x.com/jerryjliu0/status/1922460369345511781

(Unproven) arguments on why AI might have more rapid impact than previous tech: 1) Broad-based & fast diffusion means more impact 2) Knowledge work is different, transforms faster 3) Agentic systems shortcut the need for complex adaption of systems 4) Supersmart AI just does work”” / X https://x.com/emollick/status/1921983757281571021

ℏεsam on X: “this is one of the best articles on multi-agent design patterns. it explains each workflow with diagrams, theory, and code, covering: > prompt chaining > reflection pattern > tool use pattern > agent orchestration > multi-agent pattern https://t.co/PzS1ECW673″ / X
https://x.com/Hesamation/status/1919810226473046292

Creating evaluations is the most effective way at improving model performance in any domain!”” / X https://x.com/BorisMPower/status/1922080385514504572

We’re back! ⚖️ @hwchase17 is setting the stage for the afternoon talks at Interrupt 2025 🦜🚀, centered around evals, quality, and reliability! – Quality is still the biggest blocker of bringing agents to production – “”Great evals start with great observability”” – OpenEvals https://x.com/LangChainAI/status/1922745714246906086

surpassing SOTA on 20% of the problems it was applied to is actually nuts https://x.com/bio_bootloader/status/1923121148864164123

Check out the full leaderboard at: https://x.com/lmarena_ai/status/1921966654256197814

Hunyuan-Turbos ranks in top-10 across all categories (except for style control #13) https://x.com/lmarena_ai/status/1921966651655717217

How to simulate and evaluate multi-turn conversations 💬 Most LLM applications today are chat-based. How would you evaluate the conversations? 🔧 We’re excited to launch OpenEvals — a set of utilities to simulate full conversations and evaluate your LLM application’s https://x.com/LangChainAI/status/1922747560483226041

Jensen Huang is worried: Tariff war will create a vaccum https://semiconductorsinsight.com/jensen-huang-is-worried-about-china/

We’ve added support for the Responses API in the Evals API and dashboard. 🧭 https://x.com/OpenAIDevs/status/1923048126002102530

The Physical Turing Test: your house is a complete mess after a Sunday hackathon. On Monday night, you come home to an immaculate living room and a candlelight dinner. And you couldn’t tell whether a human or a machine had been there. Deceptively simple, insanely hard. It is the https://x.com/DrJimFan/status/1920504375925223669

We’ve just released HealthBench — a new eval for AI systems for health. Developed with 262 physicians who have practiced in 60 countries.”” / X https://x.com/gdb/status/1921987974356443595

II-Medical – Intelligent Internet https://ii.inc/web/blog/post/ii-medical

Deep Transformer models have high parameter counts. This paper proposes Intra-Layer Recurrence (ILR), selectively reusing individual layers. This enhances performance without adding parameters, particularly when reapplying earlier layers; for instance, perplexity with ALiBi https://x.com/rohanpaul_ai/status/1921863923071926350

A neat feature of KerasHub: you can create KerasHub pretrained components straight from the base classes (tokenizers/backbones/classifiers/etc). Write code once, and switch the model later. Similar vibe to Transformer autoclasses and pipelines, but unlike pipelines these https://x.com/fchollet/status/1922719664859381922

🔔 New blog post on how we can attain large speedups for our inference customers using custom speculators! 🚀 Key benefits of customization: ✅ ~1.3x faster inference ✅ ~25% cost reduction ✅ Gets better as you generate more responses https://x.com/togethercompute/status/1921983794573197538

Google’s recently announced Gemini 2.0 Flash Preview Image Generation delivers a modest upgrade over the 2.0 Flash Experimental release, although those improvements can be subtle in individual comparisons – see below! In the latest Artificial Analysis Image Arena rankings, https://x.com/ArtificialAnlys/status/1922659105048821984

Here are the top AI Papers of the Week (May 5 – 11): – ZeroSearch – Discuss-RAG – Absolute Zero – Llama-Nemotron – The Leaderboard Illusion – Reward Modeling as Reasoning Read on for more:”” / X https://x.com/dair_ai/status/1921606662214787114

Predicting and explaining AI model performance: A new approach to evaluation – Microsoft Research https://www.microsoft.com/en-us/research/blog/predicting-and-explaining-ai-model-performance-a-new-approach-to-evaluation/

X-REASONER: Towards Generalizable Reasoning Across Modalities and Domains “”General-domain text-based post-training can enable such strong generalizable reasoning.”” “”we introduce X-REASONER, a vision-language model posttrained solely on general-domain text for generalizable https://x.com/iScienceLuvr/status/1920435270824178089

[2505.09568v1] BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset https://arxiv.org/abs/2505.09568v1

From Zero to Hero on all things Vision Language Models – from multimodal to reasoning to to MoEs to benchmarks AND more 🔥 One definitive blogpost to put you up to speed on all things VLMs! – Enjoy! 🤗 https://x.com/reach_vb/status/1921974792242016591

Super excited to work with @fidjissimo even more closely. Welcome to @OpenAI! It’s fun and wild and inspiring.”” / X https://x.com/kevinweil/status/1920348319856943114

Famously, GPT-4o makes up citations to papers (though error rates appear far lower for citations generated by Deep Research models). How often does it do that? This clever large-scale study gives us a clear picture. The AI is also biased towards shorter titles & famous papers. https://x.com/emollick/status/1920319164993933511

OpenAI introduces HealthBench, a new open-source LLM benchmark for health! Across frontier models, o3 is the best performing model with a score of 60%, followed by Grok 3 (54%) and Gemini 2.5 Pro (52%) A deeper dive: HealthBench consists of 5,000 synthetically generated https://x.com/iScienceLuvr/status/1922013874687246756

Introducing HealthBench | OpenAI https://openai.com/index/healthbench/

Good debate on idea generation and AI: 1) Experimental paper finds that using the old GPT-3.5 helps people generate better ideas 2) Response paper finds that the AI’s ideas are all quite similar to each other 3) Response to that argues that it may not matter as results are good https://x.com/emollick/status/1922717797848613068

Please check out our Qwen3 Technical Report. 👇🏻 https://x.com/Alibaba_Qwen/status/1922265772811825413

After supervising 20+ papers, I have highly opinionated views on writing great ML papers. When I entered the field I found this all frustratingly opaque So I wrote a guide on turning research into high-quality papers with scientific integrity! Hopefully still useful for NeurIPS https://x.com/NeelNanda5/status/1921928364790833651

The @vercel Chat SDK now features stream resumption. This makes AI conversations resilient to network hiccups and reloading or sharing a chat mid-generation. This is especially valuable for long responses (e.g.: Deep Research). No proprietary APIs, no sticky load balancing, just https://x.com/rauchg/status/1921168985900372081

Democratizing AI: The Psyche Network Architecture – NOUS RESEARCH https://nousresearch.com/nous-psyche/

We’re missing (at least one) major paradigm for LLM learning. Not sure what to call it, possibly it has a name – system prompt learning? Pretraining is for knowledge. Finetuning (SL/RL) is for habitual behavior. Both of these involve a change in parameters but a lot of human”” / X https://x.com/karpathy/status/1921368644069765486

Hot-take: Auto-regression sucks and is impressive as a parlor trick. Any spark of intelligence from an LLM reflects that it moved beyond, and built a factorized model with meaningful latents.”” / X https://x.com/francoisfleuret/status/1922174021619097741

P.S. CoT [are trying to] address this by sampling the said meaningful latents as regular tokens. I claim this is a poor man’s version of “”the real thing””.”” / X https://x.com/francoisfleuret/status/1922892961680896238

Deep learning is ~10% idea and ~90% implementation.”” / X https://x.com/hyhieu226/status/1922707456771195390

Type-constrained Code Generation with Large Language Models”” Use the LSP/type system to constrain valid output tokens during code generation Reduces compilation errors by >50% with 30B models. https://x.com/mathemagic1an/status/1922449795425198209

🚀 Big memory upgrade in LlamaIndex! The new, flexible Memory API blends short-term chat history and long-term memory via plug-and-play blocks: ➡️ StaticMemoryBlock for non-changing static information ➡️ FactExtractionMemoryBlock that keeps track of a list of useful facts ➡️ https://x.com/llama_index/status/1922340015499313543

So many updates and the hackathon hasnt even begun! You can now train using Atropos using @axolotl_ai too https://x.com/Teknium1/status/1922435846751584771

Does your eval loop survive contact with real humans? Most don’t. @weave_wb closes the gap by letting you layer human annotations right onto every trace.”” / X https://x.com/weights_biases/status/1922722359795916943

this should make you feel something https://x.com/arithmoquine/status/1922751330474500530

also, embeddings *are* underrated”” / X https://x.com/jxmnop/status/1922468210256879786

(the fact flash-attention somehow works with uv gives me hope this is possible)”” / X https://x.com/typedfemale/status/1922427558924001672

🚨This week’s top AI/ML research papers: – Absolute Zero – RM-R1 – Seed-Coder – Flow-GRPO – ZeroSearch – Ming-Lite-Uni – A Survey on Large Multimodal Reasoning Models – On Path to Multimodal Generalist – ZeroSearch – HunyuanCustom – Unified Multimodal CoT Reward Model through https://x.com/TheAITimeline/status/1921626740675248338

This is just so incredible.. They ran RL training for a massive LLM without owning a cluster. It worked. INTELLECT-2: The First Globally Distributed Reinforcement Learning Training of a 32B Parameter Model This is the closest LLMs have come to Bitcoin energy—decentralized, https://x.com/rohanpaul_ai/status/1922224996291879358

INTELLECT_2_Technical_Report.pdf https://storage.googleapis.com/public-technical-paper/INTELLECT_2_Technical_Report.pdf#page=11.25

NousResearch/DisTrO: Distributed Training Over-The-Internet https://github.com/NousResearch/DisTrO

Built this nice and clean LMS using @lovable_dev LMS doesn’t have to be complex. Simple user experience easy to find resources, manage assignments, and track learning. For 2.0, added some fun: • Weekly Challenges • Study Groups • Achievements • Learn from mentors online https://x.com/DhruvalGolakiya/status/1916559636091855277

[2505.04572] Stow: Robotic Packing of Items into Fabric Pods https://arxiv.org/abs/2505.04572

Chain-of-Thought (CoT) training applies uniform inference budgets. This causes inefficient stochastic gradient estimation due to varying prompt difficulty. This paper presents GVM-RAFT. It dynamically allocates sample budgets to prompts, minimizing stochastic gradient variance https://x.com/rohanpaul_ai/status/1921894876951572611

Parameter-Efficient Fine-Tuning (PEFT) in LLMs leaks private data from gradients. ReCIT is a novel attack recovering full private data and Personally Identifiable Information (PII) with high fidelity. 📌 Malicious pre-training biases LLM memory to recall Personally Identifiable https://x.com/rohanpaul_ai/status/1921520409176141854

This paper offers two new Reinforcement Learning (RL) methods, Stochastic Group Relative Policy Optimization (S-GRPO) and Token-Specific Prefix Matching Optimization (T-SPMO), for Low-Rank Adaptation (LoRA) fine-tuning. They improve SVAMP benchmark accuracy from 46% to over 70% https://x.com/rohanpaul_ai/status/1921533495920455854

INTELLECT-2 Release: The First Globally Trained 32B Parameter Model Reinforcement Learning Training Run https://www.primeintellect.ai/blog/intellect-2-release

[2505.03335] Absolute Zero: Reinforced Self-play Reasoning with Zero Data https://arxiv.org/abs/2505.03335

I know you have always secretly craved a cool distillation script that actually gets results. That time has come 🤯 In collaboration w/ @lawrence_cjs & Shuchen Xue, we present a Diffusers-compatible training script for SANA Sprint 🏃 Links ⬇️ https://x.com/RisingSayak/status/1922213888168173960

LLMs in recommenders often struggle with limited context, inefficient item handling, and position bias where item display order skews results. To address this, the paper introduces a hybrid system: a traditional model first selects top items, then an LLM reranks them, testing if https://x.com/rohanpaul_ai/status/1921905698314875093

PrimeIntellect-ai/prime-rl: prime-rl is a codebase for decentralized RL training at scale https://github.com/PrimeIntellect-ai/prime-rl

HuB is a unified framework that enables humanoids to execute complex balancing tasks. It integrates reference motion refinement, balance-aware policy learning, and sim-to-real robustness training. Project page: https://x.com/TheHumanoidHub/status/1922153643555406058

Researchers at Tsinghua University introduced Absolute Zero, a new method for AI training It enables models to learn and master complex reasoning tasks on their own through self-play Can be a strong alternative to training with costly human-labeled data https://x.com/rowancheung/status/1921815757886775804

Absolute Zero: Reinforced Self-play Reasoning with Zero Data https://www.arxiv.org/pdf/2505.03335

LLMs Get Lost in Multi-turn Conversation The cat is out of the bag. Pay attention, devs. This is one of the most common issues when building with LLMs today. Glad there is now paper to share insights. Here are my notes: https://x.com/omarsar0/status/1922755721428598988

SVAD https://yc4ny.github.io/SVAD/

Deep Transformer models are computationally demanding due to their many layers. This paper introduces ReplaceMe, a training-free depth pruning method. It substitutes Transformer blocks with a single linear operation, estimated from calibration data, achieving up to 25% pruning https://x.com/rohanpaul_ai/status/1921919540671230093

LLMs Get Lost In Multi-Turn Conversation https://arxiv.org/pdf/2505.06120

UC Berkeley researchers also introduced PyRoki, a modular, extensible, and cross-platform toolkit for kinematic optimization It solves inverse kinematics, trajectory optimization, and motion retargeting for a wide range of robots, including humanoids https://x.com/adcock_brett/status/1921597243628360079

RL for Search-Efficient LLMs Presents a new post-training RL framework that explicitly trains LLMs to optimize search usage. Recipe: structured reasoning template and reward policy + GRPO Leads to smarter and more efficient reasoning and retrieval of external knowledge. https://x.com/omarsar0/status/1922665313117552664

Introducing Continuous Thought Machines https://sakana.ai/ctm/

Generating Physically Stable and Buildable LEGO Designs from Text https://avalovelace1.github.io/LegoGPT/

Main reasons LLMs get “”lost”” – Make premature and often incorrect assumptions early in the conversation. – Attempt full solutions before having all necessary information, leading to “bloated” or off-target answers. – Over-rely on their previous (possibly incorrect) answers, https://x.com/omarsar0/status/1922755800843550833

Severe Performance Drop in Multi-Turn Settings All tested LLMs show significantly worse performance in multi-turn, underspecified conversations compared to single-turn, fully-specified instructions. The average performance drop is 39% across six tasks, even for SoTA models. For https://x.com/omarsar0/status/1922755768585158785

Stanford researchers debuted a Teleoperated Whole-Body Imitation System (TWIST) It enables coordinated, versatile, whole-body movements of humanoids, using a single neural network This will enable functional general-purpose robots in different domains! https://x.com/adcock_brett/status/1921597153626984597