Benchmarks: AI News Week Ending 05/16/2025

Benchmarks: AI News Week Ending 05/16/2025

May 16, 2025

Image created with GPT Image 1. Image prompt: rose cluster centre with thin color-code strip footer, PCL muted floral palette, minimalist graphic design inspired by New Order’s ‘Power, Corruption & Lies’, metaphor for stacked bar scorecards of model prowess, flat color, subtle texture, 1980s Saville typography style

Salesforce Signs Definitive Agreement to Acquire Convergence.ai – Salesforce https://www.salesforce.com/news/stories/salesforce-signs-definitive-agreement-to-acquire-convergence-ai/

In September, 2024, physicians working with AI did better at the Healthbench doctor benchmark than either AI or physicians alone. With the release of o3 and GPT-4.1, AI answers are no longer improved on by physicians. Also error rates appear to be dropping for newer AI models. https://x.com/emollick/status/1922145507461197934

Report: Spring 2025 AI Model Usage Trends – Poe https://poe.com/blog/spring-2025-ai-model-usage-trends

A common question is “”can an AI make money?”” This benchmark, where AIs run a simulated vending machine over time, suggests yes, with an important caveat. On average, Claude 3.5 & o3-mini beat a human, but they are high in variance, and fail at random times for complex reasons. https://x.com/emollick/status/1921048218353197470

Evaluations are essential to understanding how models perform in health settings. HealthBench is a new evaluation benchmark, developed with input from 250+ physicians from around the world, now available in our GitHub repository. https://x.com/OpenAI/status/1921983050138718531

FlyLoop – AI Agent for Scheduling Meetings and Managing Your Calendar | Hacker News https://news.ycombinator.com/item?id=43972660

Creating evaluations is the most effective way at improving model performance in any domain!”” / X https://x.com/BorisMPower/status/1922080385514504572

We’re back! ⚖️ @hwchase17 is setting the stage for the afternoon talks at Interrupt 2025 🦜🚀, centered around evals, quality, and reliability! – Quality is still the biggest blocker of bringing agents to production – “”Great evals start with great observability”” – OpenEvals https://x.com/LangChainAI/status/1922745714246906086

surpassing SOTA on 20% of the problems it was applied to is actually nuts https://x.com/bio_bootloader/status/1923121148864164123

Check out the full leaderboard at: https://x.com/lmarena_ai/status/1921966654256197814

Hunyuan-Turbos ranks in top-10 across all categories (except for style control #13) https://x.com/lmarena_ai/status/1921966651655717217

How to simulate and evaluate multi-turn conversations 💬 Most LLM applications today are chat-based. How would you evaluate the conversations? 🔧 We’re excited to launch OpenEvals — a set of utilities to simulate full conversations and evaluate your LLM application’s https://x.com/LangChainAI/status/1922747560483226041

Jensen Huang is worried: Tariff war will create a vaccum https://semiconductorsinsight.com/jensen-huang-is-worried-about-china/

We’ve added support for the Responses API in the Evals API and dashboard. 🧭 https://x.com/OpenAIDevs/status/1923048126002102530

The Physical Turing Test: your house is a complete mess after a Sunday hackathon. On Monday night, you come home to an immaculate living room and a candlelight dinner. And you couldn’t tell whether a human or a machine had been there. Deceptively simple, insanely hard. It is the https://x.com/DrJimFan/status/1920504375925223669

We’ve just released HealthBench — a new eval for AI systems for health. Developed with 262 physicians who have practiced in 60 countries.”” / X https://x.com/gdb/status/1921987974356443595

II-Medical – Intelligent Internet https://ii.inc/web/blog/post/ii-medical

Google’s recently announced Gemini 2.0 Flash Preview Image Generation delivers a modest upgrade over the 2.0 Flash Experimental release, although those improvements can be subtle in individual comparisons – see below! In the latest Artificial Analysis Image Arena rankings, https://x.com/ArtificialAnlys/status/1922659105048821984

Super excited to work with @fidjissimo even more closely. Welcome to @OpenAI! It’s fun and wild and inspiring.”” / X https://x.com/kevinweil/status/1920348319856943114

Famously, GPT-4o makes up citations to papers (though error rates appear far lower for citations generated by Deep Research models). How often does it do that? This clever large-scale study gives us a clear picture. The AI is also biased towards shorter titles & famous papers. https://x.com/emollick/status/1920319164993933511

OpenAI introduces HealthBench, a new open-source LLM benchmark for health! Across frontier models, o3 is the best performing model with a score of 60%, followed by Grok 3 (54%) and Gemini 2.5 Pro (52%) A deeper dive: HealthBench consists of 5,000 synthetically generated https://x.com/iScienceLuvr/status/1922013874687246756

Introducing HealthBench | OpenAI https://openai.com/index/healthbench/

Main reasons LLMs get “”lost”” – Make premature and often incorrect assumptions early in the conversation. – Attempt full solutions before having all necessary information, leading to “bloated” or off-target answers. – Over-rely on their previous (possibly incorrect) answers, https://x.com/omarsar0/status/1922755800843550833

Severe Performance Drop in Multi-Turn Settings All tested LLMs show significantly worse performance in multi-turn, underspecified conversations compared to single-turn, fully-specified instructions. The average performance drop is 39% across six tasks, even for SoTA models. For https://x.com/omarsar0/status/1922755768585158785