Benchmarks: AI News Week Ending 05/02/2025

Benchmarks: AI News Week Ending 05/02/2025

May 2, 2025

Image created with Ideogram v3. Image prompt: Late‑90s boy‑band cover “Score‑Board – Beyond Metrics”: five atop giant bar‑chart columns, pointing upward; satin tracksuits numbered 99; background of glowing line graphs; chrome logo, celebratory confetti lens flares.

“The Future of Presentations Is Here! Introducing Genspark AI Slides, a full agentic tool that makes creating slides fast and simple. https://x.com/genspark_ai/status/1914650977577394295

“Evaluating LLM research agents on scientific discovery lacks objective measures for assessing proposed methods. This paper introduces MLRC-BENCH, a benchmark using Machine Learning conference competitions to objectively evaluate agent novelty and effectiveness against human https://x.com/rohanpaul_ai/status/1916300010959896857

“We also evaluated the preliminary performance of Qwen3-235B-A22B on the open-source coding agent Openhands. It achieved 34.4% on Swebench-verified, achieving competitive results with fewer parameters! Thanks to @allhands_ai for providing an easy-to-use agent. Both open models and https://x.com/Alibaba_Qwen/status/1917064282552078480

“Much of the AI industry is caught in a particularly toxic feedback loop rn. Blindly chasing better human preference scores is to LLMs what chasing total watch time is to a social media algo. It’s a recipe for manipulating users instead of providing genuine value to them.” / X https://x.com/alexalbert__/status/1916878483390869612

“Devastating takedown of Chatbot Arena. It’s one thing for leaderboards to suck because they try to quantify the unquantifiable but quite another thing to actively choose flagrantly unscientific and nontransparent practices that benefit the big dogs. https://x.com/random_walker/status/1917516403977994378

“If you can’t shell out 2K$ 😱 to learn about LLM evaluations, take a look at our free/open resources: 1. LLM guidebook: From theory to troubleshooting https://x.com/clefourrier/status/1915339216344526896

GenAI LLM Assessment https://go.turing.com/genai-llm-assessment

“Whether to collect preferences (“do you prefer response A or B?”) from the same person who wrote the prompt, or a different person, is important and understudied. Highlighted this question in a recent talk https://x.com/johnschulman2/status/1917483351436582953

“Thanks for the authors’ feedback, we’re always looking to improve the platform! If a model does well on LMArena, it means that our community likes it! Yes, pre-release testing helps model providers identify which variant our community likes best. But this doesn’t mean the” / X https://x.com/lmarena_ai/status/1917492084359192890

“@willdepue I think technical details of how the model was made aren’t particularly interesting, rather the question is how was it tested and shipped in such a state; these are standard components in any post-mortem – the problem to solve in the future is organizational, not purely technical” / X https://x.com/nearcyan/status/1917475639655018708

United Compute | AI Computers & GPU Benchmark https://www.unitedcompute.ai/gpu-price-tracker

“I appreciate the answer, but it misses the point: → Selective reporting is biased because best-of-N inflates the final scores. → Access to preference data leads to overfitting and better elo scores. The fact that only a few companies can access this data completely biases the” / X https://x.com/maximelabonne/status/1917563456632328508

“It is critical for scientific integrity that we trust our measure of progress. The @lmarena_ai has become the go-to evaluation for AI progress. Our release today demonstrates the difficulty in maintaining fair evaluations on @lmarena_ai, despite best intentions. https://x.com/sarahookr/status/1917547727715721632

“There is no reasonable scientific justification for this practice. Being able to choose the best score to disclose enables systematic gaming of Arena score. This advantage increases with number of variants and if all other providers don’t know they can also private test.. https://x.com/sarahookr/status/1917547733994594420

“We also observe large differences in Arena Data Access @lmarena_ai is a open community resource that provides free feedback but 61.3% of all data goes to proprietary model providers. https://x.com/sarahookr/status/1917547738553803018

“Really incredible detective work by @singhshiviii et al. at @Cohere_Labs and elsewhere documenting the ways in which @lmarena_ai works with companies to help them game the leaderboard. https://x.com/BlancheMinerva/status/1917445722380681651

“It’s not only about how long your context is, but how well you use it. Great to see Gemini 2.5 models dominating MRCR and other benchmarks on long context! See 2.5 Pro tackle a complex coding task by reasoning over an entire repo (>500k tokens). Performance and effective use of https://x.com/OriolVinyalsML/status/1916917758023139670

Benchmarking LLMs for global health https://research.google/blog/benchmarking-llms-for-global-health/

“🖼️ GPT-Image-1 and all the top text-to-image models are now live on LMArena Beta! For now, it only supports single-turn, try it out and explore our new design! https://x.com/lmarena_ai/status/1917327293116264658

“If you’re picking your AI models based on public generalist leaderboards, you’re doing it wrong. In my opinion, evaluation & model picking is at least 30% of the work of any great AI builder and it’s a mix of public generalist and specialized leaderboards, social signals (likes https://x.com/ClementDelangue/status/1917565202633023505

“Around this time 2 years ago, the community helped us launch the very first Arena leaderboard! Today we’re publishing a blog to celebrate everything we’ve built together on LMArena! 🥳👏 Highlights: ☑️ 3M+ community votes 🤖 400+ models ranked across text, vision, https://x.com/lmarena_ai/status/1916620122342695363

“There’s a new paper circulating looking in detail at LMArena leaderboard: “The Leaderboard Illusion” https://x.com/karpathy/status/1917546757929722115

LLM Arena Pareto Frontier https://winston-bosan.github.io/llm-pareto-frontier/

“BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text “we present BRIDGE, a comprehensive multilingual benchmark comprising 87 tasks sourced from real-world clinical data sources across nine languages. We systematically evaluated 52 https://x.com/iScienceLuvr/status/1917139649354666432