Benchmarks: AI News Week Ending 06/06/2025

Benchmarks: AI News Week Ending 06/06/2025

June 6, 2025

Image created with OpenAI GPT-Image-1. Image prompt: vintage Sly & the Family Stone album-cover style, glowing neon marquee sign reading I’M BACK featuring scoreboard showing AI benchmark scores; grainy retro print texture, vibrant 60s funk color palette, high-resolution

Veo 3 is now the first model to top both the Image to Video and Text to Video leaderboards, outperforming Kling 2.0 and Runway Gen 4 to secure the #1 spot across both modalities! Veo 3 represents a significant leap in Image to Video generation, where Google’s previous Veo 2 had https://x.com/ArtificialAnlys/status/1928318831761707224

Exciting news: @OpenAI’s GPT-Image-1 takes the #1 spot in the Text-to-Image Arena! 🖼️🏆 ➤ Outperforms Google’s Imagen-3.0 by 50+ points ➤ Major leap over DALL·E 3 Huge congrats to @OpenAI! 👏 https://x.com/lmarena_ai/status/1930296340648735147

DeepSeek’s R1 leaps over xAI, Meta and Anthropic to be tied as the world’s #2 AI Lab and the undisputed open-weights leader DeepSeek R1 0528 has jumped from 60 to 68 in the Artificial Analysis Intelligence Index, our index of 7 leading evaluations that we run independently https://x.com/ArtificialAnlys/status/1928071179115581671

I wrote a history of AI in 32 images of otters using wifi on airplanes, from images to video to code. It shows two big trends: rapid improvements in AI models of all types and the growth of open weights AI models. Link in the comments. https://x.com/emollick/status/1929306757903319089

I’ve been using prompts of otters as a test of AI ability. It has taken less than three years to go from a text prompt producing images of abstract masses of fur to producing realistic videos with sound (including “”like the musical Cats but for otters”). https://x.com/emollick/status/1929612980041253132

Anthropic’s progress on SWE-bench Verified really stands out. Naive extrapolation suggests they’ll have it mostly solved in another year. https://x.com/i/web/status/1929568948086800798

An AI agent upgraded its own tools and doubled its bug-fix score. Darwin-style search plus Gödel-style self-reference cracked coding tasks. Pass rate jumps from 20 % to 50 % on SWE-bench-Verified Darwin Gödel Machine (DGM) is a coding agent that rewrites its own code, tests https://x.com/rohanpaul_ai/status/1929461153182122442

in 2026 we will no longer have to deal with call centers and phone menus. AI agents will own the space”” / X https://x.com/braelyn_ai/status/1927888909121507367

HOT: MiMo-VL new 7B vision LMs by Xiaomi surpassing gpt-4o (Mar), competitive in GUI agentic + reasoning tasks ❤️‍🔥 not only that, but also MIT license & usable with transformers 🔥 available on @huggingface 🤗 https://x.com/mervenoyann/status/1928475979753619663

🎉 After 2 years in production serving millions of requests, we’re open sourcing Chatterbox – our state-of-the-art TTS model that just beat ElevenLabs in blind evaluations. In recent testing, 63.75% of listeners preferred Chatterbox over ElevenLabs. Not only is it free and open https://x.com/resembleai/status/1927755087620796668

The team at @podonos did a subjective evaluation where they found that Chatterbox outperforms other proprietary models like ElevenLabs. https://x.com/resembleai/status/1927755092507144348

🚨 Wow. MOSTLY AI launches $100K synthetic data competition, two challenges, $50K each, pushes privacy-safe data sharing. @mostly_ai 🧵1/n The competition awards $50 K per track to teams whose synthetic data best balance realism and privacy. They’re spending large sums to https://x.com/rohanpaul_ai/status/1929527535445889063

I do like these sorts of tests but wish they made it clearer that they represent the minimum bounds of AI. They are representative of what a naive user would experience (which is important!) but not what you could do with Gemini 2.5 Pro or o3. Backward, not forward looking. https://x.com/emollick/status/1930346576125288868

IBM Unveils watsonx AI Labs: The Ultimate Accelerator for AI Builders, Startups and Enterprises in New York City https://newsroom.ibm.com/2025-06-02-ibm-unveils-watsonx-ai-labs-the-ultimate-accelerator-for-ai-builders,-startups-and-enterprises-in-new-york-city

After over a year of saying i need to do an evals conference, we finally have the speakers (and practitioners who lead these evals at work instead of trying to sell you on their evals) to do a dedicated evals track for the first time ever! every AI engineer serious enough about https://x.com/i/web/status/1929609793104499152

Large language models are proficient in solving and creating emotional intelligence tests | Communications Psychology https://www.nature.com/articles/s44271-025-00258-x

🚀 DeepSeek-R1-0528 is here! 🔹 Improved benchmark performance 🔹 Enhanced front-end capabilities 🔹 Reduced hallucinations 🔹 Supports JSON output & function calling ✅ Try it now: https://x.com/deepseek_ai/status/1928061589107900779

DeepSeek has released DeepSeek-R1-0528, an updated version of DeepSeek-R1. How does the new model stack up in benchmarks? We ran our own evaluations on a suite of math, science, and coding benchmarks. Full results in thread! https://x.com/EpochAIResearch/status/1928489524616630483

New DeepSeek just dropped. Proud to serve the fastest DeepSeek R1 0528 inference on OpenRouter (#1 on TTFT and TPS) with our Model APIs. https://x.com/basetenco/status/1928195639822700898

The DeepSeek-R1-0528 model card just dropped. Up 17.5 points on the AIME 2025 test. https://x.com/fdaudens/status/1928055679182352461

Today’s open weights frontier is led by DeepSeek (both reasoning and non-reasoning models) https://x.com/ArtificialAnlys/status/1928477951365939328

We made dynamic 1bit quants for DeepSeek-R1-0528 – 74% smaller 713GB to 185GB. Use the magic incantation -ot “”.ffn_.*_exps.=CPU”” to offload MoE layers to RAM, allowing non MoEs to fit < 24GB VRAM on 16K context! The rest sits in RAM & disk. Quants here: https://x.com/danielhanchen/status/1928278088951157116

On GPQA Diamond, a set of PhD-level multiple-choice science questions, DeepSeek-R1-0528 scores 76% (±2%), outperforming the previous R1’s 72% (±3%). This is generally competitive with other frontier models, but below Gemini 2.5 Pro’s 84% (±3%). https://x.com/EpochAIResearch/status/1928489527204589680

DeepSeek R1 05-28 LiveBench results: – 8th in the Overall ahead of o4-mini, Gemini 2.5 Flash Preview and Qwen3-235B-A22B (biggest competitors) – 1st on Data Analysis !!! – 3rd on Reasoning !! – 4th on Mathematics ! – 11th on Language – 20th on Instruction Following – 23rd on https://x.com/scaling01/status/1928173385399308639

sharing negative results >>> 20k google scholar citations”” / X https://x.com/i/web/status/1929449598524821509

Releasing our Q2 2025 State of AI – China Report 🇨🇳: Chinese AI labs have achieved close to parity with US labs, led by DeepSeek’s leap to world #2 in intelligence and backed by a deep ecosystem of 10+ players Key findings from our analysis: 🇨🇳 The Chinese AI Ecosystem has depth https://x.com/ArtificialAnlys/status/1928477941715079175

Predicting and explaining AI model performance: A new approach to evaluation – Microsoft Research https://www.microsoft.com/en-us/research/blog/predicting-and-explaining-ai-model-performance-a-new-approach-to-evaluation/

The latest mlx-lm has a new dynamic quantization method (made with @angeloskath). It consistently results in better model quality with no increase in size. Some perplexity results (lower is better) for a few Qwen3 base models: https://x.com/i/web/status/1929633379504493048

Shisa V2 405B: Japan’s Highest Performing LLM https://simonwillison.net/2025/Jun/3/shisa-v2/

Selecting a Model Based on Stripe Conversion – A Practical Eval for Startups https://cookbook.openai.com/examples/stripe_model_eval/selecting_a_model_based_on_stripe_conversion

The Stripe eval: How @HyperwriteAI A/B tested models and chose GPT-4.1—the one that drove the most customer purchases for them: https://x.com/i/web/status/1929632332837015833

Introducing LisanBench LisanBench is a simple, scalable, and precise benchmark designed to evaluate large language models on knowledge, forward-planning, constraint adherence, memory and attention, and long context reasoning and “”stamina””. “”I see possible futures, all at once. https://x.com/scaling01/status/1928510435164037342

Models can already tell when you are grading them. 😯 Your evaluation prompt has a scent; top LLMs smell it fast. Frontier language models can already sense when they are being tested. A new 1 000-item benchmark shows top systems spot evaluation prompts almost as well as https://x.com/rohanpaul_ai/status/1930579137581723905

This paper introduces EfficientLLM, the first large-scale benchmark evaluating efficiency techniques across the LLM lifecycle with fine-grained metrics. Methods 🔧: → The benchmark evaluates efficient attention variants, sparse Mixture-of-Experts, and attention-free https://x.com/rohanpaul_ai/status/1929522638403297582