Benchmarks: AI News Week Ending 01/30/2026

Benchmarks: AI News Week Ending 01/30/2026

January 30, 2026

Image created with gemini-3.1-flash-image-preview with claude-sonnet-4-5. Image prompt: Animation cel style illustration of a muscular blue-skinned genie with magical cyan wisps emerging from a brass oil lamp, holding up and examining a large report card covered in A+ grades and colorful bar charts, Disney Pixar quality 2D animation aesthetic with bold outlines, warm golden lighting, clean gradient background in deep purples, genie has friendly approving expression, composition leaves horizontal space at top for text overlay, jewel tone color palette, volumetric magical smoke effects.

xAI’s Grok Imagine takes the #1 spot in both Text to Video and Image to Video in the Artificial Analysis Video Arena, surpassing Runway Gen-4.5, Kling 2.5 Turbo, and Veo 3.1! Grok Imagine is the latest video model from @xAI, and joins an increasing roster of models such as”” https://x.com/ArtificialAnlys/status/2016749756081721561

Vidu Q3 Pro ranks #2 in Text to Video in the Artificial Analysis Video Arena, surpassing Runway Gen-4.5 and Kling 2.5 Turbo while trailing only xAI’s Grok Imagine! Vidu Q3 Pro is the latest release from @ViduAI_official, representing a significant upgrade from their Vidu Q2″” https://x.com/ArtificialAnlys/status/2017225053008719916

🚨BREAKING: Kimi K2.5 Thinking by @Kimi_Moonshot debuts in Text Arena as the #1 open model, surpassing GLM-4.7 and ranking #15 overall. Highlights: – #1 Open model (+5pts vs GLM-4.7) – #7 Coding – #7 Instruction Following – #14 Hard Prompts One of only two open models to break”” https://x.com/arena/status/2016294722445443470

🚨BREAKING: @xAI’s first model in Video Arena debuts in the top 3! Grok-Imagine-Video ranks #3 on the Image-to-Video Arena and #4 on the Text-to-Video Arena. It is close to the top-ranked @GoogleDeepMind Veo 3.1 and @OpenAI Sora 2 Pro models. Grok-Imagine-Video offers: -“” https://x.com/arena/status/2016748418635616440

@xai Try New Grok Imagine here! Text to Image https://t.co/OeJMwL9hoH Image Editing https://t.co/Q7lojX41I1 Text to Video https://t.co/fAzEJABTYn Image to Video https://t.co/zTdoJQjkqk Video Editing”” https://x.com/fal/status/2016746473887609118

three levels of ai agent evals: 1. single-step: did it make the right decision? 2. full-turn: did it execute the task correctly? 3. multi-turn: did it maintain context across conversation? but it all starts with the foundation of agent tracing!”” https://x.com/samecrowder/status/2016563057947005376

DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints “”we introduce DeepPlanning, a challenging benchmark for practical long-horizon agent planning. It features multi-day travel planning and multi-product shopping tasks that require proactive”” https://x.com/iScienceLuvr/status/2016122154862182792

Small models just beat giant LLM agents at their own job. Not by thinking harder, but by coordinating better. A new system just outscored GPT-5 on Humanity’s Last Exam, using far less compute. 𝗧𝗵𝗶𝘀 𝘀𝘆𝘀𝘁𝗲𝗺 𝗿𝗲𝗽𝗹𝗮𝗰𝗲𝘀 𝗼𝗻𝗲 𝗯𝗶𝗴 𝗯𝗿𝗮𝗶𝗻 𝘄𝗶𝘁𝗵 𝗮”” https://x.com/LiorOnAI/status/2016904429543272579

first of 3 babies shipped: Arena Mode! a world’s first in shipping an Arena directly into a product. new blog: https://t.co/8ygMKAvnHy when i first talked to @cognition this was the first idea I pitched. I am very inspired by @arena and think basically every agent lab should”” https://x.com/swyx/status/2017342647963431363

Introducing Arena Mode in Windsurf: One prompt. Two models. Your vote. Benchmarks don’t reflect real-world coding quality. The best model for you depends on your codebase and stack. So we made real-world coding the benchmark. Free for the next week. May the best model win.”” https://x.com/windsurf/status/2017334552075890903

The code for running SWE-fficiency has now been released! This is one of the most challenging and unsaturated coding benchmarks, we’re excited to see how models improve as people iterate on this.”” https://x.com/OfirPress/status/2016559053808222644

🚨Leaderboard update: Tencent’s Hunyuan-Image-3.0-Instruct now ranks #7 in the Image Edit Arena! A new lab breaks into the top-10, closely matching Nano-Banana and Seedream-4.5. Congrats to @TencentHunyuan on the huge milestone! 👏”” https://x.com/arena/status/2015846799446311337

Can AI solve math research problems that have eluded human mathematicians? Our new benchmark, FrontierMath: Open Problems, is designed to help find out. AI hasn’t solved any of these yet, but the game is young!”” https://x.com/EpochAIResearch/status/2016188014540816879

Recursive Self-Aggregation (RSA) + Gemini 3 Flash scores 59.31% at only 1/10th the cost of Gemini Deep Think on the public ARC-AGI-2 evals. Insane”” https://x.com/kimmonismus/status/2015717203362926643

8 most illustrative VLA (Vision-Language-Action) models: ▪️ Gemini Robotics ▪️ π0 ▪️ SmolVLA ▪️ Helix ▪️ ChatVLA-2 (with MoE design) ▪️ ACoT-VLA (Action Chain-of-Thought) ▪️ VLA-0 ▪️ Rho-alpha (ρα) – the newest VLA + model from Microsoft Here you can explore what these models”” https://x.com/TheTuringPost/status/2015016772043452834

Kimi K2.5 is #1 on Design Arena 🏆”” https://x.com/Kimi_Moonshot/status/2017158490930999424

Kimi K2.5 is #1 Open Model for Coding 🏆”” https://x.com/Kimi_Moonshot/status/2016521406906028533

Kimi K2.5 is #1 Open Model in VoxelBench 🏆”” https://x.com/Kimi_Moonshot/status/2016732248800997727

Kimi K2.5 now on Eigent 🤗”” https://x.com/Kimi_Moonshot/status/2016473945957155252

Grok Imagine is also #1 in the Artificial Analysis Image to Video Leaderboard!”” https://x.com/ArtificialAnlys/status/2016749790907027726

Interesting qualitative observations on GPT-5.2 Pro’s high frontier math score from one of the folks running the test.”” https://x.com/emollick/status/2015069180177809817

RL coding agents increasingly game rewards by exploiting their semantic and syntactic weaknesses. Can LLMs detect such behaviors from live training rollouts? We find contrastive cluster analysis is key! 🚀 GPT-5.2 jumps from 45% to 63%. Humans reach 90% Paper + data 🧵”” https://x.com/getdarshan/status/2017054360887611510

New record on FrontierMath Tier 4! GPT-5.2 Pro scored 31%, a substantial jump over the previous high score of 19%. Read on for details, including comments from mathematicians.”” https://x.com/EpochAIResearch/status/2014769359747744200?s=20

Realtime Eval Guide https://cookbook.openai.com/examples/realtime_eval_guide

What the heck: Qwen3-Max-Thinking outperforms all SOTA Models (Gemini 3.0 Pro, GPT-5.2, …) in HLE with search tools and even achieves almost 60% Overall really impressive evals! OpenAI and Anthropic have to hurry in their r&d”” https://x.com/kimmonismus/status/2015820838243561742

🚨 Qwen3 Max Thinking is in the Text Arena! @Alibaba_Qwen’s Qwen3 Max Preview debuted last fall in the top 10 – so let’s see what this variant can do. Bring your toughest prompts and we’ll see how it stacks up against other frontier AI models in the most competitive arena. 💪”” https://x.com/arena/status/2015803787680808996

Video Arena Is Live on Web https://arena.ai/blog/video-arena/