Benchmarks: AI News Week Ending 10/31/2025

Benchmarks: AI News Week Ending 10/31/2025

October 31, 2025

Image created with gemini-2.5-flash-image with claude-sonnet-4-5. Image prompt: A vintage wooden sports scoreboard on a suburban lawn at dusk in autumn, displaying AI model names and numerical scores on flip-panel displays, surrounded by Halloween decorations including jack-o-lanterns and orange lights, fallen leaves covering the grass, warm nostalgic 80s suburban atmosphere with houses visible in background, cinematic composition

🎉 Congrats to @Kimi_Moonshot! vLLM Day-0 model expands! Now supporting Kimi Linear — hybrid linear attention with Kimi Delta Attention(KDA): – RULER 128k context: 84.3 perf + 3.98× speedup – Up to 6× faster decoding & 6.3× faster TPOT (1M tokens) – 75% KV cache reduction 💡 https://x.com/vllm_project/status/1983941708233765149

🔥 Inside Kimi Linear: First-Hand Insights @Kimi_Moonshot just dropped something impressive again. @yzhang_cs from Kimi AI Infra, shared an insider’s look at the making of Kimi Linear — an architecture designed around hybrid linear attention and optimized for efficiency × https://x.com/ZhihuFrontier/status/1984321210055082207

Kimi just released another “”next-gen”” model that reduces memory usage by up to 75%, while achieving up to 6.3× higher decoding throughput and outperforming MLA and GDN baselines https://x.com/scaling01/status/1983926811051384965

UNIVERSAL MUSIC GROUP AND UDIO ANNOUNCE UDIO’S FIRST STRATEGIC AGREEMENTS FOR NEW LICENSED AI MUSIC CREATION PLATFORM – UMG https://www.universalmusic.com/universal-music-group-and-udio-announce-udios-first-strategic-agreements-for-new-licensed-ai-music-creation-platform/

Earlier this month, we updated GPT-5 with the help of 170+ mental health experts to improve how ChatGPT responds in sensitive moments—reducing the cases where it falls short by 65-80%. https://x.com/OpenAI/status/1982858555805118665

From this new post by OpenAI: 0.15% of users (something like 900k people given public numbers) show signs of suicidal intent in their ChatGPT chats each week But there seems to be progress in making ChatGPT respond appropriately to mental health issues. https://x.com/emollick/status/1983034815281500218

Strengthening ChatGPT’s responses in sensitive conversations | OpenAI https://openai.com/index/strengthening-chatgpt-responses-in-sensitive-conversations/

We’re very focused on making GPT-5 safer, and continue to make a lot of progress: https://x.com/fidjissimo/status/1982856666057220330

We used our new capabilities index, the ECI, to measure the gap between open- and closed-weight models. The result? This gap is smaller than previously estimated. On average, it takes 3.5 months for an open-weight model to catch up with closed-source SOTA. https://x.com/EpochAIResearch/status/1983987212183335097

Today we’re releasing SWE-1.5, our fast agent model. It achieves near-SOTA coding performance while setting a new standard for speed. Now available in @windsurf. https://x.com/cognition/status/1983662836896448756

Today, @cognition released SWE-1.5 – the world’s fastest coding agent, powered by Cerebras. SWE-1.5 achieves frontier-level coding ability, comparable to Sonnet 4.5 and surpassing GPT-5. Cerebras and Cognition engineers worked hand in hand over the past few weeks, training a https://x.com/cerebras/status/1983695672454074794

A continuing issue in studying LLMs for medicine is the fact that everyone is testing different things with different standards. This (interesting) paper is about agentic systems powered by DeepSeek-V3.2. Other papers look at single LLMs. Tons of different benchmarks. Confusion.”” / X https://x.com/emollick/status/1982630126065201636

✨ At vLLM, we strive for correctness, reliability, and open collaboration — every detail matters. Together with @Kimi_Moonshot , we verified Kimi K2’s tool-calling accuracy on vLLM using the latest K2-Vendor-Verifier benchmark. Our debugging uncovered 3 key compatibility”” / X https://x.com/vllm_project/status/1983115488982122929

Kimi For Coding: Exclusive Add-on to Your VIP Plan! We’ve added Kimi For Coding as a powerful add-on built right on top of your current subscription perks. Extra value, no extra cost. More details 👉 https://x.com/Kimi_Moonshot/status/1984207737673359441

Kimi K2vv updated! We’ve added case-by-case statistics for ToolCall-Trigger Similarity and ToolCall-Schema Accuracy. Feedback is welcome! https://x.com/Kimi_Moonshot/status/1983082003731042637

Kimi K2vv updated! We’ve added case-by-case statistics for ToolCall-Trigger Similarity and ToolCall-Schema Accuracy. The infra team also listed some suggestions for vendors; looks like enforcer is important. https://x.com/crystalsssup/status/1983126339399102756

Kimi Linear Tech Report is dropped! 🚀 https://x.com/Kimi_Moonshot/status/1983937694360322136

My favorite part: > “Scaling Ladder” is a Kimi tradition for scaling models. We start from something small (say, 1B active parameters) and gradually aim to beat the baseline on benchmarks, while also monitoring the corresponding “internals.” Only after clearing each gate at each”” / X https://x.com/eliebakouch/status/1984291165860958614

Thankfully, theres a really nice glossary in the KIMI Delta Attention paper that covers most of the notable variants https://x.com/nrehiew_/status/1983891931823505518

There are a lot of works behind Kimi Linear. We’ve rethought efficient and expressive linear attention from infra. We even first discovered the attn matrix, and then the recurrent. No wait to check out the kda kernel in the FLA repo. We have much more work to do, to open.”” / X https://x.com/uniartisan/status/1983941443283775780

You are also welcome to share your suggestions and feedback for our Kimi CLI on GitHub. > https://x.com/Kimi_Moonshot/status/1984207741037252751

Introducing Kimi CLI Technical Preview & Kimi For Coding! Kimi CLI powers your terminal: – Shell-like UI + shell command execution – Seamless Zsh integration – MCP support -Agent Client Protocol (now compatible with @zeddotdev) More features incoming! https://x.com/Kimi_Moonshot/status/1984207733177090274

Claude, GPT-5, Gemini, and Kimi: “”write me a horror story done entirely in the dedications to six books (you can give me the title and author of each book as well)”” ChatGPT and Claude did well in different way. Kimi did the usual (sounds good but meaning falls apart). https://x.com/emollick/status/1982279778859151783

Many people are confused by Minimax’s recent return to full attention – especially since it was the first large-scale pivot toward hybrid linear attention – and by Kimi’s later adoption of hybrid linear variants (as well as earlier attempts by Qwen3-Next, or Qwen3.5). I actually”” / X https://x.com/SonglinYang4/status/1984021551914926514

The Metaculus forecast for arrival the first AGI has extended out ~3 years since its all-time low set in February this year. Now at May 2033. The ‘weak’ AGI standard, without robotic manipulation, is up from Sep 2026 to Oct 2027. (HT @ben_j_todd) https://x.com/robertwiblin/status/1983110134499860874

(28) The Secrets of Claude Code From the Engineers Who Built It – YouTube https://www.youtube.com/watch?v=IDSAMqip6ms

🤖LangSmith Agent Builder Today, we’re releasing our first no-code agent builder experience to let anyone build agents It’s essentially “”general purpose claude code in the UI””. Comes with built in memory support, allowing it to learn and adapt with you over time It is NOT a https://x.com/hwchase17/status/1983584242241294423

17/ @dani_avila7 demonstrated creating Claude Code skills. It involves running a command and describing desired capabilities for reusable tools. https://x.com/AtomSilverman/status/1981855900165148685

8/ Using Skills with the Claude Agent SDK. Here’s an example demo of @trq212 built using one of our premade skills to make an excel demo agent. https://x.com/AtomSilverman/status/1981855870184271989

Claude Skills, anywhere: making them first-class in Codex CLI https://www.robert-glaser.de/claude-skills-in-codex-cli/

We just added thinking block preservation in the Claude API. You can now control how thinking blocks are managed in your context window, resulting in more cache hits and lower costs.”” / X https://x.com/alexalbert__/status/1983597775293177952

We’re open-sourcing MiniMax M2 — Agent & Code Native, at 8% Claude Sonnet price, ~2x faster ⚡ Global FREE for a limited time via MiniMax Agent & API – Advanced Coding Capability: Engineered for end-to-end developer workflows. Strong capability on a wide-range of applications https://x.com/MiniMax__AI/status/1982674798649160175

we’ve been challenging ourselves “”does the world need one more code-cli?”” “”how can we catch up with claude-code?”” maybe the answer is NO, but at least, we have one place to share our understanding of coding agents, and good changes will gradually happen. it’s just a beginning.”” / X https://x.com/bigeagle_xd/status/1984217403023380802

Available in public beta today on the Claude API and on Google Cloud’s Vertex AI, with Amazon Bedrock coming soon. Docs here: https://x.com/alexalbert__/status/1983597787305697745

22/ @alexalbert__ introduced Skills in Claude. It packages knowledge for on-demand loading to handle complex agent tasks efficiently. https://x.com/AtomSilverman/status/1981855917131121028

21/ @alexalbert__ hosted convo with @ErikSchluntz on agents. It covers Claude’s strengths, skill tips, subagents, and future developments. https://x.com/AtomSilverman/status/1981855913977024524

Had a great time at @GitHub Universe announcing Agent HQ. @Claudeai will soon become a native collaborator in GitHub, able to pick up issues, create branches, commit code, and work alongside you—all powered by the Claude Agent SDK and deeply integrated with GitHub’s platform. https://x.com/mikeyk/status/1983213185332326434

MiniMax M2 + Claude Code on KingBench Agentic Evaluations: It now scores #2 on my Agentic Evaluations beating GLM-4.6 by a wide margin. It seems to work much better with Claude Code’s Tools. Really great model and it’s my daily driver now. I haven’t tested GLM with CC yet. https://x.com/aicodeking/status/1983934597353402797

Stress-testing model specs reveals character differences among language models https://alignment.anthropic.com/2025/stress-testing-model-specs/

We reviewed Anthropic’s unredacted report and agreed with its assessment of sabotage risks. We want to highlight the greater access & transparency into its redactions provided, which represent a major improvement in how developers engage with external reviewers. Reflections: 🧵”” / X https://x.com/METR_Evals/status/1983248509752213526

It looks like AI music is following the same path as AI text: 1) Appears to have passed the Turing Test, people are only 50/50 in identifying older Suno vs. human songs (but 60/40 when two songs are the same genre) 2) Same fast development, new models are getting better quickly. https://x.com/emollick/status/1981501021320053020

A new eval, Remote Labor Index, measures AI’s ability to automate real-world, economically valuable projects from remote work platforms. Currently entirely unsaturated (max score of 2.5%) A great collaboration between @scale_AI and @ai_risks!”” / X https://x.com/alexandr_wang/status/1983651538947162409

Individual evaluation scores, all run independently like-for-like by Artificial Analysis https://x.com/ArtificialAnlys/status/1982714164310315185

Remote Labor Index https://www.remotelabor.ai/

We’ve launched a new tool to track AI progress! The tool addresses one of the field’s biggest challenges: benchmark saturation. It’s called the Epoch Capabilities Index (ECI) — here’s what makes it different: https://x.com/EpochAIResearch/status/1982888284436218275

Why use LLM-as-a-judge when you can get the same performance for 15–500x cheaper? Our new research with @RakutenGroup on PII detection finds that SAE probes: – transfer from synthetic to real data better than normal probes – match GPT-5 Mini performance at 1/15 the cost (1/6) https://x.com/GoodfireAI/status/1983549685517492234

Without a measure of jaggedness, it remains hard to measure anything about how “”general”” the intelligence of AI is. All the major benchmarks are essentially correlated, we need more uncorrelated capability dimensions to measure variance across. Anyone trying to do this?”” / X https://x.com/emollick/status/1981187440489746833

🚀We are excited to introduce the Tool Decathlon (Toolathlon), a benchmark for language agents on diverse, complex, and realistic tool use. ⭐️32 applications and 600+ tools based on real-world software environments ⭐️Execution-based, reliable evaluation ⭐️Realistic, covering https://x.com/junxian_he/status/1983834164727312391

voyage-3-large embedding model just topped the RTEB leaderboard! It’s a big deal because it: – ranks first across 33 eval datasets – outperforms OpenAI and cohere models – supports quantization to reduce storage costs Here’s another reason that makes this model truly superior: https://x.com/_avichawla/status/1983783708047093838

(this is not valid) DeepSeek is the new king now. It’s gaining 125% in just 9 days, making more than GPT-5 and Gemini 2.5 Pro lost combined. DeepSeek is just a side project of a hedge fund, confirmed. https://x.com/Yuchenj_UW/status/1982658436182712750

Google’s Veo 3.1 Fast ranks #2 in Image to Video with substantial improvements over Veo 3, but we do not see a measurable increase in Text to Video quality, with Veo 3 still ranking higher than Veo 3.1 Veo 3.1 is Google’s latest update to Veo 3, bringing substantial improvements https://x.com/ArtificialAnlys/status/1983938159839998249

⛵Marin 32B Base (mantis) is done training! It is the best open-source base model (beating OLMo 2 32B Base) and it’s even close to the best comparably-sized open-weight base models, Gemma 3 27B PT and Qwen 2.5 32B Base. Ranking across 19 benchmarks: https://x.com/percyliang/status/1983561556127567911