Benchmarks: AI News Week Ending 02/27/2026

Benchmarks: AI News Week Ending 02/27/2026

February 27, 2026

Image created with gemini-3.1-flash-image-preview with claude-sonnet-4-5. Image prompt: 1980s NORAD war room with massive glowing CRT monitor displaying cascading benchmark test scores with ERROR codes and impossible values, flickering amber and blue wireframe graphics, dark silhouette of operator in foreground, red alarm lights, bold red retro sans-serif text reading BENCHMARKS at top, cinematic lighting, high contrast, foreboding atmosphere

Google’s Nano Banana 2 (Gemini 3.1 Flash Image Preview) takes #1 in Text to Image in the Artificial Analysis Image Arena at half the price of Nano Banana Pro! Nano Banana 2 is the latest Flash-tier image model from @GoogleDeepMind , succeeding the original Nano Banana (Gemini
https://x.com/ArtificialAnlys/status/2027052241019175148

Am currently putting together an article, and yeah, the SWE-Bench Verified numbers are definitely a bit sus across all models — the benchmark suggest they are more similar than they really are. So, I went down a rabbit hole looking into SWE-Bench Verified issues… And it looks
https://x.com/rasbt/status/2026062254571913522

Devin now has full computer use capabilities and can share screen recordings. You can control desktop apps, build and QA mobile apps, and automate tedious work. Here are some examples that blew our team away: 1. Making a desktop game
https://x.com/cognition/status/1983983151157563762

For years I’ve said that the capability-reliability gap is an under-appreciated limitation of AI agents. Finally, in a new paper led by @steverab, we defined and measured it!
https://x.com/random_walker/status/2026384543700115870

Frontier models have (mostly) stopped making dumb security mistakes. But, when running for a long time, like in agentic coding or OpenClaw, even a single mistake can be fatal. How can we benchmark this? Instead of making larger and larger agentic benchmarks, we made an easier
https://x.com/jonasgeiping/status/2026714911951220888

Lots of important ideas here! “Evaluating 14 models on two complementary benchmarks, we found that nearly two years of rapid capability progress have produced only modest reliability gains… Unfortunately, AI agents are evaluated based on a single number, the average success
https://x.com/JustinBullock14/status/2026693253169336475

Many teams treat evals as a last-mile check. https://t.co/8pFE1Aw4hH Service made them a Day 0 requirement for their AI service agents. Using LangSmith, the monday service team has been able to: 🔷Achieve 8.7x faster evaluation feedback loops (from 162 seconds to 18 seconds).
https://x.com/hwchase17/status/2026095629148258440

New research from Intuit AI Research. Agent performance depends on more than just the agent. It also depends on the quality of the tool descriptions it reads. However, tool interfaces are still written for humans, not LLMs. As the number of candidate tools grows, poor
https://x.com/omarsar0/status/2026676835539628465

Our new SWE-bench Multilingual leaderboard compares software engineering performance across 9 different languages as evaluated with mini-SWE-agent v2. Model rankings are significantly different between languages. Detailed stats & browsable trajectories in 🧵
https://x.com/KLieret/status/2026322986907652295

Having an agentic VLM model, shade & render your 3d scene is the ultimate counter example to the “pixels is all you need” crowd. Real time video is powerful – it’s even a new medium. But explicit 3d is still very useful. Also this donut makes me hungry.
https://x.com/bilawalsidhu/status/2026184423004160185

Can coding agents build entire software systems from scratch? ByteDance, M-A-P, 2077AI, and leading Chinese universities present NL2Repo-Bench, a new benchmark that pushes agents to their limits. It tests if an AI can take a simple text description and autonomously design,
https://x.com/jiqizhixin/status/2025823941642621241

🆕 The End of SWE-Bench Verified (2024-2026) https://t.co/HCmogFFG8w Today @OpenAIDevs is announcing the voluntary deprecation of SWE-Bench Verified! We’re releasing a podcast + analysis in today’s post. Saturation of SWE-Bench has been a community hot topic for over a year –
https://x.com/latentspacepod/status/2026027529039990985

BREAKING: Arrow 1 by @QuiverAI ranks #1 on SVG Arena by Design Arena with an Elo of 1583 It’s the first model to ever break 1500+ on one of our leaderboards, establishing the new SOTA frontier for SVG generation Huge congratulations to the @QuiverAI team for this remarkable
https://x.com/Designarena/status/2027066193946026200?s=20

Document OCR benchmarks are hitting a ceiling – and that’s a problem for real-world AI applications. Our latest analysis reveals why OmniDocBench, the go-to standard for document parsing evaluation, is becoming inadequate as models like GLM-OCR @Zai_org achieve 94.6% accuracy
https://x.com/llama_index/status/2026342120236396844

OmniDocBench is getting saturated VLMs are getting increasingly better at document understanding, from OSS (DeepSeek-OCR2, GLM-OCR), to frontier (Gemini 3, Kimi 5.2, GPT-5.2). A popular benchmark to measure document understanding progress has been OmniDocBench. But we’re
https://x.com/jerryjliu0/status/2026408921385284001

please stop falling for benchmaxxing
https://x.com/scaling01/status/2026698844088549848

The First Fully General Computer Action Model | blog https://si.inc/posts/fdm1/

The rankings on AlgoTune look a bit weird to some people at first, it doesn’t always correlate with rankings on other coding leaderboards- this is because AlgoTune has a $1 limit per task- so cheap models sometimes do much better than smarter but expensive models. I think this
https://x.com/OfirPress/status/2026068384589172800

tl;dr SWE-bench Verified is heavily contaminated for all frontier models, and many of the problems are also broken. Time to move on to harder, uncontaminated coding evals.
https://x.com/polynoamial/status/2026032321212891550

Today, we’re launching a dedicated Multi-File React leaderboard. When Code Arena first launched, we evaluated models on single-file HTML. Then we raised the bar → multi-file React apps (routing, hooks, components, state management) and now have a leaderboard to match!
https://x.com/arena/status/2027114744847720782

We just launched the SWE-bench Multilingual leaderboard! It’s a set of 300 tasks in 9 programming languages; none of these tasks were in SWE-bench Verified. State-of-the-art is 72% here, so lots of room for growth.
https://x.com/OfirPress/status/2026324248973689068

Can an agent survive as a worker in a real economy? Here is a super interesting economic benchmark for AI agents – ClawWork. It’s like a real-world labor market for LLM-based agents that evaluates them in an economic survival loop. ClawWork turns agents into AI coworkers and
https://x.com/TheTuringPost/status/2024960484378816894

Gemini 3.1 pro scores 72.1% on WeirdML, up from 69.9% for gemini 3.0. Gemini 3.1 seems to have both the highest peak performances of any models, but also some weird weaknesses as well. It uses almost 3 times the number of output tokens as 3.0, considering this, the increase
https://x.com/htihle/status/2025867003550958018

GPT-5.2-chat-latest, the newest model powering ChatGPT, is now in the Text Arena top 5! Highlights: ▪️Top 5 scoring 1478 on par with Gemini-3-Pro ▪️+40pt improvement over the GPT-5.2 model ▪️Top in key categories: Multi-Turn, Instruction-Following, Hard Prompts, Coding A strong
https://x.com/arena/status/2025966052950315340

📊Noticeable improvements with @OpenAI’s GPT-5.2-Chat-Latest vs GPT-5.2 (#5 vs #29 Overall) Where GPT-5.2-Chat-Latest gains: Text: – Coding (+13: #6 vs #19) – Hard Prompts (+21: #4 vs #25) – Instruction Following (+21: #7 vs #28) – Longer Query (+10: #14 vs #24) – English (+33:
https://x.com/arena/status/2025986008484061391

Big news today if you’re into coding evals: SWE-Bench Verified is dead!! https://t.co/SPApcuM5uW i’m not sure if @HamelHusain is tired of me tagging him but it turns out @OpenAI really did look back at their own 2024 work and then you 1) look at the CoT and 2) look at the
https://x.com/swyx/status/2026029120040137066

Code → design → code Generate design files from code, collaborate in @Figma, and implement updates all within Codex without breaking your flow.
https://x.com/OpenAIDevs/status/2027062351724527723

I experienced a very similar transition in December. However, for higher-complexity tasks (ML-related), we are still not there yet. Two days ago I had GPT-5.2-PRO-ET and DeepThink argue for hours, converge, be happy, yet they missed a very obvious math issue. Still a huge unlock
https://x.com/MParakhin/status/2027027034828902421

Introducing WebSockets in the Responses API. Built for low-latency, long-running agents with heavy tool calls. https://x.com/OpenAIDevs/status/2026025368650690932

The Codex app lets you go further, do more in parallel, and go deeper on the problems you care about.”” — @gdb
https://x.com/OpenAIDevs/status/2024212279215198396

The standard for frontier coding evals is changing with model maturity. We now recommend reporting SWE-bench Pro and are sharing more detail on why we’re no longer reporting SWE-bench Verified as we work with the industry to establish stronger coding eval standards. SWE-bench
https://x.com/OpenAIDevs/status/2026002219909427270

uhhh WTF?! gpt-5.3-codex gets 86% on IBench, beating out all other models massively. I was NOT expecting this
https://x.com/adonis_singh/status/2026456939224510848

We expanded file input types so you can now pass docx, pptx, csv, xlsx, and more directly to the Responses API. Your agents can now pull context from real-world files and generate more accurate outputs. https://x.com/OpenAIDevs/status/2026420817568084436

We tested @OpenAI’s new WebSocket connection mode for the Responses API into Cline and the early numbers are wild. Instead of resending full context every turn, WebSocket mode keeps a persistent connection, sends only incremental inputs. With 5.2 Codex results vs the standard
https://x.com/cline/status/2026031848791630033

What did you build with Codex this weekend?
https://x.com/OpenAIDevs/status/2025712197100589353

✨ Run it now with SGLang！Chong!
https://x.com/Alibaba_Qwen/status/2026348924433477775

📊With all the Qwen-3.5 scores out for Text, Code and Vision, let’s compare the evolution of Qwen-3.5 (397B-A17B) vs Qwen-3.0 (235B-A22B). This is a +24 rank jump in Text. Specially where Qwen-3.5 gains the most: Text: – Overall (+24: #19 vs #43) – English (+25: #21 vs #46) –
https://x.com/arena/status/2026404630297719100

🔥 Qwen 3.5 Medium Model Series FP8 weights are now open and ready for deployment！ Native support for vLLM and SGLang. Check the model card for example code. ⚡️ Optimize your workflow with FP8 precision. 👇 Get the weights: Hugging Face：
https://x.com/Alibaba_Qwen/status/2026682179305275758

🚩Qwen3.5 INT4 model is now available! https://t.co/rY5GrT3b60 @Alibaba_Qwen @JustinLin610
https://x.com/HaihaoShen/status/2026208062009426209

A big jump in intelligence-per-watt today: “”Qwen3.5-35B-A3B now surpasses Qwen3-235B-A22B-2507″”
https://x.com/awnihannun/status/2026353100144218569

Huge thanks to the @vllm_project for the Day-0 support on the Qwen3.5 Medium Series 🚀
https://x.com/Alibaba_Qwen/status/2026496673179181292

Minimax M2.5 GGUFs (from Q4 down to Q1) perform poorly overall. None of them come close to the original model. That’s very different from my Qwen3.5 GGUF evaluations, where even TQ1_0 held up well enough. Lessons: – Models aren’t equally robust, even under otherwise very good
https://x.com/bnjmn_marie/status/2027043753484021810

Qwen 3.5 family is here! > vision built-in, and can outperform previous VL models > designed to be more efficient > expanded support for more languages 35B: (fits on 24GB+ system) ollama run qwen3.5:35b 122B: ollama run qwen3.5:122b 397B (cloud only): ollama run
https://x.com/ollama/status/2026598944177009147

Qwen3.5-35B-A3B is now in Jan 🔥
https://x.com/Alibaba_Qwen/status/2026660582221558190

Qwen3.5-35B-A3B is now live in LM Studio 🚀
https://x.com/Alibaba_Qwen/status/2026496880285462962

Taken at face value, this is… somewhat catastrophic for MoEs, as @YouJiacheng notes. By right, a 397B-A17B ought to have a higher “”power level”” than a dense 27B. Also a big W for Qwen’s integrity and HLE eval quality, I guess. 397B is certainly better at memorization.
https://x.com/teortaxesTex/status/2026690994029072512

the conclusion should not be about moe vs dense, but that you can “”benchmaxx”” (not always a bad thing btw) HLE with tools no matter the model size the difference between Qwen3.5-35B-A3B and Qwen3.5-397B-A17B is only 1 point
https://x.com/eliebakouch/status/2026727151978840105

The new Qwen3.5 Medium models are ready to run 🔥 GGUF support is here! Big thanks to @UnslothAI for making it happen so quickly 🚀
https://x.com/Alibaba_Qwen/status/2026497723944546395

The Qwen3.5 series maintains near-lossless accuracy under 4-bit weight and KV cache quantization. In terms of long-context efficiency: Qwen3.5-27B supports 800K+ context length Qwen3.5-35B-A3B exceeds 1M context on consumer-grade GPUs with 32GB VRAM Qwen3.5-122B-A10B supports
https://x.com/Alibaba_Qwen/status/2026502059479179602

Why benchmarks like Peter’s “”Bullshit Benchmark”” or my ShizoBench matter so much and what do Strawberries have to do with it? I was very skeptical of the performance of Qwen3.5-27B on ArtificialAnalysis leaderboard. So I’m testing the model myself a bit. Naturally I tried the
https://x.com/scaling01/status/2027110908775002312

Qwen3.5-397B-A17B is currently the #1 trending model on Hugging Face. 🏆 This flagship open-weight model is designed for high-performance inference and complex reasoning. 🚀 Try it now on Hugging Face: https://x.com/Ali_TongyiLab/status/2026211680653611174