Benchmarks: AI News Week Ending 03/20/2026

Benchmarks: AI News Week Ending 03/20/2026

March 20, 2026

Image created with gemini-3.1-flash-image-preview with claude-sonnet-4-5. Image prompt: Using the provided reference image, preserve exactly the deep midnight navy car hood, shallow depth-of-field sky background, chrome pedestal base, dramatic upward camera angle, and automotive advertisement lighting. Replace only the Mercedes star with a single chrome checkered racing flag ornament mounted on the same pedestal at realistic hood ornament scale–the flag should appear to wave gracefully upward, rendered in polished metal with alternating mirror-bright and brushed-matte squares creating a liquid mercury effect. Add bold white sans-serif text reading BENCHMARKS across the upper portion of the image as a clean headline.

BREAKING 🚨: MiniMax released MiniMax M2.7, a new self-evolving model, achieving a score of 56.22% on SWE-Bench Pro. M2.7 was used for building complex agent harnesses during its own development. Users can now access MiniMax M2.7 via APIs and MiniMax Agent.
https://x.com/testingcatalog/status/2034250919345377604#m

During the iteration process, we also realized that the model’s ability to recursively evolve its harness is equally critical. Our internal harness autonomously collects feedback, builds evaluation sets for internal tasks, and based on this continuously iterates on its own
https://x.com/MiniMax_AI/status/2034315323109953605#m

Introducing MiniMax-M2.7, our first model which deeply participated in its own evolution, with an 88% win-rate vs M2.5 – Production-Ready SWE: With SOTA performance in SWE-Pro (56.22%) and Terminal Bench 2 (57.0%), M2.7 reduced intervention-to-recovery time for online incidents
https://x.com/MiniMax_AI/status/2034315320337522881#m

MiniMax Global Announces Full Year 2025 Financial Results – MiniMax News | MiniMax https://www.minimax.io/news/minimax-global-announces-full-year-2025-financial-results

Minimax M2.7 released! And its a big one Highlights: Self-evolving – first model that helped build itself, running 100+ autonomous optimization loops during its own RL training (30% internal improvement). Strong coder – 56.2% on SWE-Pro (near Opus 4.6), 55.6% on VIBE-Pro,
https://x.com/kimmonismus/status/2034269026353082422#m

MiniMax M2.7: Early Echoes of Self-Evolution – MiniMax News | MiniMax https://www.minimax.io/news/minimax-m27-en

Ramp AI Index March 2026 update https://ramp.com/velocity/ai-index-march-2026

5.3 to 5.4 is what i would have expected to warrant a jump to GPT-6
https://x.com/yacineMTB/status/2033291560217923803

A knowledge-work platform built around GPT-5.4 Pro level intelligence would be really useful. The gap between other models and what Pro can do on complex intellectual work remains stark. I would love to have access in a Codex-like platform with shared file spaces, subagents, etc
https://x.com/emollick/status/2033959257196966360

GPT-5.4 mini matters for subagents because it changes what feels worth handing off. The parent thread should hold the architecture, plan, and progress narrative. Fast subagents can explore the repo, check hypotheses, and preserve the parent thread’s limited attention.
https://x.com/nickbaumann_/status/2034134875234832540#m

BrowseComp-Plus, perhaps the hardest popular deep research task, is now solved at nearly 90%… … and all it took was a 150M model ✨ Thrilled to announce that Reason-ModernColBERT did it again and outperform all models (including models 54× bigger) on all metrics
https://x.com/antoine_chaffin/status/2034649565614272925

“a large jump in agentic” – we agree 🙌 M2.7 is a big step forward in agentic workflows, from tool use to real-world, multi-step execution. Now live on @OpenRouter 🚀
https://x.com/MiniMax_AI/status/2034356786413867182#m

🔍Follow Zhihu contributor toyama nao, a top large model reviewer, to evaluate @MiniMax_AI MiniMax-M2.7’s capabilities in detail!✨ 📌 Basic Info： MiniMax iterates monthly in the Agent-driven model track. As a minor version upgrade, M2.7 carries its new understanding of the
https://x.com/ZhihuFrontier/status/2034543142234628318

DEFAULT and FREE M2.7 on @zocomputer
https://x.com/MiniMax_AI/status/2034348503347171625#m

Early testers are saying that M2.7 has big improvements in emotional intelligence and character consistency 👀
https://x.com/MiniMax_AI/status/2034528945962696948

Great to see M2.7 live on @vercel_dev 🙌 We’re seeing a real shift from simple tool use → multi-step agentic workflows running in production. M2.7 is built for exactly that.
https://x.com/MiniMax_AI/status/2034357583797178841#m

Live Stream Alert with @OpenClaw Thursday 9PM ET We will share an in-depth look at MiniMax M2.7, including early developments in self-evolution and efficient solutions designed to support 100,000 OpenClaw running clusters. 🎁 MiniMax vouchers will also be distributed during
https://x.com/MiniMax_AI/status/2034520321466978488

M2.7 is already up😎 Try it on @kilocode.
https://x.com/MiniMax_AI/status/2034339731660759097#m

M2.7 now live on @yupp_ai 🌸 Feels like a good time to build something new.
https://x.com/MiniMax_AI/status/2034328337527783857#m

M2.7 now on @opencode ⚙️ give it a plan → it runs with it add the loop (check → fix → retry) and things start to feel very agentic
https://x.com/MiniMax_AI/status/2034361282527461473#m

Minimax 2.7 incoming!
https://x.com/kimmonismus/status/2033531736647463151

Minimax 2.7 is available in Hermes Agent through the Minimax Provider, try it today!
https://x.com/Teknium/status/2034658808870621274

MiniMax doubles in Hong Kong debut, marking yet another Chinese AI listing https://www.cnbc.com/2026/01/09/minimax-hong-kong-ipo-ai-tigers-zhipu.html

MiniMax has released MiniMax-M2.7, delivering GLM-5-level intelligence for less than one third of the cost MiniMax-M2.7 from @MiniMax_AI scores 50 on the Artificial Analysis Intelligence Index, an 8-point improvement over MiniMax-M2.5, which was released one month ago. This is
https://x.com/ArtificialAnlys/status/2034313314420019462#m

MiniMax launches M2.7 model on MiniMax Agent and APIs https://www.testingcatalog.com/minimax-launches-m2-7-model-on-minimax-agent-and-apis/

MiniMax M2.7 now live on @Trae_ai Excited to see what you ship. 🙌
https://x.com/MiniMax_AI/status/2034327432124350924#m

MiniMax M2.7: Early Echoes of Self-Evolution
https://x.com/MiniMax_AI/status/2034335605145182659

MiniMax M2.7🆚MiniMax M2.5 – Website about recently released video games The release of M2.7 should be close. MiniMax M2.5 was released two days after it appeared on the Arena
https://x.com/AiBattle_/status/2033503838284447758

MiniMax-M2.7 is now available on Ollama’s cloud. made for coding and agentic tasks 🖥️ Try it inside Claude Code: ollama launch claude –model minimax-m2.7:cloud 🦞 Use it with OpenClaw: ollama launch openclaw –model minimax-m2.7:cloud If you already have OpenClaw
https://x.com/ollama/status/2034351916097106424#m

We’re now going to start having these really awesome benchmarks where as you get better on the benchmark, you’re not just re-solving an exam question- you’re solving something no one has solved before you and making the world a better place.
https://x.com/OfirPress/status/2034298283774877926#m

Help us measure the progress towards AGI (specifically cognitive capabilities) by building benchmarks on @kaggle, with $ 200K in prizes available! Details in 🧵
https://x.com/OfficialLoganK/status/2033978254344786351

Xiaomi has released MiMo-V2-Pro, which scores 49 on the Artificial Analysis Intelligence Index, placing it between Kimi K2.5 and GLM-5 @Xiaomi’s MiMo-V2-Pro is a new reasoning model and a significant upgrade over their prior open weights release, MiMo-V2-Flash (309B total / 15B
https://x.com/ArtificialAnlys/status/2034239267052896516#m

BullshitBench update: The new GPT-5.4 mini and nano models score quite low. This screenshot shows OpenAI models only, on the full list would put GPT-5.4-mini around 40th place and Nano is around 70th place. Again thinking didn’t help much at all.
https://x.com/petergostev/status/2033995459522396287

GPT 5.4 is a big step for Codex – by Nathan Lambert https://www.interconnects.ai/p/gpt-54-is-a-big-step-for-codex

gpt-5.4 has ramped faster than any other model we’ve launched in the API: within a week of launch, 5T tokens per day, handling more volume than our entire API one year ago, and reaching an annualized run rate of $1B in net-new revenue. it’s a good model, try it out!
https://x.com/gdb/status/2033605419726483963

GPT-5.4 nano is is also available starting today in the API.
https://x.com/OpenAI/status/2033953595637538849

GPT-5.4-mini looks really good for computer-use
https://x.com/scaling01/status/2033954794105127007

Ran a small eval today on an LM using GPT-5.2 as a judge. Model scores 10%, but paper reports it scoring 34%. I see that the paper uses GPT-5.1 as a judge; for the sake of consistency I change it. Switch to GPT-5.1 as a judge. Model now scores 43.5%… bro
https://x.com/a1zhang/status/2034059629072945251#m

This, but for real* Here’s METR-style graph of labor displacement from Roman aqueducts, doubling time of CDDII years. Lesson: 1) Displacing terrible work is good 2) All exponentials become s-curves in the end * I had GPT-5.4 Pro do the research, spot checks seemed accurate.
https://x.com/emollick/status/2033636278508425646