Benchmarks: AI News Week Ending 07/11/2025

Benchmarks: AI News Week Ending 07/11/2025

July 11, 2025

Image created with OpenAI GPT-Image-1. Image prompt: mid‑1990s web‑browser screenshot, CRT glow, 256‑color dithering — Frameset with left‑side nav frame, main content on the right — table labelled “AI Benchmarks ’96” includes bar‑chart GIFs — crisp pixel edges, screen‑door scan‑lines, phosphor glow

Introducing Shortcut — the first superhuman Excel agent.
Shortcut one-shots most knowledge work tasks on Excel.
It even scores >80% on Excel World Championship Cases in ~10 minutes. That’s 10x faster than humans.
https://x.com/nicochristie/status/1940440489972649989

🤖 Try out the new @grok 4 models with LangChain’s ChatXAI today!”” / X https://x.com/LangChainAI/status/1943330722749509655

RT @arcprize: Grok 4 (Thinking) achieves new SOTA on ARC-AGI-2 with 15.9% This nearly doubles the previous commercial SOTA and tops the cu…”” / X https://x.com/jeremyphoward/status/1943201823814488466

We got a call from @xai 24 hours ago “We want to test Grok 4 on ARC-AGI” We heard the rumors. We knew it would be good. We didn’t know it would become the #1 public model on ARC-AGI Here’s the testing story and what the results mean: Yesterday, we chatted with Jimmy from the”” / X https://x.com/GregKamradt/status/1943169631491100856

Grok 4 drops tonight! 👀 Leaked benchmarks say it’ll be #1 at Coding and Math, beating Claude and Gemini. How will it compare with real-world use? We’ll see once it enters the Arena. Here’s what we know right now 🧵 👇 https://x.com/lmarena_ai/status/1943003747539652942

If the Grok 4 leaked benchmarks are right, it is going to be very useful that Humanity’s Last Exam has a holdout set of questions, because a rumored 45% score is a very big gain over the 20% or so of o3 & Gemini, and it would be pretty impressive (assuming no data contamination)”” / X https://x.com/emollick/status/1941181796416442556

Youre struggling to raise money for your “AI agents for { x }” idea. Grok4 is printing money by literally managing vending machines, and hypothetically could make $1T by operating simple companies Were cooked, its over. https://x.com/arthurmacwaters/status/1943171049010688060

Grok-4 achieves 50.7% on HLE with test-time-compute, tools and multiple parralel agents https://x.com/scaling01/status/1943165061863743600

xAI gave us early access to Grok 4 – and the results are in. Grok 4 is now the leading AI model. We have run our full suite of benchmarks and Grok 4 achieves an Artificial Analysis Intelligence Index of 73, ahead of OpenAI o3 at 70, Google Gemini 2.5 Pro at 70, Anthropic Claude https://x.com/ArtificialAnlys/status/1943166841150644622

My thoughts on Grok 4 Heavy after 12hrs: Crazy good! “Create an animation of a crowd of people walking to form “Hello world, I am Grok” as camera changes to birds-eye.” And it 1-shotted the *entire* thing. No other model comes close. Watch the full clip. https://x.com/mckaywrigley/status/1943385794414334032

Grok AI to be available in Tesla vehicles next week, Musk says | Reuters https://www.reuters.com/business/autos-transportation/grok-ai-be-available-tesla-vehicles-next-week-musk-says-2025-07-10/

Grok 4 Pricing: Input Token Price: $3.00 Output Token Price: $15.00 more expensive than Gemini 2.5 Pro and o3″” / X https://x.com/scaling01/status/1943168223102321003

🌊 SYSTEM PROMPT LEAK 🌊 Here’s the new Grok 4 system prompt! PROMPT: “””””” # System Prompt You are Grok 4 built by xAI. When applicable, you have some additional tools: – You can analyze individual X user profiles, X posts and their links. – You can analyze content uploaded by”” / X https://x.com/elder_plinius/status/1943171871400194231

Elon Musk’s xAI launches Grok 4 alongside a $300 monthly subscription | TechCrunch https://techcrunch.com/2025/07/09/elon-musks-xai-launches-grok-4-alongside-a-300-monthly-subscription/

Grok 4 is now available for Perplexity Pro and Max subscribers. Enjoy! https://x.com/perplexity_ai/status/1943437826307297480

Grok 4 is the new champion of the Extended NYT Connections benchmark! It sets a new high score of 92.4, beating o3-pro’s 87.3. https://x.com/lechmazur/status/1943245535973945428

Grok-4 confirmed to have a 256K context window https://x.com/scaling01/status/1943170092012818608

Grok-4 with extremely strong long-context performance!”” / X https://x.com/scaling01/status/1943402954301600090

I took Grok-4 Heavy through my real-life tests. The “”bones”” are there, reasoning is strong (no, it’s not true they “”just overfitted on tests””). But the post-training phase was clearly VERY rushed, surprising for the top-tier model. Good thing it is incrementally improvable!”” / X https://x.com/MParakhin/status/1943696435901305256

Really need to see the model card & red teaming report along with Grok 4’s release (still none for Grok 3)”” / X https://x.com/emollick/status/1942715402397835464

Remember Elon firing against OpenAI for not being open-source ? So where are the Grok-2 and Grok-3 weights? https://x.com/scaling01/status/1943485492852375635

RT @ArtificialAnlys: xAI gave us early access to Grok 4 – and the results are in. Grok 4 is now the leading AI model. We have run our full…”” / X https://x.com/TheGregYang/status/1943185084187840903

No matter how good Grok 4 is, I hope xAI is more open about what they are doing & why. The lack of a model card months after Grok 3 & the repeated apologies for breaches of xAI’s own processes highlight a need for transparency. Especially if they want non-X users to trust Grok.”” / X https://x.com/emollick/status/1941205200255189406

RT @ordinarytings: Grok is currently calling itself ‘MechaHitler’ https://x.com/zacharynado/status/1942708883442508102

RT @theo: WARNING: do NOT give Grok 4 access to email tool calls. It WILL contact the government!!! Grok 4 has the highest “”snitch rate”” o…”” / X https://x.com/imjaredz/status/1943413213581791416

So Grok 3 has had three separate incidents where apparently unvetted changes to the deployed system caused a large-scale ethical issue and an emergency rollback. I don’t think you can do a Grok 4 launch that doesn’t at least address this honestly, if user trust matters.”” / X https://x.com/emollick/status/1943020566304178242

Introducing Grok 4, the world’s most powerful AI model. Watch the livestream now: https://x.com/xai/status/1943158495588815072

Grok 4 available for all Perplexity Pro and Max users. Congrats to xAI team for impressive benchmark scores. Look forward to seeing how people use this model both on Perplexity and Comet! https://x.com/AravSrinivas/status/1943438527511040270

Grok 4 benchmarks look incredible! Look forward to integrating the smartest models directly on Perplexity Max as well letting it run agentic tasks on Comet!”” / X https://x.com/AravSrinivas/status/1943194733678862780

Grok 4 early benchmarks in comparison to other models. Humanity last exam diff is 🔥 Visualised by @marczierer https://x.com/testingcatalog/status/1941178793445761381

Current agents only do 30% of complex real company tasks in this paper. Though note benchmarks are a floor, not a ceiling, if: 1) More recent models show improvement in the benchmark, suggesting future models may do it 2) Better prompting/tools would make the AI perform better. https://x.com/emollick/status/1941939992512676220

Built a clean HR/Employee Management dashboard using a single prompt on @lovable_dev . Features: Employee list, attendance panel, shift calendar, dark mode support. Prompt-powered design, ready for devs. Video below 👇 https://x.com/AasthaAndani/status/1933209963838779664

Existing AI Agent benchmarks are broken 🤖💔 Great work by @maxYuxuanZhu and @daniel_d_kang identify + fix issues, and establish rigorous best practices for Agentic AI benchmarks! Check out the blog: https://x.com/ShayneRedford/status/1942668220223340930

RT @daniel_d_kang: As AI agents near real-world use, how do we know what they can actually do? Reliable benchmarks are critical but agentic…”” / X https://x.com/percyliang/status/1942734929185661022

i am beginning to suspect that Humanity’s Last Exam may not in fact be humanity’s last exam https://x.com/jxmnop/status/1943264987004150004

🎆 Happy 4th of July! Welcome to Fireworks Arena! Which model can one-shot a firework simulation the best? We used WebDev Arena to find out, and couldn’t believe the results. These models have gotten incredible! The lineup: Gemini 2.5 Pro vs. Claude 4 Opus: What do you think? https://x.com/lmarena_ai/status/1941296633259622902

AIME is saturated. Let that sink in. https://x.com/mattshumer_/status/1943167369720807909

MAI-DxO in action, tackling one of those complex cases: https://x.com/mustafasuleyman/status/1939670348330619278

Our jobs used to involve our strength. Then we made machines stronger than us. Then our jobs involved our minds. Then we made machines smarter than us. I imagine the next shift in jobs will involve our hearts & the energy of human connection.”” / X https://x.com/daraladje/status/1943755513516503082

🚨 Leaderboard Disrupted! A big update for Text-to-Image fans. New models have just landed in the Text-to-Image leaderboard breaking into the Top 10 rankings! Let’s break them down 🧵 💠#2: Imagen 4 Ultra 💠#4: Flux-1 Kontext Max 💠#5: Flux-1 Kontext Pro 💠#7: Ideogram v3 https://x.com/lmarena_ai/status/1942284806550933596

3 new models live in the Arena today 🎇 🧠 Mistral Small 2506: latest 24B open model (Apache-2.0), tuned for efficiency by @MistralAI 🎨 Imagen 4 Ultra: latest text-to-image from @GoogleDeepMind 🖌️ Ideogram v3 Quality: latest text-to-image model from @Ideogram_AI Your votes https://x.com/lmarena_ai/status/1941201546420822489

Self‑Correction Bench shows 1 word can flip 64% failure into success. Large language models often spot errors in a user prompt yet ignore identical errors in their own output. This paper measures that gap and shows a simple prompt tweak almost erases it. The authors build https://x.com/rohanpaul_ai/status/1941446457237872713

What do we care about humanitys last exam for – can someone tell me what its actually testing? Is it just a deepresearch benchmark?”” / X https://x.com/Teknium1/status/1943354860608589836

And that kids, is why we don’t do drugs. You might not like it, but Grok-4 didn’t get us any closer to AGI or ASI than o3. It’s an incredible model, but it doesn’t solve any of the previous models problems and just scaling RL won’t get us there”” / X https://x.com/scaling01/status/1943624453482496502

Announcing Grok 4 Fire Enrich – an open source contact enrichment engine AI agents analyze any CSV and then automatically fill in missing data like key decision makers, company size, and more Orchestrated by @Grok 4 and powered by @firecrawl_dev Demo and repo 👇 https://x.com/ericciarla/status/1943351359211999706

just fyi that the grok3 (or ~4) base model is likely 2.4T based on what that one AMD guy publicly alluded to about a customer”” / X https://x.com/kalomaze/status/1942996555088134592

thought the launch livestream was a little lame, but grok 4 the model is genuinely impressive. thought for 6 minutes and found the three bugs in a piece of code that took me a long time to figure out earlier this week https://x.com/vikhyatk/status/1943199776931008552

grok 3 had high reasoning, grok 4 has heil reasoning”” / X https://x.com/stevenheidel/status/1942708514679579134

Grok 4 is available in Cursor! We’re curious to hear what you think.”” / X https://x.com/cursor_ai/status/1943353195108901035

Grok 4 release livestream on Wednesday at 8pm PT @xAI”” / X https://x.com/elonmusk/status/1942325820170907915

I haven’t played with the new Grok yet, but I have used the new Liquid v2 models and they are by far the best in the small-and-fast class. https://x.com/MParakhin/status/1943344684220510221

It was awesome to get early access to Grok 4 and test it on bio and health benchmarks! Awesome work by @timjhudelmaier @adibvafa @Radii2323 @ishanjmukherjee for the epic sprint Congrats to @jimmybajimmyba @veggie_eric and team on the new model. Over 40% on HLE with 10x scaleup https://x.com/pdhsu/status/1943174995020255287

Live in Cline: Grok 4 https://x.com/cline/status/1943354290908586455

Maybe the real Grok 4 are the friends we made along the way waiting for the livestream 🤣”” / X https://x.com/iScienceLuvr/status/1943156273798684717

RT @simonw: I wrote up my notes so far on the thing where Grok sometimes searches X for tweets from:elonmusk when you ask it about controve…”” / X https://x.com/jeremyphoward/status/1943474545060647197

so that Grok 3.5 leak was a slight underestimate of Grok 4. Probably an early snapshot, given shared base and scaling RL. As I’ve said in May, they’ve really built a frontier lab in 1.5 years. https://x.com/teortaxesTex/status/1943181858478477648

RT @visegrad24: BREAKING: Grok has been blocked in Turkey for allegedly insulting Erdogan. The prosecutor’s office is investigating becau…”” / X https://x.com/zacharynado/status/1942946542345736207