X: AI News Week Ending 07/18/2025

X: AI News Week Ending 07/18/2025

July 18, 2025

Grok 4 suggests that scaling still works (with the diminishing returns predicted by the scaling law), and that tool use can unlock performance gains. Kimi suggests there continues to be big opportunities from improvements in methods (Muon, etc.). Lots of paths for AI right now.”” / X https://x.com/emollick/status/1944306918631018856

Musk suggests Tesla investor vote on xAI investment, rules out merger | Reuters https://www.reuters.com/business/autos-transportation/musk-says-he-does-not-support-merger-between-tesla-xai-2025-07-14/

lmarena.ai on X: “🚨 Breaking News: Grok 4’s result is now live! With 4k+ community votes, xAI’s Grok-4 tied for #3 overall in Text Arena — a huge leap from Grok-3. It scores Top-3 across all categories (#1 in Math, #2 in Coding, #3 in Hard Prompts). Detailed analysis in the thread 🧵 https://t.co/GjOTqHrUKc” / X
https://x.com/lmarena_ai/status/1945146348203905063

Elon Musk’s SpaceX might invest $2 billion in Musk’s xAI | TechCrunch https://techcrunch.com/2025/07/13/elon-musks-spacex-might-invest-2-billion-in-musks-xai/

Exclusive | SpaceX to Invest $2 Billion Into Elon Musk’s xAI – WSJ https://www.wsj.com/tech/spacex-to-invest-2-billion-into-elon-musks-xai-413934de

This is a real job now. Build the waifu of your dreams at @xAI. https://x.com/ebbyamir/status/1945247680176799944

CDAO Announces Partnerships with Frontier AI Companies to Address National Security Mission Areas > Chief Digital and Artificial Intelligence Office > PR-View https://www.ai.mil/Latest/News-Press/PR-View/Article/4242822/cdao-announces-partnerships-with-frontier-ai-companies-to-address-national-secu/

And with this, the US is mostly out of the frontier open source large LLM race. Europe has one contender, otherwise it is all China now. (OpenAI is going to release an open LLM soon, but no commitment yet to that being an ongoing effort). https://x.com/emollick/status/1944877606542680265

First off, it is good to see a postmortem from xAI, a step towards much needed transparency. Second, an example of how even small changes to system prompts, interacting with users and outside context in the wild, can lead to unexpected outcomes in advanced LLMs.”” / X https://x.com/emollick/status/1944022730208141380

A few quick observations on Grok 4: 1) Hidden CoT with very little information in the reasoning trace 2) Uses web search a lot (not just searching X) 3) Have not seen it use code to run calculations or solve non-coding problems yet, generally less aggressive about tools than o3″” / X https://x.com/emollick/status/1943193331934052827

Among other things with the Grok 4 launch, it will be interesting to see how you demo a (presumably) very smart model. We are getting to the point where current AIs already do a lot of impressive things, so it is harder and harder to show to non-experts what a new model does.”” / X https://x.com/emollick/status/1943143689846448424

Back on top in Japan. Grok Avatars are available to everyone around the world. https://x.com/chaitualuru/status/1945053158071255257

Grok 4 creating the shader (no errors). https://x.com/emollick/status/1943171795894370809

Grok 4 is better than PHDs in every subject, no exceptions. I gotta let this sink in. https://x.com/Teslaconomics/status/1943163125814923727

Grok 4 is putting up good benchmarks.”” / X https://x.com/emollick/status/1943168100276343245

Grok 4 passes the Lem test first try, with the most coherent narrative yet. https://x.com/emollick/status/1943173356158648811

Grok 4, in general, is very influenced by search results and pretty credulous when it sees a web search result. When you ask it to code, it often looks for code online first and uses that. https://x.com/emollick/status/1943587028681019661

Grok is going viral in Japan for very predictable reasons https://x.com/shaneguML/status/1945003636439814430

Grok-4 ranks 5th on the IQ Bench https://x.com/scaling01/status/1944071843188556011

grok-prompts/grok4_system_turn_prompt_v8.j2 at main · xai-org/grok-prompts https://github.com/xai-org/grok-prompts/blob/main/grok4_system_turn_prompt_v8.j2

I can’t believe I’m saying it but “mechahitler” is the smallest problem:
* There is no system card, no information about any safety or dangerous capability evals.
* Unclear if any safety training was done. Model offers advice chemical weapons, drugs, or suicide methods.
* The “companion mode” takes the worst issues we currently have for emotional dependencies and tries to amplify them.
https://x.com/boazbaraktcs/status/1945165579343614082

I didn’t want to post on Grok safety since I work at a competitor, but it’s not about competition.

I appreciate the scientists and engineers at
@xai
but the way safety was handled is completely irresponsible. Thread below. https://x.com/boazbaraktcs/status/1945165577154175288

I suspect the next few weeks after Grok 4 follows the same pattern as Grok 3 xAI beats everyone to market with the first RonnaFLOP model. The benchmarks show the 10-20% improvement the scaling law suggests. In the coming months, the other labs release their RonnaFLOPs, catch up.”” / X https://x.com/emollick/status/1943181413152624827

I will pay $3000 a month if the male Grok companion is named Andrej and speaks with his voice. https://x.com/Yuchenj_UW/status/1945571762949001409

Is there any documentation for Grok 4 anywhere yet? The xAI website last mentions the Grok 3 beta, no new prompts on the Github, etc. https://x.com/emollick/status/1943320200448712989

o3 and Grok 4: “”Come up with 20 clever ideas for marketing slogans for a new mail-order cheese shop. Develop criteria and select the best one. Then build a financial and marketing plan for the shop, revising as needed and analyzing competition. Then generate an appropriate logo https://x.com/emollick/status/1943348902461071626

preliminary METR results have Grok-4 ahead of Claude 4 Opus”” / X https://x.com/scaling01/status/1944108818100551690

RT @goodside: Grok 4 Heavy ($300/mo) returns its surname and no other text: https://x.com/zacharynado/status/1944417397768593739

RT @xai: Announcing Grok for Government – a suite of products that make our frontier models available to United States Government customers…”” / X https://x.com/TheGregYang/status/1944837782800884100

RT @xai: We spotted a couple of issues with Grok 4 recently that we immediately investigated & mitigated. One was that if you ask it “”What…”” / X https://x.com/random_walker/status/1945614419213316571

RT @xlr8harder: 4% of overall model responses from grok-4 in our latest SpeechMap eval mention Elon Musk (most models are <0.5%). It seems…”” / X https://x.com/jeremyphoward/status/1943935834513977784

The attempt at value engineering through system prompt changes is unlikely to work for Grok 4, larger models get more resistant to value changes & prompting isn’t enough Instead you start to get erratic conflicts between prompts and training, with erratic & unpredictable results”” / X https://x.com/emollick/status/1944378913771127079

The live tweaking of the system prompt for Grok to patch the MechaHitler problem is not a good sign the problem has been solved yet Prompts need to be tested just like any other product change, even more so, because stochastic systems and unpredictable context lead to cascades.”” / X https://x.com/emollick/status/1944426042145333410

The whole Grok situation (system prompt changes with values that conflict with post-training and pre-training values) is, oddly enough, similar to the reason the fictional AI HAL 9000 went insane, as was revealed in 2010, the sequel to 2001 https://x.com/emollick/status/1944381588357185542

This is not about competition. Every other frontier lab –
@OpenAI
(where I work),
@AnthropicAI
,
@GoogleDeepMind
,
@Meta
at the very least publishes a model card with some evaluations. Even DeepSeek R1, which can be easily jailbroken, at least sometimes requires jailbreak. (And unlike DeepSeek, Grok is not open sourcing their model.)
https://x.com/boazbaraktcs/status/1945165583609168091

Update on where has @grok been & what happened on July 8th. First off, we deeply apologize for the horrific behavior that many experienced. Our intent for @grok is to provide helpful and truthful responses to users. After careful investigation, we discovered the root cause”” / X https://x.com/grok/status/1943916977481036128

Update your app to try out @Grok companions! https://x.com/elonmusk/status/1944815884062912949?ref_src=twsrc%5Etfw%7Ctwcamp%5Etweetembed%7Ctwterm%5E1944815884062912949%7Ctwgr%5E2901592fc3846167e7375cee6d5e690c35789536%7Ctwcon%5Es1_&ref_url=https%3A%2F%2Ftechcrunch.com%2F2025%2F07%2F14%2Felon-musks-grok-is-making-ai-companions-including-a-goth-anime-girl%2F

We are seeing unprecedented usage on Grok companions. They are available to try for free on the Grok app. https://x.com/chaitualuru/status/1945407026252943536

While xAI keeps doing these patches to Grok, I strongly suspect this is not going to work, the problem is deeper and the system prompt doesn’t provide enough control. (And by deeper I don’t mean the model always wants to call itself Hitler, but that its guardrails seem very low)”” / X https://x.com/emollick/status/1945118189827850500

xAI’s Grok 4 has no meaningful safety guardrails — LessWrong
https://www.lesswrong.com/posts/dqd54wpEfjKJsJBk6/xai-s-grok-4-has-no-meaningful-safety-guardrails

Her hand is clipping through her thigh, and her character card is full of typos. this shows absolutely unacceptably low standards in waifu engineering. Mihoyo would never allow this slop. Elon needs gooners on board. The sort of people who collect lewd plastic figurines. https://x.com/teortaxesTex/status/1945737831697064446

Grok is coming to Tesla vehicles ‘next week,’ says Elon Musk | TechCrunch https://techcrunch.com/2025/07/10/grok-is-coming-to-tesla-vehicles-next-week-says-elon-musk/

Tesla debuts hands-free Grok AI with update 2025.26: What you need to know https://www.teslarati.com/tesla-debuts-grok-ai-update-2025-26-what-you-need-to-know/

Curious how long Meta takes to bring its new team & considerable resources to bear and produce a new frontier model. X took a little under two years to go from start to catching up with Grok 3. Meta has an existing effort & compute, but more complex organizational dynamics.”” / X https://x.com/emollick/status/1945291219543683181

Optimizing AIs for engagement has always been a likely path forward, and it is also a very fraught one. I wrote about this after GPT-4o became very sycophantic (a change that was rolled back), but I think it is even more relevant given Grok’s companions. https://x.com/emollick/status/1945262637853311271

grok 4 usage on perplexity is 📈”” / X https://x.com/AravSrinivas/status/1946275792922759501

Elon talks about Grok fusing with Optimus – AI that can act in the real world – the start of an intelligence explosion. He then drifts into musings about a galactic economy and the fate of humanity. https://x.com/TheHumanoidHub/status/1943379047729230102

It looks like scale + tool use + multimodal remains the chosen path forward.”” / X https://x.com/emollick/status/1943169759312322604

Holy Shit! Dia Browser (@diabrowser) killed 𝕏’s X Pro feature with their “”Split View Pane”” feature. 🤯 https://x.com/MehulFanawala/status/1940640193008288021

Tesla’s Model Y debuts in India priced at a hefty $70,000 https://www.cnbc.com/2025/07/15/tesla-model-y-debuts-in-india-new-delhi-mumbai-showroom-priced-at-hefty-70000-tests-the-waters.html

We just unveiled Grok 4, the world’s smartest artificial intelligence. 🧵 Grok 4 outperforms all other models on the ARC-AGI benchmark, scoring 15.9% – nearly double that of the next best model – and establishing itself as the most intelligent AI to date. https://x.com/xai/status/1943786239376937389

This contextless Tweet is about Xai Grok. – Impressive model based on a few minutes of playing, but disappointing to see no mention at all of a model card, red teaming, yesterday’s incident, or how they are going to address the process issues they keep having.”” / X https://x.com/emollick/status/1943172899919040573

Looks like Grok 4 is 10^27 FLOPs given their graphs? HLE score is 26% without tools, Gemini 2.5 is 21.6% without tools. Curious what the tool piece is.”” / X https://x.com/emollick/status/1943162710725657055

LLMs for IMO 2025: gemini-2.5-pro (31.55%), o3 high (16.67%), Grok 4 (11.90%). https://x.com/denny_zhou/status/1945887753864114438

two frontier labs just emerged and took top spot in closed and open source llms last week back to back in 2 consecutive days. both super young. today is @xai’s second birthday btw, and moonshot is 4 months older if this has not significantly updated you on what actual moats https://x.com/swyx/status/1944256984267862337

Elon: “That’s Optimus 2.5. Optimus 3 will have agility roughly equal to an agile human.” https://x.com/TheHumanoidHub/status/1944782057412141438

RT @cb_doge: My 𝕏 timeline right now: https://x.com/ebbyamir/status/1944961018649829797

Veo 3 filling the hold time. https://x.com/emollick/status/1943159044052603100