About This Week’s Covers

This week’s cover pays homage to the namesake of xAI’s flagship language model Grok. The big story this week is that Grok-3 was released, and the name Grok comes from the Robert Heinlein novel “Stranger in a Strange Land”. The word means to ‘to understand intuitively or by empathy’. This week’s cover is a robot remake of the original novel’s cover. I tried to have Grok make the cover, but it failed miserably. I ended up using flux pro. Claude sonnet 3.5 helped me identify the key descriptors to re-create the atmosphere of the image (below).

The covers this week were built, using the entire prompt from flux with the request to modify them into each particular category using Claude + Ideogram.

This Week By The Numbers

Total Organized Headlines: 634

This Week’s Executive Summaries

The top story this week was that xAI launched Grok-3, which claimed the top spot in the AI model rankings. This puts the lag between X and OpenAI at about nine months. This is similar to the gap between DeepSeek and OpenAI. I’m not an expert, but my impression is that DeepSeek got to where they were through efficiencies and reinforcement learning, whereas X was able to get their results by throwing massive amounts of computing power and scale at the problem (see the Colossus supercomputer project). Part of me wonders what the next models will look like when you combine these two techniques: scale and reinforcement learning. The rumor is that Anthropic is going to launch a new version of Claude in the next week or two. OpenAI and Meta also have new updates coming soon. The horse race continues!

The second big story this week is that Microsoft has developed a completely new state of matter that they plan to use for quantum computing. My takeaway here is that while other companies are chipping away—no pun intended—at quantum computing using traditional techniques, Microsoft took a step back, like Tiger Woods relearning his swing, came out with a completely new architecture, and literally created a new type of matter!

Google came out with a new AI research tool called Co-Scientist. A university researcher took 10 years of his own work and asked Co-Scientist what it thought the answer would be—without telling the system any details and using only a simple prompt. In 48 hours, Google’s system was able to recreate 10 years of research.

The New York Times has formally adopted AI in the newsroom, approving a variety of tools, including its own proprietary tool called Echo. This does not conflict with the NYT’s lawsuit against OpenAI, but it does show that AI is becoming widely embraced, even by the most traditional publishers.

Perplexity has reconfigured the Chinese open model DeepSeek and turned it into a new tool called Deep Research, which is able to research and reason at almost the same level as OpenAI’s $200/month research model.

Perplexity also retrained DeepSeek to remove its biases and censorship and has re-released DeepSeek as R1 1976.

Robotics company Figure has released its own proprietary multi-modal model that allows robots to pick up objects they have never seen before and understand them simply by receiving natural language instructions. This comes only a few weeks after Figure announced that they would no longer partner with OpenAI. The new model for the robots runs at a spectacular 200 times per second and has been tested on thousands of household items that the robots had never seen before. Impressively, the system runs on regular computer chips.

Another big name from OpenAI announced a new company after several months of silence, presumably due to non-compete agreements. The former CTO of AI, Mira Murati, has founded Thinking Machines Lab.

Microsoft came out with a new system that can let any large language model understand computer interfaces by detecting everything that can be clicked on and mapping their functions back to language commands.

Google announced that Gemini will now be able to remember all chat histories and converse between chats to give continuity and personality to users who want to resume where they left off or connect multiple chats.

OpenAI created a benchmark to test models against actual programming tasks using the website Upwork. OpenAI found about 1,400 tasks that language models struggle to complete and will use this benchmark to track progress over time. The goal is to see when language models are able to achieve $1 million in productivity value across multiple projects.

ChatGPT-4 has been upgraded without being renamed and now holds the number one position in six major categories. Of course, this all comes at the same time as Grok and other releases, so it’s almost like following a bouncing ball with the leaderboard.

xAI’s Grok-3 Claims Top Spot in AI Model Rankings
xAI released Grok-3, which ranks first place on the Chatbot Arena leaderboard with a record-breaking 1400 score. Grok-3 outperforms leading models like GPT-4, Gemini 2 Pro, and Claude 3.5 Sonnet on key benchmarks, particularly in math, science, and coding tasks. The model was trained using 200,000 NVIDIA H100 GPUs – double the hardware used for Meta’s Llama 4. The system includes three specialized modes: Think, Big Brain, and DeepSearch, with the latter offering web search capabilities similar to recent offerings from Google and OpenAI. While Grok-3 hit impressive benchmarks, it will take a few weeks to know if Grok-3 prevails in real-world tasks. Audio input and output features are planned for the coming weeks. One take away that is consistent with the recent deep seek disruption is that the gap between closed models and open models is about 6-9 months, a relentless pace.

“Grok 3 drops tomorrow night—xAI’s billion-dollar bet on scaling. Reminder: xAI built Colossus, the world’s most powerful AI training cluster (100,000+ NVIDIA H100s in just 122 days) to train Grok 3. This comes after DeepSeek-R1 tanked the stock market by delivering a strong https://x.com/rowancheung/status/1891151253951987737

“AI NEWS: Elon Musk’s xAI just unveiled Grok-3 and ranked #1 on the Chatbot Arena. Plus, more news from Mistral’s new regional AI Saba, Ilya’s SSI, Nous Research, and a new open-source Chinese video model. Here’s what you need to know:” / X https://x.com/rowancheung/status/1891773915560583258

“Here are the benchmark numbers: Grok 3 significantly outperforms other models in its category such as Gemini 2 Pro and GPT-4o. Even Grok-3 mini shows to be competitive. https://x.com/omarsar0/status/1891706611023938046

Grok3 Launch Video / X https://x.com/i/broadcasts/1gqGvjeBljOGB

“Grok 3 release with live demo on Monday night at 8pm PT. Smartest AI on Earth.” / X https://x.com/elonmusk/status/1890958798841389499

“Grok 3 reasoning beta achieved 96 on AIME and 85 on GPQA, which is on par with the full o3. https://x.com/arankomatsuzaki/status/1891708250199839167

“Grok 3 is a new best model in the world from the @xai team! Grok 3 ranks #1 on Chatbot Arena w/a big gap, and scores impressively on pretraining and reasoning evals. congrats to @elonmusk @ibab @jimmybajimmyba @Yuhu_ai_ looking forward to more partnership on grok4 & beyond 🚀 https://x.com/alexandr_wang/status/1891714169629524126

“BREAKING: xAI announces Grok 3 Here is everything you need to know: https://x.com/omarsar0/status/1891705029083512934

“Grok 3 involved 10x more training than Grok 2! Grok finished pretraining in early January! The model is still training. https://x.com/omarsar0/status/1891705957220016403

“Elon mentioned that Grok 3 is an order of magnitude more capable than Grok 2. https://x.com/omarsar0/status/1891705031243469270

“BREAKING: @xAI early version of Grok-3 (codename “chocolate”) is now #1 in Arena! 🏆 Grok-3 is: – First-ever model to break 1400 score! – #1 across all categories, a milestone that keeps getting harder to achieve Huge congratulations to @xAI on this milestone! View thread 🧵 https://x.com/lmarena_ai/status/1891706264800936307

“Grok-3 without reasoning actually looks pretty good on these 3 cherry picked benchmarks. It’s also a good sign that they got 1400 Elo n lmsys from the get go. However, I feel like this launch was rather underwhelming. Too few benchmarks, no report and no useful demos. If it’s https://x.com/scaling01/status/1891786871304323280

“This is it: The world’s smartest AI, Grok 3, now available for free (until our servers melt). Try Grok 3 now: https://x.com/xai/status/1892400129719611567

“Grok 3 Reasoning Beta performance on AIME 2025. Grok 3 shows generalization capabilities. It not only does coding and math problem-solving, but it can also do other creative and useful real-world tasks. https://x.com/omarsar0/status/1891711110476111884

Grok 3, xAI’s New Model Family, Improves on its Predecessors, Adds Reasoning https://www.deeplearning.ai/the-batch/grok-3-xais-new-model-family-improves-on-its-predecessors-adds-reasoning/

“Reasoning models like Grok-3 reasoning beta and DeepSeek-R1 are trained using reinforcement learning with verifiable rewards, but what exactly does this mean? Verifiable tasks. One detail that we should immediately notice about reasoning models is that they are primarily used https://x.com/cwolferesearch/status/1891893034956030242

“Grok 3 also has reasoning capabilities too! The Grok team has been testing these capabilities which they have unlocked using RL. The model is good, especially in coding. https://x.com/omarsar0/status/1891707915351859547

“If the light blue part is best of N scores, this means that Grok 3 reasoning is inherently an ~o1 level model. This means the capabilities gap between OpenAI and xAI is ~9 months. Also what is the difference between “think” and “big brain” https://x.com/nrehiew_/status/1891710589115715847

Microsoft Creates New State of Matter for Quantum Computing
Microsoft has developed Majorana 1, a quantum processing unit that uses a new state of matter called topoconductors. The breakthrough enables qubits that are 100 times smaller than current versions and could accelerate the timeline for practical quantum computers from decades to years. The microscopic qubits, measuring 1/100th of a millimeter, create a path to processors containing a million quantum bits. for now, there is not much practical use but scientifically and for the future this is a big deal. Microsoft decided to go the slower route and create an entirely new state of matter, as opposed to banging their head on the bottlenecks of existing structures.

Satya Nadella on X: “A couple reflections on the quantum computing breakthrough we just announced… Most of us grew up learning there are three main types of matter that matter: solid, liquid, and gas. Today, that changed. After a nearly 20 year pursuit, we’ve created an entirely new state of https://t.co/Vp4sxMHNjc” / X https://x.com/satyanadella/status/1892242895094313420

Google AI Tool Solves Decade-Long Superbug Mystery in 48 Hours
Google’s tool called “co-scientist” identified how superbugs spread between species, matching conclusions that took Imperial College London researchers 10 years to discover. When Professor José Penadés tested the AI with a “short prompt” to match his unpublished research about bacterial resistance, it correctly did its own research and described how superbugs form virus-like tails to move between hosts. The AI also proposed four additional viable hypotheses, including one novel approach the research team is now investigating.

AI cracks superbug problem in two days that took scientists years https://www.bbc.com/news/articles/clyz6e9edy3o

“NEW: Google introduces AI co-scientist. It’s a multi-agent AI system built with Gemini 2.0 to help accelerate scientific breakthroughs. 2025 is truly the year of multi-agents! Let’s break it down: https://x.com/omarsar0/status/1892223515660579219

“Google’s new “co-scientist” solved a complex microbiology problem in 48 hours, a task that took researchers at Imperial College London a decade to complete. Professor José R. Penadés tested the AI with a hypothesis about superbug resistance, and it correctly identified the https://x.com/rohanpaul_ai/status/1892746665225826321

NY Times Embraces AI Tools to Support Newsroom Operations
The New York Times is implementing AI tools to assist journalists with tasks like editing, headline creation, and interview preparation, while maintaining editorial oversight. The company’s new internal tool, Echo, helps staff summarize articles and create social media content. While embracing AI for support functions, humans will continue handling reporting and editorial decisions. The Times is continuing its lawsuit against OpenAI and Microsoft over unauthorized use of its content for AI training.

The New York Times adopts AI tools in the newsroom | The Verge https://www.theverge.com/news/613989/new-york-times-internal-ai-tools-echo

Perplexity Uses DeepSeek to Launch AI Research Agent That Rivals OpenAI
Perplexity has unveiled Deep Research, a tool that creates detailed research reports by combining web search, analysis, and coding capabilities. The service matches OpenAI’s performance on industry benchmarks while operating significantly faster and at lower cost, leveraging DeepSeek’s open-source model. Users can generate reports on topics ranging from business incorporation guidance to investment analysis, with free users receiving 5 queries daily and Pro subscribers getting 500. Most reports are completed within 3 minutes.

“Deep Research as a small business incorporation legal consultant. Usually charged hundreds or even thousands of dollars an hour to offer this. Now free, only on Perplexity. https://x.com/AravSrinivas/status/1891563239240069245

“Perplexity Deep Research can write an investment memo like Bill Ackman. Example: writing a memo for taking a big position in $UBER. https://x.com/AravSrinivas/status/1891233048605184371

“Deep Research on Perplexity scores 21.1% on Humanity’s Last Exam, outperforming Gemini Thinking, o3-mini, o1, DeepSeek-R1, and other top models. We also have optimized Deep Research for speed. https://x.com/perplexity_ai/status/1890452359773405675

“Perplexity just announced Deep Research (PDR)! I’m now testing and comparing it with OpenAI’s Deep Research (ODR). I still think the o3 variant powering ODR is a massive advantage. 20.5% (PDR) vs. 26.6% (ODR) on Humanity’s Last Exam. https://x.com/omarsar0/status/1890525249977872640

“Perplexity Deep Research is quite close to OpenAI o3 on the Humanity Last Exam Benchmark despite being an order of magnitude faster and cheaper. This is possible because DeepSeek is open source and cheap and fast. https://x.com/AravSrinivas/status/1890486069361025040

Perplexity Removes Content Restrictions from DeepSeek AI Model, Names It R1 1776
Perplexity’s R1 1776 adapted Chinese DeepSeek’s R1 model to provide unrestricted responses while maintaining its problem-solving abilities. Testing across 1,000+ examples confirmed the model responds to sensitive topics while preserving its math and reasoning capabilities. If it lives up to the hype, the MIT-licensed model will bring the power of a frontier-model to a staggering number of third-party and personal use cases.

“Perplexity just released POST TRAINED DeepSeek R1 for factual and unbiased information – MIT Licensed 🔥 https://x.com/reach_vb/status/1891922768892989559

“Perplexity just dropped R1 1776 a version of the DeepSeek R1 model that has been post-trained to provide uncensored, unbiased, and factual information https://x.com/_akhaliq/status/1891961543455031429

“🎯 @perplexity_ai drops their FIRST open-weight model on @huggingface: A decensored DeepSeek-R1 with full reasoning capabilities. Tested on 1000+ examples for unbiased responses. https://x.com/fdaudens/status/1891949269470351833

Figure’s AI System Lets Humanoid Robots Handle New Objects Through Simple Commands
Figure has launched Helix, an AI system that enables humanoid robots to pick up unfamiliar objects and work together using natural language instructions. The system controls the robots’ upper body movements – including fingers, wrists, and torso – at 200 times per second. In testing, robots using Helix successfully handled thousands of household items they had never encountered before, from toys to glassware, by responding to basic commands like “pick up the coffee mug.” The system runs on standard computer chips and needs no additional training to perform new tasks, suggesting potential for real-world applications. This is notably just a few weeks after Figure announced that they were no longer going to partner with OpenAI on language integration, but rather build their own systems.

Helix: A Vision-Language-Action Model for Generalist Humanoid Control https://www.figure.ai/news/helix

“In our lifetime you will see more humanoid robots than humans when you’re out and about” / X https://x.com/adcock_brett/status/1889946006558744898

“. @Figure_robot just unveiled Helix, a Vision-Language-Action (VLA) model powered humanoid robots to reason and interact naturally in home environments. 📌 Unlike prior systems, Helix allows robots to pick up any household item without training, collaborate in real-time, and https://x.com/rohanpaul_ai/status/1892662504054259779

Former OpenAI CTO Launches Thinking Machines Lab, Emphasizing Open Science and AI Customization
Mira Murati, previously CTO at OpenAI, has founded Thinking Machines Lab, bringing together talent from prominent projects like ChatGPT, Character.ai, and PyTorch. The company’s broad mission is to keep AI systems open, understandable, and customizable while advancing frontier research. The team plans to build multimodal AI systems that can excel in diverse fields beyond current AI’s strengths in programming and mathematics. It’s pretty vague and mirrors a lot of what we’re hearing from Safe Superintelligence, another start-up by OpenAI co-founder Ilya Sutskever that’s in the news this week, also after several months of silence. I assume these delays correspond to non-competes or something to that effect.

“Career Update: Incredibly fortunate and excited to be part of the founding team at Thinking Machines Lab! https://x.com/dchaplot/status/1891920016339042463

Thinking Machines Lab https://thinkingmachines.ai/

“Today, we are excited to announce Thinking Machines Lab ( https://x.com/thinkymachines/status/1891919141151572094

AI Agents: Three Big AI Automation Stories to Know About
1 – Convergence’s Proxy 1.0 is built for automated web browsing and allows users to schedule and automate routine web tasks, including logging into systems and downloading files. 2 – Microsoft’s OmniParser V2 enables any LLM to understand computer interfaces by detecting all clickable elements and mapping their functions to plain language commands. 3 – Hugging Face launched an Agent Leaderboard that ranks 17 frontier models across 14 real-world benchmarks. Unfortunately, it does not include “wrapper” tools like Convergence.

“Introducing Proxy 1.0 – the world’s most capable web-browsing agent. https://x.com/convergence_ai_/status/1892129466610073931

Introducing Our Agent Leaderboard on Hugging Face – Galileo AI https://www.galileo.ai/blog/agent-leaderboard

“Microsoft just dropped OmniParser V2, looks incredible Turning Any LLM into a Computer Use Agent https://x.com/_akhaliq/status/1890546832784208080

Google Adds Infinite Chat Memory to Gemini AI
Gemini can now reference past conversations when responding to users, allowing seamless continuation of previous discussions and topic summaries. Users maintain control with options to view, edit, or delete their chat history, marking a huge step forward in AI chat understanding the larger context of each user (no pun intended). I’m hoping that OpenAI and Anthropic enable this sort of feature sooner than later because I find Gemini to be the worst of all of the models. You’d have to pay me to use it at this point.

“LMAO, so Google just dropped infinite memory for Gemini before OpenAI did for ChatGPT. It can now recall past conversations , you can refer to something discussed a week ago. How long has OpenAI been working on this? 😆 Note: To ask Gemini to reference past chats, you need Gemini https://x.com/ai_for_success/status/1890377941579891003

“Rolling out starting today, you can ask Gemini to consider your past chats to craft its responses. Easily pick up where you left off or have it summarize a previous topic. You can view, edit, or delete any chats you’ve had with Gemini, and see when it’s used. Try it in Gemini https://x.com/GeminiApp/status/1890137961871605863

OpenAI Creates Benchmarks to Test AI Models Against Real Programming Jobs
OpenAI has created SWE-Lancer, a benchmark testing AI models against 1,400 actual software engineering tasks from Upwork worth $1 million in client payments. The benchmark examines both coding tasks (from simple bug fixes to complex features) and project management decisions. When advanced AI models attempt these real-world challenges, they fail most of them, so the benchmarks can measure improvements over time. OpenAI has made the testing framework public with the goal to measure AI’s practical impact on software development.

“OpenAI announces SWE-Lancer Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? https://x.com/_akhaliq/status/1891721712296747126

ChatGPT-4 Update Tops Performance Rankings Across Multiple Categories
OpenAI’s latest GPT-4 version shares the #1 position in six major testing categories including coding, creative writing, and handling complex conversations. Math remains an area for development.

“A new version of @OpenAI’s ChatGPT-4o is now live on Arena leaderboard! Currently tied for #1 in categories: 💠Overall 💠Creative Writing 💠Coding 💠Instruction Following 💠Longer Query 💠Multi-Turn This is a jump from #5 since the November update. Math continues to be an area https://x.com/lmarena_ai/status/1890477460380348916

AI Visuals and Charts: Week Ending February 21, 2025

“Here are the benchmark numbers: Grok 3 significantly outperforms other models in its category such as Gemini 2 Pro and GPT-4o. Even Grok-3 mini shows to be competitive. https://x.com/omarsar0/status/1891706611023938046

“Bro is driving around a 3D Gaussian Splat of a city https://x.com/bilawalsidhu/status/1891845261401501765

“Alibaba strikes again. Full-body swap anyone in a video with just a photo reference. What’s wild to me is that this tech completely bypasses the 3d pipeline (i.e. what Wonder Dynamics does to accomplish similar output) and yet looks so damn good. Basically viggle on steroids.” / X https://x.com/bilawalsidhu/status/1890535455600369687

Introducing Helix – YouTube https://www.youtube.com/watch?v=Z3yQHYNXPws

Flavio Adamo on X: “🚨 o3-mini crushed DeepSeek R1 🚨 “write a Python program that shows a ball bouncing inside a spinning hexagon. The ball should be affected by gravity and friction, and it must bounce off the rotating walls realistically” https://t.co/xEvPDzzbVk” / X https://x.com/flavioAd/status/1885449107436679394

“We are claiming SOTA for AI Avatar, but the ultimate test is big face. We don’t use post process or blurring hacks to hide misery. 5 videos. Same script. Generate with Argil 👇 https://x.com/BrivaelLp/status/1890435559127986463

“2. Goku+: Product and Human Interaction https://x.com/minchoi/status/1890074266244395495

x.com/_akhaliq/status/1890215479047754194 https://x.com/_akhaliq/status/1890215479047754194

“ロボット犬が襲いかかってくる展示を見て感動している。なんだこれ！　ヤバすぎる！！！ https://x.com/takahiroanno/status/1890350288554397709

Min Choi on X: “2. Goku+: Product and Human Interaction https://t.co/KGtK4DPxTw” / X https://x.com/minchoi/status/1890074266244395495

“A Japanese expo displayed a chained robot dog that was programmed to attack anyone who came near While the specifics of the creepy Black Mirror-like robot remain unclear, the message is evident: we need to do more on the AI safety front https://x.com/adcock_brett/status/1891177603312067056

“ByteDance presents Phantom Subject-consistent video generation via cross-modal alignment https://x.com/_akhaliq/status/1892073250974216476