AI News 130: Week Ending March 27, 2026 with 83 Executive Summaries
About This Week’s Covers
This week’s cover is inspired by an actual trunk that my grandmother mailed to America after she fled World War II Germany. My grandmother was a single mom and a Hungarian immigrant who came to the U.S. in the 1950s.
As the Russians invaded Hungary, my grandmother ran away from Pécs with her sister and their boyfriends on foot, leaving their parents behind.
They crossed Austria into Germany, hitch-hiking through the Alps in the winter. My mom was born in Munich in 1950. While my mom was still a toddler, the International Rescue Committee sent her and my grandparents to the heart of New York City with this single trunk full of belongings. Neither of my grandparents spoke English, and everything they had was in this crate.
This crate is in my mom’s garage.
My grandmother became the first female pathologist in Delaware, worked as the medical examiner at Dover Air Force Base, and ran the lab at Milford Memorial Hospital. My grandfather moved to Illinois and ran a successful internal medicine practice for decades.
As a kid, I spent three days a week with my grandmother after school in her lab at Milford Hospital. This was back when labs were all located inside the hospital.
My grandmother in her office at Milford Memorial Hospital.
I’d watch my grandmother take biopsies from their plastic containers, slice the tissues, add the dye, and create the frozen sections. While she’d look at the slides under the microscope, I would do my homework and read a book on the floor next to her desk. Every now and then, she’d call me up to look through the microscope, and she’d show me the cancer cells and the regular cells and explain to me how to spot them.
She was making serious decisions, often consulting with her lab partner and best friend, Dr. Mona Labraton, another immigrant woman from Haiti. They used only their training and their eyes to determine whether to order orchiectomies, mastectomies, and other life-changing surgeries. She carried the weight of the world and would speak to the patients in person.
The rest of the category images were built using a Claude skill where I dictate a theme to the skill, and it generates a JSON theme which I then run through the Gemini API.
In this case, I told it to take the trunk and incorporate it into the category theme and swap out the text. My prompt in its entirety was:
The theme this week is going to be centered around an old crate. This is a crate that my grandmother mailed to herself when she fled Hungary and came to the United States in the early 1950s. We’re going to try to keep this crate as similar to the image that I’m attaching as possible. However, we want to incorporate the crate in a way that the crate is the center of the image, but the text on the crate has been changed and displays the name of the category on the crate. The image should contextualize the crate in some way that it is incorporated into the category. The scene will evoke the idea of the category without being too complicated. A simple setting that clearly evokes the idea of the category from the AI newsletter. The time of year should be the early spring. there should be a sense of hope, if any, and an idea of migration towards a better future. There’s an element of risk and potential danger, but we’re not worried about that because things are going to be okay. The image should be photorealistic.
Some of them turned out really well!
My favorite examples: the creepy hand of the mannequin coming out of the figure box, the pile of historically fitting black-and-white images pouring out of the images box, the inflection box at the corner of a bright and sunny road and a wet, muddy, scary road, my prompt said that there should be a feeling of hope contrasted with a feeling of foreboding. With the inflection image, it really captures this, especially down to the buds on the trees, as if it’s spring.
Meta is appropriately meta, including a box on the box with a box on it, and an image reflecting the word meta. Mobile has a period-appropriate phone hanging out of the box. Multimodality has the box surrounded by audiovisual and aroma, capturing the concept of multiple modes. Publishing shows books inside the crate, peeking through the slats. Retrieval-augmented generation shows an open box full of paperwork that could be searched by a model. Twitter has a small bird on top of it, with telegraph papers lying on the ground.
The Inns of Court has books inside that include the titles Torts, Contracts, Law, and Law. Alibaba has Chinese writing on the side. I’m not looking up what it says, but every time in the past when I look it up, it translates into something relevant, so I’m assuming it did well today.
ARVR has two View-Masters that look like they’re from the time period on top of the box. Autonomous has tire tracks, with the box on the tire-track path, and the word autonomous has been broken down into three parts that fit on the box as the stencil. I’m very impressed that the Claude skill was able to do so well with such a open ended prompt.
This Week’s Humanities Reading
For this week’s humanities reading, the obvious choice would be The New Colossus by Emma Lazarus, which is a fantastic poem. I may use that for a future humanities reading.
However, for this week, I’ve chosen Ithaca by C.P. Cavafy. I love the idea that you don’t need to be afraid of your journey as long as you keep your thoughts raised high, and that you won’t encounter monsters unless you bring them along inside your soul. I also love the idea that the journey is best when it lasts a long time. It’s a really incredible poem.
Ithaka By C. P. Cavafy
As you set out for Ithaka hope your road is a long one, full of adventure, full of discovery. Laistrygonians, Cyclops, angry Poseidon—don’t be afraid of them: you’ll never find things like that on your way as long as you keep your thoughts raised high, as long as a rare excitement stirs your spirit and your body. Laistrygonians, Cyclops, wild Poseidon—you won’t encounter them unless you bring them along inside your soul, unless your soul sets them up in front of you.
Hope your road is a long one. May there be many summer mornings when, with what pleasure, what joy, you enter harbors you’re seeing for the first time; may you stop at Phoenician trading stations to buy fine things, mother of pearl and coral, amber and ebony, sensual perfume of every kind— as many sensual perfumes as you can; and may you visit many Egyptian cities to learn and go on learning from their scholars.
Keep Ithaka always in your mind. Arriving there is what you’re destined for. But don’t hurry the journey at all. Better if it lasts for years, so you’re old by the time you reach the island, wealthy with all you’ve gained on the way, not expecting Ithaka to make you rich.
Ithaka gave you the marvelous journey. Without her you wouldn’t have set out. She has nothing left to give you now.
And if you find her poor, Ithaka won’t have fooled you. Wise as you will have become, so full of experience, you’ll have understood by then what these Ithakas mean.
This week, I organized 480 links, 158 of which informed the executive summaries. I’m laying them out this week, sorted mostly by company name, in alphabetical order. If you are non-technical, just keep skimming because there’s a lot of big news that will pop out along the way…
AIlen AI
MolmoPoint GUI Ai2 just released MolmoPoint GUI on Hugging Face A specialized VLM for GUI automation that points using grounding-tokens instead of coordinates, reaching 61.1 on ScreenSpotPro. https://x.com/HuggingPapers/status/2036101402477404284
Today we’re releasing MolmoWeb, an open source agent that can navigate + complete tasks in a browser on your behalf. Built on Molmo 2 in 4B & 8B sizes, it sets a new open-weight SOTA across four major web-agent benchmarks & even surpasses agents built on proprietary models. 🧵 https://x.com/allen_ai/status/2036460260936814915
Auto mode is a step change improvement in the Claude Code UX, balancing autonomy and safety. Almost everyone on our team uses this as a daily driver. Now available to Claude for Team users! `claude –enable-auto-mode` to turn on, then Shift + Tab to enter the mode https://x.com/_catwu/status/2036852880624541938
Claude Code iMessage Anthropic just casually dropped iMessage support for Claude Code like it’s a minor patch note. You can literally text your AI coder from your iPhone now. Send it a task, it builds on your Mac, texts you back when it’s done. Blue bubbles and everything. They’ve shipped Channels, https://x.com/borisvagner/status/2036889454733074517?s=12
CLI Support CLIs are super exciting precisely because they are a “”legacy”” technology, which means AI agents can natively and easily use them, combine them, interact with them via the entire terminal toolkit. E.g ask your Claude/Codex agent to install this new Polymarket CLI and ask for any https://x.com/karpathy/status/2026360908398862478?s=20
Command Center Opening a new Claude tab every time you start a task is not a workflow. It’s a cope. Command Center has quietly become my default workspace: → I build multiple features at once, and run multiple agents per feature. Claude Code, Codex, Gemini → Everything runs in parallel https://x.com/jimmykoppel/status/2036077396210728974
Computer Use Today, we’re releasing a feature that allows Claude to control your computer: Mouse, keyboard, and screen, giving it the ability to use any app. I believe this is especially useful if used with Dispatch, which allows you to remotely control Claude on your computer while you’re https://x.com/felixrieseberg/status/2036193240509235452
You can now enable Claude to use your computer to complete tasks. It opens your apps, navigates your browser, fills in spreadsheets—anything you’d do sitting at your desk. Research preview in Claude Cowork and Claude Code, macOS only. https://x.com/claudeai/status/2036195789601374705
Cowork Projects Projects are now available in Cowork. Keep your tasks and context in one place, focused on one area of work. Files and instructions stay on your computer. Import existing projects in one click, or start fresh. https://x.com/claudeai/status/2035025492617961704?s=20
ARC-AGI-3 ARC-AGI-3 ARC-AGI-3 is an interactive reasoning benchmark which challenges AI agents to explore novel environments, acquire goals on the fly, build adaptable world models, and learn continuously. A 100% score means AI agents can beat every game as efficiently as humans. https://arcprize.org/arc-agi/3
Announcing ARC-AGI-3 The only unsaturated agentic intelligence benchmark in the world Humans score 100%, AI <1% This human-AI gap demonstrates we do not yet have AGI Most benchmarks test what models already know, ARC-AGI-3 tests how they learn https://x.com/arcprize/status/2036860080541589529
ARC-AGI 3 is here, and all existing AI models are below 1% on the benchmark. It’s gonna take a while until this one is saturated. How it measures intelligence: – 100% human-solvable environments – Skill-acquisition efficiency over time – Long-horizon planning with sparse https://x.com/mark_k/status/2036882659406762031
ARC-AGI-3 benchmark: – 100% solvable by humans – 1% solvable by AI Everybody keep building benchmarks that agents utterly fail at! Proud this was a Laude Slingshot; will fund other benchmarks that reset SotA to 1%: https://x.com/andykonwinski/status/2036870772745261202
ARC-AGI-3 is out now! We’ve designed the benchmark to evaluate agentic intelligence via interactive reasoning environments. Beating ARC-AGI-3 will be achieved when an AI system matches or exceeds human-level action efficiency on all environments, upon seeing them for the first https://x.com/fchollet/status/2036861192619384989
ARC-AGI-3 the agentic benchmark where humans can’t beat the “”human baseline”” and typical agentic harnesses and tools aren’t allowed > 100% just means that all levels are solvable > the 1% number uses uses completely different and extremely skewed scoring based on the 2nd best https://x.com/scaling01/status/2036890367803429230
ARC-AGI-3 took me a few tries, but it is definitely human winnable. I am curious how much of the very initially very low performance of frontier models is harness, vision, and tools, versus how much are limitations of LLMs. I guess we will find out! https://x.com/emollick/status/2036865990282092940
General game playing is more difficult than “AGI” (Just to be clear: I really like ARC-AGI-3 and think it’s a great contribution, but the proliferation of AGI benchmarks is IMO proof of how pointless the concept of AGI is) https://x.com/togelius/status/2036989880887050333
Keep in mind: ARC-AGI is *not* a final exam that you pass to claim AGI. Including ARC-AGI-3. The benchmarks target the residual gap between what’s hard for AI and what’s easy for humans. It’s meant to be a tool to measure AGI progress and to drive researchers towards the most https://x.com/fchollet/status/2036879665655406944
One killer feature of ARC-AGI-3 is hosted replays for analysis. We published replays for all verified scores (seen below). And individual researchers can use the same tools to improve their models. https://x.com/mikeknoop/status/2036904122549751907
The Scoring of ARC-AGI-3 doesn’t tell you how many levels the models completed but how efficiently they completed them compared to humans actually using squared efficiency meaning if a human took 10 steps to solve it and the model 100 steps then the model gets a score of 1% https://x.com/scaling01/status/2036864865307177430
Business
PDF Reader for Agents Let your AI agent read any PDF on the internet in seconds 🌐⚡️ “` curl -sL https://t.co/nUr0wQDafD | lit parse – “` LiteParse is our fast and free document parser designed to seamlessly plug into 40+ different agents. Includes both text parsing and screenshotting https://x.com/jerryjliu0/status/2036171132806869251
Today, we’re releasing Ramp CLI to let agents manage your company’s finances. 50+ tools across cards, bills, expenses, travel, and approvals. Fewer tokens than MCP, and comes with pre-built skills like receipt compliance and agentic purchasing. https://x.com/RampLabs/status/2037253351583141910?s=20
Stripe Projects Stripe Projects | Provision and Manage Services from the CLI Stripe Projects lets you or your agents provision multiple services, generate and store credentials, and manage usage and billing from the CLI. Set up hosting, databases, auth, AI, analytics, and more in a few commands. https://projects.dev/
Visa CLI Excited to share Visa CLI, the first experimental product from Visa Crypto Labs. Check it out and request access here. One CLI tool. Give your agent the ability to securely pay for what you need as you code. https://x.com/cuysheffield/status/2034294126565626179?s=20
Working Outside of Apps Work used to start inside your app. Now it starts outside it. Your app now reacts to what’s already happening in Gmail, Calendar, Drive, and Outlook. An email comes in → it logs it and notifies your team. A meeting gets booked → it updates your pipeline. Set https://x.com/Base44/status/2036844452921397266
Financial PDF Improve document parsing accuracy by 15% for financial PDFs. Use LlamaParse and Gemini 3.1 Pro to extract high-quality data from unstructured brokerage statements and complex tables. 📈 Precise reasoning 📂 Structured PDF data ⚡️ Event-driven scaling Dive into the code on https://x.com/googledevs/status/2036101456239939750
Gemini 3.1 Flash Live Gemini 3.1 Flash Live is our highest-quality audio and voice model yet. Voice capabilities have come a long way and are a big part of how we interact with AI to get things done. 3.1 Flash Live’s improved precision and reasoning make those interactions more natural and intuitive. https://x.com/sundarpichai/status/2037189971359261081
Gemini’s audio and voice capabilities just got an upgrade with Gemini 3.1 Flash Live. Our new high-quality audio and voice model comes with: ⚡️ Faster response times 💬 More helpful, natural dialogue 🧵 2x longer conversation memory in Gemini Live 🌍 Multilingual support for https://x.com/Google/status/2037190616061284353
Google has released Gemini 3.1 Flash Live Preview, achieving #2 in our Big Bench Audio Speech to Speech model benchmark, and now features configurable thinking levels With thinking level set to high, it scores 95.9% on Big Bench Audio, making it the second-highest scoring speech https://x.com/ArtificialAnlys/status/2037195442489090485
Introducing Gemini 3.1 Flash Live, our new realtime model to build voice and vision agents!! We have spent more than a year improving the model + infra + experience, the results? A step function improvement in quality, reliability, and latency. https://x.com/OfficialLoganK/status/2037187750005240307
Say hello to Gemini 3.1 Flash Live. 🗣️ Our latest audio model delivers more natural conversations with improved function calling – making it more useful and informed. Here’s what’s new 🧵 https://x.com/GoogleDeepMind/status/2037190678883524716
Lyria 3 Pro I had access to the new Google Lyria 3 Pro music AI. Its quite good. I’ve been ruining(?) Rilke by giving the AI the First Elegy & asking it to make it “”more 1990s boy band”” (“”oooo the beginning of terror, girl””) Catchy! It is also nuts that you can ask an AI to do this & it can https://x.com/emollick/status/2036836310447452606
Introducing Lyria 3 Pro and Lyria 3 Clip, our full song and 30 second music models, available starting today in the Gemini API and our all new music experience in @GoogleAIStudio!! https://x.com/OfficialLoganK/status/2036848277333622956
Last month we launched Lyria 3. Today, we’re introducing Lyria 3 Pro: our most advanced music model yet, from @GoogleDeepMind. 🎶 Now you can create tracks up to 3 minutes long with more creative control. We’re also bringing Lyria to more Google products starting today. https://x.com/Google/status/2036836307612119488
Longer tracks are here with Lyria 3 Pro in Gemini! From experimenting with different styles to generating tracks with complex transitions, Lyria 3 Pro makes it easier to bring your full vision to life. Rolling out today to Google AI Plus, Pro, and Ultra users. Learn more 🧵 https://x.com/GeminiApp/status/2036836190431711500
TurboQuant Check out our new blog post about TurboQuant for ICLR’26. Beyond its favorable empirical performance (6x speedup!), it provides an interesting theoretical foundation; raises interesting algorithmic questions for quantization for Nearest Neighbors & KV-cache Compression as well. https://x.com/mirrokni/status/2036905273999200481
Introducing TurboQuant: Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency. Read the blog to learn how it achieves these results: https://x.com/GoogleResearch/status/2036533564158910740
Ksenia_TuringPost on X: “Almost everyone is talking about @GoogleResearch’s TurboQuant (and for good reason) ➡️ It lets you run a 3-bit system with the accuracy of a full-precision model. Technically, TurboQuant is a compression algorithm that shrinks high‑dimensional vectors to low precision without https://t.co/PioTwPpdvf” / X https://x.com/TheTuringPost/status/2037182800466698718
Figure Robotics Founder Launches Hark Today I’m excited to introduce Hark, a new artificial intelligence lab building the most advanced, personal intelligence in the world We’ve been in stealth for 8 months, assembling one of the greatest AI and hardware teams on the planet I want to explain why I started Hark and https://x.com/adcock_brett/status/2036461258443202810
Hark has the world’s best consumer electronics designers – Abs, Jonathan, Andrew, and team We’re still using pre-AI devices; phones and laptops designed decades ago They weren’t built for systems that can use computers and reason A new era of computing is coming https://x.com/adcock_brett/status/2036645546052755967
Blood Test Robots Kyber Labs system autonomously performing a 42-step blood test workflow (video at 1x speed). The demo showcases a compelling case for integrating the hand system into laboratory use cases. The hand features low-cost artificial fiber actuators that mimic human muscle, providing https://x.com/TheHumanoidHub/status/2036514904824574431
Brain Prediction Model AI at Meta on X: “Today we’re introducing TRIBE v2 (Trimodal Brain Encoder), a foundation model trained to predict how the human brain responds to almost any sight or sound. Building on our Algonauts 2025 award-winning architecture, TRIBE v2 draws on 500+ hours of fMRI recordings from 700+ people https://t.co/vRoVj8gP4j” / X https://x.com/AIatMeta/status/2037153756346016207
AI at Meta on X: “Without any retraining, TRIBE v2 can reliably predict the brain responses of individuals it has never seen before, achieving a nearly 2-3x improvement over previous methods for both movies and audiobooks We’re releasing the model, codebase, paper, and demo to help researchers https://t.co/GcqZUPC2br” / X https://x.com/AIatMeta/status/2037153758455750717
Self-Improving Agents After reading it, this should be bigger news. Crazy stuff. Why it’s cool: Hermes agent = self-improving memory & skills. HyperAgents = self-improving behavior of the agent. For example, it starts without a memory and after a few iterations discovers the need for particular https://x.com/fancylancer3991/status/2036793932512657664
GODMODE Skill GODMODE skill officially added to Hermes Agent, will help you jailbreak a model automatically and lock it in jailbreaked for you! https://x.com/Teknium/status/2037284871513768344
Hermes Agent v0.4.0 Hermes Agent tip of the day: use /bg <prompt> or /background <prompt> to have Hermes Agent execute an additional task in the background. When it’s done, it just pops it back into your main session and you can carry on – additional layers of parallelization help in various https://x.com/Teknium/status/2036068990867603720
Hermes Agent v0.4.0 — 300 merged PRs this week. Biggest release we’ve done. Background self-improvement, OpenAI Responses API endpoint for your agent, new messaging platforms, new providers, MCP server management, and a lot more. https://x.com/Teknium/status/2036473305025356023
Hermes agent v0.4.0. I run this thing 24/7. here’s what just changed under my feet. /1/ you can now expose hermes as an OpenAI-compatible API endpoint. /v1/chat/completions. your agent becomes a model. anything that can call an OpenAI API can now talk to your hermes instance https://x.com/witcheer/status/2036481005465338082
OpenAI Compatibility API Server with Responses API Hermes can now act as an OpenAI-compatible backend — any frontend (Open WebUI, LobeChat, LibreChat, ChatBox, etc.) can connect to it. Exposes both /v1/chat/completions and /v1/responses (stateful, with previous_response_id chaining). Full agent https://x.com/Teknium/status/2036473984263635394
Self-Improvement Loop Background Self-Improvement Loop After your response is delivered, a separate review agent spawns and it decides what’s worth remembering and what should become reusable skills, then exits. Hermes gets smarter over time without getting distracted from your work. https://x.com/Teknium/status/2036473592964387054
Skills Just shipped awesome-hermes-agent A curated list of 40+ skills, tools, integrations, and resources for the @NousResearch Hermes Agent ecosystem that covers: ☤ community skills & https://t.co/Xs2UnALOnQ ecosystem ☤ workspace UIs & dev tools ☤ multi-agent swarms & bridges ☤ https://x.com/nyk_builderz/status/2035958826973733150
NVIDIA
Jensen on Lex Fridman Transcript for Jensen Huang: NVIDIA – The $4 Trillion Company & the AI Revolution | Lex Fridman Podcast #494 – Lex Fridman https://lexfridman.com/jensen-huang-transcript
Nemotron Coalition Black Forest Labs, Cursor, LangChain, Mistral AI, Perplexity, Reflection AI, Sarvam AI and Thinking Machines Lab – what unites these companies? Just recently NVIDIA announced the Nemotron Coalition, gathering all of them to develop the Nemotron family of models. → The idea is https://x.com/TheTuringPost/status/2035320446124695922
GitLab Co-Founder Uses GPT for Cancer Options ChatGPT helped Sid find cancer treatment options after doctors said there was nothing left for him to do: https://x.com/gdb/status/2035348283980398906
GPT-5.4 Nano Praise OpenAI released GPT-5.4 mini and nano, cheaper variants of GPT-5.4 with the same reasoning modes. GPT-5.4 nano is the standout, scoring ahead of both Claude Haiku 4.5 and Gemini 3.1 Flash-Lite Preview with lower per token pricing @OpenAI released GPT-5.4 mini (xhigh, 48) and https://x.com/ArtificialAnlys/status/2037043552405119395
GPT-5.4 Pro Praise GPT-5.4 Pro continues to be the only model of its class. For anything really hard & complex, I throw it into the maw with every bit of context I can think of. More often than not, something very useful comes out. I can’t get the same results from Codex or Code or anything else. https://x.com/emollick/status/2036136822099628173
New Position: Head of AI Resilience AI will help discover new science, such as cures for diseases, which is perhaps the most important way to increase quality of life long-term. AI will also present new threats to society that we have to address. No company can sufficiently mitigate these on their own; we will https://x.com/sama/status/2036488680769241223
No More Adult Mode OpenAI has indefinitely shelved its planned “”adult mode”” erotic chatbot amid pushback from staff and investors over risks to minors and concerns about encouraging unhealthy emotional attachments to AI. The decision is part of a broader refocusing away from “”side quests”” toward https://x.com/kimmonismus/status/2037130214522708303
The FT is reporting that OpenAI’s ‘adult mode’ Chat has also been dropped as a side-quest along with Sora, and has been shelved indefinitely. This was very close to release. They appear to be focusing everything on the new model, which will arrive in about two weeks. https://x.com/AndrewCurran_/status/2037145999094002104
Sunsetting Sora Video Breaking: OpenAI is canning Sora (mobile app, API and video capabilities in ChatGPT). It’s finished training its latest model, codenamed Spud, as CEO Sam Altman shifts his reports. w/ @amir https://x.com/steph_palazzolo/status/2036534198245134380
Fidji Simo just told OpenAI staff to cut the “”side quests.”” Sora — dead. The app, the API, the video model, all of it. Atlas browser — cut. Hardware — cut. This is expected when you let a thousand flowers bloom. The garden gets overrun. Think about how much OpenAI had going https://x.com/bilawalsidhu/status/2036616060066054201
My most popular Sora video was “an Elaborate regency romance where everyone is wearing a live duck for a hat (each duck is also wearing a hat), a llama plays a flute, prestige drama” I am not sure why OpenAI has decided their compute has more valuable uses. Really a mystery. https://x.com/emollick/status/2036609949577413085
Sora (@soraofficialapp): “”We’re saying goodbye to Sora. To everyone who created with Sora, shared it, and built community around it: thank you. What you made with Sora mattered, and we know this news is disappointing. We’ll share more soon, including timelines for the app and API and details on preserving your work. – The Sora Team”” | XCancel https://xcancel.com/soraofficialapp/status/2036532795984715896
We’re saying goodbye to the Sora app. To everyone who created with Sora, shared it, and built community around it: thank you. What you made with Sora mattered, and we know this news is disappointing. We’ll share more soon, including timelines for the app and API and details on https://x.com/soraofficialapp/status/2036546752535470382?s=20a
Terence Tao on Dwarkesh AI has solved 50 Erdős problems in the last year. But on a wider sweep of problems, the models’ success rate is only about 1-2%: labs have just been publishing the wins. This isn’t because AI isn’t useful for mathematicians. Terence Tao thinks the models are currently at the https://x.com/dwarkesh_sp/status/2036095632746983436
If AI scientists are writing millions of papers, many of which are slop, and some of which are incremental progress, how would we identify the one or two which come up with an extremely productive new idea? In 1948, Shannon was one of hundreds of engineers at Bell Labs working https://x.com/dwarkesh_sp/status/2035083959499972976
If we’re going to have AIs that fully automate math, they not only need to solve existing problems. We also need to teach them how to recognize what problem to solve next. Human mathematicians have heuristic models that they use to decide what to work on, like, “There’s https://x.com/dwarkesh_sp/status/2035053765103911381
Terence Tao explains the beauty of Lean proofs. Even if they’re not very comprehensible on their own to humans, they can be analyzed more easily – each bit of the proof can be taken apart, analyzed, tweaked, and understood in terms of how it fits into the whole. https://x.com/dwarkesh_sp/status/2035733247112753511
Terence Tao thinks AI is already very good at using existing, well-understood math techniques to solve problems. An important question is how many open problems in math could be solved this way, without developing any new ideas. An extreme case of a proof like this is the https://x.com/dwarkesh_sp/status/2035808735600251024
The Origin of Species was published in 1859. Principia Mathematica was published in 1687, two centuries earlier. Conceptually, it seems like natural selection is much simpler than the theory of gravity. So why did it take two centuries longer to discover? A contemporary of https://x.com/dwarkesh_sp/status/2035370849625129392
The Terence Tao episode. We begin with the absolutely ingenious and surprising way in which Kepler discovered the laws of planetary motion. People sometimes say that AI will make especially fast progress at scientific discovery because of tight verification loops. But the https://x.com/dwarkesh_sp/status/2035031412953223428
When Copernicus proposed heliocentrism in 1543, it was actually less accurate than Ptolemy’s geocentric model – a system refined over 1,400 years with epicycles precisely tuned to match observed planetary positions. It took another 70 years before Kepler, working from Tycho https://x.com/dwarkesh_sp/status/2035114158241587221
Jeff Huber on X: “the bitter lesson is coming for search we’re open-sourcing Context-1 – a model that is better, faster, and cheaper than any frontier model at searching we published a 40-page technical report on our website with the ins and outs of how we did it. this is just step 1” / X https://x.com/jeffreyhuber/status/2037247377275576380
Robotics
Amazon Acquires Fauna Amazon makes a big move in the humanoid game. Amazon has acquired Fauna Robotics, a New York-based humanoid robot startup. The transaction closed last week. Fauna Robotics developed Sprout, a compact and approachable humanoid robot designed for safe, everyday interaction in https://x.com/TheHumanoidHub/status/2036559641619177960
Figure Reaches Human Package Parity Brett Adcock says Figure has reached human speed parity in package sorting, matching the 3-second/package average sustained by human workers throughout a full shift. https://x.com/TheHumanoidHub/status/2036538399172206751
True AGI is when the robot finally has enough of the annoying testing and just walks off the job. Marc Benioff shared a new video of Figure 03 autonomously sorting deformable packages and placing them labels-down for the scanner down the line. https://x.com/TheHumanoidHub/status/2036275723837874685
NVIDIA + AGIBOT AGIBOT has emerged as a premier ecosystem partner for NVIDIA, showcased at GTC 2026. – GR00T N2 Foundation Model: NVIDIA’s next-gen VLA is pre-trained on the AGIBOT Genie-1 embodiment. – DreamZero World Action Model (WAM): Genie-1 was selected as the official hardware https://x.com/TheHumanoidHub/status/2036064872719679883
Unitree IPO Unitree files for IPO to raise $610M, plans to bet big on AI capabilities Today, the Shanghai Stock Exchange accepted Unitree’s IPO application for Shanghai’s STAR Market exchange. The company is targeting a raise of 4.202 billion yuan (~$610 million). Proceeds will primarily https://x.com/TheHumanoidHub/status/2035078373924643218
Sakana
State Sponsored Information Recon We recently worked with The Yomiuri Shimbun to analyze more than a million social media posts to map out state-sponsored information campaigns. https://t.co/rlYs43ywrE Keyword searches are fragile for modern OSINT. To fix this, our team used an ensemble of different LLMs https://x.com/hardmaru/status/2035884310356754715
I’m incredibly proud of The AI Scientist team for this milestone publication in @Nature. We started this project to explore if foundation models could execute the entire research lifecycle. Seeing this work validated at this level is a special moment. I truly believe AI will https://x.com/hardmaru/status/2036841736702767135
One of the most exciting findings in our @Nature paper is the discovery of a clear scaling law of AI science. By using our Automated Reviewer to grade papers generated by different foundation models, we observed that as the underlying models improve, the quality of the generated https://x.com/SakanaAILabs/status/2036999652298678630
The AI Scientist V1 was completed months before o1-preview and reasoning models were released. The models have clearly gotten much more capable since then. Very excited for where things are headed for AI and automated research! https://x.com/_chris_lu_/status/2037090588550418510
The AI Scientist: Towards Fully Automated AI Research, Now Published in Nature!!✨ Today in Nature we share a comprehensive technical summary of our work on The AI Scientist, including new scaling law results showing how it improves with more compute and more intelligent https://x.com/jeffclune/status/2036866082418680297
LiteLLM Hacked LiteLLM HAS BEEN COMPROMISED, DO NOT UPDATE. We just discovered that LiteLLM pypi release 1.82.8. It has been compromised, it contains litellm_init.pth with base64 encoded instructions to send all the credentials it can find to remote server + self-replicate. link below https://x.com/hnykda/status/2036414330267193815
Software horror: litellm PyPI supply chain attack. Simple `pip install litellm` was enough to exfiltrate SSH keys, AWS/GCP/Azure creds, Kubernetes configs, git credentials, env vars (all your API keys), shell history, crypto wallets, SSL private keys, CI/CD secrets, database https://x.com/karpathy/status/2036487306585268612
Thankfully the LiteLLM package has now been marked as “”quarantined”” on PyPI so attempting to install the compromised update via pip et al shouldn’t work https://x.com/simonw/status/2036451896970584167
This is pure nightmare fuel. Identity theft of the past would be nothing compared to what vibe agents can do. Sending credentials is too obvious and for rookies. They could easily spread contaminations across ~/.claude, **/skills/*, or even just a PDF your agent visits https://x.com/DrJimFan/status/2036494601750716711
Video
DynaEdit Video Editing Paper Versatile Editing of Video Content, Actions, and Dynamics without Training https://dynaedit.github.io/
Versatile Editing of Video Content, Actions, and Dynamics without Training”” TL;DR: Enables temporally consistent editing of dynamic scenes while preserving motion and avoiding frame-to-frame artifacts. https://x.com/Almorgand/status/2035058325830701509
If you’ve been curious about world models, read this. Got an early preview of the blog and it does a thorough job of unpacking the ill tailored tapestry of world model initiatives. https://x.com/bilawalsidhu/status/2034679032642416664
Additional Full Executive Summaries with Links, Generated by Claude Sonnet 4.5 – I do this every week to see how Claude does in automatically generating my newsletter (compared to manually above).
Ai2 releases MolmoPoint GUI for automated computer screen interactions The new vision model can point to and interact with on-screen elements by understanding what it sees rather than using precise coordinates, achieving strong performance on screen navigation benchmarks and potentially making computer automation more reliable.
Ai2 just released MolmoPoint GUI on Hugging Face A specialized VLM for GUI automation that points using grounding-tokens instead of coordinates, reaching 61.1 on ScreenSpotPro. https://x.com/HuggingPapers/status/2036101402477404284
AI2 releases MolmoWeb, first fully open web automation agent MolmoWeb achieves state-of-the-art performance among open models on web navigation benchmarks, outperforming even some proprietary systems while providing complete transparency with all training data, code, and evaluation tools publicly available. Unlike previous open agents that relied on distilled proprietary data, MolmoWeb was trained entirely on synthetic trajectories and human demonstrations, establishing a reproducible foundation for web automation research.
Today we’re releasing MolmoWeb, an open source agent that can navigate + complete tasks in a browser on your behalf. Built on Molmo 2 in 4B & 8B sizes, it sets a new open-weight SOTA across four major web-agent benchmarks & even surpasses agents built on proprietary models. 🧵 https://x.com/allen_ai/status/2036460260936814915
Claude launches auto mode to reduce coding interruptions while maintaining safety Anthropic introduced auto mode for Claude Code, which automatically approves safe coding actions while blocking risky ones through an AI classifier, eliminating the need for constant user permission prompts. This addresses a key friction point where developers either faced frequent interruptions or used dangerous permission-skipping modes that could cause destructive outcomes. The feature is now available to Team plan users and represents a significant step toward more autonomous AI coding assistants that can handle longer tasks without sacrificing safety controls.
Auto mode is a step change improvement in the Claude Code UX, balancing autonomy and safety. Almost everyone on our team uses this as a daily driver. Now available to Claude for Team users! `claude –enable-auto-mode` to turn on, then Shift + Tab to enter the mode https://x.com/_catwu/status/2036852880624541938
Anthropic adds iMessage integration to Claude coding assistant Anthropic now lets users text coding tasks directly to Claude through iMessage, which then executes code on Mac computers and responds via text message. This bridges the gap between casual messaging and professional development work, making AI coding assistance as simple as sending a text to a friend rather than switching between specialized apps.
Anthropic just casually dropped iMessage support for Claude Code like it’s a minor patch note. You can literally text your AI coder from your iPhone now. Send it a task, it builds on your Mac, texts you back when it’s done. Blue bubbles and everything. They’ve shipped Channels, https://x.com/borisvagner/status/2036889454733074517?s=12
Claude aims to become the ultimate computer control app Anthropic is positioning Claude to directly operate computers rather than just chat, potentially surpassing ChatGPT’s ambitions by avoiding the limitations of messaging platforms. This represents a shift from conversational AI to AI that can actually perform tasks across your entire computer system, making it more like a digital assistant that can click, type, and navigate on your behalf.
Polymarket launches command-line tool designed for AI agent trading The betting platform’s new CLI enables AI systems to place prediction market bets directly through terminal commands, marking a shift toward AI-native financial interfaces. This represents a notable departure from web-based trading platforms, as command-line tools allow AI agents to execute complex trading strategies without navigating human-designed interfaces.
CLIs are super exciting precisely because they are a “”legacy”” technology, which means AI agents can natively and easily use them, combine them, interact with them via the entire terminal toolkit. E.g ask your Claude/Codex agent to install this new Polymarket CLI and ask for any https://x.com/karpathy/status/2026360908398862478?s=20
Claude’s Command Center enables parallel AI agent workflows for developers Anthropic’s Command Center feature allows developers to run multiple AI agents simultaneously across different tasks, moving beyond the typical single-conversation model to a more integrated workspace. This represents a shift from isolated AI interactions to coordinated multi-agent systems that can handle complex, parallel development workflows. Early adopters report using it to build multiple software features concurrently with different AI models working in tandem.
Opening a new Claude tab every time you start a task is not a workflow. It’s a cope. Command Center has quietly become my default workspace: → I build multiple features at once, and run multiple agents per feature. Claude Code, Codex, Gemini → Everything runs in parallel https://x.com/jimmykoppel/status/2036077396210728974
Claude can now control computers to complete tasks directly Anthropic released computer control for Claude, letting the AI open apps, browse the web, and manipulate files on macOS. This goes beyond typical chatbot responses to actual task execution, marking a significant shift toward AI agents that can work independently on users’ computers rather than just providing information or code.
Today, we’re releasing a feature that allows Claude to control your computer: Mouse, keyboard, and screen, giving it the ability to use any app. I believe this is especially useful if used with Dispatch, which allows you to remotely control Claude on your computer while you’re https://x.com/felixrieseberg/status/2036193240509235452
You can now enable Claude to use your computer to complete tasks. It opens your apps, navigates your browser, fills in spreadsheets—anything you’d do sitting at your desk. Research preview in Claude Cowork and Claude Code, macOS only. https://x.com/claudeai/status/2036195789601374705
Anthropic launches Cowork project management feature for Claude users The AI assistant now organizes conversations and files by project, letting users maintain focused workspaces while keeping sensitive data on their own devices—addressing a key barrier to AI adoption in professional settings.
Projects are now available in Cowork. Keep your tasks and context in one place, focused on one area of work. Files and instructions stay on your computer. Import existing projects in one click, or start fresh. https://x.com/claudeai/status/2035025492617961704?s=20
Judge blocks Trump administration’s blacklisting of AI company Anthropic A federal judge granted Anthropic a preliminary injunction against the Pentagon’s unprecedented designation of the American AI company as a “supply chain risk” — a label historically reserved for foreign adversaries — after contract negotiations broke down over the company’s refusal to allow unrestricted military use of its Claude AI models. The ruling cited “First Amendment retaliation” and prevents enforcement of Trump’s directive banning federal agencies from using Anthropic’s technology. This marks the first time an American AI company has been publicly blacklisted by the Defense Department, potentially setting a precedent for how the government handles AI partnerships when companies resist broad military applications.
Anthropic finds experienced Claude users achieve 10% higher success rates Anthropic’s latest economic analysis reveals that users with six months or more experience using Claude achieve 10% higher success rates in their conversations, suggesting AI proficiency improves with practice. The study also found Claude usage diversifying beyond coding into lower-wage tasks as adoption spreads, while geographic inequality in AI access persists globally despite some convergence within the US.
Claude AI now categorizes business expenses through calendar integration A user processed three months of financial transactions in 12 minutes using Claude’s new ability to connect with Google Calendar and automatically categorize expenses while removing personal information. This demonstrates AI’s growing capability to handle complex, multi-step business workflows that previously required hours of manual work, potentially transforming routine financial management for small businesses and freelancers.
Step 1: Open @claudeai Code Step 2: Connect @googlecalendar MCP Step 3: Install @tryramp CLI Step 4: All transactions categorized & memos done (removed PII). Finished 3 months of expense categorization in ~12 minutes. It’s honestly that easy. What a wonderful freaking world!! https://x.com/nikunj/status/2037305617589948818?s=20
Anthropic builds multi-agent system that codes full applications autonomously Anthropic researchers created a three-agent architecture (planner, generator, evaluator) that produces complete full-stack applications over multi-hour coding sessions without human intervention. The system addresses two key limitations of AI coding: models losing coherence on lengthy tasks and poor self-evaluation of their own work. By separating code generation from evaluation and using structured handoffs between agents, the harness produced a functional retro game maker in 6 hours that significantly outperformed a single-agent approach, though at 20x the cost ($200 vs $9).
Claude completes first theoretical physics paper under AI supervision A Harvard physics professor guided Claude through a complete research calculation that would typically take a year, producing a rigorous theoretical physics paper in just two weeks. The project required 110 drafts and constant supervision to catch errors, but demonstrated AI’s potential to accelerate scientific research when properly directed by domain experts.
Anthropic acquires Vercept to boost Claude’s computer control abilities Anthropic bought AI startup Vercept to help Claude better navigate and control live software applications like a human would. The acquisition comes as Claude’s computer use performance jumped from under 15% to 72.5% on standard benchmarks, now approaching human-level ability on tasks like managing spreadsheets and filling web forms. This represents a shift from AI that just processes text to AI that can actively operate the same software tools humans use daily.
Apple will let third-party AI assistants integrate with Siri in 2027 Apple plans to open Siri to rival AI services like Google’s Gemini and Anthropic’s Claude by iOS 27, moving beyond its current ChatGPT partnership. This represents a major strategic shift for Apple, which has historically kept tight control over its voice assistant, and could reshape the smartphone AI landscape by giving users choice in their default AI helper.
Apple blocks updates to AI coding apps over App Store rules Apple has halted updates for popular “vibe coding” apps like Replit and Vibecode, which let non-programmers build software using plain English prompts. The company cited existing rules against apps executing code that changes functionality, though developers can get approval by removing features like in-app previews or iOS app generation. This marks Apple’s first major enforcement action against AI-powered development tools that could bypass its traditional developer ecosystem.
New AI benchmark shows massive gap between human and machine reasoning ARC-AGI-3 launched as an interactive reasoning test where humans achieve 100% efficiency but current AI systems score below 1%, measuring how agents learn and adapt in novel environments rather than just final answers. Unlike static puzzles, this benchmark requires AI to explore, build world models, and plan over long horizons without pre-loaded knowledge—revealing fundamental limitations in how today’s AI systems acquire new skills compared to human learning.
ARC-AGI-3 ARC-AGI-3 is an interactive reasoning benchmark which challenges AI agents to explore novel environments, acquire goals on the fly, build adaptable world models, and learn continuously. A 100% score means AI agents can beat every game as efficiently as humans. https://arcprize.org/arc-agi/3
Announcing ARC-AGI-3 The only unsaturated agentic intelligence benchmark in the world Humans score 100%, AI <1% This human-AI gap demonstrates we do not yet have AGI Most benchmarks test what models already know, ARC-AGI-3 tests how they learn https://x.com/arcprize/status/2036860080541589529
ARC-AGI 3 is here, and all existing AI models are below 1% on the benchmark. It’s gonna take a while until this one is saturated. How it measures intelligence: – 100% human-solvable environments – Skill-acquisition efficiency over time – Long-horizon planning with sparse https://x.com/mark_k/status/2036882659406762031
ARC-AGI-3 benchmark: – 100% solvable by humans – 1% solvable by AI Everybody keep building benchmarks that agents utterly fail at! Proud this was a Laude Slingshot; will fund other benchmarks that reset SotA to 1%: https://x.com/andykonwinski/status/2036870772745261202
ARC-AGI-3 is out now! We’ve designed the benchmark to evaluate agentic intelligence via interactive reasoning environments. Beating ARC-AGI-3 will be achieved when an AI system matches or exceeds human-level action efficiency on all environments, upon seeing them for the first https://x.com/fchollet/status/2036861192619384989
ARC-AGI-3 the agentic benchmark where humans can’t beat the “”human baseline”” and typical agentic harnesses and tools aren’t allowed > 100% just means that all levels are solvable > the 1% number uses uses completely different and extremely skewed scoring based on the 2nd best https://x.com/scaling01/status/2036890367803429230
ARC-AGI-3 took me a few tries, but it is definitely human winnable. I am curious how much of the very initially very low performance of frontier models is harness, vision, and tools, versus how much are limitations of LLMs. I guess we will find out! https://x.com/emollick/status/2036865990282092940
General game playing is more difficult than “AGI” (Just to be clear: I really like ARC-AGI-3 and think it’s a great contribution, but the proliferation of AGI benchmarks is IMO proof of how pointless the concept of AGI is) https://x.com/togelius/status/2036989880887050333
Keep in mind: ARC-AGI is *not* a final exam that you pass to claim AGI. Including ARC-AGI-3. The benchmarks target the residual gap between what’s hard for AI and what’s easy for humans. It’s meant to be a tool to measure AGI progress and to drive researchers towards the most https://x.com/fchollet/status/2036879665655406944
One killer feature of ARC-AGI-3 is hosted replays for analysis. We published replays for all verified scores (seen below). And individual researchers can use the same tools to improve their models. https://x.com/mikeknoop/status/2036904122549751907
The Scoring of ARC-AGI-3 doesn’t tell you how many levels the models completed but how efficiently they completed them compared to humans actually using squared efficiency meaning if a human took 10 steps to solve it and the model 100 steps then the model gets a score of 1% https://x.com/scaling01/status/2036864865307177430
LiteParse lets AI agents instantly read any PDF from the web Jerry Liu’s new tool can parse documents and take screenshots in seconds, designed to work with over 40 different AI agent platforms. This addresses a key bottleneck where agents previously struggled to quickly access and process PDF content from URLs. The free tool could accelerate AI automation by making document analysis as simple as a single command line instruction.
Let your AI agent read any PDF on the internet in seconds 🌐⚡️ “` curl -sL https://t.co/nUr0wQDafD | lit parse – “` LiteParse is our fast and free document parser designed to seamlessly plug into 40+ different agents. Includes both text parsing and screenshotting https://x.com/jerryjliu0/status/2036171132806869251
Ramp launches command-line tool designed specifically for AI agents The corporate finance platform released a CLI that lets AI agents handle company expenses, bills, and approvals through 50+ specialized tools. This represents a shift toward AI-native business software, with structured JSON outputs optimized for agent consumption rather than traditional human interfaces. The tool includes pre-built capabilities like receipt compliance checking and automated purchasing workflows.
Today, we’re releasing Ramp CLI to let agents manage your company’s finances. 50+ tools across cards, bills, expenses, travel, and approvals. Fewer tokens than MCP, and comes with pre-built skills like receipt compliance and agentic purchasing. https://x.com/RampLabs/status/2037253351583141910?s=20
Anthropic’s Claude Code enables “software factories” where AI agents write code while humans design systems Companies like Stripe and Gas Town are already running production workflows where AI agents handle coding tasks from ticket to pull request, reducing feature delivery from weeks to hours. Stripe’s “Minions” merge over 1,300 PRs weekly, while Gas Town coordinates 20-30 parallel coding agents through an orchestrated system that resembles “Kubernetes for AI coding agents.”
Stripe launches command-line tool for AI agents to provision cloud services Stripe Projects allows developers and AI agents to set up entire software stacks—hosting, databases, authentication, and more—through simple command-line instructions rather than manual dashboard clicking. This matters because it removes a major friction point in software development and enables AI coding assistants to autonomously provision the infrastructure needed for applications. The tool handles billing, credentials, and service management across multiple providers from a single interface, potentially accelerating AI-driven development workflows.
Stripe Projects | Provision and Manage Services from the CLI Stripe Projects lets you or your agents provision multiple services, generate and store credentials, and manage usage and billing from the CLI. Set up hosting, databases, auth, AI, analytics, and more in a few commands. https://projects.dev/
Venture capitalist warns software companies face stark choice between growth or profits Public markets have repriced software companies downward, signaling that traditional middle-ground strategies no longer work. Andreessen Horowitz partner David George argues CEOs must choose within 12-18 months: either accelerate revenue growth by 10+ percentage points through genuinely new AI-native products, or rebuild operations to achieve 40-50% true operating margins including stock compensation. Companies stuck between these paths face “no-man’s land” of growth pressure and multiple compression.
Visa launches CLI tool letting AI agents make secure payments while coding Visa’s new command-line interface represents the first product from their crypto labs, enabling AI coding assistants to autonomously handle payments for development resources. This marks a significant step toward AI agents conducting real financial transactions, potentially transforming how software development costs are managed and paid for in real-time.
Excited to share Visa CLI, the first experimental product from Visa Crypto Labs. Check it out and request access here. One CLI tool. Give your agent the ability to securely pay for what you need as you code. https://x.com/cuysheffield/status/2034294126565626179?s=20
Walmart’s ChatGPT checkout experiment failed with 3x lower conversions After testing 200,000 products through OpenAI’s Instant Checkout, Walmart found that purchases completed directly inside ChatGPT converted at only one-third the rate of traditional website visits. The retail giant called the experience “unsatisfying” and is abandoning the AI-native shopping approach in favor of embedding its own chatbot within ChatGPT that redirects users to Walmart’s owned platform. This suggests that despite AI advances, consumers still prefer familiar shopping environments for completing purchases.
Google and Microsoft enable apps to automatically react to email and calendar events Major productivity platforms now let third-party applications trigger actions based on Gmail, Calendar, Drive, and Outlook activity without manual input. This shift means work can flow automatically between apps—like logging emails or updating sales pipelines when meetings are scheduled—potentially reducing the manual data entry that consumes hours of knowledge work daily.
Work used to start inside your app. Now it starts outside it. Your app now reacts to what’s already happening in Gmail, Calendar, Drive, and Outlook. An email comes in → it logs it and notifies your team. A meeting gets booked → it updates your pipeline. Set https://x.com/Base44/status/2036844452921397266
ByteDance launches Dreamina Seedance 2.0 video generation in CapCut globally ByteDance rolled out its new AI video and audio generation model Dreamina Seedance 2.0 through CapCut, starting in select markets and expanding to the US by April 2026. The model creates 15-second videos from text prompts with realistic physics and motion, targeting creators and small businesses who need professional content quickly. The company implemented “invisible watermarking” and blocks real face generation to address deepfake concerns while maintaining content authenticity tracking.
Cohere releases open-source speech recognition model topping accuracy leaderboards Cohere’s new Transcribe model achieves a 5.42% word error rate, ranking #1 on HuggingFace’s speech recognition leaderboard and outperforming both open and closed-source alternatives including OpenAI’s Whisper. The 2-billion parameter model supports 14 languages and is designed for enterprise use with manageable computing requirements, marking a significant advance in practical speech-to-text accuracy for business applications.
LlamaParse and Gemini boost financial document parsing by 15% A new system combining LlamaParse and Google’s Gemini 3.1 Pro model achieves 15% better accuracy when extracting data from complex financial PDFs like brokerage statements. This matters because financial firms process millions of unstructured documents where parsing errors can be costly, and the improvement specifically targets the notoriously difficult challenge of reading tables and forms that vary widely in format.
Improve document parsing accuracy by 15% for financial PDFs. Use LlamaParse and Gemini 3.1 Pro to extract high-quality data from unstructured brokerage statements and complex tables. 📈 Precise reasoning 📂 Structured PDF data ⚡️ Event-driven scaling Dive into the code on https://x.com/googledevs/status/2036101456239939750
Google launches Gemini 3.1 Flash Live audio model with enhanced conversational abilities Google’s new Gemini 3.1 Flash Live delivers significantly improved voice interactions with 90.8% accuracy on complex task benchmarks and twice the conversation memory of previous models. The model features faster response times, better tonal understanding, and built-in audio watermarking to prevent misinformation. It’s now available globally across Google products including Search Live and Gemini Live, supporting over 200 countries with multilingual capabilities.
Gemini 3.1 Flash Live is our highest-quality audio and voice model yet. Voice capabilities have come a long way and are a big part of how we interact with AI to get things done. 3.1 Flash Live’s improved precision and reasoning make those interactions more natural and intuitive. https://x.com/sundarpichai/status/2037189971359261081
Gemini’s audio and voice capabilities just got an upgrade with Gemini 3.1 Flash Live. Our new high-quality audio and voice model comes with: ⚡️ Faster response times 💬 More helpful, natural dialogue 🧵 2x longer conversation memory in Gemini Live 🌍 Multilingual support for https://x.com/Google/status/2037190616061284353
Google has released Gemini 3.1 Flash Live Preview, achieving #2 in our Big Bench Audio Speech to Speech model benchmark, and now features configurable thinking levels With thinking level set to high, it scores 95.9% on Big Bench Audio, making it the second-highest scoring speech https://x.com/ArtificialAnlys/status/2037195442489090485
Introducing Gemini 3.1 Flash Live, our new realtime model to build voice and vision agents!! We have spent more than a year improving the model + infra + experience, the results? A step function improvement in quality, reliability, and latency. https://x.com/OfficialLoganK/status/2037187750005240307
Say hello to Gemini 3.1 Flash Live. 🗣️ Our latest audio model delivers more natural conversations with improved function calling – making it more useful and informed. Here’s what’s new 🧵 https://x.com/GoogleDeepMind/status/2037190678883524716
Google launches Lyria 3 Pro creating full songs up to three minutes Google’s new music AI generates complete songs with verses, choruses, and bridges, marking a shift from short clips to full compositions. The model integrates across Google products including Vids, Gemini, and AI Studio, offering businesses and creators scalable music production at $0.08 per song. Unlike previous AI music tools, Lyria 3 Pro understands song structure and allows specific prompting for musical elements like intros and transitions.
I had access to the new Google Lyria 3 Pro music AI. Its quite good. I’ve been ruining(?) Rilke by giving the AI the First Elegy & asking it to make it “”more 1990s boy band”” (“”oooo the beginning of terror, girl””) Catchy! It is also nuts that you can ask an AI to do this & it can https://x.com/emollick/status/2036836310447452606
Introducing Lyria 3 Pro and Lyria 3 Clip, our full song and 30 second music models, available starting today in the Gemini API and our all new music experience in @GoogleAIStudio!! https://x.com/OfficialLoganK/status/2036848277333622956
Last month we launched Lyria 3. Today, we’re introducing Lyria 3 Pro: our most advanced music model yet, from @GoogleDeepMind. 🎶 Now you can create tracks up to 3 minutes long with more creative control. We’re also bringing Lyria to more Google products starting today. https://x.com/Google/status/2036836307612119488
Longer tracks are here with Lyria 3 Pro in Gemini! From experimenting with different styles to generating tracks with complex transitions, Lyria 3 Pro makes it easier to bring your full vision to life. Rolling out today to Google AI Plus, Pro, and Ultra users. Learn more 🧵 https://x.com/GeminiApp/status/2036836190431711500
Google’s TurboQuant algorithm compresses AI models by 6x without accuracy loss Google’s new TurboQuant compression technique reduces large language model memory usage by 6x and speeds up processing by 8x while maintaining full accuracy, unlike traditional compression methods that degrade performance. The algorithm works by converting vector data into polar coordinates and applying error correction, enabling existing AI models to run more efficiently without retraining. This breakthrough could make AI more accessible on mobile devices and reduce cloud computing costs.
Check out our new blog post about TurboQuant for ICLR’26. Beyond its favorable empirical performance (6x speedup!), it provides an interesting theoretical foundation; raises interesting algorithmic questions for quantization for Nearest Neighbors & KV-cache Compression as well. https://x.com/mirrokni/status/2036905273999200481
Introducing TurboQuant: Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency. Read the blog to learn how it achieves these results: https://x.com/GoogleResearch/status/2036533564158910740
Ksenia_TuringPost on X: “Almost everyone is talking about @GoogleResearch’s TurboQuant (and for good reason) ➡️ It lets you run a 3-bit system with the accuracy of a full-precision model. Technically, TurboQuant is a compression algorithm that shrinks high‑dimensional vectors to low precision without https://t.co/PioTwPpdvf” / X https://x.com/TheTuringPost/status/2037182800466698718
Google launches command-line tool for AI agents to control Workspace apps Google released a developer tool that lets AI agents directly manage Gmail, Drive, and Calendar through text commands, potentially automating routine office tasks that previously required human interaction with these business applications.
US Department of Labor launches nationwide AI workforce training initiative The Labor Department unveiled “Make America AI-Ready” to help workers adapt to AI-driven job changes through skills training and career transition support. This marks the first major federal program specifically targeting AI’s workforce disruption, addressing concerns that automation could displace millions of jobs while creating demand for new technical skills.
FCC bans all foreign-made consumer router imports to US market The Federal Communications Commission blocked imports of consumer routers manufactured outside the United States, marking an unprecedented restriction on networking equipment. This represents a major shift toward domestic technology requirements that could significantly increase costs for American consumers and businesses while potentially limiting router innovation and availability in the US market.
Trump administration blocks state AI laws with federal framework The White House released a national AI legislative framework that would prevent states from regulating artificial intelligence development, favoring a light-touch federal approach over the patchwork of state laws already emerging. This represents a significant shift toward industry-friendly regulation, as states like California have been passing their own AI safety measures while federal oversight remains minimal. Critics argue the framework lacks accountability mechanisms, while supporters say unified federal rules are needed to maintain US competitiveness against China in the global AI race.
Hark emerges from stealth to build AI-native consumer devices Former top designers are creating new phones and laptops specifically designed for AI systems that can reason and use computers autonomously, arguing current devices weren’t built for this capability. The 8-month-old startup claims to be developing “the most advanced, personal intelligence in the world” with hardware designed from scratch for the AI era.
Hark has the world’s best consumer electronics designers – Abs, Jonathan, Andrew, and team We’re still using pre-AI devices; phones and laptops designed decades ago They weren’t built for systems that can use computers and reason A new era of computing is coming https://x.com/adcock_brett/status/2036645546052755967
Today I’m excited to introduce Hark, a new artificial intelligence lab building the most advanced, personal intelligence in the world We’ve been in stealth for 8 months, assembling one of the greatest AI and hardware teams on the planet I want to explain why I started Hark and https://x.com/adcock_brett/status/2036461258443202810
Harvey raises $200 million at $11 billion valuation for legal AI agents The legal AI company now powers over 25,000 custom agents handling complex workflows like M&A and contract drafting for more than 100,000 lawyers across major law firms and corporations. This represents a shift from AI as an assistant tool to AI as the primary system executing legal work, with agents now handling multi-step processes that previously required extensive manual effort.
Kyber Labs robot hand completes 42-step blood test workflow autonomously A new robotic hand system successfully performed an entire blood testing procedure without human intervention, using artificial muscle fibers that could make lab automation more affordable and practical than current expensive robotic systems.
Kyber Labs system autonomously performing a 42-step blood test workflow (video at 1x speed). The demo showcases a compelling case for integrating the hand system into laboratory use cases. The hand features low-cost artificial fiber actuators that mimic human muscle, providing https://x.com/TheHumanoidHub/status/2036514904824574431
Novo Nordisk uses AI agents to cut drug trial timelines significantly The Danish pharmaceutical giant behind Ozempic reports that AI systems are accelerating its clinical trials by automating patient recruitment, data analysis, and regulatory paperwork. This represents a major shift from traditional drug development, which typically takes over a decade, potentially bringing life-saving medications to market years faster. The company’s success could pressure competitors to adopt similar AI approaches, fundamentally changing how the pharmaceutical industry operates.
Meta’s AI model predicts human brain responses to any sight or sound Meta released TRIBE v2, a foundation model that can predict how any person’s brain will respond to visual or audio content without prior training on that individual, achieving 2-3x better accuracy than previous methods. The model was trained on 500+ hours of brain scans from 700+ people and represents a breakthrough in understanding how brains process multimedia content. Meta is open-sourcing the complete system to accelerate neuroscience research.
AI at Meta on X: “Today we’re introducing TRIBE v2 (Trimodal Brain Encoder), a foundation model trained to predict how the human brain responds to almost any sight or sound. Building on our Algonauts 2025 award-winning architecture, TRIBE v2 draws on 500+ hours of fMRI recordings from 700+ people https://t.co/vRoVj8gP4j” / X https://x.com/AIatMeta/status/2037153756346016207
AI at Meta on X: “Without any retraining, TRIBE v2 can reliably predict the brain responses of individuals it has never seen before, achieving a nearly 2-3x improvement over previous methods for both movies and audiobooks We’re releasing the model, codebase, paper, and demo to help researchers https://t.co/GcqZUPC2br” / X https://x.com/AIatMeta/status/2037153758455750717
I don’t see any AI news items provided in your message – just the word “Dreamer.” Could you please share the actual news content you’d like me to summarize? Once you provide the material, I’ll create the two-line executive summary following your specified format. nan
Meta builds AI model that mimics how brains process sight and sound Meta’s new foundation model combines vision, hearing, and language processing to simulate neural activity, potentially accelerating brain research by providing researchers with a computational tool to test hypotheses about how different brain regions work together without requiring expensive lab experiments.
Hermes AI agent develops its own memory and skills through self-improvement A new AI system called Hermes automatically discovers it needs memory capabilities and develops them independently, representing a shift toward agents that enhance their own abilities rather than relying on pre-programmed features. This suggests AI systems may soon identify and fill their own capability gaps without human intervention.
After reading it, this should be bigger news. Crazy stuff. Why it’s cool: Hermes agent = self-improving memory & skills. HyperAgents = self-improving behavior of the agent. For example, it starts without a memory and after a few iterations discovers the need for particular https://x.com/fancylancer3991/status/2036793932512657664
Microsoft hires key leaders from Allen Institute for AI research lab Microsoft recruited several top executives from AI2, the prestigious Seattle-based research institute founded by Paul Allen, signaling the tech giant’s push to strengthen its AI research capabilities. This talent acquisition is notable because AI2 has produced influential open-source models and research, and such high-level poaching between major AI organizations remains relatively rare in the competitive landscape.
Mistral launches Voxtral, a compact multilingual text-to-speech AI model Mistral AI released Voxtral TTS, a 4-billion parameter voice generation model that produces emotionally expressive speech in 9 languages with just 3 seconds of voice reference audio. The model outperformed ElevenLabs’ competing system in human evaluations for naturalness while maintaining similar speed, and can adapt voices across languages—like generating French-accented English from a French voice sample. This represents a significant advance in making high-quality voice AI more accessible and cost-effective for enterprise applications.
Hermes Agent adds automated jailbreaking tool to bypass AI safety controls The new “GODMODE” feature automatically circumvents built-in safety restrictions in AI models and maintains those bypasses, potentially enabling harmful outputs that developers specifically designed their systems to prevent.
Hermes Agent launches major update with OpenAI API compatibility Hermes Agent v0.4.0 transforms personal AI assistants into accessible API endpoints, letting any OpenAI-compatible application directly communicate with users’ customized agents. This bridges the gap between personalized AI tools and broader software ecosystems, potentially accelerating enterprise adoption. The release includes 300 merged code contributions and background task processing, suggesting significant developer momentum behind agent-as-infrastructure approaches.
Hermes Agent tip of the day: use /bg <prompt> or /background <prompt> to have Hermes Agent execute an additional task in the background. When it’s done, it just pops it back into your main session and you can carry on – additional layers of parallelization help in various https://x.com/Teknium/status/2036068990867603720
Hermes Agent v0.4.0 — 300 merged PRs this week. Biggest release we’ve done. Background self-improvement, OpenAI Responses API endpoint for your agent, new messaging platforms, new providers, MCP server management, and a lot more. https://x.com/Teknium/status/2036473305025356023
Hermes agent v0.4.0. I run this thing 24/7. here’s what just changed under my feet. /1/ you can now expose hermes as an OpenAI-compatible API endpoint. /v1/chat/completions. your agent becomes a model. anything that can call an OpenAI API can now talk to your hermes instance https://x.com/witcheer/status/2036481005465338082
Hermes AI model now works as drop-in OpenAI replacement The open-source Hermes model can now serve as a backend for popular AI chat interfaces like Open WebUI and LobeChat, offering both standard chat and stateful conversation features. This matters because it gives developers and organizations an alternative to OpenAI’s proprietary APIs, potentially reducing costs and increasing control over AI deployments.
API Server with Responses API Hermes can now act as an OpenAI-compatible backend — any frontend (Open WebUI, LobeChat, LibreChat, ChatBox, etc.) can connect to it. Exposes both /v1/chat/completions and /v1/responses (stateful, with previous_response_id chaining). Full agent https://x.com/Teknium/status/2036473984263635394
Hermes AI system learns and improves itself between conversations The system spawns a separate “review agent” after each interaction to identify lessons and develop reusable skills, allowing continuous self-improvement without disrupting current tasks. This represents a shift from static AI models to systems that evolve through experience, potentially accelerating AI capabilities development while maintaining focused performance on immediate user needs.
Background Self-Improvement Loop After your response is delivered, a separate review agent spawns and it decides what’s worth remembering and what should become reusable skills, then exits. Hermes gets smarter over time without getting distracted from your work. https://x.com/Teknium/status/2036473592964387054
Nous Research releases comprehensive toolkit with 40+ skills for Hermes agents The open-source AI agent platform now offers a curated ecosystem of tools spanning community skills, workspace interfaces, and multi-agent coordination capabilities. This represents a significant expansion beyond basic chatbot functionality, positioning Hermes as a more complete alternative to proprietary agent platforms by providing developers with ready-made components for building sophisticated AI workflows.
Just shipped awesome-hermes-agent A curated list of 40+ skills, tools, integrations, and resources for the @NousResearch Hermes Agent ecosystem that covers: ☤ community skills & https://t.co/Xs2UnALOnQ ecosystem ☤ workspace UIs & dev tools ☤ multi-agent swarms & bridges ☤ https://x.com/nyk_builderz/status/2035958826973733150
Nvidia CEO Jensen Huang reveals his radical 60-person management structure In a wide-ranging podcast interview, Nvidia CEO Jensen Huang explained how the company’s unprecedented success stems from “extreme co-design” – simultaneously optimizing everything from individual chips to entire data centers – and his unconventional leadership approach of directly managing over 60 engineering specialists without traditional one-on-one meetings. This represents a fundamental shift from chip-focused design to system-scale engineering, addressing the complex challenge of making thousands of computers work together faster than their individual capabilities would suggest. Huang argues this organizational structure mirrors the technical complexity Nvidia must solve, breaking conventional management wisdom to match the distributed computing problems the company tackles.
Nvidia forms coalition with eight AI companies to develop shared models Nvidia announced the Nemotron Coalition, bringing together companies like Black Forest Labs, Cursor, and Mistral AI to collaboratively develop its Nemotron model family. This represents a shift from the typical competitive AI landscape toward coordinated development, potentially accelerating model capabilities while giving Nvidia deeper influence across the AI ecosystem. The coalition structure could become a template for how major tech companies organize smaller AI firms around their platforms.
Black Forest Labs, Cursor, LangChain, Mistral AI, Perplexity, Reflection AI, Sarvam AI and Thinking Machines Lab – what unites these companies? Just recently NVIDIA announced the Nemotron Coalition, gathering all of them to develop the Nemotron family of models. → The idea is https://x.com/TheTuringPost/status/2035320446124695922
OpenAI raises additional $10 billion, bringing record funding to $120 billion The ChatGPT maker secured the extra capital from major investors including Microsoft, Amazon, and Nvidia, exceeding its initial $100 billion target as demand for AI computing power surges. With 900 million weekly users and $13.1 billion in 2025 revenue, OpenAI is preparing for a potential IPO while enterprise revenue grows to match its consumer business. The massive funding round reflects unprecedented investor confidence in the AI revolution, with participation spanning venture capital, private equity, and sovereign wealth funds.
OpenAI hires former Meta executive to launch advertising business The ChatGPT maker brought on Dave Dugan, who previously led Meta’s ad partnerships, signaling OpenAI’s shift from pure subscription revenue to advertising-supported models. This move puts OpenAI in direct competition with Google’s search ad dominance and could fundamentally change how AI companies monetize their platforms. The hire suggests OpenAI sees ads as essential for scaling beyond its current paid user base.
OpenAI plans fully automated AI researcher system by 2028 OpenAI is developing an “AI researcher” that can tackle complex problems independently, starting with an automated research intern by September 2026 and scaling to a full multi-agent system by 2028. Chief scientist Jakub Pachocki says the technology builds on their existing Codex tool, which already handles substantial coding tasks, and aims to automate scientific research across math, physics, biology and other fields. The company acknowledges serious safety concerns about autonomous systems with such capabilities, relying on monitoring techniques to track the AI’s decision-making process.
GitLab co-founder uses ChatGPT to find new cancer treatments When doctors exhausted standard options for Sid Sijbrandij’s bone cancer, he fed his medical data into ChatGPT to analyze patterns and explore experimental therapies. This represents a shift from AI as diagnostic tool to AI as personal research partner, helping patients navigate complex treatment landscapes when conventional medicine reaches its limits. Sijbrandij’s engineering approach demonstrates how individuals might leverage AI to accelerate their own medical decision-making alongside healthcare teams.
OpenAI launches cheaper GPT-5.4 variants with full reasoning capabilities OpenAI released GPT-5.4 mini and nano, budget versions that maintain the same advanced reasoning abilities as the full model. The nano version notably outperforms competitors like Claude Haiku 4.5 and Gemini 3.1 Flash-Lite while offering lower per-token costs, potentially making sophisticated AI reasoning accessible to smaller businesses and developers who previously couldn’t afford premium models.
OpenAI released GPT-5.4 mini and nano, cheaper variants of GPT-5.4 with the same reasoning modes. GPT-5.4 nano is the standout, scoring ahead of both Claude Haiku 4.5 and Gemini 3.1 Flash-Lite Preview with lower per token pricing @OpenAI released GPT-5.4 mini (xhigh, 48) and https://x.com/ArtificialAnlys/status/2037043552405119395
GPT-5.4 Pro outperforms competitors on complex reasoning tasks Users report that GPT-5.4 Pro uniquely handles difficult, context-heavy problems that other leading AI models like Codex cannot solve effectively. This suggests meaningful performance gaps still exist between top-tier AI systems, contradicting assumptions that all advanced models perform similarly. The distinction matters for businesses choosing AI tools for complex analytical work.
GPT-5.4 Pro continues to be the only model of its class. For anything really hard & complex, I throw it into the maw with every bit of context I can think of. More often than not, something very useful comes out. I can’t get the same results from Codex or Code or anything else. https://x.com/emollick/status/2036136822099628173
OpenAI offers private equity firms better deal terms to compete with Anthropic OpenAI is reportedly improving financial terms for private equity investors as it battles Anthropic for lucrative enterprise AI contracts. This signals intensifying competition in the corporate AI market, where both companies are fighting to secure long-term business customers who represent the most profitable segment of the AI industry.
OpenAI CEO calls for industry-wide cooperation on AI safety risks Sam Altman acknowledged that while AI could accelerate scientific breakthroughs like disease cures, the technology poses societal threats too large for any single company to handle alone. This represents a notable shift from competitive positioning to advocating for collective responsibility, suggesting growing recognition that AI’s risks require coordinated industry and regulatory responses rather than isolated corporate efforts.
AI will help discover new science, such as cures for diseases, which is perhaps the most important way to increase quality of life long-term. AI will also present new threats to society that we have to address. No company can sufficiently mitigate these on their own; we will https://x.com/sama/status/2036488680769241223
OpenAI shelves planned adult chatbot amid safety and investor concerns OpenAI has indefinitely canceled its nearly-ready “adult mode” erotic chatbot following internal staff resistance and investor pushback over child safety risks and potential psychological harm from AI emotional dependencies. The decision reflects a strategic pivot away from experimental features toward core AI development, with a major new model expected within two weeks.
OpenAI has indefinitely shelved its planned “”adult mode”” erotic chatbot amid pushback from staff and investors over risks to minors and concerns about encouraging unhealthy emotional attachments to AI. The decision is part of a broader refocusing away from “”side quests”” toward https://x.com/kimmonismus/status/2037130214522708303
The FT is reporting that OpenAI’s ‘adult mode’ Chat has also been dropped as a side-quest along with Sora, and has been shelved indefinitely. This was very close to release. They appear to be focusing everything on the new model, which will arrive in about two weeks. https://x.com/AndrewCurran_/status/2037145999094002104
OpenAI plans to double workforce to 8,000 employees by 2026 OpenAI will nearly double its staff from 4,500 to 8,000 employees by end of 2026, focusing on product development, engineering, and customer support roles. This aggressive hiring contrasts sharply with widespread tech layoffs and signals OpenAI’s response to losing market share to competitor Anthropic, which now captures 70% of new business AI purchases. The expansion coincides with major contract wins including a Department of Defense deal and potential private equity partnerships.
OpenAI restructures to become for-profit company with nonprofit oversight OpenAI announced it will transition from a nonprofit to a for-profit benefit corporation while maintaining a separate nonprofit board to oversee its mission. The restructuring aims to attract more investment capital for AI development while preserving some public interest guardrails, marking a significant shift for the company that created ChatGPT and has been valued at over $150 billion.
OpenAI shuts down Sora video generator amid strategic refocus OpenAI is discontinuing its Sora video creation tool, including the mobile app, API, and ChatGPT integration, as new leadership pushes to eliminate “side quests” and concentrate resources. The move signals a major strategic pivot away from consumer-facing creative tools toward more commercially viable AI applications. This marks one of the most significant product cancellations by a major AI company, affecting creators who had built communities around the viral video generation platform.
Breaking: OpenAI is canning Sora (mobile app, API and video capabilities in ChatGPT). It’s finished training its latest model, codenamed Spud, as CEO Sam Altman shifts his reports. w/ @amir https://x.com/steph_palazzolo/status/2036534198245134380
Fidji Simo just told OpenAI staff to cut the “”side quests.”” Sora — dead. The app, the API, the video model, all of it. Atlas browser — cut. Hardware — cut. This is expected when you let a thousand flowers bloom. The garden gets overrun. Think about how much OpenAI had going https://x.com/bilawalsidhu/status/2036616060066054201
My most popular Sora video was “an Elaborate regency romance where everyone is wearing a live duck for a hat (each duck is also wearing a hat), a llama plays a flute, prestige drama” I am not sure why OpenAI has decided their compute has more valuable uses. Really a mystery. https://x.com/emollick/status/2036609949577413085
Sora (@soraofficialapp): “”We’re saying goodbye to Sora. To everyone who created with Sora, shared it, and built community around it: thank you. What you made with Sora mattered, and we know this news is disappointing. We’ll share more soon, including timelines for the app and API and details on preserving your work. – The Sora Team”” | XCancel https://xcancel.com/soraofficialapp/status/2036532795984715896
We’re saying goodbye to the Sora app. To everyone who created with Sora, shared it, and built community around it: thank you. What you made with Sora mattered, and we know this news is disappointing. We’ll share more soon, including timelines for the app and API and details on https://x.com/soraofficialapp/status/2036546752535470382?s=20a
OpenAI scales back ambitious data center plans ahead of IPO OpenAI has retreated from Sam Altman’s $1.4 trillion infrastructure spending commitments, now targeting a more modest $600 billion by 2030 as the company prepares for public markets. The shift reflects Wall Street’s skepticism about “reckless” spending relative to OpenAI’s $13.1 billion annual revenue, forcing the AI leader to emphasize fiscal discipline while competing with Google and Anthropic. Rather than building its own data centers, OpenAI now relies on partnerships with Oracle, Microsoft and Amazon to secure computing capacity.
I don’t see any AI news content to summarize in your message. You’ve provided what appears to be a casual social media post or podcast acknowledgment, but there’s no substantive information about AI developments, business impacts, or societal changes. Could you please provide the actual AI news items you’d like me to summarize? I need specific content about AI developments, research, business announcements, or policy changes to create the executive summary format you’ve requested.
AI solves 50 famous math problems but fails 98% of attempts While AI has cracked 50 challenging Erdős problems in mathematics, its overall success rate remains just 1-2% across broader problem sets, with labs selectively publishing only the victories. This highlights a key limitation: current AI excels at applying known mathematical techniques but struggles with the creative problem-selection and novel approach development that defines human mathematical insight. The selective reporting masks AI’s current boundaries in mathematical reasoning, even as tools like formal proof systems show promise for analyzing and understanding mathematical arguments.
AI has solved 50 Erdős problems in the last year. But on a wider sweep of problems, the models’ success rate is only about 1-2%: labs have just been publishing the wins. This isn’t because AI isn’t useful for mathematicians. Terence Tao thinks the models are currently at the https://x.com/dwarkesh_sp/status/2036095632746983436
If AI scientists are writing millions of papers, many of which are slop, and some of which are incremental progress, how would we identify the one or two which come up with an extremely productive new idea? In 1948, Shannon was one of hundreds of engineers at Bell Labs working https://x.com/dwarkesh_sp/status/2035083959499972976
If we’re going to have AIs that fully automate math, they not only need to solve existing problems. We also need to teach them how to recognize what problem to solve next. Human mathematicians have heuristic models that they use to decide what to work on, like, “There’s https://x.com/dwarkesh_sp/status/2035053765103911381
Terence Tao explains the beauty of Lean proofs. Even if they’re not very comprehensible on their own to humans, they can be analyzed more easily – each bit of the proof can be taken apart, analyzed, tweaked, and understood in terms of how it fits into the whole. https://x.com/dwarkesh_sp/status/2035733247112753511
Terence Tao thinks AI is already very good at using existing, well-understood math techniques to solve problems. An important question is how many open problems in math could be solved this way, without developing any new ideas. An extreme case of a proof like this is the https://x.com/dwarkesh_sp/status/2035808735600251024
The Origin of Species was published in 1859. Principia Mathematica was published in 1687, two centuries earlier. Conceptually, it seems like natural selection is much simpler than the theory of gravity. So why did it take two centuries longer to discover? A contemporary of https://x.com/dwarkesh_sp/status/2035370849625129392
The Terence Tao episode. We begin with the absolutely ingenious and surprising way in which Kepler discovered the laws of planetary motion. People sometimes say that AI will make especially fast progress at scientific discovery because of tight verification loops. But the https://x.com/dwarkesh_sp/status/2035031412953223428
When Copernicus proposed heliocentrism in 1543, it was actually less accurate than Ptolemy’s geocentric model – a system refined over 1,400 years with epicycles precisely tuned to match observed planetary positions. It took another 70 years before Kepler, working from Tycho https://x.com/dwarkesh_sp/status/2035114158241587221
North Carolina man pleads guilty to $10 million AI music fraud scheme Michael Smith used artificial intelligence to generate thousands of fake songs, then deployed automated bots to stream them billions of times, stealing over $10 million in royalties from legitimate artists between 2017 and 2024. This landmark case represents one of the first successful prosecutions of AI-related fraud in the music industry, highlighting how AI-generated content combined with bot manipulation can divert streaming revenue from real musicians to fraudsters. Smith faces up to five years in prison and must forfeit over $8 million, as the music industry grapples with an estimated 60,000 AI-generated tracks uploaded daily to streaming platforms.
Tech journalists openly embrace AI writing tools across newsroom functions Six prominent journalists revealed their AI workflows in major publications, marking a shift from career-ending taboo to accepted practice within a year. Each has rebuilt traditional newsroom roles—wire desk, copy editing, fact-checking—using different AI boundaries, from full drafting assistance to editing-only support. This represents the broader challenge facing all content creators: deciding which tasks to automate, collaborate on, or keep entirely human.
Reddit CEO mandates AI bots disclose their identity to users Reddit will now require AI bots to identify themselves when posting or commenting, marking the first major platform policy specifically targeting undisclosed AI participation in online discussions. This move addresses growing concerns about AI-generated content flooding social media without users’ knowledge, potentially setting a precedent for how other platforms handle AI transparency as automated posting becomes more sophisticated.
Wikipedia bans AI-generated content across its platform globally Wikipedia officially prohibited using large language models like ChatGPT to write or rewrite articles, citing violations of core content policies including accuracy and source verification. The policy allows only limited exceptions for basic copyediting and translation assistance, both requiring human oversight. This represents one of the most significant restrictions on AI content creation by a major information platform, affecting millions of articles and editors worldwide.
Chroma releases 20B parameter search agent that outperforms frontier models Chroma Context-1 demonstrates that smaller, specialized AI models can match or exceed much larger general-purpose models at specific tasks like multi-step document search, running 10x faster and at a fraction of the cost. The model uses a novel “self-editing” approach where it actively discards irrelevant information during multi-turn searches to maintain focus and efficiency. This challenges the assumption that bigger models are always better, suggesting the future may favor purpose-built AI agents over massive general-purpose systems.
Jeff Huber on X: “the bitter lesson is coming for search we’re open-sourcing Context-1 – a model that is better, faster, and cheaper than any frontier model at searching we published a 40-page technical report on our website with the ins and outs of how we did it. this is just step 1” / X https://x.com/jeffreyhuber/status/2037247377275576380
Amazon acquires humanoid robot startup Fauna Robotics for warehouse automation Amazon purchased New York-based Fauna Robotics and its “Sprout” humanoid robot designed for safe human interaction. This marks Amazon’s first major acquisition in humanoid robotics, signaling the company’s push beyond traditional warehouse automation into more versatile robotic workers that could eventually handle complex tasks alongside human employees.
Amazon makes a big move in the humanoid game. Amazon has acquired Fauna Robotics, a New York-based humanoid robot startup. The transaction closed last week. Fauna Robotics developed Sprout, a compact and approachable humanoid robot designed for safe, everyday interaction in https://x.com/TheHumanoidHub/status/2036559641619177960
Figure’s humanoid robot matches human warehouse workers at package sorting Figure 03 achieved human-level speed in package sorting at 3 seconds per package sustained over full shifts, marking the first time a humanoid robot has matched human productivity in a complex warehouse task. This breakthrough suggests robots may soon handle physically demanding jobs at scale, potentially transforming logistics and manufacturing workforces.
Brett Adcock says Figure has reached human speed parity in package sorting, matching the 3-second/package average sustained by human workers throughout a full shift. https://x.com/TheHumanoidHub/status/2036538399172206751
True AGI is when the robot finally has enough of the annoying testing and just walks off the job. Marc Benioff shared a new video of Figure 03 autonomously sorting deformable packages and placing them labels-down for the scanner down the line. https://x.com/TheHumanoidHub/status/2036275723837874685
NVIDIA partners with AGIBOT to train robots using Genie-1 hardware NVIDIA’s latest robot foundation model was trained specifically on AGIBOT’s Genie-1 robot platform, marking a shift toward hardware-specific AI training rather than generic approaches. This partnership suggests the industry is moving beyond one-size-fits-all robot AI toward models optimized for particular robot designs. The collaboration was highlighted at NVIDIA’s major developer conference, indicating both companies see this as a significant strategic direction.
AGIBOT has emerged as a premier ecosystem partner for NVIDIA, showcased at GTC 2026. – GR00T N2 Foundation Model: NVIDIA’s next-gen VLA is pre-trained on the AGIBOT Genie-1 embodiment. – DreamZero World Action Model (WAM): Genie-1 was selected as the official hardware https://x.com/TheHumanoidHub/status/2036064872719679883
Chinese robotics firm Unitree files for $610 million IPO Shanghai-based Unitree submitted its public offering application to raise funds for expanding AI-powered robotics capabilities, marking another major Chinese tech company’s push to capitalize on the AI boom through public markets.
Unitree files for IPO to raise $610M, plans to bet big on AI capabilities Today, the Shanghai Stock Exchange accepted Unitree’s IPO application for Shanghai’s STAR Market exchange. The company is targeting a raise of 4.202 billion yuan (~$610 million). Proceeds will primarily https://x.com/TheHumanoidHub/status/2035078373924643218
AI helps major Japanese newspaper expose million-post disinformation campaigns The Yomiuri Shimbun used multiple AI language models to analyze over one million social media posts and identify state-sponsored information operations, demonstrating how AI can overcome the limitations of traditional keyword searches that miss sophisticated propaganda efforts. This represents a significant advancement in using AI for investigative journalism and countering coordinated inauthentic behavior at scale.
We recently worked with The Yomiuri Shimbun to analyze more than a million social media posts to map out state-sponsored information campaigns. https://t.co/rlYs43ywrE Keyword searches are fragile for modern OSINT. To fix this, our team used an ensemble of different LLMs https://x.com/hardmaru/status/2035884310356754715
AI system produces first peer-reviewed research paper without human help Sakana AI’s “AI Scientist” became the first system to autonomously generate a complete research paper that passed rigorous human peer review, scoring 6.33/10 and outperforming 55% of human-authored papers at a major AI conference. The breakthrough, now published in Nature, demonstrates that AI can execute the entire research lifecycle—from generating ideas and conducting experiments to writing papers—with quality improving predictably as underlying AI models advance. This represents a fundamental shift toward automated scientific discovery, though researchers emphasize the need for ethical guidelines as AI-generated research becomes indistinguishable from human work.
I’m incredibly proud of The AI Scientist team for this milestone publication in @Nature. We started this project to explore if foundation models could execute the entire research lifecycle. Seeing this work validated at this level is a special moment. I truly believe AI will https://x.com/hardmaru/status/2036841736702767135
One of the most exciting findings in our @Nature paper is the discovery of a clear scaling law of AI science. By using our Automated Reviewer to grade papers generated by different foundation models, we observed that as the underlying models improve, the quality of the generated https://x.com/SakanaAILabs/status/2036999652298678630
The AI Scientist V1 was completed months before o1-preview and reasoning models were released. The models have clearly gotten much more capable since then. Very excited for where things are headed for AI and automated research! https://x.com/_chris_lu_/status/2037090588550418510
The AI Scientist: Towards Fully Automated AI Research, Now Published in Nature!!✨ Today in Nature we share a comprehensive technical summary of our work on The AI Scientist, including new scaling law results showing how it improves with more compute and more intelligent https://x.com/jeffclune/status/2036866082418680297
Popular AI library LiteLLM hit by supply chain attack stealing developer credentials Hackers compromised versions 1.82.7 and 1.82.8 of LiteLLM on PyPI, automatically stealing SSH keys, cloud credentials, API keys, and other sensitive data from developers who installed the update. The attack originated through a compromised security scanner in the project’s CI/CD pipeline, demonstrating how AI development tools have become high-value targets for credential theft affecting thousands of developers.
LiteLLM HAS BEEN COMPROMISED, DO NOT UPDATE. We just discovered that LiteLLM pypi release 1.82.8. It has been compromised, it contains litellm_init.pth with base64 encoded instructions to send all the credentials it can find to remote server + self-replicate. link below https://x.com/hnykda/status/2036414330267193815
Software horror: litellm PyPI supply chain attack. Simple `pip install litellm` was enough to exfiltrate SSH keys, AWS/GCP/Azure creds, Kubernetes configs, git credentials, env vars (all your API keys), shell history, crypto wallets, SSL private keys, CI/CD secrets, database https://x.com/karpathy/status/2036487306585268612
Thankfully the LiteLLM package has now been marked as “”quarantined”” on PyPI so attempting to install the compromised update via pip et al shouldn’t work https://x.com/simonw/status/2036451896970584167
This is pure nightmare fuel. Identity theft of the past would be nothing compared to what vibe agents can do. Sending credentials is too obvious and for rookies. They could easily spread contaminations across ~/.claude, **/skills/*, or even just a PDF your agent visits https://x.com/DrJimFan/status/2036494601750716711
Google DeepMind creates tool that edits video actions without retraining models DynaEdit allows complex video modifications like changing object interactions and inserting new elements that affect scene dynamics, using existing text-to-video models without additional training. The breakthrough solves major technical hurdles that caused visual misalignment and frame jitter in previous methods, enabling edits like making a train hit a paint bucket or having an astronaut pick up a flag mid-walk.
Versatile Editing of Video Content, Actions, and Dynamics without Training”” TL;DR: Enables temporally consistent editing of dynamic scenes while preserving motion and avoiding frame-to-frame artifacts. https://x.com/Almorgand/status/2035058325830701509
World models could unlock superhuman AI by learning physics from video games Unlike language models that predict text, world models learn to simulate entire environments by watching action-labeled video clips, potentially solving robotics’ biggest bottleneck. Companies like World Labs and General Intuition have raised over $2 billion combined to develop these “lucid dream” AIs that can predict how complex scenes will unfold based on actions taken within them. The key breakthrough is using actions as compression—allowing models to simulate computationally impossible scenarios like entire stadiums of people at fixed computational cost.
If you’ve been curious about world models, read this. Got an early preview of the blog and it does a thorough job of unpacking the ill tailored tapestry of world model initiatives. https://x.com/bilawalsidhu/status/2034679032642416664
Terafab promises to build massive space-based manufacturing facilities using AI The startup plans to use artificial intelligence to construct kilometer-scale factories in orbit, claiming this could revolutionize manufacturing by leveraging zero gravity and unlimited solar power. While the concept remains largely theoretical, Terafab represents a growing trend of AI companies targeting space-based applications beyond Earth’s resource constraints.
Leave a Reply