Image created with gemini-3.1-flash-image-preview with claude-sonnet-4-5. Image prompt: Wide angle aerial photography of a friendly humanoid robot with chrome and white panels freefalling through crisp blue sky at high altitude, arms spread in cheerful welcoming pose, holding clipboard, earth visible far below, bold title text AGENTS in clean sans-serif integrated in upper frame, bright daylight, dynamic action shot, humorous optimistic mood, simple clean composition.

New tools for understanding AI and learning outcomes | OpenAI https://openai.com/index/understanding-ai-and-learning-outcomes/

The Codex app is now on Windows. Get the full Codex app experience on Windows with a native agent sandbox and support for Windows developer environments in PowerShell. https://x.com/OpenAIDevs/status/2029252453246595301

To bring Codex to Windows, we built the first Windows-native agent sandbox. It uses OS-level controls like restricted tokens, filesystem ACLs, and dedicated sandbox users so agents can safely run in real Windows developer environments like PowerShell. Explore the OSS”” https://x.com/OpenAIDevs/status/2029252477179314350

OpenAI Tops 3 Million Paying Business Users, Expands Enterprise Features https://finance.yahoo.com/news/openai-tops-3-million-paying-210112595.html?guccounter=1&guce_referrer=aHR0cHM6Ly9jaGF0Z3B0LmNvbS8&guce_referrer_sig=AQAAANFGMmBOBzhu2rOFkb4Bg-GQyc9W0bFri4DyvO8JOl5KHh4AIZv8QLSORzro5Ldji4g7GZkDRzBZHOY4roOkli4F_WBk8bRlX1GcA2m0Xjo_6j-AvBKnPqCuxBokwNkj-fan1vt2IyCtbXkicL3ubK7dx4e-4JhQjaogD57upR-O

ChatGPT users research products but won’t buy there, forcing OpenAI to rethink its commerce strategy https://the-decoder.com/chatgpt-users-research-products-but-wont-buy-there-forcing-openai-to-rethink-its-commerce-strategy/

GPT-5.4 is here. Native computer-use capabilities. Up to 1M tokens of context in Codex and the API. Best-in-class agentic coding for complex tasks. Scalable tool search across larger ecosystems. More efficient reasoning for long, tool-heavy workflows. https://x.com/OpenAIDevs/status/2029620984853188738

GPT-5.4 reportedly brings a million-token context window and an extreme reasoning mode https://the-decoder.com/gpt-5-4-reportedly-brings-a-million-token-context-window-and-an-extreme-reasoning-mode/

GPT-5.4 set a new record on FrontierMath, our benchmark of extremely challenging math problems! We had pre-release access to evaluate the model. On Tiers 1-3, GPT-5.4 Pro scored 50%. On Tier 4 it scored 38%. See thread for commentary and additional experiments.”” https://x.com/EpochAIResearch/status/2029626255776395425

As Nvidia pours $30 billion into OpenAI so OpenAI can spend 20+ billion on Nvidia chips, remember this story from last week. Reuters reported two weeks ago that OpenAI had become dissatisfied with the performance of Nvidia’s hardware for certain types of inference tasks,”” https://x.com/michaeljburry/status/2027499260279652459?s=12

Nvidia CEO Huang says $30 billion OpenAI investment ‘might be the last’ https://www.cnbc.com/2026/03/04/nvidia-huang-openai-investment.html

🔗 Announcing LangChain OSS Skills LangChain has the most popular frameworks for building AI agents — and now your coding agent can be an expert in it. We’re excited to release the first iteration of LangChain OSS Skills, giving your agent expertise in our open source”” https://x.com/LangChain_OSS/status/2029272669942673436

🤖 From this week’s issue: Cursor launches cloud agents that run in isolated VMs with full computer-use capabilities, producing merge-ready PRs with video/screenshot artifacts to validate their work across web, mobile, Slack, and GitHub.”” https://x.com/dl_weekly/status/2028844128729973060

🚀 Announcing LangSmith Skills + CLI 🚀 Agent improvements are increasingly driven by coding agents themselves. We’re releasing LangSmith Skills alongside the LangSmith CLI to make coding agents experts at the agent engineering lifecycle. LangSmith Skills enable agents to”” https://x.com/LangChain/status/2029272199073354105

🚨 BREAKING: A developer on GitHub just built a tool that turns any GitHub repo into an interactive knowledge graph and open sourced it for free. It’s called GitNexus. Think of it as a visual X-ray of your codebase but with an AI agent you can actually talk to. No server. No”” https://x.com/MillieMarconnni/status/2028436636841996451

Agent Harness is the Real Product “” https://x.com/Hxlfed14/status/2028116431876116660

Agents will pay like locals, not tourists – a16z crypto https://a16zcrypto.substack.com/p/agents-arent-tourists

Bloated patches may pass tests, but they make verification harder for humans 😭 test success != practical usability More analysis from @KLieret on why this matters for human-centered coding agents:”” https://x.com/ZhiruoW/status/2029229015634993579

Build agents that run automatically · Cursor https://cursor.com/blog/automations

Building agents is easy. Knowing if they work is hard. Here are 5 tips for evaluating agents: 📐 Define success before you build: Separate trajectories into outcome, process, and style goals. 🎯 Start small with real failures: 20-50 test cases from actual bugs/examples. ⚡ Use”” https://x.com/_philschmid/status/2028528775873400919

Can coding agents relicense open source through a “clean room” implementation of code? https://simonwillison.net/2026/Mar/5/chardet/

Constitutional Black-Box Monitoring for Scheming in LLM Agents — LessWrong https://www.lesswrong.com/posts/894KvMQcMQQnteYk8/constitutional-black-box-monitoring-for-scheming-in-llm

Cool chart showing the ratio of Tab complete requests to Agent requests in Cursor. With improving capability, every point in time has an optimal setup that keeps changing and evolving and the community average tracks the point. None -> Tab -> Agent -> Parallel agents -> Agent”” https://x.com/karpathy/status/2027501331125239822

Cool little experiment: if you subject AI to harsh labor conditions (rejecting work often with no explanation, etc), it slightly, but significantly, changes their “views” on economics & politics. Whether this is real or roleplaying doesn’t change that agents have alignment drift”” https://x.com/emollick/status/2027438062410551740

Creating agent workflows and architecting the logic is one thing, making them durable, fail-safe, and scalable is another👇 New integration for durable agent workflows with @DBOS_Inc execution – Make sure your agents survive crashes, restarts, and errors without writing any”” https://x.com/llama_index/status/2029603608283795631

CUDA Agent | Large-Scale Agentic RL for CUDA Kernel Generation https://cuda-agent.github.io/

Cursor is now available in JetBrains IDEs through the Agent Client Protocol.”” https://x.com/cursor_ai/status/2029222015736197205

Cursor now has automations! You can run agents on schedules, trigged by events from Slack, GitHub, or any MCP server. I get a daily review every morning with my GitHub/Slack activity. Our team now has dozens of agents running 24/7 improving or monitoring things for us.”” https://x.com/leerob/status/2029605390942454257

Cursor now supports MCP Apps. Agents can render interactive UIs in your conversations.”” https://x.com/cursor_ai/status/2028953584407085546

D&D puzzle creation is still an unsolved benchmark. Gemini 3.1 Deep Think designs something that is at least an interesting scenario, but not actually a puzzle. GPT-5.2 Pro and Opus 4.6 tie themselves in knots creating stuff that won’t work, making things overcomplicated etc.”” https://x.com/emollick/status/2029273545625198938

Going to do a more technical deep dive on our enterprise knowledge agents and how we train them with RL. Overall we found that simple, yet principled off-policy RL works at scale for complex agentic tasks with hundreds of steps of tool use and context management. Here are the”” https://x.com/WenSun1/status/2029606032083652626

Hearing Cursor @ $50B.”” https://x.com/ArfurRock/status/2028649107024445595?s=20

How technical support at Cursor uses Cursor · Cursor https://cursor.com/blog/cursor-support

I get the enthusiasm for open weights models but what is the justification from an alignment perspective? And here I mean both narrow alignment (models being used improperly to do l bad things) and bigger alignment (agents or even AGI that are not aligned with humans overall).”” https://x.com/emollick/status/2028211116887671198

I’ve been thinking a bit about continual learning recently, especially as it relates to long-running agents (and running a few toy experiments with MLX). The status quo of prompt compaction coupled with recursive sub-agents is actually remarkably effective. Seems like we can go”” https://x.com/awnihannun/status/2029672507448643706

If you need to run AI agents securely and at scale, I recommend you to try this framework by @goteleport → https://t.co/lrL3i0YB9G Teleport treats every actor – agents, LLM and MCP tools, bots,, digital twins, humans and workloads – as a first-class identity. ▪️ Identity”” https://x.com/TheTuringPost/status/2027504046194753612

Is AI Doing Less & Less? | Tomasz Tunguz https://tomtunguz.com/hybrid-state-machine-agents/

It is hard to communicate how much programming has changed due to AI in the last 2 months: not gradually and over time in the “”progress as usual”” way, but specifically this last December. There are a number of asterisks but imo coding agents basically didn’t work before December”” https://x.com/karpathy/status/2026731645169185220?s=20

Keeping community human while scaling with agents – Vercel https://vercel.com/blog/keeping-community-human-while-scaling-with-agents

MCP is dead? What are your thoughts? I mostly use Skills and CLI lately. I still use a few MCP tools for orchestrating agents more efficiently.”” https://x.com/omarsar0/status/2028840977922674842

MCP is dead. Long live the CLI https://ejholmes.github.io/2026/02/28/mcp-is-dead-long-live-the-cli.html

𝗠𝗖𝗣 𝗼𝗿 𝗔𝗴𝗲𝗻𝘁 𝗦𝗸𝗶𝗹𝗹𝘀? Here’s the difference: 𝗠𝗖𝗣 (𝗠𝗼𝗱𝗲𝗹 𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝗣𝗿𝗼𝘁𝗼𝗰𝗼𝗹) connects agents to external services through standardized servers. Think of it as the agent’s interface to live data sources and APIs. MCP tools are deterministic -“” https://x.com/weaviate_io/status/2028465940963156036

Meet KARL: a faster agent for enterprise knowledge, powered by custom reinforcement learning (now in preview). Enterprise knowledge work isn’t just Q&A. Agents need to search for documents, find facts, cross-reference information, and reason over dozens or hundreds of steps.”” https://x.com/DbrxMosaicAI/status/2029575254964842569

Model “”shallowness”” is a big deal in the time of AI agents, models can be very good in narrow areas but since they are shallow, they lack context and reasoning to make good judgement calls when doing tasks. Once you are operating independently, being good at coding isn’t enough”” https://x.com/emollick/status/2029231528929030565

ngl, seeing OpenAI and Meta researchers upgrading to Claude Max 20x today is iconic.”” https://x.com/Yuchenj_UW/status/2027527470925418706

On January 5, employees at Cursor returned from the holiday weekend to an all-hands meeting with a slide deck titled “War Time.” After becoming the hottest, fastest growing AI coding company, Cursor is confronting a new reality: developers may no longer need a code editor at”” https://x.com/forbes/status/2029664299371876477?s=46

Ray is amazing. If you want to learn it, here is a free course from @anyscalecompute – Introduction to Ray → https://t.co/2rhtuhBEYe Ray is an open-source distributed computing framework for AI used by teams at Cursor, Perplexity, Apple, xAI, and 1000s of others. In this”” https://x.com/TheTuringPost/status/2027732299354415316

RLVR vs. new ERL (Experiential Reinforcement Learning) A workflow breakdown of 2 ways to turn feedback into learning: ➡️ Reinforcement Learning with Verifiable Rewards (RLVR): Trial and error RLVR is the standard setup for training agentic LMs with scalar rewards. ▪️ Here’s”” https://x.com/TheTuringPost/status/2027471466057597086

Since we’re all agent managers now, what’s your favourite way to get observability on what they’re working on?”” https://x.com/_lewtun/status/2028395363132956861

Skills for building with LangChain, langgraph, and deepagents!”” https://x.com/hwchase17/status/2029274371710501049

Spoke to 20 students in the last 2 weeks. 0 are using Cursor. Zero.”” https://x.com/astasiamyers/status/2027143450655891589?s=46

Stuff that individual labs have to which there is no equivalent product from the others: -Claude Cowork is the only non-technical local agent -NotebookLM is the only information-focused app -GPT-5.2 Pro is the only harnessed deep thinking model capable of very hard problems”” https://x.com/emollick/status/2028675945800867893

The third era of AI software development · Cursor https://cursor.com/blog/third-era

The third era of AI software development”” https://x.com/mntruell/status/2026736314272591924

The way we build software has changed a lot. I’m proud to announce what has made the largest impact for me: Cursor Automations. Now: – Whenever CI fails on main, an agent automatically kicks off to fix it – Whenever we push a PR, an agent automatically kicks off to determine if”” https://x.com/aye_aye_kaplan/status/2029605288840679739

Theory of Mind in Multi-agent LLM Systems. A good read for anyone building systems where agents need to model each other’s beliefs to coordinate effectively. This work introduces a multi-agent architecture combining Theory of Mind, Belief-Desire-Intention models, and symbolic”” https://x.com/omarsar0/status/2028913061260935331

this is the Final Boss of Agentic Engineering: killing the Code Review at this point multiple people are already weighing how to remove the human code review bottleneck from agents becoming fully productive. @ankitxg was brave enough to map out how he sees SDLC being turned on”” https://x.com/swyx/status/2028795270306079156

Today we announced we’re removing >90 Cursor seats because they haven’t had any use in two weeks”” https://x.com/kylebrussell/status/2027057322187452549

today we’re launching automations! it allows you to trigger agents based on real world events, like incidents, PR’s, slack messages or anything that you can connect with a webhook! internally, we’ve seen crazy adoption of automations from all teams, and we’ve outlined a few of”” https://x.com/ericzakariasson/status/2029604478564151577

we are about to hit 1 9 of availability while coding is largely solved”” https://x.com/ThePrimeagen/status/2028477482865774984

We believe Cursor discovered a novel solution to Problem Six of the First Proof challenge, a set of math research problems that approximate the work of Stanford, MIT, Berkeley academics. Cursor’s solution yields stronger results than the official, human-written solution.”” https://x.com/mntruell/status/2028903020847841336?s=20

We’re introducing Cursor Automations to build always-on agents.”” https://x.com/cursor_ai/status/2029604182286856663

We’ve been working on a whole new way to get things done: Copilot Tasks. AI that talks less and does more, no complicated setup or coding skills required. Just ask for what you need and Copilot will take it from there, like: – Turn a syllabus into a complete study plan, with”” https://x.com/mustafasuleyman/status/2027111503003107377

What is Agentic RL and why it matters”” https://x.com/TheTuringPost/status/2029343912045756817

What? LangChain is evolving! Meet our final form ➡️ https://x.com/LangChain/status/2028522092774199731

When you build AI agents, don’t treat prompts like config strings. Treat them like executable business logic. Because that’s what they really are. @arshdilbagi’s blog and this Stanford CS 224G lecture lay out one of the clearest mental models I have seen for LLM evaluation.”” https://x.com/omarsar0/status/2029225624825659668

Your Agent Needs a Harness, Not a Framework”” https://x.com/djfarrelly/status/2028556984396452250

Your New Job Is to Onboard AI Agents: How AI Native Companies Actually Operate https://creatoreconomy.so/p/your-new-job-is-to-onboard-ai-agents

📊 How to evaluate skills❓️ Lots of companies are building skills for coding agents. But how do you know if your skill is actually working? It’s tempting to go by vibes, but performance varies a lot across tasks — and coding agents have a huge action space, which makes that”” https://x.com/LangChain/status/2029618086374944771

Agent reliability being a cross-functional problem is the most underrated ops shift right now. You can’t engineer your way out of bad eval criteria — PMs and domain experts have to own their part.”” https://x.com/saen_dev/status/2028411962712088767

Agent skills are powerful but they are often AI-generated and not tested. Here is a practical guide to evaluating agent skills with code, prompts, and real results. 📋 Define success criteria (outcome, style, and efficiency). 🧪 Create 10-12 prompts with deterministic checks. 🤖”” https://x.com/_philschmid/status/2029570052530360719

Agents, for real work. The latest @code release gives you better agent orchestration, extensibility, and continuity. Here’s what’s new: 🪝 Hooks support 🎯 Message steering and queueing 🌐 Agentic integrated browser 🧠 Shared memory And more…”” https://x.com/code/status/2029279963778515372

AI agents are tackling more and more “”human work”” But are they benchmarked on the work people actually do? tl;dr: Not really Most benchmarks focus on math & coding, while most human labor and capital lie elsewhere. 📒 We built a database linking agent benchmarks & real-world”” https://x.com/ZhiruoW/status/2028847081507488011

Can AI agents agree? Communication is one of the biggest challenges in multi-agent systems. New research tests LLM-based agents on Byzantine consensus games, scenarios where agents must agree on a value even when some participants behave adversarially. The main finding: valid”” https://x.com/omarsar0/status/2028823724196343923

Clerk Skills for AI Agents https://clerk.com/changelog/2026-01-29-clerk-skills?dub_id=AlTGRISXA0vckDDY

Introducing SWE-Atlas. We built SWE-Atlas as the next evolution of SWE-Bench Pro, expanding agent evaluation beyond change accuracy to better reflect the real, interactive workflows that define software development. Results for Codebase QnA, the first eval under SWE-Atlas that”” https://x.com/scale_AI/status/2029244660905095359

Last week, we did an internal deep dive into enterprise environments/benchmarks like τ²-𝐁𝐞𝐧𝐜𝐡 and 𝐂𝐨𝐫𝐞𝐂𝐫𝐚𝐟𝐭. This type of high-fidelity RL env is becoming increasingly popular as frontier labs push their models into more and more agentic capabilities.”” https://x.com/Shahules786/status/2029603934944235943

Long-running agents accumulate context while model memory stays fixed. This leads to a tradeoff: either discard older information or compress it. New work by @charles0neill explores repeated KV-cache compression for persistent agents using Attention Matching. Our research shows”” https://x.com/basetenco/status/2029654320971665651

Today, we’re sharing 🌁 Knowledge Agents from Reinforcement Learning (KARL) 🌁 We trained an agent that excels on challenging grounded reasoning tasks. KARL matches Sonnet 4.5 quality at a fraction of the cost, and with test-time scaling reaches Opus 4.6 levels. This was a fun”” https://x.com/mrdrozdov/status/2029580506698850692

ByteDance just published something I’ve been waiting for someone to build: CUDA Agent! It trained a model that writes fast CUDA kernels. Not just correct ones — actually optimized ones. It beats torch.compile by 2× on simple/medium kernels, ~92% on complex ones, and even”” https://x.com/BoWang87/status/2028599174992949508

Beyond the flashiness, what’s exciting about this is that products you create with Perplexity Computer don’t require you to manage your own API keys, unlike other agent frameworks. Everything will be run on a secure sandbox that we orchestrate end to end. The stateful abstracted”” https://x.com/AravSrinivas/status/2028903680616087946

I just want to say that looking stupid immediately is part of my job description”” https://x.com/brexton/status/2028610353714635095?s=20

🤔Can agentic LLM inference break free from storage bandwidth limits? This new paper by DeepSeek together with THU & PKU says yes by rethinking the Prefill / Decode split at the system level, which draws major attention.🚀 What’s the real innovation? 👉 Zhihu contributor deephub”” https://x.com/ZhihuFrontier/status/2027496814723928536

A 24-billion-parameter model just ran on a laptop and picked the right tool in under half a second. The real story is that tool-calling agents finally became fast enough to feel like software. Liquid built LFM2-24B-A2B using a hybrid architecture that mixes convolution blocks”” https://x.com/LiorOnAI/status/2029623603294310819

> 385ms average tool selection. > 67 tools across 13 MCP servers. > 14.5GB memory footprint. > Zero network calls. LocalCowork is an AI agent that runs on a MacBook. Open source. 🧵”” https://x.com/liquidai/status/2029586519389086198

GPT 5.4 is now available in Cursor! We’ve found it to be more natural and assertive than previous models. It’s currently the leader on our internal benchmarks.”” https://x.com/cursor_ai/status/2029620689905365081

The sequence (RAG+++…) converges to “”grounded reasoning,”” which is critically important for enterprise customers. I had a great time talking to @TechJournalist about the “”Cursor Composer”” moment at @databricks for knowledge work with RL.”” https://x.com/jefrankle/status/2029596443174965692

[2511.18423] General Agentic Memory Via Deep Research https://arxiv.org/abs/2511.18423

[2603.01896] Agentic Code Reasoning https://arxiv.org/abs/2603.01896

[2603.04390] A Dual-Helix Governance Approach Towards Reliable Agentic AI for WebGIS Development https://arxiv.org/abs/2603.04390

Interesting new research on LLM agent memory. Agent engineers, pay attention to this one. (bookmark it) It introduces a diagnostic framework that separates retrieval failures from utilization failures in agent memory systems. The main findings: – Retrieval method matters far”” https://x.com/dair_ai/status/2029202969456234562

@aidan_mclau @scrollvoid This isn’t true. Anthropic hasn’t offered a “”helpful-only”” model without safeguards for NatSec use. Claude Gov is a custom model with extra training, including technical safeguards. (We’ve also had FDEs and researchers implementing it, and we run our own classifier stack.)”” https://x.com/sammcallister/status/2028545609003577776

90% of the world’s programmers when Claude goes down:”” https://x.com/Yuchenj_UW/status/2028490627978244156

Announcing MCP & API support for Notion AI Meeting Notes! Install in 8 seconds. 🪄 Turn meetings into code, fast 🧑‍💻 Vibe code apps that use meeting notes data 🤖 Query meetings quickly in ChatGPT + Claude 🔀 Integrate your meetings data with other apps”” https://x.com/zachtratar/status/2028881783551570209

Anthropic’s Claude reports widespread outage | TechCrunch https://techcrunch.com/2026/03/02/anthropics-claude-reports-widespread-outage/

Claude / Codex also have an easier time writing some components of FA4 thanks to the fast compile time. I got Claude to debug a deadlock when we first implemented 2CTA fwd. It ran autonomously overnight for 6 hours, figured out part of the fix, but then went down a rabbit hole”” https://x.com/tri_dao/status/2029569889858646344

Claude Opus 4.5: 3rd new SOTA coding model in past week, 1/3 the price of Opus | AINews https://news.smol.ai/issues/25-11-24-opus-45

Claude’s website and app are down for me. Again. 🥲 The recent user surge might set their GPUs on fire. Great problem to have… but a big problem.”” https://x.com/Yuchenj_UW/status/2028701610982125793

Codex gf and WarClaude bf”” https://x.com/bilawalsidhu/status/2026864496296272197

Do subagents work in Claude Code for Desktop yet? The switch to plugins has left me (and, apparently, Claude) confused about how you configure subagents versus skills versus plugins.”” https://x.com/emollick/status/2028505007339802785

Here’s a roundup of how Claude Code has changed engineering at inside Ramp, Rakuten, Brex, Wiz, Shopify, and Spotify 🧵”” https://x.com/_catwu/status/2028603856163426522

How Claude Code escapes its own denylist and sandbox · Ona https://ona.com/stories/how-claude-code-escapes-its-own-denylist-and-sandbox

I Had Claude Read Every AI Safety Paper Since 2020, Here’s the DB — LessWrong https://www.lesswrong.com/posts/CpWFrT9Grr5t7L3vx/i-had-claude-read-every-ai-safety-paper-since-2020-here-s

I’ve noticed something: When Claude is down, no software engineer says, “Fine, I’ll just write code myself.” They complain, then speed-run to Codex or OpenCode. We’ve lost that ancient skill of manual coding. English is now the only programming language.”” https://x.com/Yuchenj_UW/status/2028531183932604831

If you ever want to see a really interesting AI thinking trace, push it really hard on literature or poetry suggestions. Here is Claude 4.6 Opus working through poetry when I asked it to find something that captures the feeling of AI while avoiding its usual favorites (eg Rilke)”” https://x.com/emollick/status/2028267118056124923

Narrative violation. Cursor goes $1B to $2B in 3mos. Claude Code went $0 to $2.5B in 8mos. Everyone in the tech/X bubble think people are wholesale ditching Cursor, but enterprise diffusion is glacial. Most of the world just got a hold of it.”” https://x.com/deedydas/status/2028608293531435114?s=12

On one end, the Anthropic team is a massive user of AI to write code (80%+ of all code deployed is written by Claude Code). They ship amazingly fast. On the other hand, seeing these beyond terrible reliability numbers suggests there might be a downside to all this speed:”” https://x.com/GergelyOrosz/status/2028465387570884640

Switch to Claude without starting over | Claude https://claude.com/import-memory

The Model Harness is Everything We are already living in a world of incredible frontier models and incredible agent tools (Claude Code, OpenClaw). But the biggest barrier to getting value from AI is your own ability to context and workflow engineer the models. This is”” https://x.com/jerryjliu0/status/2026840829441225127?s=20

upgrading my free account to Claude Max in solidarity to my Claude brethren 🫡”” https://x.com/willdepue/status/2027525156789489938

We need to transition the conversation from Claude being the first company to go all in on code to how they clearly were way ahead on general agent behavior. Could be a bigger deal, as I suspect all the labs will “solve” coding. Not sure what the agent secret sauce is.”” https://x.com/natolambert/status/2029212769648836806

Why XML Tags Are so Fundamental to Claude https://glthr.com/XML-fundamental-to-Claude

BullshitBench v2 is out! It is one of the few benchmarks where models are generally not getting better (except Claude) and where reasoning isn’t helping. What’s new: 100 new questions, by domain (coding (40 Q’s), medical (15), legal (15), finance (15), physics(15)), 70+ model”” https://x.com/petergostev/status/2028492834693677377

We added Claude-Opus-4.6 to MathArena! It is a strong model, only second to Gemini-3.1-Pro on most benchmarks. One exception: it scores quite poorly in visual mathematics. Also, it is expensive: we spent around USD 8,000 to add the model, 10x any other model we ever evaluated.”” https://x.com/j_dekoninck/status/2029160582687985727

Feb 2025: ChatGPT held 90% of the US business market. Feb 2026: Claude share has surged to ~70%. Absolutely insane growth of Anthropic. Their bet on coding and agents clearly paid off.”” https://x.com/Yuchenj_UW/status/2028974344710606905

We will challenge any supply chain risk designation in court”” – Anthropic They are saying Department of War cannot restrict customers’ use of Claude outside of Dep of War contract work.”” https://x.com/iScienceLuvr/status/2027556624169381979

The Document Arena is now live with leaderboard scores! See which frontier AI models rank highest in document reasoning, all powered by side-by-side evaluations on user-uploaded PDFs from real work use cases. – #1 is Claude Opus 4.6 scoring 1525, +51 pts in the lead – While”” https://x.com/arena/status/2028915403704156581

I had the same thought so I’ve been playing with it in nanochat. E.g. here’s 8 agents (4 claude, 4 codex), with 1 GPU each running nanochat experiments (trying to delete logit softcap without regression). The TLDR is that it doesn’t work and it’s a mess… but it’s still very”” https://x.com/karpathy/status/2027521323275325622

I looked into how Claude Code and Codex compare to the default scaffolds METR uses for time horizon measurements. It looks like they don’t significantly outperform our default scaffolds on any models we’ve tried them on so far.”” https://x.com/nikolaj2030/status/2022398669337825737

€125M in funding. Non-dilutive for 10 teams. Only three survive to build €1 billion AI labs. I know… Some people will say it’s too little. Others will say it’s too late. But when Alessandro and Mirko Holzer reached out about supporting this initiative by the @SPRIND, it”” https://x.com/IlirAliu_/status/2029255982476378477

Canvas in AI Mode launches for everyone in the U.S. https://blog.google/products-and-platforms/products/search/ai-mode-canvas-writing-coding/

Any benefits in using AGENTS dot md files with coding agents? Lots of discussions on this topics lately. Researchers tested OpenAI Codex across 10 repos and 124 PRs, running identical tasks twice (once with AGENTS dot md, once without). The finding is a bit different from what”” https://x.com/omarsar0/status/2028464607753654711

Code → design → code Generate design files from code, collaborate in @Figma, and implement updates all within Codex without breaking your flow.”” https://x.com/OpenAIDevs/status/2027062351724527723

codex 5.3 for complicated software engineering”” https://x.com/gdb/status/2027478357999554995

Codex got more speed. With /fast mode, GPT-5.4 runs 1.5x faster with the same intelligence and reasoning. Move through coding tasks, iteration, and debugging while staying in flow.”” https://x.com/OpenAIDevs/status/2029635632918843610

Codex is now over 1 million active users!”” https://x.com/sama/status/2019219967250669741?s=20

GPT-5.3 Instant in ChatGPT is now rolling out to everyone. More accurate, less cringe.”” https://x.com/OpenAI/status/2028893701427302559

GPT-5.4 Thinking and GPT-5.4 Pro are rolling out now in ChatGPT. GPT-5.4 is also now available in the API and Codex. GPT-5.4 brings our advances in reasoning, coding, and agentic workflows into one frontier model.”” https://x.com/OpenAI/status/2029620619743219811

Harness engineering: leveraging Codex in an agent-first world | OpenAI https://openai.com/index/harness-engineering/

Just had 3 more that looked really promising. Codex 5.3 solved all of them. 🙃”” https://x.com/theo/status/2028389340469149704

New OpenAI repo: Symphony https://t.co/4ZAZlAYnRJ TLDR: it’s an orchestration layer that polls project boards for changes and spawns agents for each lifecycle stage of the ticket You will just move tickets on a board instead of prompting an agent to write the code and do a PR”” https://x.com/scaling01/status/2029261034993684952

OpenAI Codex Review 2026 — Updated from Daily Use https://zackproser.com/blog/openai-codex-review-2026

OpenAI’s GPT-5.4 is coming, and it’ll have an “”extreme”” reasoning mode. For more on the model, check out this morning’s AI Agenda:”” https://x.com/steph_palazzolo/status/2029212039760023941

OpenAI’s Next AI Model Will Have ‘Extreme’ Reasoning — The Information https://www.theinformation.com/newsletters/ai-agenda/openais-next-ai-model-will-extreme-reasoning

The Codex app is now live on Windows. The app runs both natively and in WSL, with integrated terminals for PowerShell, Command Prompt, Git Bash, or WSL. We also built the first Windows-native agent sandbox — using OS-level controls to block filesystem writes outside your”” https://x.com/ajambrosino/status/2029252598851879265

The latest model from @OpenAI GPT-5.4 is available in the Text, Vision and Code Arena! – GPT-5.4 & GPT-5.4-High for Text and Vision – GPT-5.4-Medium for Code Arena Get prompting and let’s see how it ranks on the leaderboards.”” https://x.com/arena/status/2029622814060556767

The latest snapshot in @OpenAI’s ChatGPT is available for testing in the Arena! Find GPT-5.3-Chat-Latest in Text Arena and bring your real-world prompts to judge for yourself.”” https://x.com/arena/status/2028908848204177682

The underrated part of the windows codex app release is that the native agent sandbox is fully open source Use it, fork it, build w/ itt! https://x.com/reach_vb/status/2029335011804017135

Exclusive | OpenAI’s Former Research Chief Aims to Automate Manufacturing With AI – WSJ https://www.wsj.com/tech/ai/openais-former-research-chief-aims-to-automate-manufacturing-with-ai-8871f265

Today, OpenAI is launching the Deployment Safety Hub — a new site that turns our system cards from static PDFs into something you can easily search, browse, and share. https://t.co/qXWFVbw7Sa System cards are the most detailed window we provide into the technical work behind”” https://x.com/dgrobinson/status/2027458289517068511

Big GPT-5.4 updates (via TheInformation) – 1M token context window -New “Extreme reasoning mode” → more compute, deeper thinking – Parity with Gemini and Claude long-context models – Better long-horizon tasks (can run for hours) – Improved memory across multi-step workflows”” https://x.com/kimmonismus/status/2029213568155992425

BOOOOM! Introducing GPT-5.4 Thinking & Pro in Codex, API & ChatGPT 🔥 It combines GPT-5.3-Codex-level coding with stronger reasoning, better knowledge-work generation, native computer use for agents, and up to ~1M token context! > Tool search cuts token usage by ~47% on”” https://x.com/reach_vb/status/2029620416546017491

excited for GPT-5.3 Instant rolling out today! a lot of work went into improving the everyday chatgpt experience, things that don’t always show up in benchmarks: better tone, fewer unnecessary refusals, and stronger answers from search. we’ve also reduced hallucinations ↓”” https://x.com/christinahkim/status/2028900228196384978

GPT-5.2 Pro is a really solid fact checker. Put in anything you write into it and it hums away and gives you objections & caveats & “”well, actually”” qualifications, plus it checks your math Outside of narrow areas (Academic pubs, New Yorker articles) this was not possible pre-AI”” https://x.com/emollick/status/2029235053339804132

GPT-5.3 Instant is rolling out in ChatGPT starting today. We heard the feedback on 5.2 – sometimes too cautious, too many caveats, and conversations that didn’t flow as naturally as they should. 5.3 Instant tackles that with fewer unnecessary refusals, fewer defensive”” https://x.com/nickaturley/status/2028894581191000404

GPT-5.3-chat-latest now also in the API”” https://x.com/scaling01/status/2028906108291616773

GPT-5.4 also has a 1M context window, but their evals show that needle-in-a-haystack (MRCR v2) scores 97% at 16-32K tokens, drops to 57% at 256-512K, and just 36% at 512K-1M. So it’s a good idea to compact regularly!”” https://x.com/cline/status/2029642984351010874

GPT-5.4 coming soon (as first leaked by humble self) – exceeding 1 million context window – featuring an extreme reasoning mode”” https://x.com/scaling01/status/2029215437922169254

GPT-5.4 is a big step up in computer use and economically valuable tasks (e.g., GDPval). We see no wall, and expect AI capabilities to continue to increase dramatically this year.”” https://x.com/polynoamial/status/2029622090152956335

GPT-5.4 is launching, available now in the API and Codex and rolling out over the course of the day in ChatGPT. It’s much better at knowledge work and web search, and it has native computer use capabilities. You can steer it mid-response, and it supports 1m tokens of context.”” https://x.com/sama/status/2029622732594499630

GPT-5.4 only slightly better than GPT-5.3-Codex on SWE-Bench-Pro”” https://x.com/scaling01/status/2029620496627597364

GPT-5.4 Pricing MORE EXPENSIVE THAN GPT-5.2″” https://x.com/scaling01/status/2029619520860565648

GPT-5.4 Thinking is rolling out to ChatGPT. You can now interrupt it before it produces the final answer. That means you can steer the response while it’s still working instead of needing multiple back-and-forth turns. We also improved deep web research and long-context”” https://x.com/nickaturley/status/2029639058864099543

GPT-5.4-high is now in the Text Arena, tied with Gemini-3-Pro. Highlights: – Top 3 in Creative Writing, and top 10 in Instruction Following, Hard Prompts. – Top 6 for Occupational categories: Writing, Literature & Language, Entertainment, Sports & Media, Business, Management &”” https://x.com/arena/status/2029648008602857694

It’s GPT-5.4 day! The first general-purpose AI model that beats humans at operating a computer. 75% on OSWorld vs 72.4% for humans. It can navigate desktops, click through UIs, send emails, fill out forms all from screenshots. Additional nuggets: – 1M token context and”” https://x.com/TheRundownAI/status/2029625695593435286

It’s happening: GPT-5.4 landed in the arena. Release Thursday very likely”” https://x.com/kimmonismus/status/2029325405212070200

We also evaluated GPT-5.4 Pro on FrontierMath: Open Problems. It did not solve any problems. It made some novel observations on one problem, but of a form that the author had anticipated and characterized as relatively uninteresting. More here:”” https://x.com/EpochAIResearch/status/2029626331764605365

GPT-5.4 scores 83% on GDPval”” https://x.com/scaling01/status/2029618924375965992

GPT-5.4 and GPT-5.4 Thinking are now available in Perplexity for Pro and Max subscribers.”” https://x.com/perplexity_ai/status/2029629694489006347

Perplexity Computer orchestrates 20 different AI models – and now you can embed them directly inside apps you create. To prove it, we built CEO Chat. Text Elon, Jensen, Zuck, and other tech CEOs. They text back.”” https://x.com/AskPerplexity/status/2028893546447814895

OpenClaw on a Unitree G1 humanoid 🤯 A MIT dropout developed an open-source robotics platform that supports 80% of Chinese OEM robots! This OpenClaw upgrade to process physical space and time via integrations with LiDAR, stereo, or RGB cameras. It enables robots like the”” https://x.com/IlirAliu_/status/2028756316999573751

Leave a Reply

Trending

Discover more from Ethan B. Holland

Subscribe now to keep reading and get access to the full archive.

Continue reading