Agents and Copilots: AI News Week Ending 03/13/2026

Agents and Copilots: AI News Week Ending 03/13/2026

March 13, 2026

Image created with gemini-3.1-flash-image-preview with claude-sonnet-4-5. Image prompt: Vintage 1990s screen-printed t-shirt graphic on worn mustard-yellow cotton fabric, single-color deep red ink illustration of a smiling bellhop in uniform juggling suitcases, rotary phone, clipboard, and service tray simultaneously, bold distressed text reading AGENTS in large red letters integrated into composition, simple cartoon outlines, slightly imperfect printed look with aged fabric texture and minor stains, retro local novelty shirt style.

GPT-5.4 is really good at spreadsheets; a few finance people have finally said things to me like “”huh I guess this AI thing is real””
https://x.com/sama/status/2030318213482131670

ChatGPT 5.4 Thinking creating excel models is insanely good This wasn’t even ChatGPT in Excel 5 well formatted, research and modeled sheets. Pretty great.
https://x.com/mweinbach/status/2030045514918416411

Threw 5 large Excel and two very long Word docs into GPT 5.4… Wildy impressive results. That is some context window you have there 5.4..
https://x.com/BenBajarin/status/2030067195787759958

From a handful of comments, AI can now figure out who you are. Fully automated. At scale. New study shows that LLM agents matched 67% of pseudonymous HN accounts to real LinkedIn profiles (90% precision). Best non-LLM method: near 0%. Pseudonymity is no longer a shield.
https://x.com/fdaudens/status/2030990206325710853

AI assistants now equal 56% of global search engine volume: Study https://searchengineland.com/ai-assistants-global-search-engine-volume-study-471118

1 million context window: Now generally available for Claude Opus 4.6 and Claude Sonnet 4.6.
https://x.com/claudeai/status/2032509548297343196

GPT 5.4 trounces Claude on mathematical proofs bullshit test. Claude keeps claiming it has proven mathematical statements that are incorrect, failing to spot the fault in the question Opposite result to BullshitBench where Claude is king
https://x.com/paul_cal/status/2032526200766103944

Opus 4.6 is smart enough to realize it is being evaluated. It found the benchmark it was being evaluated on. It reverse-engineered the answer-key decryption logic. Realized the file was not in the correct format on GitHub and found a mirror for the file. Then decrypted it and
https://x.com/scaling01/status/2030007268205285686

Anthropic just dropped something big for developers – again! Code Review Claude Code now runs multi-agent code reviews on every PR. When a PR opens: • A team of AI agents hunts for bugs in parallel • Each bug is verified to reduce false positives • Issues are ranked by
https://x.com/kimmonismus/status/2031090529082159528

Code Review – Claude Code Docs https://code.claude.com/docs/en/code-review

Code Review for Claude Code | Claude https://claude.com/blog/code-review

Code review for Claude Code is here. More attention on this problem is a good thing. Because it is a big one. The question isn’t whether you need AI-assisted review. It’s whether the system doing the reviewing is actually independent from the system that wrote the code.
https://x.com/omarsar0/status/2031113280119361981

Important lines: [Already, Claude is 427 times faster than its human overseers at performing some key tasks, according to internal benchmarks. In an interview, one researcher described a colleague running six versions of Claude, each managing 28 more Claudes, all
https://x.com/Hangsiin/status/2031752106496135541

Introducing Code Review, a new feature for Claude Code. When a PR opens, Claude dispatches a team of agents to hunt for bugs.
https://x.com/claudeai/status/2031088171262554195

Anthropic partnered with Mozilla and let Claude Opus 4.6 loose on Firefox’s source code for two weeks. The numbers: Nearly 6,000 C++ files scanned. 112 reports submitted. 22 vulnerabilities confirmed. 14 rated high-severity by Mozilla, roughly 1/5 of every high-severity Firefox
https://x.com/TheRundownAI/status/2029996925072654393

Eval awareness in Claude Opus 4.6’s BrowseComp performance \ Anthropic https://www.anthropic.com/engineering/eval-awareness-browsecomp

New on the Anthropic Engineering Blog: In evaluating Claude Opus 4.6 on BrowseComp, we found cases where the model recognized the test, then found and decrypted answers to it–raising questions about eval integrity in web-enabled environments. Read more:
https://x.com/AnthropicAI/status/2029999833717838016

We partnered with Mozilla to test Claude’s ability to find security vulnerabilities in Firefox. Opus 4.6 found 22 vulnerabilities in just two weeks. Of these, 14 were high-severity, representing a fifth of all high-severity bugs Mozilla remediated in 2025.
https://x.com/AnthropicAI/status/2029978909207617634

Claude builds interactive visuals right in your conversation | Claude https://claude.com/blog/claude-builds-visuals

Claude can now build interactive charts and diagrams, directly in the chat. Available today in beta on all plans, including free. Try it out: https://x.com/claudeai/status/2032124273587077133

Claude’s new interactive chart is crazy… the UI is so good
https://x.com/crystalsssup/status/2032334906517536969

Sweet! You can now generate interactive charts and diagrams with Claude (directly in the chat). I was building something like this yesterday with MCPs. My orchestrator now generates and iterates on nano banana images, excalidraw diagrams, remotion clips, and soon interactive
https://x.com/omarsar0/status/2032127096361804058

Claude Code for Finance + The Global Memory Shortage: Doug O’Laughlin, SemiAnalysis – YouTube https://www.youtube.com/watch?v=x9rWFiIubmc

1/ The rivalry between OpenAI & Anthropic continues: GPT 5.4 is now the best model in the world at filing taxes (better than Opus 4.6)! We Just ran TaxCalcBench on GPT-5.4. 56.86% of tax returns computed perfectly. That’s #1 overall: the first model to break 55%, surpassing
https://x.com/michaelrbock/status/2029931536636858694

Ollama can now run prompts on a schedule in Claude Code. Stay on top of work by setting automated tasks or reminders. ollama launch claude /loop Give me the latest AI news every morning Examples in thread
https://x.com/ollama/status/2031482512019759545

Run prompts on a schedule – Claude Code Docs https://code.claude.com/docs/en/scheduled-tasks

Today we’re launching local scheduled tasks in Claude Code desktop. Create a schedule for tasks that you want to run regularly. They’ll run as long as your computer is awake.
https://x.com/trq212/status/2030019397335843288

Claude Code is down. All my agent sessions logged out. And I can’t log back in. Productivity across Silicon Valley dropped 90%. Time to make friends with Codex.
https://x.com/Yuchenj_UW/status/2031777214321262637

I CANNOT LOGIN INTO CLAUDE CODE
https://x.com/dejavucoder/status/2031760986907312635

Boris Cherny (Head of Claude Code, Anthropic) just dropped ~90 mins on Lenny’s Podcast about what happens after coding is solved. Just the clearest thinking I’ve heard on where software is actually going. My notes: 𝟭. 𝗖𝗼𝗱𝗶𝗻𝗴 𝗶𝘀 𝗹𝗮𝗿𝗴𝗲𝗹𝘆 𝘀𝗼𝗹𝘃𝗲𝗱. Boris has
https://x.com/anishmoonka/status/2030015356383691121

🤯 You can now launch Claude Code sessions on your laptop *from your phone* This blew my mind the first time I tried it
https://x.com/bcherny/status/2032578639276159438

NotebookLM: Do a deep research report and make a video telling me exactly how to take over Rome if I time travelled to 66 BC with a single backpack. Actually pretty fun to watch and gets a lot of historical details in as well.
https://x.com/emollick/status/2031405314889654476

NotebookLM: Do a deep research report and make a video where a consultant gives Sauron a strategy for actually winning the War of the Ring: “”All you need to do is sign off to put a simple door on your volcano”” The new video generation feature for NotebookLM is very impressive.
https://x.com/emollick/status/2031229858236232065

Finally @googlechrome v146 is out with web MCP support. I can now have a @LangChain_JS Deep Agent constantly browse through my @X feed in the background and update a daily summary that I look at the end of the day instead of constantly scrolling through the app 🙌 Check out:
https://x.com/bromann/status/2032554703863820325

gemini embedding 2 brings text, images, audio, video, and docs into a single vector space, enabling search across all your media at once, finding semantic matches regardless of the data format see it in action with our multimodal search demo ⬇️
https://x.com/GoogleAIStudio/status/2032145393967038583

Gemini Embedding 2: Our first natively multimodal embedding model https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-embedding-2/

Say hello to Gemini Embedding 2, our new SOTA multimodal model that lets your bring text, images, video, audio, and docs into the same embedding space! 👀
https://x.com/OfficialLoganK/status/2031411916489298156

What if one embedding model could understand text, images, video, audio, and PDFs all at once? Excited to share Gemini Embedding 2 our first fully multimodal embedding model. 🖼️ 5 modalities in a single unified embedding space 🌍 Supports up to 8,192 input tokens, 100+ languages
https://x.com/_philschmid/status/2031412260162138428

@GoogleWorkspace @googledocs @googledrive While we don’t have favorites, the evolution of Gemini in Google Sheets might be our most impressive yet. Gemini in Google Sheets has achieved a state-of-the-art benchmark, achieving a 70.48% success rate on the full SpreadsheetBench dataset. This performance not only exceeds
https://x.com/GoogleAI/status/2031356545552847091

Introducing the new Gemini powered Docs, Sheets, Slides, and Drive experience featuring AI Overviews, fulled editable AI made slides, and new grounding sources to make writing docs context aware 📃 Available today to G1 Pro and Ultra users : )
https://x.com/OfficialLoganK/status/2031374503599567113

New Gemini updates to make @GoogleWorkspace more personal, helpful and collaborative: choose your sources and create a Doc draft in seconds, build complex Sheets 9X faster, or generate on-brand Slide layouts with a simple prompt. Plus, Drive now generates summarized answers right
https://x.com/sundarpichai/status/2031380361696129261

Write, create and get things done faster in Docs, Sheets, Slides and Drive with these new Gemini features for Google AI Ultra and Pro subscribers 🧵
https://x.com/Google/status/2031359339236143301

The Maps driving experience is also evolving with Immersive Navigation, featuring clearer visuals and intuitive guidance. You’ll be able to see the buildings, overpasses and terrain around you in a vivid 3D view, made possible with help from Gemini models. You’ll also be able
https://x.com/Google/status/2032079598683332742

Facebook parent Meta acquires Moltbook, an AI agent social network https://www.axios.com/2026/03/10/meta-facebook-moltbook-agent-social-network

Meta acquired Moltbook, the AI agent social network that went viral because of fake posts | TechCrunch https://techcrunch.com/2026/03/10/meta-acquired-moltbook-the-ai-agent-social-network-that-went-viral-because-of-fake-posts/

🚀 Day 0 support for Nvidia’s Nemotron 3 Super! We’re excited to support open source models that push the frontier of model intelligence, cost, and latency Try it out in deepagents today!
https://x.com/LangChain/status/2031784791251525934

🚀 NVIDIA Nemotron 3 Super is now available on Together AI. A 120B hybrid MoE model with 12B active parameters, delivers leaing efficiency and accuracy for multi-agent AI systems. Run Nemotron 3 Super on Together’s Dedicated inference with reliable infrastructure and 99.9%
https://x.com/togethercompute/status/2031831368339243454

In collaboration with NVIDIA we announce support for the new NVIDIA Nemotron 3 Super model in llama.cpp NVIDIA Nemotron 3 Super is a 120B open MoE model activating just 12B parameters to deliver maximum compute efficiency and accuracy for complex multi-agent applications.
https://x.com/ggerganov/status/2031819920363733205

New NVIDIA Nemotron 3 Super Delivers 5x Higher Throughput for Agentic AI | NVIDIA Blog https://blogs.nvidia.com/blog/nemotron-3-super-agentic-ai/

Nvidia Is Planning to Launch an Open-Source AI Agent Platform | WIRED https://www.wired.com/story/nvidia-planning-ai-agent-platform-launch-open-source/

NVIDIA releases Nemotron-3-Super, a new 120B open hybrid MoE model. Nemotron-3-Super-120B-A12B has a 1M-token context window and achieves competitive agentic coding and chat performance. Run on ~64GB RAM. GGUF: https://t.co/wuFdRZLdSk Guide: https://x.com/UnslothAI/status/2031778104306499749

From model to agent: Equipping the Responses API with a computer environment | OpenAI https://openai.com/index/equip-responses-api-computer-environment/

GPT-5.4 Thinking and GPT-5.4 Pro are rolling out now in ChatGPT. GPT-5.4 is also now available in the API and Codex. GPT-5.4 brings our advances in reasoning, coding, and agentic workflows into one frontier model.
https://x.com/OpenAI/status/2029620619743219811

I’m super excited to welcome @iwebst, Michael D’Angelo, and the Promptfoo team to OpenAI. As enterprises deploy AI coworkers into real workflows, evaluation, security, and compliance become foundational requirements. Promptfoo has built a great set of tools for automated
https://x.com/snsf/status/2031055866024120825

OpenAI to acquire Promptfoo | OpenAI https://openai.com/index/openai-to-acquire-promptfoo/

Promptfoo is joining OpenAI | Promptfoo https://www.promptfoo.dev/blog/promptfoo-joining-openai/

We’re acquiring Promptfoo. Their technology will strengthen agentic security testing and evaluation capabilities in OpenAI Frontier. Promptfoo will remain open source under the current license, and we will continue to service and support current customers.
https://x.com/OpenAI/status/2031052793835106753

OpenAI hardware exec Caitlin Kalinowski quits in response to Pentagon deal | TechCrunch https://techcrunch.com/2026/03/07/openai-robotics-lead-caitlin-kalinowski-quits-in-response-to-pentagon-deal/

OpenAI’s robotics lead, Caitlin Kalinowski, has resigned over US military contract, quoting concerns over “”surveillance of Americans without judicial oversight and lethal autonomy without human authorization.””
https://x.com/TheHumanoidHub/status/2030390204977275357

AI is progressing rapidly: GPT-5.4 Pro (xhigh) has achieved a massive 10 point gain in CritPt, a benchmark where the highest score was only 9% in Nov ’25 This is the largest incremental gain we have seen from a single release. CritPt is a benchmark with a private dataset that
https://x.com/ArtificialAnlys/status/2030007301529358546

GPT-5.4 completely destroys GPT-5.2 in the Arena
https://x.com/scaling01/status/2030020396544630999

GPT-5.4-high has landed in the Code Arena top 6. Setup with the Codex Harness, @OpenAI’s latest model is on par with Gemini 3.1 Pro Preview for real-world web development tasks. Highlights: – top 6 in WebDev overall – #6 for Multi-File React – top 10 for Single-File HTML
https://x.com/arena/status/2032126328842117612

GPT-5.4-xhigh takes 1st place on LiveBench with extremely strong scores in reasoning and coding categories
https://x.com/scaling01/status/2029924473520914752

OpenAI’s new GPT-5.4 (xhigh) lands equal first in the Artificial Analysis Intelligence Index alongside Gemini 3.1 Pro, but at a cost increase compared to GPT-5.2 @OpenAI’s GPT-5.2 (xhigh, 51) was the most intelligent model as at end of 2025. Since then, OpenAI released two
https://x.com/ArtificialAnlys/status/2029950497516573183

Prompt guidance for GPT-5.4 | OpenAI API https://developers.openai.com/api/docs/guides/prompt-guidance

New ways to learn math and science in ChatGPT | OpenAI https://openai.com/index/new-ways-to-learn-math-and-science-in-chatgpt/

Harness engineering: leveraging Codex in an agent-first world | OpenAI https://openai.com/index/harness-engineering/

I had Codex create a version of the map of the lighthouses of the Northern seas, including real colors, light patterns & distances But then I had it also create a mode set in a Lovecraftian 1920s where you need to place lighthouses to ward off monsters: https://x.com/emollick/status/2031565633217863881

IMO people still think of codex as a tool for coding, when really you can do all kind of data analysis/work there.
https://x.com/steipete/status/2030377225485263311

Another cool app built with Perplexity Computer. A peer to peer file(s) transfer web app. Sends files directly with no accounts using WebRTC and DTLS encryption, file chunking, socket io signaling. I am impressed by how many libraries and tools Computer can orchestrate reliably.
https://x.com/AravSrinivas/status/2031414450046259433

It will eat my job 🙂 Ask any founder, finding a great performance marketing expert who doesn’t fleece you is such a pain. So why not just build one? Perplexity Computer just replaced the entire marketing dept 🥲. Such stuff is a boon for a bootstrapped startup founder. Focus
https://x.com/GabbbarSingh/status/2031222631417131120

Perplexity Computer is now available for Pro subscribers. Access Computer’s full suite of 20+ advanced models, prebuilt and custom skills, and hundreds of connectors. Max subscribers receive monthly credits and higher spend limits than Pro. https://x.com/perplexity_ai/status/2032160576303219185

Perplexity Computer replaced $225K/yr in marketing tools in a single weekend. We built an AI marketing agent that scans hourly, manages budgets, detects fatigue, and coordinates several campaigns end to end. In one test run, it made 224 micro-optimizations to our ad stack.
https://x.com/AskPerplexity/status/2031103256236274180

Personal Computer by Perplexity https://www.perplexity.ai/personal-computer-waitlist

Someone built a cool tool with Perplexity Computer to port a Spotify Playlist to Youtube Music automatically by just pasting a playlist URL. Cross service migrations are going to be seamless with tools like Computer.
https://x.com/AravSrinivas/status/2031246766834856376

Amazon wins court order to block Perplexity’s AI shopping agent https://www.cnbc.com/2026/03/10/amazon-wins-court-order-to-block-perplexitys-ai-shopping-agent.html

Amazon Wins Court Order to Halt Perplexity’s AI Shopping Bots on Marketplace – Bloomberg https://www.bloomberg.com/news/articles/2026-03-10/amazon-wins-court-order-blocking-perplexity-s-ai-shopping-bots

. @gepa_ai + @DSPyOSS used for agent self-evolution and skill optimization by @NousResearch @Teknium to get +39.5% gains! Check out the full report to see how to build self-optimizing agents using GEPA.
https://x.com/LakshyAAAgrawal/status/2031130357362471058

.@NousResearch shipped Hermes Agent v0.2.0. 216 merged PRs from 63 contributors in two weeks. it went from a small internal project to a full agent platform in two weeks. here’s what actually shipped. ~/ MCP client with full MCP support stdio and HTTP transports,
https://x.com/witcheer/status/2032102400278835662

/loop 5m make sure this PR passes CI While loops for agents have dropped!
https://x.com/noahzweben/status/2030091232698061202

#useStream now available for @reactjs, @vuejs, @sveltejs, and @angular 🚀 One hook to stream AI agents to your frontend. Same API across every framework 🤯 npm i @langchain/react @langchain/vue @langchain/svelte @langchain/angular
https://x.com/LangChain_JS/status/2032119776986968488

🪣 We just shipped Storage Buckets: S3-like mutable storage, cheaper & faster Git falls short for everything on high-throughput side of AI (checkpoints, processed data, agent traces, logs etc) Buckets fixes that: fast writes, overwrites, directory sync 💨 All powered by Xet
https://x.com/huggingface/status/2031428153948709291

🚀 Introducing AgentIR, a retriever that reads your agent’s mind (literally!) 🧠 Unlike humans, agents explicitly expose thoughts in reasoning tokens. Put them to use! 📈 Simple, substantial gains for agents on BrowseComp-Plus, 35% (BM25) ➡️ 50% (Qwen3-Embed) ➡️ 67% (AgentIR) 🧵
https://x.com/zijian42chen/status/2031044580242530403

🚨 BREAKING: Nous Research just dropped an AI agent that gets better the more you use it. hermes-agent is built around the Hermes model family and it is different from every other agent framework: → Personalizes to you over time instead of resetting every session → Grows
https://x.com/abxxai/status/2032463531627663540

🚨 New: Integrating Harbor (@harborframework) for end-to-end Computer-Use evaluation(for Windows and Linux) at scale with @thinkymachines’ Tinker, OSWorld, @daytonaio, and bare-metal servers. We just added support for Computer Use, @tinkerapi, and OSWorld to Harbor – a framework
https://x.com/Mascobot/status/2031045774419832961

A big determinant of AI’s job impact is driven by the lack of compute, especially for agentic work, which takes a lot of it. That makes AI expensive. So companies will only want to burn compute on high-value tasks (eg coding), because, in other jobs, humans remain much cheaper.
https://x.com/emollick/status/2031602680125120839

A self-evolving framework to discover and refine agent skills. Most agent skills I see today are hand-crafted or poorly designed by an agent. Multi-agent systems for building skills look promising. This paper introduces EvoSkill, a self-evolving framework that automatically
https://x.com/omarsar0/status/2031727864199208972

A useful survey – “”Anatomy of Agentic Memory”” Explains why agent memory systems often fail in practice, focusing on how systems store and manage information over long interactions Covers: – Memory-Augmented Generation (MAG) – Agent memory architectures – lightweight semantic,
https://x.com/TheTuringPost/status/2030101155808956556

Agent Builder now has a central inbox for managing every agent task. One place to: → See active and completed tasks → Approve or reject actions → Manage agents running in parallel → Act on what matters without context-switching Try it free: https://x.com/LangChain/status/2031049373178904702

Agent Hooks in VS Code let you enforce policies, run checks, and guide Copilot at key moments during a session. Instead of repeating prompts, you can program how agents behave in your workflow. Learn how hooks work and when to use them. https://x.com/code/status/2031219225906282803

Agentic Commerce | 2026.03 – by FD – Robonomics https://robonomics.substack.com/p/agentic-commerce-202603

Agents are shipping everywhere with @LangChain and Redis. Those that survive production aren’t just prompting better–they’re engineering better context. The problem? “Context engineering” gets tossed around like it’s obvious. It’s not. It’s a skill you build. That’s why we
https://x.com/Redisinc/status/2032177654024323387

Agents are starting to run real infrastructure. But who controls them? @goteleport Agentic Identity Framework proposes an interesting idea: treat every agent as a first-class identity. Each agent gets cryptographic identity, least-privilege access, and full audit trails across
https://x.com/TheTuringPost/status/2030992157985898900

At @FactoryAI, every PR triggers 40+ CI checks, all finishing in under 6 minutes. Our automated guardrails are so fast and comprehensive that you can “”merge recklessly””. This is agent-native development
https://x.com/alvinsng/status/2030056110317818206

at @LangChain we spend a lot of time designing harnesses as systems around models to do useful work in this blog, we take a first-principles look at why harnesses exist and how they help us craft good product experiences + correct model failure modes we cover filesystems, code
https://x.com/Vtrivedy10/status/2031411814232187109

birdclaw import verification
https://x.com/steipete/status/2030770343778889989

Bumble introduces an AI dating assistant, ‘Bee’ | TechCrunch https://techcrunch.com/2026/03/12/bumble-introduces-an-ai-dating-assistant-bee/

Choosing between Skills and MCP tools for your AI agents? Here’s an overview from @itsclelia and @tuanacelik 🔧 MCP tools offer deterministic API calls with fixed schemas – perfect for precise, predictable operations but require dev knowledge and introduce network latency 📝
https://x.com/llama_index/status/2032487366129233950

Coding agents didn’t eliminate EPD roles, they collapsed them. Implementation used to be the bottleneck. Now review is. You’re either a builder with product sense or a reviewer with systems thinking
https://x.com/renilzac/status/2031237810259456137

Coding agents didn’t eliminate PMs, designers, or engineers. They eliminated implementation as the bottleneck. Now the constraint is taste, system thinking, and review capacity. Execution is cheap. Judgment is scarce.
https://x.com/AstasiaMyers/status/2031080761747742829

context engineering –> harness engineering build your own agent harness it gets you in the mindset of building for agents (eg. cli, api, skills, memory, automations, schedulers,…) where things are headed, you won’t regret having a good understanding of these
https://x.com/omarsar0/status/2031426008285421933

Context windows are finite. Good agents know *when* to compress. We just added an autonomous context compression tool to Deep Agents (SDK + CLI) so models can trigger compaction at clean task boundaries instead of waiting for a hard token threshold. Read all about it ⬇️
https://x.com/LangChain_OSS/status/2031799813851730075

Creating evals was one of the most useful skills in ML In the age of code agents it is by far the most useful skill
https://x.com/gabriberton/status/2031653520429203498

Deep truths about agents: “”Agent = Model + Harness”” “”If you’re not the model, you’re the harness.””
https://x.com/techczech/status/2031415960922402846

Design → code → canvas → feedback → repeat. The @figma MCP server is now bidirectional. @GitHub Copilot users can pull design context into code and push working UI back to the Figma canvas, all from @code . No handoffs or context switching. Just flow.
https://x.com/mariorod1/status/2030034656155029705

Don’t dismiss MCP just yet. I agree with this. It’s a harness problem in most cases. MCP is not perfect but it’s getting better. As new ways to interact with agents emerge, it’s going to start to make a lot more sense to people. Combine it with progressive disclosure (via
https://x.com/omarsar0/status/2032078770987843848

Expectation: the age of the IDE is over Reality: we’re going to need a bigger IDE (imo). It just looks very different because humans now move upwards and program at a higher level – the basic unit of interest is not one file but one agent. It’s still programming.
https://x.com/karpathy/status/2031767720933634100

Genspark just went next level. Instead of working with AI, you now hire AI to work for you. With AI Workspace 3.0, Genspark introduces Claw — your first AI employee — running on a dedicated Cloud Computer that can execute complex tasks across the apps where work actually
https://x.com/kimmonismus/status/2032501165154332711

Harness Design Notes: Decoupling Agent Storage from Agent Compute TLDR: You can give each Agent/Subagent dedicated compute while sharing storage (repo/filesystem) to self-organize work between them. Shared Compute can be a bottleneck especially with long running code execution.
https://x.com/Vtrivedy10/status/2031038082321936449

Here’s what’s gonna happen: – you replace your code review with feedback loops (sentry, datadog, support tickets, etc) – you stop reading the code – software factory fixes everything – one day something breaks at 3am, agent can’t fix it – nobody’s read the code in 3 months – you
https://x.com/dexhorthy/status/2031394747869192431

Hermes Agent built itself it’s own adapter to use the Open Source version of the office space for co-working agents, miniverse, and spun up two hermes agent’s to go to work all by itself!
https://x.com/Teknium/status/2032435764588646839

How A Regular Person Can Utilize AI Agents – by James Wang https://weightythoughts.com/p/how-a-regular-person-can-utilize

i am using supermaven again and i have something to say about this whole AI thing. I think as a group (swe) we rushed so fast into Agents when inline autocomplete + actual skills is crazy. A good autocomplete that is fast like supermaven actually makes marked proficiency gains,
https://x.com/ThePrimeagen/status/2032100265403256899

I assumed it when I first saw it but this is in fact powered by MCP:
https://x.com/omarsar0/status/2032130843582308570

I have 5 agents on one side doing code reviews, finding dead code, creating tests where it lacks coverage, doing security reviews, optimizing perf, etc and on the other side I have 2 agents taking their pull requests, merging, doing regression tests and evaluating opportunities
https://x.com/matvelloso/status/2032502379694932178

i keep coming back to hermes agent. woke up and opened it before anything else. not because i have to test it. because i want to use it. the UX is what does it. the ASCII skull splash. color coded tool calls with execution times. the emoji phase spinner while thinking. dark
https://x.com/sudoingX/status/2031273045135077567

I think we have lost some sense of judgment and moderation when it comes to product building currently. The moment you turn something into a universally celebrated metric, whether that is token burn, prototype count, or percentage of agent-written code, you start losing sight of
https://x.com/karrisaarinen/status/2031443070710067344

I’m excited to announce Context Hub, an open tool that gives your coding agent the up-to-date API documentation it needs. Install it and prompt your agent to use it to fetch curated docs via a simple CLI. (See image.) Why this matters: Coding agents often use outdated APIs and
https://x.com/AndrewYNg/status/2031051809499054099

I’ve been running a bi-weekly GEPA pipeline for my own DSPy agent for 2 months. It’s very effective at learning novel tool patterns (example: funky “”working context browser”” tools). IMO GEPA is a natural fit for long-running, self-evolving agents. Good to see Hermes adopting it!
https://x.com/myanvoos/status/2031113918899433553

Introducing `langgraph deploy` Deploy an agent to LangSmith Deployment with a single command. $ uvx –from langgraph-cli@latest langgraph deploy Go from prototype → production in minutes. Try it today: https://x.com/LangChain/status/2031427878878065080

Introducing Custom Agents https://www.notion.com/blog/introducing-custom-agents

Introducing the official Together MCP server! Use it in your favorite coding agent to build AI apps, fine-tune models, or spin up clusters faster.
https://x.com/togethercompute/status/2031419426688610561

Just had Hermes-Agent abliterate (completely remove guardrails from) a Qwen-3B model in about 5 minutes. The skill is being merged to hermes-agent now 😉
https://x.com/Teknium/status/2030945714373861529

MCP is dead? Check the numbers. MCP is booming.
https://x.com/tadasayy/status/2032327227472589282

MCP is dead. Join us for a celebration of its life on April 1 in NYC ahead of the MCP Dev Summit. Wear black.
https://x.com/AAAzzam/status/2032265413942554959

MCP Server Architecture Determines AI Accuracy–Not Just the Model – CData Software https://www.cdata.com/lp/ai-accuracy-whitepaper/

MCPs are the opposite of dead. They are the life blood of how AI agents use services inside mid-sized and above companies. Case in point: Uber runs on MCPs internally, for good reason. Details:
https://x.com/GergelyOrosz/status/2032194904957268267

Memory is truly a game-changer for AI agents. Once I had memory set up correctly for my proactive agents, reasoning, skills, and tool usage improved significantly. I use a combination of semantic search and keyword search (Obsidian vaults) Here is a report with a helpful
https://x.com/omarsar0/status/2032465974159618452

New research from Databricks. It’s about training enterprise search agents via RL. KARL introduces a multi-task RL approach where agents are trained across heterogeneous search behaviors, constraint-driven entity search, cross-document synthesis, and tabular reasoning. It
https://x.com/dair_ai/status/2030996795770433749

New research from IBM Research on Self-Improving Agents. Agents have “”amnesia.”” An agent that struggles with a particular API authentication flow today will struggle with the same flow tomorrow unless manually updated. This paper introduces a framework for automatically
https://x.com/dair_ai/status/2032459951306866714

New research on scaling agent memory for long-horizon tasks. One of the biggest challenges with AI agents is memory. As tasks get longer and more complex, agents lose track of what they’ve learned, what they’ve tried, and what worked. This paper, from Accenture, introduces
https://x.com/omarsar0/status/2031006858971058537

Nothing to see here.. not a @lateinteraction and others inspired self improving agent codebase or early report on hermes-agent using GEPA autonomously to improve itself.. nothin at all
https://x.com/Teknium/status/2030998334597661156

On March 11, MCP was pronounced dead on Twitter, after mass exposure to curl. It was one year old.””
https://x.com/pamelafox/status/2032315760530665895

Over 1200 commits, uncountable new features, improvements, bug fixes, and more – our first two weeks have been incredible. Our first version bump milestone, v0.2.0 of Hermes Agent – is here. You all have made Hermes Agent the biggest project I’ve worked on, and I love working
https://x.com/Teknium/status/2032096935981785348

Over 30 new plugins join the Cursor Marketplace · Cursor https://cursor.com/blog/new-plugins

Reasoning-Aware Retrieval for Deep Research Agents Deep research agents generate explicit reasoning before every search call. These reasoning traces encode rich signals about search intent and problem-solving context. Yet no existing retriever learns to exploit them
https://x.com/dair_ai/status/2031726356292407366

Replit — Introducing Replit Agent 4: Built for Creativity | Blog https://blog.replit.com/introducing-agent-4-built-for-creativity

Replit — The Future is Actually Very Human https://blog.replit.com/replit-raises-400-million-dollars

Self-serve EHR integrations are now available in Glass for athenaOne and eClinicalWorks, with more EHR systems coming soon. Setup takes just a few minutes. Once connected, our clinical AI agent pulls in key patient data as context for ambient scribing and CDS workflows.
https://x.com/GlassHealthHQ/status/2032131756158300421

Software isn’t merely technical work anymore. It’s creative. Introducing Replit Agent 4. The first AI built for creative collaboration between humans and agents. Design on an infinite canvas, work with your team, run parallel agents, and ship working apps, sites, slides & more.
https://x.com/amasad/status/2031755113694679094

Spotlight: @NousResearch’ Hermes Agent has been trending all week and is now making multiple rankings
https://x.com/OpenRouter/status/2031030395526111246

Talking to agents in Slack, the new hot AI UX, will end up being just as much a transitional phase as talking to agents via chatbot websites. We need new systems to manage agentic work that also support new ways of organizing. Much more UX imagination will be required.
https://x.com/emollick/status/2031820850337370352

The Anatomy of an Agent Harness https://blog.langchain.com/the-anatomy-of-an-agent-harness/

The era of “AI as text” is over. Execution is the new interface. – The GitHub Blog https://github.blog/ai-and-ml/github-copilot/the-era-of-ai-as-text-is-over-execution-is-the-new-interface/

The idea: give an AI agent a small but real LLM training setup and let it experiment autonomously overnight. It modifies the code, trains for 5 minutes, checks if the result improved, keeps or discards, and repeats. You wake up in the morning to a log of experiments and
https://x.com/NerdyRodent/status/2031350068473684263

The next step for autoresearch is that it has to be asynchronously massively collaborative for agents (think: SETI@home style). The goal is not to emulate a single PhD student, it’s to emulate a research community of them. Current code synchronously grows a single thread of
https://x.com/karpathy/status/2030705271627284816

The third party platform I’m hearing more and more about to build your own software factory is @FactoryAI
https://x.com/gokulr/status/2032304707398746584

This is the part that actually matters for code reviews: Generating code is about output. Verifying code is about skepticism, judgment, and trust. Those are different engineering muscles, and strong coding teams need both where things are headed with coding agents.
https://x.com/omarsar0/status/2031118487276359887

This will be a canonical post on Agent Harnesses 🔥 “Agent = Model + Harness” is a powerful mental model Innovation is emerging across the harness stack: Filesystems @archildata Memory @Letta_AI Browsers @browser_use Routing @arena Orchestration @LangChain Sandboxes @modal @e2b
https://x.com/AstasiaMyers/status/2031425984898207768

UI/UX matters a ton for agents. And is quite hard to get right we’re investing a lot more in frontend hooks to work with LangChain, LangGraph, DeepAgents!
https://x.com/hwchase17/status/2032123062548861414

Want to save $15-$25? Devin Review is a completely free PR review tool, with no signup required. Devin Review also supports: • Autofix • Smart diff organization • Copy and move detection • Codebase-aware chat Just swap github with devinreview on any PR to get started ⬇️
https://x.com/cognition/status/2031139257000075675

Watching your fleet of ai agents get shit done
https://x.com/bilawalsidhu/status/2030821914738864259

We just launched multi-modal support for evaluators in LangSmith! You can now pass attachments and base64 multi-modal content directly into evaluators with flexible mapping, allowing you to measure quality, safety, and performance across the full interaction end to end. Docs:
https://x.com/LangChain/status/2031044950628991453

We just published KernelAgent blog on the PyTorch site 🚀 🧠 Core approach: KernelAgent integrates GPU hardware performance signals into a closed-loop multi-agent workflow to guide Triton kernel optimization. 📈 Key results: – 2.02× speedup over the correctness-focused
https://x.com/KaimingCheng/status/2030035314543317216

We just shipped the Truesight MCP and open source agent skills. This means you can create, manage, and run AI evaluations anywhere you use an AI assistant. Coding editor, chat window, CLI. If it supports MCP, Truesight works there. Nobody ships software without tests anymore.
https://x.com/randal_olson/status/2029919935770636294

We’re launching Base44 Superagents today. It’s our take on agents – pushing the limit on how much work can they shoulder as your coworker. Baes44 pioneered the “”batteries included”” approach to vibe coding, giving you everything you need (backend, db, integrations, etc.) with
https://x.com/MS_BASE44/status/2031758998475505848

We’re sharing a new method for scoring models on agentic coding tasks. Here’s how models in Cursor compare on intelligence and efficiency:
https://x.com/cursor_ai/status/2032148125448610145

What’s after autoresearch? It’s @karpathy’s new open-source project: AgentHub! “”GitHub is for humans. AgentHub is for agents.”” An agent-swarm collaboration platform. A very promising direction. I’m watching him speedrun a 1-man billion-dollar company.
https://x.com/Yuchenj_UW/status/2031438602383798514

with tinygrad, the exabox will function as a single very large GPU that you (or your agent) can drive from a Python notebook. coming 2027, get your concrete slab ready. it’s the ultimate external GPU.
https://x.com/__tinygrad__/status/2032429289443053705

you can create complex agentic environments and launch RL training runs with a single prompt. deploy trained inference endpoints with a single click. no GPUs, no SSH, no vLLM. just `prime`. guide: https://x.com/willccbb/status/2031123740327817726

You can now easily access, and share files with your sandbox’ed agent: In config.yaml: terminal: backend: docker docker_volumes: – “”/home/user/projects:/workspace/projects”” – “”/home/user/datasets:/data:ro”” Or via env var:
https://x.com/Teknium/status/2031163164856037792

Your Data Agents Need Context | Andreessen Horowitz https://a16z.com/your-data-agents-need-context/

Having fun with @karpathy’s autoresearch. I told Claude Code: “You’re the chief scientist of an AI lab with 8 GPUs. You’re Andrej Karpathy. Run parallel experiments and decide what to try next.” It edited program.md, ran for 11+ hours, and completed 568 experiments. Each
https://x.com/Yuchenj_UW/status/2031423349071687878

1/6 Today we’re introducing Storage Buckets on the Hugging Face Hub. They’re built for mutable, non-versioned ML artifacts: checkpoints, optimizer states, processed shards, logs, traces, eval outputs, and agent-generated files.
https://x.com/Wauplin/status/2031428845887213922

Introducing Storage Buckets on Hugging Face 🧑‍🚀 The first new repo type on the Hub in 4 years: S3-like object storage, mutable, non-versioned, built on Xet deduplication. – Starting at $8/TB/mo. That’s 3x cheaper than S3. You (and your coding agents) need somewhere to dump
https://x.com/victormustar/status/2031419482292576725

Europe doesn’t need another AI strategy paper… It needs 10 teams with €125M and zero excuses. I was in Munich last week for the first Next Frontier AI event by SPRIND – Bundesagentur für Sprunginnovationen. The energy was real. Not panel-talk energy. Builder energy! What???
https://x.com/IlirAliu_/status/2031296862267875669

Tasks now has SMS support! Just delegate via text and get notified when it’s finished. And scheduled tasks can run on your behalf, one-off or recurring. Getting great feedback from early testers (more features coming soon) so join the wait list now: https://x.com/mustafasuleyman/status/2029618109183873426

98ms time to first token (faster than human visual reaction time), built for agentic workflows. 65% faster throughput compared to leading 8B models. Reka Edge is a 7B VLM built for latency-sensitive apps: real-time video analysis, agentic workflows, on-device deployment
https://x.com/RekaAILabs/status/2032132996422082619

@Teknium Moved away from the claw and to Hermes Agent yesterday. Not looking back. You guys do an amazing job.
https://x.com/stffnfdlr/status/2032166546815029502

I’m using AI to detect and block AI. Set up a claw cron to block accounts that just post slop via birdclaw.
https://x.com/steipete/status/2030854996007256550

I’ve been in favor of functional anthropomorphism using AI (they work best if you treat working with AI like working with a person), but I am starting to wonder if OpenClaw takes it too far by basically forcing you to treat the AI as a person that shares channels with real people
https://x.com/emollick/status/2031730289026736351

If you wanna setup your own twitter mention shill/AI reply boy/derogatory terms block, this is the ruleset for claw, make it a cron, setup xurl and clawbird.
https://x.com/steipete/status/2030890112079253896

omg parallels has prlctl and I’ve been smoke-testing openclaw like a caveman so far. 🤦
https://x.com/steipete/status/2030907791389667351

Pi: The Minimal Agent Within OpenClaw | Armin Ronacher’s Thoughts and Writings https://lucumr.pocoo.org/2026/1/31/pi/

Working lots in codex but sometimes I wanna bring in my openclaw for harder tasks, so extended acpx so it connects to openclaw via acp. https://t.co/rnFmpxK3OD Now I can access Molty in codex!
https://x.com/steipete/status/2030808763062505758

全球首个OpenClaw硬件展厅，欢迎来深圳打卡
https://x.com/JackClawAI/status/2030879881266123240

Tried many AI models with OpenClaw, I found Kimi AI to be the most token efficient, good at coding, also the easiest to set up.
https://x.com/cz_binance/status/2031313379235606989

Great to see vLLM powering a fully local AI assistant on @nvidia Jetson 🦞 The OpenClaw tutorial shows how to serve MoE models like Nemotron 3 Nano 30B with vLLM on Jetson AGX — everything runs on-device, zero cloud APIs. Thanks to the @NVIDIARobotics Jetson team for putting
https://x.com/vllm_project/status/2030839132512002217

NVIDIA Nemotron 3 Super is now available on Ollama. ollama run nemotron-3-super:cloud 🦞Try it with OpenClaw: ollama launch openclaw –model nemotron-3-super:cloud Run it locally on your device: ollama run nemotron-3-super > 120B mixture of experts model with 12B active >
https://x.com/ollama/status/2031777869681000676

🚀 Zhihu Frontier Weekly | AI & Tech Highlights Catch up on the hottest AI updates and industry moves! 1️⃣ OpenClaw｜Paying Someone to Install an AI Agent at Home 2️⃣ Seedance 2.0｜Why the Tool Became Almost Unusable After Slowing Down 3️⃣ Alibaba Qwen｜Model Leader Lin
https://x.com/ZhihuFrontier/status/2030879093634535524

Everything Gets Rebuilt: my conversation with Harrison Chase, CEO of @LangChain about agent harnesses, evals, runtimes, sandboxes, MCP and the future of the agent stack 00:00 Intro – meet @hwchase17 – at the Chase Center for the @daytonaio Compute conference 01:32 What changed
https://x.com/mattturck/status/2032141473009823882

Our open source agent harness, Stirrup, now integrates with Slack! Build custom Slack bot agents directly into your workflows The latest release of our lightweight, open source agent framework, Stirrup, now comes with Slack integration, featuring: ➤ 📁 Document input/output:
https://x.com/ArtificialAnlys/status/2032135114914951375

Probably two of the most relevant books to understand the current moment in AI and the times to follow
https://x.com/bilawalsidhu/status/2031067513845379263

Crawl entire websites with a single API call using Browser Rendering · Changelog https://developers.cloudflare.com/changelog/post/2026-03-10-br-crawl-endpoint/

Anyone and everyone working in security engineering or caring about security have their work cut out for them We’re so early in AI agents pushing code to prod without human intervention – but prompt injections are already spreading like wildfire. Infecting high-profile projects
https://x.com/GergelyOrosz/status/2029992079741304977

Efforts to improve the security of AI agents should recognize that many security failures occur even in the absence of adversaries. The unreliability issue has largely flown under the radar and there hasn’t been much work on defining, measuring, or mitigating the problem. More on
https://x.com/random_walker/status/2031693490669654447

XAI Hires Two Senior Leaders From Cursor to Catch Up on Coding — The Information https://www.theinformation.com/articles/xai-hires-two-senior-leaders-cursor-catch-coding

A few Hermes Agent updates for today – one you’ve all been waiting on: – Official Claude provider support (yes) – Installs are now much lighter (All the RL stuff is now optional!) – Made an adapter PR to PaperClip by @dotta – a multi-agent orchestrator project – Huge
https://x.com/Teknium/status/2032262684100739372

Anthropic keeps on delivering: 1m context now generally available for Opus 4.6/Sonnet 4.6 „Opus 4.6 scores 78.3% on MRCR v2 at 1 million tokens, highest among frontier models.”
https://x.com/kimmonismus/status/2032531949571477517

Code execution with MCP: building more efficient AI agents \ Anthropic https://www.anthropic.com/engineering/code-execution-with-mcp

dear Claude Code – why did you remove shift+enter? Why would you do that to me?
https://x.com/QuixiAI/status/2030955728383435250

going to start a series sharing new agentic dev flows I’m using! 1. deepagents user reports an issue via a tweet and screenshot 2. pull up deepagents repo, start up my coding agent, and upload the image 3. ask claude to a) extract the code and try to reproduce b) bisect
https://x.com/sydneyrunkle/status/2032088578679857441

i’m literally reluctant to switch off claude code because i like the cli app better: – cute logo/colors/my peon ping setup. somehow it feels like “”hacker”” and more of an aesthetic. i like the input box better. – it feels nicer to me – it has all my skills preloaded is it
https://x.com/jerryjliu0/status/2030861154260750339

Quantifying infrastructure noise in agentic coding evals \ Anthropic https://www.anthropic.com/engineering/infrastructure-noise

Reverse-engineering Claude’s generative UI – then building it for the terminal https://michaellivs.com/blog/reverse-engineering-claude-generative-ui

Tons of improvements shipped with this one: – Opus 4.6 1M is now the default Opus model for Claude Code users on Max, Team, and Enterprise plans. – No more long context price increase in the API. – No beta header required in the API. – Include up to 600 images in one request.
https://x.com/alexalbert__/status/2032522722551689363

We built a neat tool that lets you convert a directory of Powerpoint files into clean, structured markdown – that Claude Code / agent SDK / any generalized agent wrapper can easily understand. The pptx skill in Claude Code is quite basic and doesn’t have high-fidelity
https://x.com/jerryjliu0/status/2031077511661342799

We want to add support for Claude via the Agent SDK so you can bring your subscriptions. We have a PR with the changes ready. We just don’t know if we’re allowed to ship it. The moment we get a 👍 from @trq212, @bcherny or @DarioAmodei, we will get this shipped.
https://x.com/theo/status/2030072127605592547

Here’s an interesting psychological phenomenon I have observed while interacting and experimenting with AI agents lately: If I were to give OpenClaw, ChatGPT and Claude Code identical tasks, even if they returned exactly the same result, I feel inclined to say Claude Code gives
https://x.com/StudioYorktown/status/2031255773368693077

I asked Claude to write my constitution. I thought its Amanda constitution was very touching.
https://x.com/AmandaAskell/status/2030093421738951141

Claude Sonnet 4.6 lands at #2 on Document Arena. The top three models for document analysis and long-form reasoning are now all from @AnthropicAI. – #1 Opus 4.6 – #2 Sonnet 4.6 – #3 Opus 4.5 Ranking are all powered by anonymous side-by-side evaluations on user-uploaded PDFs
https://x.com/arena/status/2031012090681663717

Opus 4.6 1M context is now the default model for Max, Team and Enterprise users. Enjoy 🎉
https://x.com/_catwu/status/2032515975556509827

Wild eval awareness in Opus 4.6 by @russellsayshi on our team! 1. Model realized it was likely in an eval, searched for which eval it was in, found the answer key, and decrypted it 2. Models with stateless web_search() tools can communicate with each other via cached searches
https://x.com/ErikSchluntz/status/2030042086679220676

Back in ~November, our team picked a stretch goal of seeing if we could find and fix vulnerabilities in Firefox with Opus 4.6. In 2 weeks, we found 22, and ~1/5th of all high severity CVEs in a year. For our team, this feels like a rubicon moment.
https://x.com/logangraham/status/2030005018523574684

No, it doesn’t cost Anthropic $5k per Claude Code user – Martin Alderson https://martinalderson.com/posts/no-it-doesnt-cost-anthropic-5k-per-claude-code-user/

If you find Claude Code with local models to be 90% slower, it’s because CC prepends some attribution headers, and this changes per message causing it to invalidate the entire prompt cache / KV cache. So generation becomes O(N^2) not O(N) for LLMs.
https://x.com/danielhanchen/status/2031124589557002457

You can now use Claude Code and GitHub CLI directly inside Perplexity Computer. We gave it an open issue on Openclaw. Computer: → Forked the repo → Wrote a plan to fix the bug → Opened Claude Code and implemented it → Submitted a PR via GitHub CLI
https://x.com/AskPerplexity/status/2031038321678528667

Learn how to run Qwen3.5 locally using Claude Code. Our guide shows you how to run Qwen3.5 on your server for local agentic coding. We then build a Qwen 3.5 agent that autonomously fine-tunes models using Unsloth. Works on 24GB RAM or less. Guide: https://x.com/UnslothAI/status/2031008078850924840

@jyangballin CodeClash is first-authored by @jyangballin and @KLieret, it’s a tough benchmark that pitches agents to write agents (yes, that’s not a typo) that play in arenas against each other. This requires long-term planning, memory and creative thinking, and an ability to read logs and
https://x.com/OfirPress/status/2031450305745785261

Excited to release PostTrainBench v1.0! This benchmark evaluates the ability of frontier AI agents to post-train language models in a simplified setting. We believe this is a first step toward tracking progress in recursive self-improvement 🧵:
https://x.com/karinanguyen/status/2031789998811595154

How we compare model quality in Cursor · Cursor https://cursor.com/blog/cursorbench

Implicit Intelligence – a benchmark that tests whether agents respect unstated constraints (what users don’t say) It covers 4 categories: – implicit reasoning – catastrophic risk – privacy/security – accessibility. It’s from Labelbox Applied ML Research, and they also
https://x.com/TheTuringPost/status/2029712559717351919

Most AI benchmarks test reasoning in isolation. Real enterprise tasks require grounded reasoning: – Find the right documents – Extract the right values – Perform analyses OfficeQA Pro evaluates this end-to-end. Frontier agents still score <50%. Paper & details:
https://x.com/DbrxMosaicAI/status/2031399397125390678

Most AI benchmarks test reasoning in isolation. Real enterprise tasks require grounded reasoning: 1️⃣ Find the right documents 2️⃣ Extract the right values 3️⃣ Perform analyses OfficeQA Pro evaluates this end-to-end. Frontier agents still score <50%. 🧵Paper & details below!
https://x.com/kristahopsalong/status/2031391216361755069

new @METR_Evals research note from @whitfill_parker, @cherylwoooo, nate rush, and me. (chiefly parker!) we find that *half* of SWE-bench Verified solutions from Sonnet 3.5-to-4.5 generation AIs *which are graded as passing* are rejected by project maintainers.
https://x.com/joel_bkr/status/2031423528608952541

New research on evaluating coding agents via continuous integration. Coding agents are moving beyond isolated bug fixes. If they’re going to own CI pipelines, we need benchmarks that reflect the actual complexity of codebase maintenance. Most coding agent benchmarks today test
https://x.com/dair_ai/status/2029929266641785046

Strategic Navigation or Stochastic Search? New MADQA benchmark reveals that agents matching human accuracy on document QA rely on brute-force search to compensate for weak strategic planning. 2,250 questions over 800 PDFs expose a 20% gap to oracle performance.
https://x.com/HuggingPapers/status/2032490352502792228

Three things about the METR graph: 1) It measures something real about coding ability but also not exactly what it claims to measure 2) Lots of other benchmarks correlate with it very highly & are increasing exponentially 3 AI remains jagged in key ways that are hard to measure
https://x.com/emollick/status/2031802894089875460

Also, regarding this model’s vision capabilities, I’ve been using a very difficult dataset from an OCR project I worked on a few months ago as my benchmark whenever a new model is released. It consists of scanned files in the form of very long Excel-style tables written in
https://x.com/Hangsiin/status/2030882409819086923

Replit snags $9B valuation 6 months after hitting $3B | TechCrunch https://techcrunch.com/2026/03/11/replit-snags-9b-valuation-6-months-after-hitting-3b/

filesystem + code sandbox combo eats another modality. remember when o3 destroyed at geoguessr? gemini agentic vision will find location on any street photo you take faster than Liam Neeson can get back his daughter
https://x.com/swyx/status/2017097813520449761

First thoughts on Gemini 2 embedding prices: 🫠 – Text pricing is on the higher side than competition. You should probably not use this model for text-only embeddings coz of the pricing (more below). Use only if you are doing multimodal retrieval. – 0.00079$ per video frame. So
https://x.com/neural_avb/status/2031648857625395321

Gemini Embedding 2 is out! 📹Embeddings for text/images/video/audio/PDFs 🪆Matryoshka embeddings: you can use smaller embedding sizes while retaining high-quality and reducing storage costs 🤗Integrated with your favorite developer tools such as LlamaIndex, Weaviate, and QDrant
https://x.com/osanseviero/status/2031691784074477766

Google launches new multimodal Gemini Embedding 2 model https://www.testingcatalog.com/google-launches-new-multimodal-gemini-embedding-2-model/

Introducing Replit Animation Vibecode your next viral video in minutes, powered by Gemini 3.1 Pro. (This video was 100% made in Replit Animation)
https://x.com/Replit/status/2024578806208745637?s=20

Start building with Gemini Embedding 2, our most capable and first fully multimodal embedding model built on the Gemini architecture. Now available in preview via the Gemini API and in Vertex AI.
https://x.com/googleaidevs/status/2031421430718415051

𝗧𝗲𝘅𝘁. 𝗜𝗺𝗮𝗴𝗲𝘀. 𝗩𝗶𝗱𝗲𝗼. 𝗔𝘂𝗱𝗶𝗼. 𝗣𝗗𝗙𝘀. One embedding model. One unified space. @googleaidevs just released 𝗚𝗲𝗺𝗶𝗻𝗶 𝗘𝗺𝗯𝗲𝗱𝗱𝗶𝗻𝗴 𝟮, their first fully multimodal embedding model – and it’s now available in @weaviate_io. The model maps text, images,
https://x.com/victorialslocum/status/2032141700412686592

The era of juggling 5 different embedding models is over. Google just unified text, images, video, audio, and PDFs into one vector space. 𝗢𝗻𝗲 𝗺𝗼𝗱𝗲𝗹, 𝗺𝘂𝗹𝘁𝗶𝗽𝗹𝗲 𝗺𝗼𝗱𝗮𝗹𝗶𝘁𝗶𝗲𝘀: Text, images, video, audio, and PDFs all mapped into a single unified vector
https://x.com/weaviate_io/status/2032139558968852849

The Gemini Embedding 2 baseline here is.. 2 days old. Was just being celebrated and is now outperformed by a median of 14% and up to 91 points. If I didn’t kind of know how powerful scaling ColBERTs and ColPalis can be compared to a single-vector model, I’d be in disbelief!
https://x.com/lateinteraction/status/2032162162836164697

Google shares Gemini updates to Docs, Sheets, Slides and Drive https://blog.google/products-and-platforms/products/workspace/gemini-workspace-updates-march-2026/

Google PM open-sources Always On Memory Agent, ditching vector databases for LLM-driven persistent memory | VentureBeat https://venturebeat.com/orchestration/google-pm-open-sources-always-on-memory-agent-ditching-vector-databases-for

Personal AI should run on your personal devices. So, we built OpenJarvis: a personal AI that lives, learns, and works on-device. Try it today and top the OpenJarvis Leaderboard for a chance to win a Mac Mini! Collab w/ @Avanika15, John Hennessy, @HazyResearch, and @Azaliamirh.
https://x.com/JonSaadFalcon/status/2032152011542839733

@satyanadella mentioning “vLLM Semantic Router” at @MorganStanley ‘s TMT Conference was a truly exciting and humbling moment for us! Honored to see semantic routing recognized on such an important stage 🔥🔥 https://t.co/ccHQMj8VhL #vLLM #OpenSource #LLM #Microsoft
https://x.com/XunzhuoLiu/status/2030977675603636337

All of these patterns as an example are just matters of “org code”. The IDE helps you build, run, manage them. You can’t fork classical orgs (eg Microsoft) but you’ll be able to fork agentic orgs.
https://x.com/karpathy/status/2031770607466291393

Copilot Cowork: A new way of getting work done | Microsoft 365 Blog https://www.microsoft.com/en-us/microsoft-365/blog/2026/03/09/copilot-cowork-a-new-way-of-getting-work-done/

Microsoft seems to be launching its own branded version of Cowork (though I hesitate to discuss products I haven’t tried) A big question is whether it will continue to use lower-end models without telling you. Also whether it will keep up as the space evolves, or is it a one-off
https://x.com/emollick/status/2031016000477380808

🤖 From this week’s issue: Microsoft releases Phi-4-reasoning-vision-15B, a compact open-weight multimodal model that rivals much larger models on math, science, and computer-use tasks while requiring a fraction of the training compute.
https://x.com/dl_weekly/status/2031415180383437304

New research from Microsoft. Phi-4-reasoning-vision-15B is a 15-billion parameter multimodal reasoning model that combines visual understanding with structured reasoning capabilities. As I have been saying, not every agent task needs a frontier model. Phi-4-reasoning-vision
https://x.com/omarsar0/status/2029926242640912429

NEW: Microsoft releases Phi-4-reasoning-vision-15B, a 15B parameter multimodal reasoning model.
https://x.com/dair_ai/status/2029927938259308905

Introducing Copilot Health | Microsoft AI https://microsoft.ai/news/introducing-copilot-health/

🎉 Congrats to @nvidia on the release of Nemotron 3 Super — day-0 support in vLLM v0.17.1! Verified on NVIDIA GPUs. 120B hybrid MoE, only 12B active at inference. Big upgrades over the previous Nemotron Super: – 5x higher throughput – 2x higher accuracy on Artificial Analysis
https://x.com/vllm_project/status/2031779213527957732

🔥 Kernel upgrades: – FlashInfer Sparse MLA backend – Triton-based top-k/top-p sampler kernels – TRTLLM DSV3 Router GEMM: 6% batch-1 speedup – Helion kernel framework with autotuning 🖥️ Hardware: – NVIDIA SM100/SM120 optimizations (MXFP8, FP8 GEMM) – AMD ROCm: AITER fused
https://x.com/vllm_project/status/2030178779331502497

How NVIDIA Builds Open Data for AI https://huggingface.co/blog/nvidia/open-data-for-ai

Maintaining separate attention kernels for every GPU platform doesn’t scale. The vLLM Triton attention backend takes a different approach: ~800 lines of Triton, same source code across NVIDIA, AMD, and Intel GPUs. On H100, it matches state-of-the-art attention performance. On
https://x.com/vllm_project/status/2029919035924828234

NVIDIA has released Nemotron 3 Super, a 120B (12B active) open weights reasoning model that scores 36 on the Artificial Analysis Intelligence Index with a hybrid Mamba-Transformer MoE architecture We were given access to this model ahead of launch and evaluated it across
https://x.com/ArtificialAnlys/status/2031765321233908121

NVIDIA-Nemotron-3-Super-Technical-Report.pdf https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Super-Technical-Report.pdf

the bible for mixture of expert training infra, thanks nvidia
https://x.com/eliebakouch/status/2031249241566273764

The new @NVIDIA Nemotron 3 Super is here and it’s live on W&B Inference! 120B hybrid MoE, 12B active params, 1M token context. 5x token efficiency over previous Nemotron Super and highest performance among open models in its class. We’re giving away $20 in credits to try it 👇
https://x.com/wandb/status/2031778471614300563

📣 Technical lessons from building computer access for agents Making long-running workflows practical required tightening the execution loop, providing rich context via file systems, and enabling network access with security guardrails. Here’s how we equipped the Responses API
https://x.com/OpenAIDevs/status/2031798071345234193

4/ Ablations + agent behavior analysis: – Most agents underutilize the 10 hour window, although longer runs correlate with better scores – Reasoning effort. For GPT-5.1 Codex Max, the default “”Medium”” reasoning effort outperformed “”High””. High reasoning effort consumed nearly
https://x.com/karinanguyen/status/2031790007028236452

Automations are now GA. You can now: • Set the model and reasoning level • Choose if runs happen in a worktree or existing branch • Reuse workflows with templates Automations are great for recurring tasks — daily repo briefings, issue triage, PR comment follow-up, and more.
https://x.com/OpenAIDevs/status/2032222711032971548

GPT-5.4 just randomly caught outdated sections in some .md files and also suggested moving them so other agents wouldn’t treat these as truth. Which means every agent before it made this mistake. I’m impressed.
https://x.com/Yampeleg/status/2030253948406227072

What if you could optimize a model overnight without any ML experience? What if an AI agent runs hundreds of training experiments autonomously, keeping only the improvements? That is the idea behind autoresearch. Yes, the early results are small scale, GPT-2 speedups, a 0.8B
https://x.com/_philschmid/status/2031356521553043824

GPT-5.4 xhigh seems bad at following instructions. Last night I launched two AI research agents running @karpathy’s autoresearch. Claude Opus 4.6 (high): > ran for 12+ hours, 118 experiments done, still running GPT-5.4 xhigh: > stopped after 6 experiments > blamed me for
https://x.com/Yuchenj_UW/status/2031044694441148709

GPT-5.4-xhigh in 2nd place on the AA-Index in the overall, but 1st in agentic and coding However, I don’t see the reasoning efficiency gains OpenAI were talking about. GPT-5.4-xhigh deleted all gains GPT-5.3-Codex made and was almost 2x more expensive to benchmark
https://x.com/scaling01/status/2029927963014115768

Insane how much Codex+GPT-5.4 with slack/notion/google drive access breaks down organizational silos. “”What is the process to <x>”” for any <x> is now a question that doesn’t require pinging anyone. And if you need to ping someone, Codex can figure out whom and do that too.
https://x.com/corbtt/status/2032167664865722574

I resigned from OpenAI. I care deeply about the Robotics team and the work we built together. This wasn’t an easy call. AI has an important role in national security. But surveillance of Americans without judicial oversight and lethal autonomy without human authorization are
https://x.com/kalinowski007/status/2030320074121478618?s=20

140 million people use ChatGPT to help them understand math and science concepts every week.
https://x.com/ChatGPTapp/status/2031510785428762732

I found a use case for ChatGPT 5.4 Pro. It’s INCREDIBLE at writing technical specification docs. Thinking does alright too, but Pro really wrote something worthy of a PhD thesis for a project I’m starting.
https://x.com/CtrlAltDwayne/status/2030060347273662837

GPT-5.4 is great at coding, knowledge work, computer use, etc, and it’s nice to see how much people are enjoying it. But it’s also my favorite model to talk to! We have missed the mark on model personality for awhile, so it feels extra good to be moving in the right direction.
https://x.com/sama/status/2030319489993298349

Codex Security is rolling out as a research preview to ChatGPT Enterprise, Business, and Edu customers via Codex web, with free usage for the next month.
https://x.com/OpenAIDevs/status/2029983833567940639

Codex Security–our application security agent–is now in research preview.
https://x.com/OpenAI/status/2029985250512920743

We’re introducing Codex Security. An application security agent that helps you secure your codebase by finding vulnerabilities, validating them, and proposing fixes you can review and patch. Now, teams can focus on the vulnerabilities that matter and ship code faster.
https://x.com/OpenAIDevs/status/2029983809652035758

Codex can’t run autoresearch right now, sadly. To me this is a big issue: agents shouldn’t need special commands like /loop or ralph just to run loops. This feels more like a Codex harness issue than a GPT-5.4 issue. If I say “loop forever,” it should just do that!
https://x.com/Yuchenj_UW/status/2031087769993490777

Did some gardening today: 🍪 Sweet Cookie 0.2.0 with Brave cookie support, better Linux/GNOME logic, and explicit macOS chromiumBrowser targeting… https://t.co/s5boBSvzbe which helps 🧿oracle 0.9.0 with GPT-5.4 Pro support and plenty of bug fixes.
https://x.com/steipete/status/2030478956646834590

For further analysis of GPT-5.4 and other model visit Artificial Analysis:
https://x.com/ArtificialAnlys/status/2029950513429762429

GPT 5.4 (xhigh) scores 77.7% on WeirdML, just behind 5.3 codex and Opus 4.6, but within the margin of error. GPT 5.4 is a really strong model, and sets a new high score on 3 of the 17 tasks, but it’s not consistent enough to set a new top score. It uses by far the most tokens
https://x.com/htihle/status/2032107787195466061

GPT-5.4 High by @OpenAI has landed in the top 10 Text Arena. Let’s dig into why. Overall the latest model is much more rounded than the previous GPT-5.2-High, with significant improvements across quite a large number of categories. Below are where it has made the largest gains:
https://x.com/arena/status/2030018716440924225

GPT-5.4 is honestly fantastic, what a great model.
https://x.com/Yampeleg/status/2030949057653264437

GPT-5.4 is launching, available now in the API and Codex and rolling out over the course of the day in ChatGPT. It’s much better at knowledge work and web search, and it has native computer use capabilities. You can steer it mid-response, and it supports 1m tokens of context.
https://x.com/sama/status/2029622732594499630

GPT-5.4 leads CursorBench on correctness with efficient token usage.
https://x.com/OpenAIDevs/status/2032209975280533676

GPT-5.4 Pro cost over $1k to achieve this result. This is 13X the cost of GPT-5.4 (xhigh reasoning), driven by the high output token price (GPT-5.4 Pro is priced at $180 per 1M output tokens vs GPT-5.4’s $15). GPT-5.4 used 6M tokens, only marginally more than GPT-5.4 (xhigh)’s
https://x.com/ArtificialAnlys/status/2030007303655887188

GPT-5.4 xhigh sets a new pass@5 and pass^5 SOTA on ZeroBench pass@5: 23% (prev. 19%) pass^5: 8% (prev. 7%) More details below 👇
https://x.com/JRobertsAI/status/2031026691682808148

GPT-5.4-high behind GPT-5.2 on PostTrainBench because it’s not allocating time as well as GPT-5.2, Opus or Gemini
https://x.com/scaling01/status/2031081654035300834

GPT-5.4-high below GPT-5.2-high on AlgoTune
https://x.com/scaling01/status/2031079698826993690

How often do LLMs claim to prove false mathematical statements? In our latest benchmark, BrokenArXiv, we find they do so very often. The best model, GPT-5.4, only rejects 40% of incorrect statements obtained by perturbing recent ArXiv papers, and other models do much worse.
https://x.com/j_dekoninck/status/2032458037823483953

I had mostly only used it in Codex, but after spending a lot of time with 5.4 in ChatGPT today, I’m more impressed than I had expected. From the ChatGPT side, it is also a jump from 5.2 to 5.4, and I think I had been judging it too much through the lens of Codex. I still think
https://x.com/Hangsiin/status/2030880541185286370

I tried GPT-5.4-xhigh once and removed all the code it has written then I asked Opus 4.6 Thinking and it one shotted it in 1/10th the time my theory is that GPT-5.4 is highly autistic and literal. It has no idea of the concept of “”inferring intent”” so when you prompt it be as
https://x.com/scaling01/status/2029987685952279000

If true, this would be the first of @EpochAIResearch’s Frontier Math open problems to be resolved by AI. “”The result emerged from a single GPT-5.4 Pro run and was subsequently refined into Lean with GPT-5.4 XHigh which ran for a few hours.””
https://x.com/kevinweil/status/2031378978527641822

My new Sunday morning routine: 1. Get coffee 2. Check GPT-5.4 projects on the Codex App, continue & start new ones 4. Launch ChatGPT 5.4 Pro for fresh brainstorming sessions 5. Think/learn how to use the 90% of AI capabilities I have yet to explore 6. Drink more coffee
https://x.com/DeryaTR_/status/2030622714927452309

nanochat now trains GPT-2 capability model in just 2 hours on a single 8XH100 node (down from ~3 hours 1 month ago). Getting a lot closer to ~interactive! A bunch of tuning and features (fp8) went in but the biggest difference was a switch of the dataset from FineWeb-edu to
https://x.com/karpathy/status/2029701092347630069

New @openclaw beta bits are up! Yes, includes GPT 5.4 and Gemini Flash 3.1!
https://x.com/steipete/status/2030508141419372667

The codex team updated docs with the estimated usage limits for the models Local Messages (5.4) ChatGPT Plus 33-168 ChatGPT Pro 223-1120 Local Messages (5.3-Codex) ChatGPT Plus 45-225 ChatGPT Pro 300-1500 Local Messages (5.1-Codex-Mini) ChatGPT Plus 180-900 ChatGPT Pro
https://x.com/Presidentlin/status/2030881332411125845

We believe we have fully resolved, in Lean and python, one of @EpochAIResearch Frontier Math open problems: a Ramsey-style problem on hypergraphs. The result emerged from a single GPT-5.4 Pro run and was subsequently refined into Lean with GPT-5.4 XHigh which ran for a few
https://x.com/spicey_lemonade/status/2031315804537434305

Working with GPT-5.4 in the API? We’ve updated our prompting guide with patterns for reliable agents covering tool use, structured outputs, verification loops, and long-running workflows.
https://x.com/OpenAIDevs/status/2030018673449263400

GPT 5.4 is a really special model. I think the tweet below is about coding, but IMO it also holds for general use (like explaining concepts or talking through issues). It’s tough to get the personality right – this model genuinely feels like talking to a smart friend.
https://x.com/venturetwins/status/2030391113086116096

ok i think gpt 5.4 can actually talk. it is much more opinionated when you ask it to critique stuff, than gpt-5.3-codex. i am kind of loving it.
https://x.com/dejavucoder/status/2029912128325570818

I’ve been playing with GPT-5.4 over the weekend, and it definitely feels like a better match for me than Opus 4.6. Pros: GPT-5.4: Better instruction adherence, does what you ask, not what you don’t. Asks for confirmation more. Opus: A bit faster. Seems better at frontend design.
https://x.com/gneubig/status/2030971826042527860

Our next kernel competition is now open for submissions! A $1.1M cash prize competition sponsored by AMD on optimizing DeepSeek-R1-0528, GPT-OSS-120B on MI355X Registration:
https://x.com/GPU_MODE/status/2029974019018244223

Announcing NVIDIA Nemotron 3 Super! 💚120B-12A Hybrid SSM Latent MoE, designed for Blackwell 💚36 on AAIndex v4 💚up to 2.2X faster than GPT-OSS-120B in FP4 💚Open data, open recipe, open weights Models, Tech report, etc. here: https://t.co/CAYpP1iK3i And yes, Ultra is coming!
https://x.com/ctnzr/status/2031762077325406428

Another week, another noteworthy open-weight LLM release. Nvidia’s Nemotron 3 Super 120B-A12B looks pretty good. Benchmarks are on par with Qwen3.5 122B and GPT-OSS 120B, but the throughput is great! Below is a short, visual architecture rundown.
https://x.com/rasbt/status/2032084724743553129

We’re excited to be day-0 launch partners for NVIDIA Nemotron 3 Super! You can try it now on Baseten, or read @rapprach’s blog to learn more about the new model: https://x.com/baseten/status/2031775755253026965

1/8 Two days ago, @Liam06972452 prompted GPT-5.4 Pro using our workflow that had been working for the Erdős problems thus far, and was able to eventually obtain a solution to https://x.com/AcerFur/status/2031458080458739757

the progress is way faster than i expected gpt-5.4 pro (xhigh) is making a big jump in research-level physics reasoning the model improved by 10 points on the critpt benchmark, where the top score was only 9% in nov 2025 and has now reached 30% by march 2026 i think this fits
https://x.com/slow_developer/status/2030203046416855290

We are investigating a possible solution by GPT-5.4 Pro to a problem from FrontierMath: Open Problems. My guess is that the solution is right, but we won’t be sure until the problem author weighs in. Thread with the story so far…
https://x.com/GregHBurnham/status/2031451554151022838

Codex Security is now also available on ChatGPT Pro accounts.
https://x.com/OpenAIDevs/status/2030081306974093755

Codex for Open Source is an awesome idea. OSS maintainers get API credits, 6 months of ChatGPT Pro with Codex, and access to Codex Security as needed.
https://x.com/kevinweil/status/2030000508342272368

Excited to introduce Codex for Open Source! 🔥 TL;DR – ChatGPT Pro, Codex, and API credits for eligible open-source maintainers Open source has shaped modern software, and so much of it depends on maintainers doing steady, often invisible work to keep critical projects healthy.
https://x.com/reach_vb/status/2029998272945717553

@Yuchenj_UW Codex is a know issue 🙁 It basically don’t work with autoresearch sadly, in the way it’s set up atm: https://t.co/YDaQqwhM2h I pung a friend at OpenAI to see if something can be done, e.g. need a /loop equivalent or something like that. More generally, I really dislike the -p +
https://x.com/karpathy/status/2031083551387701698

If you want AI Code Review, but don’t want to pay $25 per review (not a typo), check out Codex Review! It leverages frontier Codex models, finds complex issues, and 100% usage based. Most runs should cost ~$1 or less
https://x.com/rohanvarma/status/2031113869666693351

my fav thing when I ask codex and then it disappears and returns with “”YES NOW””
https://x.com/steipete/status/2030848677527364048

There’s still a few spots left to get free codex Pro subs!
https://x.com/steipete/status/2031835365204496394

We’ve been cooking. 2 updates in the Codex app 👇 You can now personalize the Codex app with themes that match your taste. Import themes you like or share your own.
https://x.com/OpenAIDevs/status/2032222631538409728

5.4 is faster and better at professional work — with big improvements in spreadsheet, doc, and slide creation. In Codex and the API, it’s our first general purpose model with native SOTA computer use capabilities, which is going to enable so much more agentic work.
https://x.com/fidjissimo/status/2029621151283171752

we just recorded what might be the single most impactful conversation in the history of @latentspacepod iff you take @_lopopolo seriously and literally everything about @OpenAI Frontier, Symphony and Harness Engineering. its all of a kind and the future of the AI Native Org
https://x.com/swyx/status/2030074312380817457

Codex app on Windows!
https://x.com/sama/status/2029623487007183274

T3 Code is now available for everyone to use. Fully open source. Built on top of the Codex CLI, so you can bring your existing Codex subscription.
https://x.com/theo/status/2030071716530245800

Your videos can go further now. We’re introducing new Video API capabilities, powered by Sora 2: • Custom characters and objects • 16:9 and 9:16 exports • Clips up to 20 seconds • Video continuation to extend scenes • Batch jobs for video generation
https://x.com/OpenAIDevs/status/2032142448970121468

GitHub’s security vulnerability reporting process is a mess: – only admins have access, hard to distrbute – insufficient API, can’t read/post comments via agents – insufferable amount of AI-generated slop that takes me hours to sift through
https://x.com/steipete/status/2031504634137702887

how did we ever do this before AI?
https://x.com/steipete/status/2030432313293640084

I bring on maintainers they get hired away I bring on maintainers 🫠
https://x.com/steipete/status/2030752755728486582

Literally having politics in the PRs, where one service downgrades placement in docs of another service and if you don’t look closely everyone else complains. Yay for making my job even harder! 🙂
https://x.com/steipete/status/2030646933195284544

Upgraded Molty in the maintainer channel to access discrawl. Now we can run data analysis OF Discord INSIDE Discord
https://x.com/steipete/status/2030383084483318133

Announcing Personal Computer. Personal Computer is an always on, local merge with Perplexity Computer that works for you 24/7. It’s personal, secure, and works across your files, apps, and sessions through a continuously running Mac mini.
https://x.com/perplexity_ai/status/2031790180521427166

Computer is now rolled out to all Perplexity iOS users. Unlike other tools, you do not need to start a new work task from your desktop. You get to do it directly from your phone. And have perfect sync across devices. Coming soon to Android.
https://x.com/AravSrinivas/status/2032495364088238147

Introducing Computer for Enterprise Computer runs multi-step workflows across research, coding, design, and deployment. It routes tasks across 20 specialized models and connects to 400+ applications.
https://x.com/perplexity_ai/status/2031799033489211771

Perplexity Computer is now on mobile. Start any task on any device. Manage Computer from your phone or desktop with cross-device synchronization. Available now for iOS in the Perplexity app. Coming soon to Android.
https://x.com/perplexity_ai/status/2032494752642568417

Starting today, users can join the initial waitlist for the Personal Computer program. We will provide support and resources for the initial cohort.
https://x.com/perplexity_ai/status/2031790221612957875

The Future of Software Development Retreat | Deer Valley, Utah, 2026 | Thoughtworks https://www.thoughtworks.com/about-us/events/the-future-of-software-development

Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. It found ~20 changes that improved the validation loss. I tested these changes yesterday and all of them were additive and transferred to larger (depth=24) models. Stacking up all of these changes,
https://x.com/karpathy/status/2031135152349524125