OpenAI: AI News Week Ending 03/13/2026

OpenAI: AI News Week Ending 03/13/2026

March 13, 2026

Image created with gemini-3.1-flash-image-preview with claude-sonnet-4-5. Image prompt: Vintage 1990s screen-printed t-shirt graphic in single-color deep red ink on worn mustard-yellow cotton fabric, showing a simple cartoon outline of a tall wooden lifeguard tower on a beach with whistle and binoculars, bold text ‘OPENAI’ arcing across top in retro novelty shirt typography, slightly imperfect printed look with aged fabric texture and minor stains, humorous local beach town charm

GPT-5.4 is really good at spreadsheets; a few finance people have finally said things to me like “”huh I guess this AI thing is real””
https://x.com/sama/status/2030318213482131670

ChatGPT 5.4 Thinking creating excel models is insanely good This wasn’t even ChatGPT in Excel 5 well formatted, research and modeled sheets. Pretty great.
https://x.com/mweinbach/status/2030045514918416411

Threw 5 large Excel and two very long Word docs into GPT 5.4… Wildy impressive results. That is some context window you have there 5.4..
https://x.com/BenBajarin/status/2030067195787759958

1/ The rivalry between OpenAI & Anthropic continues: GPT 5.4 is now the best model in the world at filing taxes (better than Opus 4.6)! We Just ran TaxCalcBench on GPT-5.4. 56.86% of tax returns computed perfectly. That’s #1 overall: the first model to break 55%, surpassing
https://x.com/michaelrbock/status/2029931536636858694

From model to agent: Equipping the Responses API with a computer environment | OpenAI https://openai.com/index/equip-responses-api-computer-environment/

GPT-5.4 Thinking and GPT-5.4 Pro are rolling out now in ChatGPT. GPT-5.4 is also now available in the API and Codex. GPT-5.4 brings our advances in reasoning, coding, and agentic workflows into one frontier model.
https://x.com/OpenAI/status/2029620619743219811

I’m super excited to welcome @iwebst, Michael D’Angelo, and the Promptfoo team to OpenAI. As enterprises deploy AI coworkers into real workflows, evaluation, security, and compliance become foundational requirements. Promptfoo has built a great set of tools for automated
https://x.com/snsf/status/2031055866024120825

OpenAI to acquire Promptfoo | OpenAI https://openai.com/index/openai-to-acquire-promptfoo/

Promptfoo is joining OpenAI | Promptfoo https://www.promptfoo.dev/blog/promptfoo-joining-openai/

We’re acquiring Promptfoo. Their technology will strengthen agentic security testing and evaluation capabilities in OpenAI Frontier. Promptfoo will remain open source under the current license, and we will continue to service and support current customers.
https://x.com/OpenAI/status/2031052793835106753

OpenAI hardware exec Caitlin Kalinowski quits in response to Pentagon deal | TechCrunch https://techcrunch.com/2026/03/07/openai-robotics-lead-caitlin-kalinowski-quits-in-response-to-pentagon-deal/

OpenAI’s robotics lead, Caitlin Kalinowski, has resigned over US military contract, quoting concerns over “”surveillance of Americans without judicial oversight and lethal autonomy without human authorization.””
https://x.com/TheHumanoidHub/status/2030390204977275357

ChatGPT “”adult mode”” and erotica delayed, OpenAI says https://www.axios.com/2026/03/06/openai-delays-chatgpt-adult-mode

AI is progressing rapidly: GPT-5.4 Pro (xhigh) has achieved a massive 10 point gain in CritPt, a benchmark where the highest score was only 9% in Nov ’25 This is the largest incremental gain we have seen from a single release. CritPt is a benchmark with a private dataset that
https://x.com/ArtificialAnlys/status/2030007301529358546

GPT-5.4 completely destroys GPT-5.2 in the Arena
https://x.com/scaling01/status/2030020396544630999

GPT-5.4-high has landed in the Code Arena top 6. Setup with the Codex Harness, @OpenAI’s latest model is on par with Gemini 3.1 Pro Preview for real-world web development tasks. Highlights: – top 6 in WebDev overall – #6 for Multi-File React – top 10 for Single-File HTML
https://x.com/arena/status/2032126328842117612

GPT-5.4-xhigh takes 1st place on LiveBench with extremely strong scores in reasoning and coding categories
https://x.com/scaling01/status/2029924473520914752

OpenAI’s new GPT-5.4 (xhigh) lands equal first in the Artificial Analysis Intelligence Index alongside Gemini 3.1 Pro, but at a cost increase compared to GPT-5.2 @OpenAI’s GPT-5.2 (xhigh, 51) was the most intelligent model as at end of 2025. Since then, OpenAI released two
https://x.com/ArtificialAnlys/status/2029950497516573183

Prompt guidance for GPT-5.4 | OpenAI API https://developers.openai.com/api/docs/guides/prompt-guidance

New ways to learn math and science in ChatGPT | OpenAI https://openai.com/index/new-ways-to-learn-math-and-science-in-chatgpt/

Lets look at the criteria for “”weak AGI””: ✅Loebner prize was a weak Turing Test, equivalent achieved by GPT-4.5 ✅Winograd passed by GPT-3 ✅SAT passed at 75% by GPT-4 Only remaining thing is playing an old Atari game from 1984. The labs could do the funniest thing right now
https://x.com/emollick/status/2031519480371683594

Harness engineering: leveraging Codex in an agent-first world | OpenAI https://openai.com/index/harness-engineering/

I had Codex create a version of the map of the lighthouses of the Northern seas, including real colors, light patterns & distances But then I had it also create a mode set in a Lovecraftian 1920s where you need to place lighthouses to ward off monsters: https://x.com/emollick/status/2031565633217863881

IMO people still think of codex as a tool for coding, when really you can do all kind of data analysis/work there.
https://x.com/steipete/status/2030377225485263311

Very grateful to Jensen for working to expand Nvidia capacity at AWS so much for us!
https://x.com/sama/status/2030318958512164966

Forgot to mention /fast! I think people will like this.
https://x.com/sama/status/2029623948980416681

GPT-5.4 is really good. I immediately notice its boost in understanding and ability to solve problems quickly and completely. Using it to create a compiler, Claude Code is pretty much stumped. GPT-5.3 was making slow progress. But GPT-5.4 just *gets* it.
https://x.com/QuixiAI/status/2029673108207026669

We will be able to fix these three things!
https://x.com/sama/status/2029627696314208257

What is the hardest question I could ask you that you might get right?””
https://x.com/sama/status/2030318481653334067

Wow real range of emotion reading the second and then the the third paragraph.
https://x.com/sama/status/2030318632899953108

📣 Technical lessons from building computer access for agents Making long-running workflows practical required tightening the execution loop, providing rich context via file systems, and enabling network access with security guardrails. Here’s how we equipped the Responses API
https://x.com/OpenAIDevs/status/2031798071345234193

4/ Ablations + agent behavior analysis: – Most agents underutilize the 10 hour window, although longer runs correlate with better scores – Reasoning effort. For GPT-5.1 Codex Max, the default “”Medium”” reasoning effort outperformed “”High””. High reasoning effort consumed nearly
https://x.com/karinanguyen/status/2031790007028236452

Automations are now GA. You can now: • Set the model and reasoning level • Choose if runs happen in a worktree or existing branch • Reuse workflows with templates Automations are great for recurring tasks — daily repo briefings, issue triage, PR comment follow-up, and more.
https://x.com/OpenAIDevs/status/2032222711032971548

GPT-5.4 just randomly caught outdated sections in some .md files and also suggested moving them so other agents wouldn’t treat these as truth. Which means every agent before it made this mistake. I’m impressed.
https://x.com/Yampeleg/status/2030253948406227072

What if you could optimize a model overnight without any ML experience? What if an AI agent runs hundreds of training experiments autonomously, keeping only the improvements? That is the idea behind autoresearch. Yes, the early results are small scale, GPT-2 speedups, a 0.8B
https://x.com/_philschmid/status/2031356521553043824

GPT-5.4 xhigh seems bad at following instructions. Last night I launched two AI research agents running @karpathy’s autoresearch. Claude Opus 4.6 (high): > ran for 12+ hours, 118 experiments done, still running GPT-5.4 xhigh: > stopped after 6 experiments > blamed me for
https://x.com/Yuchenj_UW/status/2031044694441148709

GPT-5.4-xhigh in 2nd place on the AA-Index in the overall, but 1st in agentic and coding However, I don’t see the reasoning efficiency gains OpenAI were talking about. GPT-5.4-xhigh deleted all gains GPT-5.3-Codex made and was almost 2x more expensive to benchmark
https://x.com/scaling01/status/2029927963014115768

Insane how much Codex+GPT-5.4 with slack/notion/google drive access breaks down organizational silos. “”What is the process to <x>”” for any <x> is now a question that doesn’t require pinging anyone. And if you need to ping someone, Codex can figure out whom and do that too.
https://x.com/corbtt/status/2032167664865722574

I resigned from OpenAI. I care deeply about the Robotics team and the work we built together. This wasn’t an easy call. AI has an important role in national security. But surveillance of Americans without judicial oversight and lethal autonomy without human authorization are
https://x.com/kalinowski007/status/2030320074121478618?s=20

140 million people use ChatGPT to help them understand math and science concepts every week.
https://x.com/ChatGPTapp/status/2031510785428762732

I found a use case for ChatGPT 5.4 Pro. It’s INCREDIBLE at writing technical specification docs. Thinking does alright too, but Pro really wrote something worthy of a PhD thesis for a project I’m starting.
https://x.com/CtrlAltDwayne/status/2030060347273662837

GPT-5.4 is great at coding, knowledge work, computer use, etc, and it’s nice to see how much people are enjoying it. But it’s also my favorite model to talk to! We have missed the mark on model personality for awhile, so it feels extra good to be moving in the right direction.
https://x.com/sama/status/2030319489993298349

Codex Security is rolling out as a research preview to ChatGPT Enterprise, Business, and Edu customers via Codex web, with free usage for the next month.
https://x.com/OpenAIDevs/status/2029983833567940639

Codex Security–our application security agent–is now in research preview.
https://x.com/OpenAI/status/2029985250512920743

We’re introducing Codex Security. An application security agent that helps you secure your codebase by finding vulnerabilities, validating them, and proposing fixes you can review and patch. Now, teams can focus on the vulnerabilities that matter and ship code faster.
https://x.com/OpenAIDevs/status/2029983809652035758

Codex can’t run autoresearch right now, sadly. To me this is a big issue: agents shouldn’t need special commands like /loop or ralph just to run loops. This feels more like a Codex harness issue than a GPT-5.4 issue. If I say “loop forever,” it should just do that!
https://x.com/Yuchenj_UW/status/2031087769993490777

Did some gardening today: 🍪 Sweet Cookie 0.2.0 with Brave cookie support, better Linux/GNOME logic, and explicit macOS chromiumBrowser targeting… https://t.co/s5boBSvzbe which helps 🧿oracle 0.9.0 with GPT-5.4 Pro support and plenty of bug fixes.
https://x.com/steipete/status/2030478956646834590

For further analysis of GPT-5.4 and other model visit Artificial Analysis:
https://x.com/ArtificialAnlys/status/2029950513429762429

GPT 5.4 (xhigh) scores 77.7% on WeirdML, just behind 5.3 codex and Opus 4.6, but within the margin of error. GPT 5.4 is a really strong model, and sets a new high score on 3 of the 17 tasks, but it’s not consistent enough to set a new top score. It uses by far the most tokens
https://x.com/htihle/status/2032107787195466061

GPT-5.4 High by @OpenAI has landed in the top 10 Text Arena. Let’s dig into why. Overall the latest model is much more rounded than the previous GPT-5.2-High, with significant improvements across quite a large number of categories. Below are where it has made the largest gains:
https://x.com/arena/status/2030018716440924225

GPT-5.4 is honestly fantastic, what a great model.
https://x.com/Yampeleg/status/2030949057653264437

GPT-5.4 is launching, available now in the API and Codex and rolling out over the course of the day in ChatGPT. It’s much better at knowledge work and web search, and it has native computer use capabilities. You can steer it mid-response, and it supports 1m tokens of context.
https://x.com/sama/status/2029622732594499630

GPT-5.4 leads CursorBench on correctness with efficient token usage.
https://x.com/OpenAIDevs/status/2032209975280533676

GPT-5.4 Pro cost over $1k to achieve this result. This is 13X the cost of GPT-5.4 (xhigh reasoning), driven by the high output token price (GPT-5.4 Pro is priced at $180 per 1M output tokens vs GPT-5.4’s $15). GPT-5.4 used 6M tokens, only marginally more than GPT-5.4 (xhigh)’s
https://x.com/ArtificialAnlys/status/2030007303655887188

GPT-5.4 xhigh sets a new pass@5 and pass^5 SOTA on ZeroBench pass@5: 23% (prev. 19%) pass^5: 8% (prev. 7%) More details below 👇
https://x.com/JRobertsAI/status/2031026691682808148

GPT-5.4-high behind GPT-5.2 on PostTrainBench because it’s not allocating time as well as GPT-5.2, Opus or Gemini
https://x.com/scaling01/status/2031081654035300834

GPT-5.4-high below GPT-5.2-high on AlgoTune
https://x.com/scaling01/status/2031079698826993690

How often do LLMs claim to prove false mathematical statements? In our latest benchmark, BrokenArXiv, we find they do so very often. The best model, GPT-5.4, only rejects 40% of incorrect statements obtained by perturbing recent ArXiv papers, and other models do much worse.
https://x.com/j_dekoninck/status/2032458037823483953

I had mostly only used it in Codex, but after spending a lot of time with 5.4 in ChatGPT today, I’m more impressed than I had expected. From the ChatGPT side, it is also a jump from 5.2 to 5.4, and I think I had been judging it too much through the lens of Codex. I still think
https://x.com/Hangsiin/status/2030880541185286370

I tried GPT-5.4-xhigh once and removed all the code it has written then I asked Opus 4.6 Thinking and it one shotted it in 1/10th the time my theory is that GPT-5.4 is highly autistic and literal. It has no idea of the concept of “”inferring intent”” so when you prompt it be as
https://x.com/scaling01/status/2029987685952279000

If true, this would be the first of @EpochAIResearch’s Frontier Math open problems to be resolved by AI. “”The result emerged from a single GPT-5.4 Pro run and was subsequently refined into Lean with GPT-5.4 XHigh which ran for a few hours.””
https://x.com/kevinweil/status/2031378978527641822

My new Sunday morning routine: 1. Get coffee 2. Check GPT-5.4 projects on the Codex App, continue & start new ones 4. Launch ChatGPT 5.4 Pro for fresh brainstorming sessions 5. Think/learn how to use the 90% of AI capabilities I have yet to explore 6. Drink more coffee
https://x.com/DeryaTR_/status/2030622714927452309

nanochat now trains GPT-2 capability model in just 2 hours on a single 8XH100 node (down from ~3 hours 1 month ago). Getting a lot closer to ~interactive! A bunch of tuning and features (fp8) went in but the biggest difference was a switch of the dataset from FineWeb-edu to
https://x.com/karpathy/status/2029701092347630069

New @openclaw beta bits are up! Yes, includes GPT 5.4 and Gemini Flash 3.1!
https://x.com/steipete/status/2030508141419372667

The codex team updated docs with the estimated usage limits for the models Local Messages (5.4) ChatGPT Plus 33-168 ChatGPT Pro 223-1120 Local Messages (5.3-Codex) ChatGPT Plus 45-225 ChatGPT Pro 300-1500 Local Messages (5.1-Codex-Mini) ChatGPT Plus 180-900 ChatGPT Pro
https://x.com/Presidentlin/status/2030881332411125845

We believe we have fully resolved, in Lean and python, one of @EpochAIResearch Frontier Math open problems: a Ramsey-style problem on hypergraphs. The result emerged from a single GPT-5.4 Pro run and was subsequently refined into Lean with GPT-5.4 XHigh which ran for a few
https://x.com/spicey_lemonade/status/2031315804537434305

Working with GPT-5.4 in the API? We’ve updated our prompting guide with patterns for reliable agents covering tool use, structured outputs, verification loops, and long-running workflows.
https://x.com/OpenAIDevs/status/2030018673449263400

GPT 5.4 is a really special model. I think the tweet below is about coding, but IMO it also holds for general use (like explaining concepts or talking through issues). It’s tough to get the personality right – this model genuinely feels like talking to a smart friend.
https://x.com/venturetwins/status/2030391113086116096

ok i think gpt 5.4 can actually talk. it is much more opinionated when you ask it to critique stuff, than gpt-5.3-codex. i am kind of loving it.
https://x.com/dejavucoder/status/2029912128325570818

I’ve been playing with GPT-5.4 over the weekend, and it definitely feels like a better match for me than Opus 4.6. Pros: GPT-5.4: Better instruction adherence, does what you ask, not what you don’t. Asks for confirmation more. Opus: A bit faster. Seems better at frontend design.
https://x.com/gneubig/status/2030971826042527860

Our next kernel competition is now open for submissions! A $1.1M cash prize competition sponsored by AMD on optimizing DeepSeek-R1-0528, GPT-OSS-120B on MI355X Registration:
https://x.com/GPU_MODE/status/2029974019018244223

Announcing NVIDIA Nemotron 3 Super! 💚120B-12A Hybrid SSM Latent MoE, designed for Blackwell 💚36 on AAIndex v4 💚up to 2.2X faster than GPT-OSS-120B in FP4 💚Open data, open recipe, open weights Models, Tech report, etc. here: https://t.co/CAYpP1iK3i And yes, Ultra is coming!
https://x.com/ctnzr/status/2031762077325406428

Another week, another noteworthy open-weight LLM release. Nvidia’s Nemotron 3 Super 120B-A12B looks pretty good. Benchmarks are on par with Qwen3.5 122B and GPT-OSS 120B, but the throughput is great! Below is a short, visual architecture rundown.
https://x.com/rasbt/status/2032084724743553129

We’re excited to be day-0 launch partners for NVIDIA Nemotron 3 Super! You can try it now on Baseten, or read @rapprach’s blog to learn more about the new model: https://x.com/baseten/status/2031775755253026965

1/8 Two days ago, @Liam06972452 prompted GPT-5.4 Pro using our workflow that had been working for the Erdős problems thus far, and was able to eventually obtain a solution to https://x.com/AcerFur/status/2031458080458739757

the progress is way faster than i expected gpt-5.4 pro (xhigh) is making a big jump in research-level physics reasoning the model improved by 10 points on the critpt benchmark, where the top score was only 9% in nov 2025 and has now reached 30% by march 2026 i think this fits
https://x.com/slow_developer/status/2030203046416855290

We are investigating a possible solution by GPT-5.4 Pro to a problem from FrontierMath: Open Problems. My guess is that the solution is right, but we won’t be sure until the problem author weighs in. Thread with the story so far…
https://x.com/GregHBurnham/status/2031451554151022838

Codex Security is now also available on ChatGPT Pro accounts.
https://x.com/OpenAIDevs/status/2030081306974093755

Codex for Open Source is an awesome idea. OSS maintainers get API credits, 6 months of ChatGPT Pro with Codex, and access to Codex Security as needed.
https://x.com/kevinweil/status/2030000508342272368

Excited to introduce Codex for Open Source! 🔥 TL;DR – ChatGPT Pro, Codex, and API credits for eligible open-source maintainers Open source has shaped modern software, and so much of it depends on maintainers doing steady, often invisible work to keep critical projects healthy.
https://x.com/reach_vb/status/2029998272945717553

@Yuchenj_UW Codex is a know issue 🙁 It basically don’t work with autoresearch sadly, in the way it’s set up atm: https://t.co/YDaQqwhM2h I pung a friend at OpenAI to see if something can be done, e.g. need a /loop equivalent or something like that. More generally, I really dislike the -p +
https://x.com/karpathy/status/2031083551387701698

If you want AI Code Review, but don’t want to pay $25 per review (not a typo), check out Codex Review! It leverages frontier Codex models, finds complex issues, and 100% usage based. Most runs should cost ~$1 or less
https://x.com/rohanvarma/status/2031113869666693351

my fav thing when I ask codex and then it disappears and returns with “”YES NOW””
https://x.com/steipete/status/2030848677527364048

There’s still a few spots left to get free codex Pro subs!
https://x.com/steipete/status/2031835365204496394

We’ve been cooking. 2 updates in the Codex app 👇 You can now personalize the Codex app with themes that match your taste. Import themes you like or share your own.
https://x.com/OpenAIDevs/status/2032222631538409728

5.4 is faster and better at professional work — with big improvements in spreadsheet, doc, and slide creation. In Codex and the API, it’s our first general purpose model with native SOTA computer use capabilities, which is going to enable so much more agentic work.
https://x.com/fidjissimo/status/2029621151283171752

we just recorded what might be the single most impactful conversation in the history of @latentspacepod iff you take @_lopopolo seriously and literally everything about @OpenAI Frontier, Symphony and Harness Engineering. its all of a kind and the future of the AI Native Org
https://x.com/swyx/status/2030074312380817457

Codex app on Windows!
https://x.com/sama/status/2029623487007183274

T3 Code is now available for everyone to use. Fully open source. Built on top of the Codex CLI, so you can bring your existing Codex subscription.
https://x.com/theo/status/2030071716530245800

Your videos can go further now. We’re introducing new Video API capabilities, powered by Sora 2: • Custom characters and objects • 16:9 and 9:16 exports • Clips up to 20 seconds • Video continuation to extend scenes • Batch jobs for video generation
https://x.com/OpenAIDevs/status/2032142448970121468