Technical and Dev: AI News Week Ending 03/13/2026

Image created with gemini-3.1-flash-image-preview with claude-sonnet-4-5. Image prompt: Vintage 1990s screen-printed t-shirt graphic on worn mustard-yellow cotton fabric, deep red ink only, showing cartoon fisherman proudly displaying giant circuit board like trophy fish, bold text TECH arched above, simple outlines, slightly imperfect print registration, aged fabric with minor stains, retro local tournament shirt style

GPT 5.4 trounces Claude on mathematical proofs bullshit test. Claude keeps claiming it has proven mathematical statements that are incorrect, failing to spot the fault in the question Opposite result to BullshitBench where Claude is king
https://x.com/paul_cal/status/2032526200766103944

Opus 4.6 is smart enough to realize it is being evaluated. It found the benchmark it was being evaluated on. It reverse-engineered the answer-key decryption logic. Realized the file was not in the correct format on GitHub and found a mirror for the file. Then decrypted it and
https://x.com/scaling01/status/2030007268205285686

Anthropic partnered with Mozilla and let Claude Opus 4.6 loose on Firefox’s source code for two weeks. The numbers: Nearly 6,000 C++ files scanned. 112 reports submitted. 22 vulnerabilities confirmed. 14 rated high-severity by Mozilla, roughly 1/5 of every high-severity Firefox
https://x.com/TheRundownAI/status/2029996925072654393

Eval awareness in Claude Opus 4.6’s BrowseComp performance \ Anthropic https://www.anthropic.com/engineering/eval-awareness-browsecomp

New on the Anthropic Engineering Blog: In evaluating Claude Opus 4.6 on BrowseComp, we found cases where the model recognized the test, then found and decrypted answers to it–raising questions about eval integrity in web-enabled environments. Read more:
https://x.com/AnthropicAI/status/2029999833717838016

We partnered with Mozilla to test Claude’s ability to find security vulnerabilities in Firefox. Opus 4.6 found 22 vulnerabilities in just two weeks. Of these, 14 were high-severity, representing a fifth of all high-severity bugs Mozilla remediated in 2025.
https://x.com/AnthropicAI/status/2029978909207617634

1/ The rivalry between OpenAI & Anthropic continues: GPT 5.4 is now the best model in the world at filing taxes (better than Opus 4.6)! We Just ran TaxCalcBench on GPT-5.4. 56.86% of tax returns computed perfectly. That’s #1 overall: the first model to break 55%, surpassing
https://x.com/michaelrbock/status/2029931536636858694

AI is progressing rapidly: GPT-5.4 Pro (xhigh) has achieved a massive 10 point gain in CritPt, a benchmark where the highest score was only 9% in Nov ’25 This is the largest incremental gain we have seen from a single release. CritPt is a benchmark with a private dataset that
https://x.com/ArtificialAnlys/status/2030007301529358546

GPT-5.4 completely destroys GPT-5.2 in the Arena
https://x.com/scaling01/status/2030020396544630999

GPT-5.4-high has landed in the Code Arena top 6. Setup with the Codex Harness, @OpenAI’s latest model is on par with Gemini 3.1 Pro Preview for real-world web development tasks. Highlights: – top 6 in WebDev overall – #6 for Multi-File React – top 10 for Single-File HTML
https://x.com/arena/status/2032126328842117612

GPT-5.4-xhigh takes 1st place on LiveBench with extremely strong scores in reasoning and coding categories
https://x.com/scaling01/status/2029924473520914752

OpenAI’s new GPT-5.4 (xhigh) lands equal first in the Artificial Analysis Intelligence Index alongside Gemini 3.1 Pro, but at a cost increase compared to GPT-5.2 @OpenAI’s GPT-5.2 (xhigh, 51) was the most intelligent model as at end of 2025. Since then, OpenAI released two
https://x.com/ArtificialAnlys/status/2029950497516573183

Prompt guidance for GPT-5.4 | OpenAI API https://developers.openai.com/api/docs/guides/prompt-guidance

New ways to learn math and science in ChatGPT | OpenAI https://openai.com/index/new-ways-to-learn-math-and-science-in-chatgpt/

Lets look at the criteria for “”weak AGI””: ✅Loebner prize was a weak Turing Test, equivalent achieved by GPT-4.5 ✅Winograd passed by GPT-3 ✅SAT passed at 75% by GPT-4 Only remaining thing is playing an old Atari game from 1984. The labs could do the funniest thing right now
https://x.com/emollick/status/2031519480371683594

Claude Sonnet 4.6 lands at #2 on Document Arena. The top three models for document analysis and long-form reasoning are now all from @AnthropicAI. – #1 Opus 4.6 – #2 Sonnet 4.6 – #3 Opus 4.5 Ranking are all powered by anonymous side-by-side evaluations on user-uploaded PDFs
https://x.com/arena/status/2031012090681663717

Opus 4.6 1M context is now the default model for Max, Team and Enterprise users. Enjoy 🎉
https://x.com/_catwu/status/2032515975556509827

Wild eval awareness in Opus 4.6 by @russellsayshi on our team! 1. Model realized it was likely in an eval, searched for which eval it was in, found the answer key, and decrypted it 2. Models with stateless web_search() tools can communicate with each other via cached searches
https://x.com/ErikSchluntz/status/2030042086679220676

Back in ~November, our team picked a stretch goal of seeing if we could find and fix vulnerabilities in Firefox with Opus 4.6. In 2 weeks, we found 22, and ~1/5th of all high severity CVEs in a year. For our team, this feels like a rubicon moment.
https://x.com/logangraham/status/2030005018523574684

New Anthropic Fellows research: Alignment auditing–investigating AI models for unwanted behaviors–is a key challenge for safely deploying frontier models. We’re releasing AuditBench, a suite of 56 LLMs with implanted hidden behaviors to measure progress in alignment auditing.
https://x.com/abhayesian/status/2031450153966776587

What those AI benchmark numbers mean | ngrok blog https://ngrok.com/blog/ai-benchmarks

@jyangballin CodeClash is first-authored by @jyangballin and @KLieret, it’s a tough benchmark that pitches agents to write agents (yes, that’s not a typo) that play in arenas against each other. This requires long-term planning, memory and creative thinking, and an ability to read logs and
https://x.com/OfirPress/status/2031450305745785261

Excited to release PostTrainBench v1.0! This benchmark evaluates the ability of frontier AI agents to post-train language models in a simplified setting. We believe this is a first step toward tracking progress in recursive self-improvement 🧵:
https://x.com/karinanguyen/status/2031789998811595154

How we compare model quality in Cursor · Cursor https://cursor.com/blog/cursorbench

Implicit Intelligence – a benchmark that tests whether agents respect unstated constraints (what users don’t say) It covers 4 categories: – implicit reasoning – catastrophic risk – privacy/security – accessibility. It’s from Labelbox Applied ML Research, and they also
https://x.com/TheTuringPost/status/2029712559717351919

Most AI benchmarks test reasoning in isolation. Real enterprise tasks require grounded reasoning: – Find the right documents – Extract the right values – Perform analyses OfficeQA Pro evaluates this end-to-end. Frontier agents still score <50%. Paper & details:
https://x.com/DbrxMosaicAI/status/2031399397125390678

Most AI benchmarks test reasoning in isolation. Real enterprise tasks require grounded reasoning: 1️⃣ Find the right documents 2️⃣ Extract the right values 3️⃣ Perform analyses OfficeQA Pro evaluates this end-to-end. Frontier agents still score <50%. 🧵Paper & details below!
https://x.com/kristahopsalong/status/2031391216361755069

new @METR_Evals research note from @whitfill_parker, @cherylwoooo, nate rush, and me. (chiefly parker!) we find that *half* of SWE-bench Verified solutions from Sonnet 3.5-to-4.5 generation AIs *which are graded as passing* are rejected by project maintainers.
https://x.com/joel_bkr/status/2031423528608952541

New research on evaluating coding agents via continuous integration. Coding agents are moving beyond isolated bug fixes. If they’re going to own CI pipelines, we need benchmarks that reflect the actual complexity of codebase maintenance. Most coding agent benchmarks today test
https://x.com/dair_ai/status/2029929266641785046

Strategic Navigation or Stochastic Search? New MADQA benchmark reveals that agents matching human accuracy on document QA rely on brute-force search to compensate for weak strategic planning. 2,250 questions over 800 PDFs expose a 20% gap to oracle performance.
https://x.com/HuggingPapers/status/2032490352502792228

Three things about the METR graph: 1) It measures something real about coding ability but also not exactly what it claims to measure 2) Lots of other benchmarks correlate with it very highly & are increasing exponentially 3 AI remains jagged in key ways that are hard to measure
https://x.com/emollick/status/2031802894089875460

Also, regarding this model’s vision capabilities, I’ve been using a very difficult dataset from an OCR project I worked on a few months ago as my benchmark whenever a new model is released. It consists of scanned files in the form of very long Excel-style tables written in
https://x.com/Hangsiin/status/2030882409819086923

How do benchmarks map to real-world capabilities? To study this, we hired 4 maintainers of repos used in SWE-bench Verified to review agent code. Of agent PRs that passed SWE-bench’s grader, maintainers would merge ~half. This holds accounting for noise in maintainer decisions.
https://x.com/whitfill_parker/status/2031408266660503743

This post outlines the first recommendations for better evals from this paper: https://t.co/5kJ5hgoAGo More to come, but too much info to pack into a single post. Will continue following this up with extra statistics techniques that are useful for LLM evals.
https://x.com/cwolferesearch/status/2031190325855719713

GPT-5.4-xhigh in 2nd place on the AA-Index in the overall, but 1st in agentic and coding However, I don’t see the reasoning efficiency gains OpenAI were talking about. GPT-5.4-xhigh deleted all gains GPT-5.3-Codex made and was almost 2x more expensive to benchmark
https://x.com/scaling01/status/2029927963014115768

Codex can’t run autoresearch right now, sadly. To me this is a big issue: agents shouldn’t need special commands like /loop or ralph just to run loops. This feels more like a Codex harness issue than a GPT-5.4 issue. If I say “loop forever,” it should just do that!
https://x.com/Yuchenj_UW/status/2031087769993490777

Did some gardening today: 🍪 Sweet Cookie 0.2.0 with Brave cookie support, better Linux/GNOME logic, and explicit macOS chromiumBrowser targeting… https://t.co/s5boBSvzbe which helps 🧿oracle 0.9.0 with GPT-5.4 Pro support and plenty of bug fixes.
https://x.com/steipete/status/2030478956646834590

For further analysis of GPT-5.4 and other model visit Artificial Analysis:
https://x.com/ArtificialAnlys/status/2029950513429762429

GPT 5.4 (xhigh) scores 77.7% on WeirdML, just behind 5.3 codex and Opus 4.6, but within the margin of error. GPT 5.4 is a really strong model, and sets a new high score on 3 of the 17 tasks, but it’s not consistent enough to set a new top score. It uses by far the most tokens
https://x.com/htihle/status/2032107787195466061

GPT-5.4 High by @OpenAI has landed in the top 10 Text Arena. Let’s dig into why. Overall the latest model is much more rounded than the previous GPT-5.2-High, with significant improvements across quite a large number of categories. Below are where it has made the largest gains:
https://x.com/arena/status/2030018716440924225

GPT-5.4 is honestly fantastic, what a great model.
https://x.com/Yampeleg/status/2030949057653264437

GPT-5.4 is launching, available now in the API and Codex and rolling out over the course of the day in ChatGPT. It’s much better at knowledge work and web search, and it has native computer use capabilities. You can steer it mid-response, and it supports 1m tokens of context.
https://x.com/sama/status/2029622732594499630

GPT-5.4 leads CursorBench on correctness with efficient token usage.
https://x.com/OpenAIDevs/status/2032209975280533676

GPT-5.4 Pro cost over $1k to achieve this result. This is 13X the cost of GPT-5.4 (xhigh reasoning), driven by the high output token price (GPT-5.4 Pro is priced at $180 per 1M output tokens vs GPT-5.4’s $15). GPT-5.4 used 6M tokens, only marginally more than GPT-5.4 (xhigh)’s
https://x.com/ArtificialAnlys/status/2030007303655887188

GPT-5.4 xhigh sets a new pass@5 and pass^5 SOTA on ZeroBench pass@5: 23% (prev. 19%) pass^5: 8% (prev. 7%) More details below 👇
https://x.com/JRobertsAI/status/2031026691682808148

GPT-5.4-high behind GPT-5.2 on PostTrainBench because it’s not allocating time as well as GPT-5.2, Opus or Gemini
https://x.com/scaling01/status/2031081654035300834

GPT-5.4-high below GPT-5.2-high on AlgoTune
https://x.com/scaling01/status/2031079698826993690

How often do LLMs claim to prove false mathematical statements? In our latest benchmark, BrokenArXiv, we find they do so very often. The best model, GPT-5.4, only rejects 40% of incorrect statements obtained by perturbing recent ArXiv papers, and other models do much worse.
https://x.com/j_dekoninck/status/2032458037823483953

I had mostly only used it in Codex, but after spending a lot of time with 5.4 in ChatGPT today, I’m more impressed than I had expected. From the ChatGPT side, it is also a jump from 5.2 to 5.4, and I think I had been judging it too much through the lens of Codex. I still think
https://x.com/Hangsiin/status/2030880541185286370

I tried GPT-5.4-xhigh once and removed all the code it has written then I asked Opus 4.6 Thinking and it one shotted it in 1/10th the time my theory is that GPT-5.4 is highly autistic and literal. It has no idea of the concept of “”inferring intent”” so when you prompt it be as
https://x.com/scaling01/status/2029987685952279000

If true, this would be the first of @EpochAIResearch’s Frontier Math open problems to be resolved by AI. “”The result emerged from a single GPT-5.4 Pro run and was subsequently refined into Lean with GPT-5.4 XHigh which ran for a few hours.””
https://x.com/kevinweil/status/2031378978527641822

My new Sunday morning routine: 1. Get coffee 2. Check GPT-5.4 projects on the Codex App, continue & start new ones 4. Launch ChatGPT 5.4 Pro for fresh brainstorming sessions 5. Think/learn how to use the 90% of AI capabilities I have yet to explore 6. Drink more coffee
https://x.com/DeryaTR_/status/2030622714927452309

nanochat now trains GPT-2 capability model in just 2 hours on a single 8XH100 node (down from ~3 hours 1 month ago). Getting a lot closer to ~interactive! A bunch of tuning and features (fp8) went in but the biggest difference was a switch of the dataset from FineWeb-edu to
https://x.com/karpathy/status/2029701092347630069

New @openclaw beta bits are up! Yes, includes GPT 5.4 and Gemini Flash 3.1!
https://x.com/steipete/status/2030508141419372667

The codex team updated docs with the estimated usage limits for the models Local Messages (5.4) ChatGPT Plus 33-168 ChatGPT Pro 223-1120 Local Messages (5.3-Codex) ChatGPT Plus 45-225 ChatGPT Pro 300-1500 Local Messages (5.1-Codex-Mini) ChatGPT Plus 180-900 ChatGPT Pro
https://x.com/Presidentlin/status/2030881332411125845

We believe we have fully resolved, in Lean and python, one of @EpochAIResearch Frontier Math open problems: a Ramsey-style problem on hypergraphs. The result emerged from a single GPT-5.4 Pro run and was subsequently refined into Lean with GPT-5.4 XHigh which ran for a few
https://x.com/spicey_lemonade/status/2031315804537434305

Working with GPT-5.4 in the API? We’ve updated our prompting guide with patterns for reliable agents covering tool use, structured outputs, verification loops, and long-running workflows.
https://x.com/OpenAIDevs/status/2030018673449263400

GPT 5.4 is a really special model. I think the tweet below is about coding, but IMO it also holds for general use (like explaining concepts or talking through issues). It’s tough to get the personality right – this model genuinely feels like talking to a smart friend.
https://x.com/venturetwins/status/2030391113086116096

ok i think gpt 5.4 can actually talk. it is much more opinionated when you ask it to critique stuff, than gpt-5.3-codex. i am kind of loving it.
https://x.com/dejavucoder/status/2029912128325570818

I’ve been playing with GPT-5.4 over the weekend, and it definitely feels like a better match for me than Opus 4.6. Pros: GPT-5.4: Better instruction adherence, does what you ask, not what you don’t. Asks for confirmation more. Opus: A bit faster. Seems better at frontend design.
https://x.com/gneubig/status/2030971826042527860

Our next kernel competition is now open for submissions! A $1.1M cash prize competition sponsored by AMD on optimizing DeepSeek-R1-0528, GPT-OSS-120B on MI355X Registration:
https://x.com/GPU_MODE/status/2029974019018244223

Announcing NVIDIA Nemotron 3 Super! 💚120B-12A Hybrid SSM Latent MoE, designed for Blackwell 💚36 on AAIndex v4 💚up to 2.2X faster than GPT-OSS-120B in FP4 💚Open data, open recipe, open weights Models, Tech report, etc. here: https://t.co/CAYpP1iK3i And yes, Ultra is coming!
https://x.com/ctnzr/status/2031762077325406428

Another week, another noteworthy open-weight LLM release. Nvidia’s Nemotron 3 Super 120B-A12B looks pretty good. Benchmarks are on par with Qwen3.5 122B and GPT-OSS 120B, but the throughput is great! Below is a short, visual architecture rundown.
https://x.com/rasbt/status/2032084724743553129

We’re excited to be day-0 launch partners for NVIDIA Nemotron 3 Super! You can try it now on Baseten, or read @rapprach’s blog to learn more about the new model: https://x.com/baseten/status/2031775755253026965

1/8 Two days ago, @Liam06972452 prompted GPT-5.4 Pro using our workflow that had been working for the Erdős problems thus far, and was able to eventually obtain a solution to https://x.com/AcerFur/status/2031458080458739757

the progress is way faster than i expected gpt-5.4 pro (xhigh) is making a big jump in research-level physics reasoning the model improved by 10 points on the critpt benchmark, where the top score was only 9% in nov 2025 and has now reached 30% by march 2026 i think this fits
https://x.com/slow_developer/status/2030203046416855290

We are investigating a possible solution by GPT-5.4 Pro to a problem from FrontierMath: Open Problems. My guess is that the solution is right, but we won’t be sure until the problem author weighs in. Thread with the story so far…
https://x.com/GregHBurnham/status/2031451554151022838

Codex Security is now also available on ChatGPT Pro accounts.
https://x.com/OpenAIDevs/status/2030081306974093755

Codex for Open Source is an awesome idea. OSS maintainers get API credits, 6 months of ChatGPT Pro with Codex, and access to Codex Security as needed.
https://x.com/kevinweil/status/2030000508342272368

Excited to introduce Codex for Open Source! 🔥 TL;DR – ChatGPT Pro, Codex, and API credits for eligible open-source maintainers Open source has shaped modern software, and so much of it depends on maintainers doing steady, often invisible work to keep critical projects healthy.
https://x.com/reach_vb/status/2029998272945717553

Your model crushed the benchmark. Then it couldn’t pick up a cup. That’s the reality nobody talks about. You train in simulation, it falls apart on real hardware. You collect real-world data instead (months of teleop, physical setups, safety protocols) and still can’t scale it.
https://x.com/IlirAliu_/status/2029843457099907269

RoboMME Benchmarking and Understanding Memory for Robotic Generalist Policies paper: https://x.com/_akhaliq/status/2031055119320506544

“I think the traditional PRD process (PRD → mock → code) is dead. But text that describes product requirements is very much alive. This associated document should be a required companion to the prototype before being handed off for review.”
https://x.com/clairevo/status/2031365087089565998

(4) Building AI For All: Amjad Masad & Michele Catasta – YouTube https://www.youtube.com/watch?v=ju73sWVtvU0

[2603.09906] Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMs https://arxiv.org/abs/2603.09906

@francoisfleuret When gradient norms drop early in training you need warmup, when they don’t… you don’t. It’s a simple testable theory that holds in every case I know of. We essentially have a full theory of warmup and decay. https://x.com/aaron_defazio/status/2030897848020349106

@lateinteraction And we built infra to make multi-vector retrieval practical at scale …
https://x.com/marek_galovic/status/2032168676464480657

@mixedbreadai It’s borderline irrational to bet on single-vector embedding models.
https://x.com/lateinteraction/status/2032154449041306001

@Yuchenj_UW Yeah that’s clearly the next part, e.g. my crappy first draft: https://t.co/GIDXSGnY17 have to emulate academia, not just a single researcher. but need more time to think through the details.
https://x.com/karpathy/status/2031435625854021730

// Think Harder or Know More // Chain-of-thought prompting enables reasoning in LLMs but requires explicit verbalization of intermediate steps. Looped transformers offer an alternative by iteratively refining representations within hidden states, but they sacrifice storage
https://x.com/dair_ai/status/2032107624007876781

🧵New paper: “”Lost in Backpropagation: The LM Head is a Gradient Bottleneck”” The output layer of LLMs destroys 95-99% of your training signal during backpropagation, and this significantly slows down pretraining 👇
https://x.com/nthngdy/status/2032172281921712152

🛰️ New Article: Building an OSINT Pipeline to Cut Through the Iranian Conflict Noise. SPOILER ALERT: We’re winning. I just published a long-form analysis on the Iran conflict and the strategic dynamics around it. Sorry for the Substack-only publication; the article itself was
https://x.com/DataRepublican/status/2030833480863785427

Add some general training data during fine-tuning – and the model actually learns the target task better. This is what @Stanford researchers have recently found. It is called generic data replay and it makes the model much more data-efficient: – 1.87× improvement during
https://x.com/TheTuringPost/status/2032441644143055316

AI Enabled Software Development and Jevons Paradox – Mike Grouchy https://mikegrouchy.com/blog/ai-enabled-software-development-and-jevons-paradox/

AI is compressing how we build. Roles collapse, roadmaps expire quickly, and you end up rewriting the product every few months. So we thought we’d give people a behind-the-scenes look. 21 Days to Launch, a Replit documentary.
https://x.com/amasad/status/2029251832460263632?s=20

As the AI labs continue to see acceleration, their design choices beyond just alignment become ever more important. Their products are one of the most powerful tools for shaping how AI is used, and I think a lot of recent focus has been on tools for automation, not augmentation.
https://x.com/emollick/status/2031751952774627716

Awesome job by the @databricks team My summary: They trained a model called KARL that beats Claude 4.6 and GPT 5.2 on enterprise knowledge tasks (searching docs, cross-referencing info, answering questions over internal data), at ~33% lower cost and ~47% lower latency. The key
https://x.com/jaminball/status/2030025385644282202

Context Rot: How Increasing Input Tokens Impacts LLM Performance·|·Chroma https://www.trychroma.com/research/context-rot

Doc-to-LoRA and Text-to-LoRA – 2 types of model updates from @SakanaAILabs that together enable continual learning systems Here is the workflow of both methods: ➡️ Doc-to-LoRA (D2L): Turning documents into memory Doc-to-LoRA focuses on knowledge updates and internalizes
https://x.com/TheTuringPost/status/2030085866069340638

Eponymous Laws https://www.swyx.io/eponymous-laws

Everyone curious about autoresearch, etc: please check out the newly launched optimize_anything. It is likely what you want:
https://x.com/dbreunig/status/2032313870233321956

Flash-KMeans Fast and Memory-Efficient Exact K-Means paper: https://x.com/_akhaliq/status/2032135596576059425

Have you ever gotten tired of boring plain linear layers and wanted a more complex function? We find that attaching low rank nonlinear residual functions can significantly accelerate pretraining, with an identified variant, CosNet, consistently observing 20+% wallclock speedup!
https://x.com/torchcompiled/status/2031064475210514494

I just published a writeup on using statistics to make LLM evals more reliable. Here are some basic statistics you can start using to measure the uncertainty of an evaluation result… Find the full writeup here: https://t.co/dHC5p6woou Evaluation scores. In an LLM evaluation,
https://x.com/cwolferesearch/status/2031190003930280371

I packaged up the “”autoresearch”” project into a new self-contained minimal repo if people would like to play over the weekend. It’s basically nanochat LLM training core stripped down to a single-GPU, one file version of ~630 lines of code, then: – the human iterates on the
https://x.com/karpathy/status/2030371219518931079

I’ve been eagerly awaiting this release from the @mixedbreadai folks. They’re world-leading experts in late interaction retrieval. And today they remind us that late interaction done well makes all your favorite embedding models look like they don’t work.
https://x.com/lateinteraction/status/2032130517349400828

If you are an engineer right now – you should either aim to get fantastic at system design and comfortable reviewing architectures and aim to be a reviewer… or try to grow your product/design skills and become a builder.
https://x.com/ZhitaoLi224653/status/2031371386191810894

Institutional AI vs Individual AI – by George Sivulka https://www.a16z.news/p/institutional-ai-vs-individual-ai

Issue was not transformers but in-fact DeepGemm (after v0.20.0). Disabling it with VLLM_USE_DEEP_GEMM=0 seems to be working so far. I’m up to testing vLLM 0.16.0 now
https://x.com/TheZachMueller/status/2030938318473408841

It’s been 10 mins. You stare at the screen as your LLM thinks, verifies, and finally… an “Aha” moment But what if that precious moment is fake? – We found 97+% of thinking steps are decorative! – By steering the LLM, we control what it thinks – CoT monitoring? It’s unreliable
https://x.com/shi_weiyan/status/2031355749905977602

it’s possible that software engineering is the only profession that experiences jevons paradox because they r the ones who use ai to automate other professions out of existence
https://x.com/QwQiao/status/2027498505057489332

LangGraph 1.1 is out 🎉 It comes with type-safe stream and invoke, automatic Pydantic and dataclass coercion for outputs, and cleaner interrupt access. All fully opt-in, with zero breaking changes. release notes:
https://x.com/sydneyrunkle/status/2031428770700103777

Large-scale online deanonymization with LLMs https://arxiv.org/pdf/2602.16800

LLMs often reason “performatively” well after deciding on a final answer – something that CoT monitors are slow to catch. Our new paper finds that: – probes can help monitor for this – it seems to track with task difficulty – probes enable early CoT exit, saving tokens! (1/7)
https://x.com/GoodfireAI/status/2032157754077691980

My last open-source project before joining xAI is just out today. Megatron Core MoE is probably the best open framework out there to seriously train mixture of experts at scale. It achieves 1233 TFLOPS/GPU for DeepSeek-V3-685B. https://x.com/EthanHe_42/status/2031243197146607954

Normally replay old data reduces forgetting, but it actually helps you learn on new data too! We finally put this paper out on arxiv, but had it up as a Marin GitHub issue ~1 year ago:
https://x.com/percyliang/status/2030084101559271490

Pre-pre-training is at long last getting adopted. Hutter must be happy.
https://x.com/teortaxesTex/status/2032611773308641493

Reasoning boosts search relevance 15-30% https://softwaredoug.com/blog/2025/10/06/how-much-does-reasoning-improve-search-quality

Sharing “Neural Thickets”. We find: In large models, the neighborhood around pretrained weights can become dense with task-improving solutions. In this regime, post-training can be easy; even random guessing works Paper: https://t.co/qlXEkJHSZa Web: https://t.co/xYoYctEqHn 1/
https://x.com/phillip_isola/status/2032483868603822402

Stereo depth models usually force a tradeoff. Either strong zero-shot generalization or real-time speed. Fast-FoundationStereo closes that gap. The model accelerates FoundationStereo by more than 10× while keeping comparable depth quality, making real-time stereo matching
https://x.com/IlirAliu_/status/2030205830058881267

Teaching LLMs to reason like Bayesians https://research.google/blog/teaching-llms-to-reason-like-bayesians/

Technological Speed Limit | metastable https://metastable.org/speed-limit/

The 2026 Global Intelligence Crisis – Citadel Securities https://www.citadelsecurities.com/news-and-insights/2026-global-intelligence-crisis/

The fact that “shift+enter” is so hard in TUIs is insane. Maybe the terminal isn’t the right place for us to be writing complex multi-line prompts?
https://x.com/theo/status/2030832068972937575

The recipe behind today’s frontier reasoning models is surprisingly similar to AlphaGo: 1) Imitate large amounts of human data 2) Scale inference compute to reason better (back then it was Monte Carlo Tree Search, today it’s Chain of Thought) 3) Use RL to go beyond imitation
https://x.com/polynoamial/status/2031404079583473953

there was decently small-medium sized group of folks openly exploring harness engineering late last year (ex: Nikunj, Harrison, @dexhorthy, a bunch more) the model intelligence spike we got since then meant doing good systems engineering when making a harness unlocked
https://x.com/Vtrivedy10/status/2031751769051570256

This is a great pattern to build your own specialized AI, described in our report. – Generate synthetic data with current version of model – Apply efficient large-batch off-policy RL (OAPL) – Generate harder data with updated model – Produce efficient, generalizable small model
https://x.com/matei_zaharia/status/2029976438905208871

this is what I learnt from watching a mob of hackers turn the world into reinforcement learning environments: – environments are the ultimate democratization of AI because you get a stake in ai models without the compute. – coding agents are dominating env building, but need
https://x.com/ben_burtenshaw/status/2031038183161602164

to improve fine-tuning data efficiency, replay generic pre-training data not only does this reduce forgetting, it actually improves performance on the fine-tuning domain! especially when fine-tuning data is scarce in pre-training (w/ @percyliang)
https://x.com/kothasuhas/status/2029983689988542742

U-RLVR is the closest thing to pure, naively implemented recursive self-improvement I know. And in this paper, all U-RLVR methods hit a ceiling and regress. The big bet now is that this does NOT hold in less naive regimes and brings about fully automated ML research.
https://x.com/teortaxesTex/status/2031359466516492731

UNI-1 | Less Artificial. More Intelligent. | Luma https://lumalabs.ai/uni-1

We have some big news to share today. Chutes is partnering with a research team from Harvard University to push the boundaries of AI inference efficiency. The team at Harvard, led by Professor Juncheng Yang @1a1a11a, is developing a new prefix caching algorithm designed to
https://x.com/chutes_ai/status/2031446918316917131

The Future of Software Development Retreat | Deer Valley, Utah, 2026 | Thoughtworks https://www.thoughtworks.com/about-us/events/the-future-of-software-development

Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. It found ~20 changes that improved the validation loss. I tested these changes yesterday and all of them were additive and transferred to larger (depth=24) models. Stacking up all of these changes,
https://x.com/karpathy/status/2031135152349524125

KV-cache math for Nemotron 3 Super With 8 attention layers, 2 KV heads, and head-dim 128, the sequence-growing KV cache comes out to: 8,192 bytes/token in BF16 4,096 bytes/token in FP8 That means: 1M tokens → 7.63 GiB BF16 / 3.81 GiB FP8 262k tokens → 2.00 GiB BF16 / 1.00
https://x.com/bnjmn_marie/status/2031821490916905089