Technical and Dev: AI News Week Ending 03/06/2026

Technical and Dev: AI News Week Ending 03/06/2026

March 6, 2026

Image created with gemini-3.1-flash-image-preview with claude-sonnet-4-5. Image prompt: A modern smartphone tumbling through bright blue sky in freefall, screen glowing with colorful app notifications, distant earth far below, shot from wide angle aerial perspective in crisp daylight, with bold clean title typography reading TECH integrated prominently across the upper portion of the image like a magazine cover, simple uncluttered composition, vibrant cheerful mood despite the perilous situation

GPT-5.4 set a new record on FrontierMath, our benchmark of extremely challenging math problems! We had pre-release access to evaluate the model. On Tiers 1-3, GPT-5.4 Pro scored 50%. On Tier 4 it scored 38%. See thread for commentary and additional experiments.”” https://x.com/EpochAIResearch/status/2029626255776395425

📊 How to evaluate skills❓️ Lots of companies are building skills for coding agents. But how do you know if your skill is actually working? It’s tempting to go by vibes, but performance varies a lot across tasks — and coding agents have a huge action space, which makes that”” https://x.com/LangChain/status/2029618086374944771

Agent reliability being a cross-functional problem is the most underrated ops shift right now. You can’t engineer your way out of bad eval criteria — PMs and domain experts have to own their part.”” https://x.com/saen_dev/status/2028411962712088767

Agent skills are powerful but they are often AI-generated and not tested. Here is a practical guide to evaluating agent skills with code, prompts, and real results. 📋 Define success criteria (outcome, style, and efficiency). 🧪 Create 10-12 prompts with deterministic checks. 🤖”” https://x.com/_philschmid/status/2029570052530360719

Agents, for real work. The latest @code release gives you better agent orchestration, extensibility, and continuity. Here’s what’s new: 🪝 Hooks support 🎯 Message steering and queueing 🌐 Agentic integrated browser 🧠 Shared memory And more…”” https://x.com/code/status/2029279963778515372

AI agents are tackling more and more “”human work”” But are they benchmarked on the work people actually do? tl;dr: Not really Most benchmarks focus on math & coding, while most human labor and capital lie elsewhere. 📒 We built a database linking agent benchmarks & real-world”” https://x.com/ZhiruoW/status/2028847081507488011

Can AI agents agree? Communication is one of the biggest challenges in multi-agent systems. New research tests LLM-based agents on Byzantine consensus games, scenarios where agents must agree on a value even when some participants behave adversarially. The main finding: valid”” https://x.com/omarsar0/status/2028823724196343923

Clerk Skills for AI Agents https://clerk.com/changelog/2026-01-29-clerk-skills?dub_id=AlTGRISXA0vckDDY

Introducing SWE-Atlas. We built SWE-Atlas as the next evolution of SWE-Bench Pro, expanding agent evaluation beyond change accuracy to better reflect the real, interactive workflows that define software development. Results for Codebase QnA, the first eval under SWE-Atlas that”” https://x.com/scale_AI/status/2029244660905095359

Last week, we did an internal deep dive into enterprise environments/benchmarks like τ²-𝐁𝐞𝐧𝐜𝐡 and 𝐂𝐨𝐫𝐞𝐂𝐫𝐚𝐟𝐭. This type of high-fidelity RL env is becoming increasingly popular as frontier labs push their models into more and more agentic capabilities.”” https://x.com/Shahules786/status/2029603934944235943

Long-running agents accumulate context while model memory stays fixed. This leads to a tradeoff: either discard older information or compress it. New work by @charles0neill explores repeated KV-cache compression for persistent agents using Attention Matching. Our research shows”” https://x.com/basetenco/status/2029654320971665651

Today, we’re sharing 🌁 Knowledge Agents from Reinforcement Learning (KARL) 🌁 We trained an agent that excels on challenging grounded reasoning tasks. KARL matches Sonnet 4.5 quality at a fraction of the cost, and with test-time scaling reaches Opus 4.6 levels. This was a fun”” https://x.com/mrdrozdov/status/2029580506698850692

ByteDance just published something I’ve been waiting for someone to build: CUDA Agent! It trained a model that writes fast CUDA kernels. Not just correct ones — actually optimized ones. It beats torch.compile by 2× on simple/medium kernels, ~92% on complex ones, and even”” https://x.com/BoWang87/status/2028599174992949508

Beyond the flashiness, what’s exciting about this is that products you create with Perplexity Computer don’t require you to manage your own API keys, unlike other agent frameworks. Everything will be run on a secure sandbox that we orchestrate end to end. The stateful abstracted”” https://x.com/AravSrinivas/status/2028903680616087946

🤔Can agentic LLM inference break free from storage bandwidth limits? This new paper by DeepSeek together with THU & PKU says yes by rethinking the Prefill / Decode split at the system level, which draws major attention.🚀 What’s the real innovation? 👉 Zhihu contributor deephub”” https://x.com/ZhihuFrontier/status/2027496814723928536

[2511.18423] General Agentic Memory Via Deep Research https://arxiv.org/abs/2511.18423

[2603.01896] Agentic Code Reasoning https://arxiv.org/abs/2603.01896

[2603.04390] A Dual-Helix Governance Approach Towards Reliable Agentic AI for WebGIS Development https://arxiv.org/abs/2603.04390

Interesting new research on LLM agent memory. Agent engineers, pay attention to this one. (bookmark it) It introduces a diagnostic framework that separates retrieval failures from utilization failures in agent memory systems. The main findings: – Retrieval method matters far”” https://x.com/dair_ai/status/2029202969456234562

BullshitBench v2 is out! It is one of the few benchmarks where models are generally not getting better (except Claude) and where reasoning isn’t helping. What’s new: 100 new questions, by domain (coding (40 Q’s), medical (15), legal (15), finance (15), physics(15)), 70+ model”” https://x.com/petergostev/status/2028492834693677377

We added Claude-Opus-4.6 to MathArena! It is a strong model, only second to Gemini-3.1-Pro on most benchmarks. One exception: it scores quite poorly in visual mathematics. Also, it is expensive: we spent around USD 8,000 to add the model, 10x any other model we ever evaluated.”” https://x.com/j_dekoninck/status/2029160582687985727

The Document Arena is now live with leaderboard scores! See which frontier AI models rank highest in document reasoning, all powered by side-by-side evaluations on user-uploaded PDFs from real work use cases. – #1 is Claude Opus 4.6 scoring 1525, +51 pts in the lead – While”” https://x.com/arena/status/2028915403704156581

I looked into how Claude Code and Codex compare to the default scaffolds METR uses for time horizon measurements. It looks like they don’t significantly outperform our default scaffolds on any models we’ve tried them on so far.”” https://x.com/nikolaj2030/status/2022398669337825737

Building community trust through open science is core to Arena. That’s why the Arena leaderboard runs on Arena-Rank, our open-source Python package for transparent ranking. With it, anyone can construct statistically grounded, reproducible leaderboards using pairwise comparison”” https://x.com/arena/status/2027528061508587728

He’s back with an improved “”BullshitBench V2″” Anthropic models are still dominating everything”” https://x.com/scaling01/status/2028494129710133725

Honestly, there should be a standard that any model release with benchmark scores should also release the prompts/trajectory. It’s easier for people to build on top of these models since we won’t have to keep worrying if the eval harness is the problem or not”” https://x.com/nrehiew_/status/2029558608393109769

I wish more research teams did this. I remember some time ago we couldn’t repro the Llama 3 scores on MATH because the 1B model was terrible at producing \boxed{} with a vanilla CoT prompt. It turned out you need a detailed system prompt that was not present in any tech report,”” https://x.com/_lewtun/status/2029571193624306016

Must-read AI research of the week: ▪️ Doc-to-LoRA ▪️ Does Your Reasoning Model Implicitly Know When to Stop Thinking? ▪️ ARLArena ▪️ Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization ▪️ On Data Engineering for Scaling LLM Terminal Capabilities ▪️”” https://x.com/TheTuringPost/status/2028777919057949106

We’re close to saturating WeirdML v2 wouldn’t surprise me to see 5.4 going beyond 86%”” https://x.com/teortaxesTex/status/2028444160517144683

What a great illustration of the central problem of AI benchmarking for real work All of the effort is going into benchmarking for coding, but that is a small part of the actual jobs people do, which leaves the true trajectory of AI progress less clear. https://x.com/emollick/status/2028870529906622677

Top 10 Open Models: February 2026 in Code Arena. In the Code Arena, currently 46 different agentic coding models are on the leaderboard, and only 18 are open source being produced by 7 different labs. Here’s how the labs stack up this month: – GLM-5 scoring 1451, ranking”” https://x.com/arena/status/2027540296276607105

You can complain about Europe. Or you can apply for €125M to build the next frontier AI lab. They’ll come from 10 teams funded with €125M. Non-dilutive. 24 months. Zero equity taken. On March 19, 2026, Next Frontier AI comes to Paris. SPRIND is launching a €125M Challenge”” https://x.com/IlirAliu_/status/2027097090619220083

GPT-5.4 scores 83% on GDPval”” https://x.com/scaling01/status/2029618924375965992

GPT 5.3 Codex (xhigh) scores 79.3% and takes the lead on WeirdML, just ahead of opus 4.6 (77.9%) at less than half the prize. It is very solid across the board, but I still feel the peak performance of gemini 3.1 is stronger.”” https://x.com/htihle/status/2028441018865955244

BullshitBench v2, created by Peter Gostev, is a benchmark that does something refreshingly different: it tests whether AI models can detect and reject nonsensical prompts instead of confidently rolling with them. Only Anthropic’s Claude models and Alibaba’s Qwen 3.5 score”” https://x.com/kimmonismus/status/2029230388028358726

Top 10 Open Models: February 2026 in Text Arena. The top 3 labs have not changed since January, but the scores have gotten tighter between them: – @Zai_org’s GLM-5, scoring 1455 – @Alibaba_Qwen’s Qwen-3.5 397B A17B, scoring1454 – @Kimi_Moonshot’s Kimi-K2.5 Thinking, 1452 The”” https://x.com/arena/status/2027511779417592173

(46) Why Emma Chamberlain Quit YouTube, Again. – YouTube https://www.youtube.com/watch?v=XuVR_elE1Pw

[[Topic of discussion]] is not [[analogy]]. [[Dramatic fact given own line]]. [[Dramatic fact given own line]]. [[Dramatic fact given own line]]. [[Dramatic summary sentence.]] [[Topic of discussion]] is [[different analogy]]. [[Implications delivered with certainty]].”” https://x.com/emollick/status/2028532794335342592

[1/9] What happens when you treat vision as a first-class citizen during multimodal pretraining? To find out, we studied the design space of training Transfusion-style models that input and output all modalities, from scratch. Here is what we learned about visual representations,”” https://x.com/DavidJFan/status/2029239760301035549

[2603.03276] Beyond Language Modeling: An Exploration of Multimodal Pretraining https://arxiv.org/abs/2603.03276

@hardmaru Good to see Hypernetworks again after ~10 years. But I wonder why an attention with an extremely long KV cache doesn’t cut it? Is it efficiency?”” https://x.com/hyhieu226/status/2027488699810766851

> an example of this is that in hybrid models, sometimes “”stronger”” linear layers can lead to overall weaker models because it incentivizes the global attention to be “”lazy”” some people asked about this. i think this is a somewhat folklore result that I don’t have a reference”” https://x.com/_albertgu/status/2027457215196778634

🚀 Today we’re releasing FlashOptim: better implementations of Adam, SGD, etc, that compute the same updates but save tons of memory. You can use it right now via `pip install flashoptim`. 🚀 https://t.co/nRrLSpjnwV A bunch of cool ideas make this possible: [1/n]”” https://x.com/davisblalock/status/2028943987349045610

🚀Trillion parameters. Zero compromises. 100% open source. 🔥Introducing Yuan 3.0 Ultra — our flagship multimodal MoE foundation model, built for stronger intelligence and unrivaled efficiency. ✅️Efficiency Redefined: 1010B total / 68.8B activated params. Our groundbreaking”” https://x.com/YuanAI_Lab/status/2029204213180580229

A must-read to refresh basic theory and practice – “”Effective Theory of Wide and Deep Transformers”” by @AIatMeta It’s a 60+ analysis of how signal propagation behaves in Transformers and what that means for their scaling Covers: – Forward & backward signal propagation in”” https://x.com/TheTuringPost/status/2028394922576121946

A useful piece of context is that the government does not have access to better AI models than you (actually they are worse, because they usually don’t get the latest models), though they may have different guardrails. You should view government AI capabilities through that lens.”” https://x.com/emollick/status/2028172648975401201

Also keep an eye on @anemll – they’re building an open-source library to run LLMs directly on Apple Neural Engine. Same direction, different approach. The ecosystem is growing fast.”” https://x.com/AmbsdOP/status/2028507402903986566

Beyond MuP: 3. Special Cases, Special Treatment https://t.co/E1VA0iYmua Derived stability metrics and steepest descent directions for Embedding, LM Head, and RMS Norm layers — explaining why Embedding and LM Head don’t play well with Muon.”” https://x.com/Jianlin_S/status/2028434454486950280

Bloated patches: LM generated solutions of SWE-bench tasks are consistently longer than human-written gold solutions (and it’s not just comments) 🧵”” https://x.com/KLieret/status/2029219763423986030

Excited to announce our new LLM inference algorithm, speculative speculative decoding (SSD)! It is fast 🚀 — up to 2x faster than state-of-the-art inference engines (vLLM, SGLang). Working on this with @tanishqkumar07 and @tri_dao was a blast. Details in thread:”” https://x.com/avnermay/status/2029251985934041232

FA4 now available in lm-engine: https://t.co/2gMqX3AUUH 13.4% end-to-end speedup for Llama 8B training on 4x GB200s (1 node) 🚀🚀🚀 1005.55 TFLOPs for SDPA vs 1140.73 for FA4 (BF16 precision) @tedzadouri @ultraproduct @__tensorcore__ @tri_dao cooked Thanks to @bharatrunwal2 for”” https://x.com/MayankMish98/status/2029652583179317378

Finiteness Problem for Diophantine Equations | Epoch AI https://epoch.ai/frontiermath/open-problems/small-diophantine

FlexAttention now has a FlashAttention-4 backend. FlexAttention has enabled researchers to rapidly prototype custom attention variants–with 1000+ repos adopting it and dozens of papers citing it. But users consistently hit a performance ceiling. Until now. We’ve added a”” https://x.com/PyTorch/status/2029617988899381376

From logistic regression to AI https://www.johndcook.com/blog/2026/03/04/from-logistic-regression-to-ai/

get in babe, we’re automating the SDLC! something I want to highlight about these that’s super powerful is that they can kick off on any event or webhook, and run in the cloud. so they are not tied to having one person’s laptop open – they are owned by your entire team.”” https://x.com/jediahkatz/status/2029609785050513576

How AI Will Reshape Public Opinion – by Dan Williams https://www.conspicuouscognition.com/p/how-ai-will-reshape-public-opinion

Hypernetworks!!!”” https://x.com/willdepue/status/2027310794766176505

I Improved 15 LLMs at Coding in One Afternoon. Only the Harness Changed. | Can.ac https://blog.can.ac/2026/02/12/the-harness-problem/

I really like this new work that we did on building a good recipe for improving SKILL.md files in production. We basically set up a “”log -> evaluate -> monitor -> improve”” pipeline, and have a full open source example from our PR review bot.”” https://x.com/gneubig/status/2028576331877822506

I’m pretty annoyed that Hypersteer (a work by some of my friends applying hypernetworks to produce very effective steering vectors from text descriptions) has not received the appropriate amount of credit in later work pursuing basically the same idea”” https://x.com/aryaman2020/status/2027327108826173471

I’m unreasonably excited about the fact that we wrote everything in Cute-DSL, embedded in Python. Installing / “compiling” now takes seconds instead of minutes / hours (looking at you, C++ templates). Try pip install fa4!”” https://x.com/tri_dao/status/2029569885395894742

I’ve been working on a new LLM inference algorithm. It’s called Speculative Speculative Decoding (SSD) and it’s up to 2x faster than the strongest inference engines in the world. Collab w/ @tri_dao @avnermay. Details in thread.”” https://x.com/tanishqkumar07/status/2029251146196631872

Instead of forcing models to hold everything in an active context window, we can use hypernetworks to instantly compile documents and tasks directly into the model’s weights. A step towards giving language models durable memory and fast adaptation. Blog:”” https://x.com/hardmaru/status/2027240562898976770

Introducing Modular Diffusers – Composable Building Blocks for Diffusion Pipelines https://huggingface.co/blog/modular-diffusers

Just a small note the small models have reasoning disabled by default – to enable it, use llama-server and use –chat-template-kwargs ‘{“”enable_thinking””:true}’ or see”” https://x.com/danielhanchen/status/2028478490069352448

LLM-based Evolution as a Universal Optimizer – imbue https://imbue.com/research/2026-02-27-darwinian-evolver/

Looking for user feedback about the upcoming ggml official Debian and Ubuntu packages”” https://x.com/ggerganov/status/2028505638452531340

LTX-2.3 and LTX Desktop: Production-Ready Engine. Designed to Be Built On. | LTX Blog https://ltx.io/model/model-blog/ltx-2-3-release

Meet KARL, an RL’d model for document-centric tasks at frontier quality and open source cost/speed. Great for @databricks customers and scientists (77-page tech report!) As usual, this isn’t just one model – it’s an RL assembly line to churn out models for us and our customers 🧵”” https://x.com/jefrankle/status/2029574154396078213

Most imagination-based world models learn representations by reconstructing pixels. But reconstruction may not be the right objective for control. In our new paper we explore a different idea: 👉 predict the next embedding instead of reconstructing observations. Introducing”” https://x.com/BredisGeorge/status/2029190420790411671

Most LLM observability tools show you a wall of diffs when comparing traces. Cool, the trace ID changed. Very helpful. 😓 We shipped a smarter trace comparison view. Chat transcript summaries, score diffs, usage breakdowns, and a new calls drilldown to inspect every call.”” https://x.com/weave_wb/status/2029624031201386655

New research from @databricks: LLMs Can Learn to Reason via Off-Policy RL Optimal Advantage-based Policy Optimization with Lagged Inference policy (OAPL) shows you don’t need strict on-policy training to improve reasoning. It matches or beats Group Relative Policy Optimization”” https://x.com/DbrxMosaicAI/status/2027472333208682858

New research from Databricks AI Research: FlashOptim cuts training memory by over 50% with no measurable loss in model quality. Training a model with AdamW typically requires 16 bytes per parameter just for weights, gradients, and optimizer state. FlashOptim brings that down to”” https://x.com/DbrxMosaicAI/status/2028977216940589383

Not enough people have read this quack writeup by Tri. Packed with info on squeezing bandwidth out of every level of the memory hierarchy.”” https://x.com/fleetwood___/status/2027481778538135966

Not Prompts, Blueprints | Tomasz Tunguz https://tomtunguz.com/filling-the-queue-for-ai/

okay this plot and discussion has blown up more than expected so let me try to leave some candid thoughts 1. i don’t believe that the intent of Mayank’s tweet was to claim “”Mamba-2 > GDN””. the primary intenet was to convey that the initialization for Mamba-2 makes a huge”” https://x.com/_albertgu/status/2027440531232722949

One of the biggest promises of Diffusion LLMs is parallel generation: predicting multiple tokens at once to bypass the sequential bottleneck of autoregressive models. However, parallel generation comes with a price. For example: Should the sentence “He is from [MASK] [MASK]” be”” https://x.com/IanLi1118/status/2029074519223353362

Online RL? 🥱 Off-policy RL 🥳 OAPL is what we’re using internally at @databricks for some really cool RL results that you’ll hear about soon…”” https://x.com/jefrankle/status/2027477902531432623

Sacred values of future AIs — LessWrong https://www.lesswrong.com/posts/sjeqDKhDHgu3sxrSq/sacred-values-of-future-ais

Sometimes an AI product starts “”underperforming”” and the team spends days fixing the prompt. But the prompt was never broken. The real problem is often the eval rubric — the rules that decide whether an output passes or fails. Most rubrics get written early on, based on a small”” https://x.com/kimmonismus/status/2029227463805378571

That 256k soft-ceiling is persistent! Context rot is stubborn.”” https://x.com/dbreunig/status/2029643546232594809

The author, @Ada_Palmer, is a historian at the University of Chicago who write “”hard social science fiction”” — a category that would be useful to expand. She also had some very early thoughtful takes on AI, including this interesting piece from 2023: https://x.com/emollick/status/2028192344520933870

The Great Transition | Daniel Miessler https://danielmiessler.com/blog/the-great-transition

The Shift from Models to Compound AI Systems – The Berkeley Artificial Intelligence Research Blog https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/

There was a nice time where researchers talked about various ideas quite openly on twitter. (before they disappeared into the gold mines :)). My guess is that you can get quite far even in the current paradigm by introducing a number of memory ops as “”tools”” and throwing them”” https://x.com/karpathy/status/2029696850366971921

This paper is one of the first to test AI skills and the results seem to suggest that yes, they have high practical value. They use pretty mediocre skills (6.2/12 quality rating) harvested mostly from places like Github, and still get large boosts, especially outside software.”” https://x.com/emollick/status/2027966367551361507

We’re excited to introduce Doc-to-LoRA and Text-to-LoRA, two related research exploring how to make LLM customization faster and more accessible. https://t.co/wGKDNhBcJX By training a Hypernetwork to generate LoRA adapters on the fly, these methods allow models to instantly”” https://x.com/SakanaAILabs/status/2027240298666209535

we’re making @blocks smaller today. here’s my note to the company. #### today we’re making one of the hardest decisions in the history of our company: we’re reducing our organization by nearly half, from over 10,000 people to just under 6,000. that means over 4,000 of you are”” https://x.com/jack/status/2027129697092731343?s=20

we’ve never done code review but damn if your team is producing this much code you’re using LLMs entirely incorrectly no one struggles with large amounts of code more than an LLM, if you don’t keep that in check you have a self defeating codebase”” https://x.com/thdxr/status/2028827251534352764

welcome 5.3 instant! was proud to help reduce hallucinations for questions where factuality matters most, it’s 26.8% better (when searching) and 19.7% better (when not searching)”” https://x.com/aidan_mclau/status/2028894122959159434

New ByteDance paper shows how an AI learned to write CUDA hardware code so well it beats standard compilers at their own game. This system creates custom software components that run up to 100% faster than traditional automated tools. Writing instructions for AI hardware is”” https://x.com/rohanpaul_ai/status/2029161433519567175

The FA4 paper is finally out after a year of work. On Blackwell GPUs, attention now goes about as fast as matmul even though the bottlenecks are so different! Tensor cores are now crazy fast that attn fwd is bottlenecked by exponential, and attn bwd is bottlenecked by shared”” https://x.com/tri_dao/status/2029569881151263082

Here’s some code for an experiment that doesn’t work so well: https://t.co/GcQ3gQ6uML – Basically you chat with a model running locally – Every now and then you /sleep the model to transition short-term memory to long-term memory – The /sleep command runs the same model to”” https://x.com/awnihannun/status/2029693579006988531

A vision system can be a universal communication port between AI models. ▪️ Vision Wormhole is a new framework that lets VLMs exchange compact continuous “”thought messages”” through a shared visual channel instead of slow text. The sender model: – Converts its internal”” https://x.com/TheTuringPost/status/2027901044538413504

Humans communicate through language and interact with the world through vision, yet most multimodal models are language-first. What happens when we go beyond language? 🤔 Beyond Language Modeling: a deep dive into the design space of truly native multimodal models Paper:”” https://x.com/__JohnNguyen__/status/2029236083914096756

New paper out! We present a training method for multimodal generative models, called Self-Flow, which combines classic flow matching and representation learning. Why? Unlike most representation alignment methods, our new approach does not require external, pretrained models and”” https://x.com/robrombach/status/2029272803099226425

Train Beyond Language. We bet on the visual world as the critical next step alongside and beyond language modeling. So, we studied building foundation models from scratch with vision. We share our exploration: visual representations, data, world modeling, architecture, and”” https://x.com/TongPetersb/status/2029237530160169286