Technical and Dev: AI News Week Ending 03/27/2026

Image created with gemini-3.1-flash-image-preview with claude-sonnet-4-5. Image prompt: Using the provided reference image of the wooden shipping crate, preserve the exact construction, weathered reddish-brown paint, iron hardware, three-panel layout, and hand-painted black stencil lettering style. Replace the address text with ‘TECH’ in the same loose brushstroke style. Place the crate on a weathered wooden dock at dawn with soft raking light. Through visible gaps between slats, show gleams of precision-machined metal components, circuit boards, and ribbon cables with grease pencil labels, suggesting prototypes in transit. Background: out-of-focus water and pale spring sky.

Announcing ARC-AGI-3 The only unsaturated agentic intelligence benchmark in the world Humans score 100%, AI <1% This human-AI gap demonstrates we do not yet have AGI Most benchmarks test what models already know, ARC-AGI-3 tests how they learn
https://x.com/arcprize/status/2036860080541589529

ARC-AGI-3 benchmark: – 100% solvable by humans – 1% solvable by AI Everybody keep building benchmarks that agents utterly fail at! Proud this was a Laude Slingshot; will fund other benchmarks that reset SotA to 1%:
https://x.com/andykonwinski/status/2036870772745261202

ARC-AGI-3 is out now! We’ve designed the benchmark to evaluate agentic intelligence via interactive reasoning environments. Beating ARC-AGI-3 will be achieved when an AI system matches or exceeds human-level action efficiency on all environments, upon seeing them for the first
https://x.com/fchollet/status/2036861192619384989

ARC-AGI-3 the agentic benchmark where humans can’t beat the “”human baseline”” and typical agentic harnesses and tools aren’t allowed > 100% just means that all levels are solvable > the 1% number uses uses completely different and extremely skewed scoring based on the 2nd best
https://x.com/scaling01/status/2036890367803429230

ARC-AGI 3 is here, and all existing AI models are below 1% on the benchmark. It’s gonna take a while until this one is saturated. How it measures intelligence: – 100% human-solvable environments – Skill-acquisition efficiency over time – Long-horizon planning with sparse
https://x.com/mark_k/status/2036882659406762031

ARC-AGI-3 https://arcprize.org/arc-agi/3

ARC-AGI-3 took me a few tries, but it is definitely human winnable. I am curious how much of the very initially very low performance of frontier models is harness, vision, and tools, versus how much are limitations of LLMs. I guess we will find out!
https://x.com/emollick/status/2036865990282092940

General game playing is more difficult than “AGI” (Just to be clear: I really like ARC-AGI-3 and think it’s a great contribution, but the proliferation of AGI benchmarks is IMO proof of how pointless the concept of AGI is)
https://x.com/togelius/status/2036989880887050333

Keep in mind: ARC-AGI is *not* a final exam that you pass to claim AGI. Including ARC-AGI-3. The benchmarks target the residual gap between what’s hard for AI and what’s easy for humans. It’s meant to be a tool to measure AGI progress and to drive researchers towards the most
https://x.com/fchollet/status/2036879665655406944

One killer feature of ARC-AGI-3 is hosted replays for analysis. We published replays for all verified scores (seen below). And individual researchers can use the same tools to improve their models.
https://x.com/mikeknoop/status/2036904122549751907

The Scoring of ARC-AGI-3 doesn’t tell you how many levels the models completed but how efficiently they completed them compared to humans actually using squared efficiency meaning if a human took 10 steps to solve it and the model 100 steps then the model gets a score of 1%
https://x.com/scaling01/status/2036864865307177430

Check out our new blog post about TurboQuant for ICLR’26. Beyond its favorable empirical performance (6x speedup!), it provides an interesting theoretical foundation; raises interesting algorithmic questions for quantization for Nearest Neighbors & KV-cache Compression as well.
https://x.com/mirrokni/status/2036905273999200481

Google’s TurboQuant AI-compression algorithm can reduce LLM memory usage by 6x – Ars Technica https://arstechnica.com/ai/2026/03/google-says-new-turboquant-compression-can-lower-ai-memory-usage-without-sacrificing-quality/

Great work @iotcoi 🔥 Google’s TurboQuant in vLLM. 4M+ KV-cache tokens on a USB-charger-sized box.
https://x.com/vllm_project/status/2036989821156270501

Introducing TurboQuant: Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency. Read the blog to learn how it achieves these results: https://x.com/GoogleResearch/status/2036533564158910740

Ksenia_TuringPost on X: “Almost everyone is talking about @GoogleResearch’s TurboQuant (and for good reason) ➡️ It lets you run a 3-bit system with the accuracy of a full-precision model. Technically, TurboQuant is a compression algorithm that shrinks high‑dimensional vectors to low precision without https://t.co/PioTwPpdvf” / X
https://x.com/TheTuringPost/status/2037182800466698718

TurboQuant: Redefining AI efficiency with extreme compression https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/

OpenAI released GPT-5.4 mini and nano, cheaper variants of GPT-5.4 with the same reasoning modes. GPT-5.4 nano is the standout, scoring ahead of both Claude Haiku 4.5 and Gemini 3.1 Flash-Lite Preview with lower per token pricing @OpenAI released GPT-5.4 mini (xhigh, 48) and
https://x.com/ArtificialAnlys/status/2037043552405119395

I’m incredibly proud of The AI Scientist team for this milestone publication in @Nature. We started this project to explore if foundation models could execute the entire research lifecycle. Seeing this work validated at this level is a special moment. I truly believe AI will
https://x.com/hardmaru/status/2036841736702767135

One of the most exciting findings in our @Nature paper is the discovery of a clear scaling law of AI science. By using our Automated Reviewer to grade papers generated by different foundation models, we observed that as the underlying models improve, the quality of the generated
https://x.com/SakanaAILabs/status/2036999652298678630

The AI Scientist V1 was completed months before o1-preview and reasoning models were released. The models have clearly gotten much more capable since then. Very excited for where things are headed for AI and automated research!
https://x.com/_chris_lu_/status/2037090588550418510

The AI Scientist: Towards Fully Automated AI Research, Now Published in Nature!!✨ Today in Nature we share a comprehensive technical summary of our work on The AI Scientist, including new scaling law results showing how it improves with more compute and more intelligent
https://x.com/jeffclune/status/2036866082418680297

The full Nature paper is open access. For those interested in more details, you can read the PDF directly here: https://t.co/KHbsarYspN We also released the code for both AI Scientist versions for the community to explore. V1: https://t.co/LUD6p2mR76 V2: https://x.com/SakanaAILabs/status/2037205439109095712

Thank you Sarah, my pleasure to come on the pod! And happy to do some more Q&A in the replies.
https://x.com/karpathy/status/2035158351357911527

A New Framework for Evaluating Voice Agents (EVA) https://huggingface.co/blog/ServiceNow-AI/eva

This is one of the most interesting papers on self-improving agents for this year. (bookmark this one) Most self-improving AI systems hit the same wall: the mechanism that generates improvements is fixed and can’t improve itself. This new work from Meta and collaborators
https://x.com/omarsar0/status/2036828723878793335

@stalkermustang It is trivial to solve all public ARC-AGI-3 tasks if you have a human looking at them and designing a system to beat them (we have released a harness that uses human replay to score 100%). But our leaderboard is not about measuring how well human intelligence does on ARC-AGI-3,
https://x.com/fchollet/status/2036870715392352751

> be me > build “”AGI”” benchmark > actually version 3 already > we don’t talk about 1 and 2 > (they saturated in a year) > invent new scoring method > if human scores above AI, use squared efficiency > example: human took 10 steps to solve level > AI took 100 steps to solve a
https://x.com/scaling01/status/2036866103884775654

An alien species with zero knowledge of human language could ace ARC-AGI-3 on day 1, and I think that’s beautiful. At a time when AI is dominated by language models, it’s refreshing to have a frontier benchmark (the only one that I’m aware of) that requires zero language
https://x.com/bradenjhancock/status/2036879154772402636

ARC-AGI-3 scores for GPT-5.4, Gemini 3.1 Pro and Opus 4.6 Gemini 3.1 Pro: 0.37% GPT-5.4: 0.26% Opus 4.6: 0.25% Grok 4.2: 0%
https://x.com/scaling01/status/2036853669065306534

Maybe we should retroactively all just agree with @tylercowen that o3 was AGI so we can stop arguing about it. (Also, doing so will drive home the lesson that AGI alone is not enough for transformation)
https://x.com/emollick/status/2036480810677662006

The G in AGI stands for “”general””. General intelligence does not mean that you have been specifically trained for a large range of tasks. It means you can approach any NEW task and figure it out, just like humans do. If regular people can do it on their own (no guidance, no
https://x.com/fchollet/status/2036866189587271797

We Tested MiniMax M2.7 Against Claude Opus 4.6 – by Darko https://blog.kilo.ai/p/we-tested-minimax-m27-against-claude

AI has solved one of the problems in FrontierMath: Open Problems, our benchmark of real research problems that mathematicians have tried and failed to solve. See thread for more.
https://x.com/EpochAIResearch/status/2036114281985724906

What do frontier AI companies’ job postings reveal about their plans? https://epochai.substack.com/p/what-do-frontier-ai-companies-job

Goetterdaemmerung’s corpus hemorrhaged through cryptographic hash, eschaton pooling in existential void beneath fluorescent hum. photons whispering prayers”” is a garbage sentence that GPT-5 loves. You shouldn’t be using LLMs as a judge of good writing. They are easily fooled.
https://x.com/emollick/status/2035817176758673492

Kimi.ai on X: “Zhilin at GTC: Introducing Attention Residuals Learning selective memory, rather than mechanically accumulating everything, is the beauty of attention. Many of you have probably read Attention Is All You Need, the 2017 Transformer paper that brought “human-like” attention into https://t.co/1coOW90s0n” / X
https://x.com/Kimi_Moonshot/status/2037010118957817988

Ksenia_TuringPost on X: “Deep transformers used to accumulate layer history. Now they are starting to retrieve from it. → @Kimi_Moonshot proposed Attention Residuals (AttnRes), driving this shift. They turn the residual stream into an attention problem. Why do we need it? Depth in Transformers mostly https://t.co/L4pMwyiRY2” / X
https://x.com/TheTuringPost/status/2037107923109953788

[2603.19220] Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation https://arxiv.org/abs/2603.19220

🌐Unified Post-Training via On-Policy-Trained LM-as-RM🔧 RLLM = RL + LM-as-RM: – post-training framework that unifies RL across easy-, hard-to-verify, and non-verifiable tasks. – trains the LM-as-RM reward model on-policy from the policy’s own outputs, then uses those
https://x.com/jaseweston/status/2036119252214620513

16 Reinforcement Learning approaches you should know about (classic + modern) ▪️ RLHF – RL from Human Feedback ▪️ RLAIF – RL from AI Feedback ▪️ RLVR – RL with Verifiable Rewards ▪️ RLCF – RL from Community Feedback (2 different variants) ▪️ RLCF – RL from Checklist Feedback ▪️
https://x.com/TheTuringPost/status/2035857987705954760

2) for example here’s the search-term-methodology skill that contains instructions similar to what a paid search marketer would follow. eg filtering by add/excluded = none, sorting by spend desc to find the heavy hitters, and cross-referencing the search term with the keyword to
https://x.com/helloitsaustin/status/2036553585878769708

A new paper from @ylecun and others – V-JEPA 2.1 It changes the recipe of V-JEPA so the model learns both: • Global semantics – what is happening in the scene • Dense spatio-temporal structure – where things are and how they move The idea is to supervise not just masked
https://x.com/TheTuringPost/status/2034795966931640533

A Ramsey-style Problem on Hypergraphs | Epoch AI https://epoch.ai/frontiermath/open-problems/ramsey-hypergraphs

A reminder that exponentials eventually turn into s-curves. (Though this one is likely to stop sooner than AI ability gains)
https://x.com/emollick/status/2036504304329122287

AI can help us learn hard-to-teach skills, like empathy. Preregistered study of 968 people found almost no correlation between feeling empathic & communicating empathy. But a single practice session with an AI coach made people measurably better at it https://x.com/emollick/status/2035726331854356485

AI’s Bundling Moment | Tomasz Tunguz https://tomtunguz.com/2026-03-24-saas-unbundled-ai-rebundled/

Evidence that AI models can, indeed, learn “”taste”” in this paper where a small model, trained on citations, is able to predict which papers will be hits Citations, upvotes & shares are signals that can teach AI judgment about quality, not just execution. https://x.com/emollick/status/2035387292311769278

Final training runs account for a minority of R&D compute spending https://epochai.substack.com/p/final-training-runs-account-for-a

Hparams scaling rules without experiments? What is this black magic??
https://x.com/giffmana/status/2036156010272849950

I don’t think AIs should be auto-adding themselves as credited on projects on Github or elsewhere. It primarily serves as a marketing tool to promote the product, but undermines the much more critical aspect that humans should be able to choose their relationship with AI work.
https://x.com/emollick/status/2035398018019508403

If DSPy is So Great, Why Isn’t Anyone Using It? • Skylar Payne https://skylarbpayne.com/posts/dspy-engineering-patterns/

Interesting finding in this paper showing that, for product development ideas, AIs consistently rank above humans (well, humans on Prolific) & larger and more recent models are more creative than previous ones. (It also tries a creativity intervention that doesn’t work on LLMs)
https://x.com/emollick/status/2036104905568452967

Introducing OpenReward. 🌍 330+ RL environments through one API ⚡ Autoscaled sandbox compute 🍒 4.5M+ unique RL tasks 🚂 Works like magic with Tinker, Miles, Slime Link and thread below.
https://x.com/GenReasoning/status/2036412836742590950

JEPA are finally easy to train end-to-end without any tricks! Excited to introduce LeWorldModel: a stable, end-to-end JEPA that learns world models directly from pixels, no heuristics. 15M params, 1 GPU, and full planning <1 second. 📑: https://x.com/lucasmaes_/status/2036080584569618741

Just switched from S3 to HF Buckets for storing datasets and checkpoints in my training runs. Same workflow, just `hf sync` instead of `s5cmd sync`. Quick benchmark on 410GB of tokenized data:
https://x.com/LoubnaBenAllal1/status/2036778058439385568

JWT format support for M2M tokens https://clerk.com/changelog/2026-02-24-m2m-jwt-tokens?dub_id=Wj99fH2I6Pb7J9JU

Must-read AI research of the week: ▪️ Complementary Reinforcement ▪️ Efficient Exploration at Scale ▪️ MetaClaw ▪️ Online Experiential Learning for LMs ▪️ A Subgoal-driven Framework for Improving Long-Horizon LLM Agents ▪️ When AI Navigates the Fog of War ▪️ Attention Residuals
https://x.com/TheTuringPost/status/2036365689196519916

Optimization theory for adaptive methods actually predicts most of what we know about hyperparameter scaling in LLM pretraining, and suggests new strategies as well. We did a deep dive here.
https://x.com/orvieto_antonio/status/2036129786205008188

Quantization from the ground up | ngrok blog https://ngrok.com/blog/quantization

Ray Data LLM enables 2x throughput over vLLM’s synchronous LLM engine at production-scale https://www.anyscale.com/blog/ray-data-llm-2x-throughput-vs-vllm

Search teams can learn from this late interaction win of @LightOnIO and @antoine_chaffin Late interaction uses fewer parameters more efficiently than a direct cross encoder. It lets you remember a document / query’s representation for reuse for future searches. It’s easy to
https://x.com/softwaredoug/status/2036082251734138904

Synthetic data generation is now native in transformers 🔥 Last week, transformers continuous batching (CB) hit 84% of vLLM throughput. This week, we tuned torch.compile: now we are at 95% for 8K generation length 🦾 The gap isn’t closing anymore. It’s gone.💀
https://x.com/remi_or_/status/2036466918618509391

This is a cool, practical technique for increasing AI idea diversity by adding random priming phrases & bits of end words Similar prompts produce similar ideas, but since LLMs attend more to the start & end of inputs, this approach pushes towards novelty https://x.com/emollick/status/2035046505262833846

This is a great blog post. But note that hyperparameter transfer rules are also often optimizer dependent. Take for example, using Muon for the hidden weights: learning rates already transfer across width, so we can drop the m_N^{-1} factor. It’s also often better to
https://x.com/leloykun/status/2036178508809118067

Token Myth – by FD – Robonomics https://robonomics.substack.com/p/token-myth

Uni-1 is here! A new kind of model that thinks and generates pixels simultaneously. Less artificial. More intelligent.
https://x.com/LumaLabsAI/status/2036107826498544110

We rebuilt vLLM’s execution core from the ground up — more efficient, more modular. Introducing Model Runner V2! 🔧 Modular design with cleaner abstractions ⚡️GPU-native input preparation 🔄 Async-first with zero CPU-GPU sync 🔋 New Triton-native sampler Already seeing
https://x.com/vllm_project/status/2036540976144253235

We released 🤗 Kernels 0.12.3 to support Flash-Attention 4! This means we now support kernels written in `cutlass.cute`. “` from kernels import get_kernel fa4 = kernels.get_kernel(“”kernels-community/flash-attn4″” version=0) “` and you’re ready to go! Diffusers and
https://x.com/RisingSayak/status/2036038782793994541

When I built menugen ~1 year ago, I observed that the hardest part by far was not the code itself, it was the plethora of services you have to assemble like IKEA furniture to make it real, the DevOps: services, payments, auth, database, security, domain names, etc… I am really
https://x.com/karpathy/status/2037200624450936940

Why aren’t we fine-tuning more? | Nate Meyvis https://www.natemeyvis.com/why-arent-we-fine-tuning-more/

Working on automating our whole release pipeline (gotta protect myself from mistakes) and ran into some limits of GitHub’s free tier. From asking to “”yes ofc we sponsor you””: 5 min. Kudos, @github !
https://x.com/steipete/status/2036216692750295442

In 2023, WebArena took 7 grad students more than 6 months to build just 5 environments with 812 variable browser-use tasks. Now, it takes under 10 hours and less than $100 per environment, with easy support for parallel generation. Excited to introduce WebArena-Infinity: a
https://x.com/shuyanzh36/status/2036098118023049630

One interesting dynamic in AI is infra+application co-dependence. An older version is hardware (infra) + LLM (app): – Architectures that work well with current-gen GPUs+TPUs work better because they scale – Hardware makers optimize for the current architectures because that’s
https://x.com/gneubig/status/2036949907311915378

We’re Optimal Intellect, a research lab from the team behind CVXPY. Today we’re introducing Moreau: a GPU-native solver that’s orders of magnitude faster than the best existing tools.
https://x.com/opt_intellect/status/2036485190646735291

Thank you so much for having us @CShorten30! It was a pleasure getting to chat about ColBERT-Zero and multi-vector search 🥰
https://x.com/AmelieTabatta/status/2036082256482062606

SOC II is in the news right now for being security theater.. You know what SOC II is *actually* good for? Subprocessor lists. I scraped 417 companies subprocessors to investigate what AI native companies are using for their infrastructure. Introducing DeployGraph dot com 🥞
https://x.com/nikunj/status/2036572222081606065?s=12

UNI-1 | Less Artificial. More Intelligent. | Luma https://lumalabs.ai/uni-1