Technical and Dev: AI News Week Ending 05/08/2026

Image created with gemini-3.1-flash-image-preview with claude-opus-4.7. Image prompt: Using the bell-pass-trail-vista reference for environment and the windgate-pass-trail-sign reference for sign style, plant a weathered brown-post trail sign on a rocky Sonoran slope with the bold all-caps header ‘TECH’ and fictional trail entries like ‘Silicon Saddle 2.4 mi →’, ‘Bandwidth Basin 3.7 mi →’, ‘← Firmware Falls 1.1 mi’, with a small circuit-board emblem replacing the medallion; beside the post, arrange a small cairn-like stack of sun-bleached consumer tech — a dusty smartphone, a circuit board, a coiled cable — nestled naturally among volcanic rocks, with saguaros and the hazy Scottsdale valley behind, photorealistic warm midday light.

Artificial Analysis is partnering with Harvey on their new Legal Agent Benchmark! Harvey’s Legal Agent Benchmark (LAB) is an agent-native take on how AI should be contributing to legal work in 2026 – made up 1200 agentic tasks across 24 practice areas. It’s highly aligned with
https://x.com/ArtificialAnlys/status/2052145762650431840

Introducing Harvey’s Legal Agent Benchmark
https://www.harvey.ai/blog/introducing-harveys-legal-agent-benchmark

LAB is the first long-horizon, open-source legal agent benchmark, from @harvey. it will help legal teams answer “”what can legal agents do today?””, plan deployment, and design human-agent cooperation. autonomous legal is a deep domain, and a good benchmark can accelerate progress
https://x.com/saranormous/status/2052061665596948894

Meta prepares Hatch AI Agent with waitlist and social skills
https://www.testingcatalog.com/meta-prepares-hatch-agent-under-waitlist-and-social-media-skills/

Meta is planning to power its AI data centers with solar energy beamed from space. If it works, solar farms could produce power 24/7 without batteries or backup generators. The company behind it all is Overview Energy — they want to launch 1,000 satellites into orbit, 22,000
https://x.com/rowancheung/status/2051320518905930208

Today we’re releasing Refactoring, the final leaderboard of our SWE Atlas suite. This new leaderboard is the ultimate test of an agent’s ability to restructure code without breaking the system. Claude Opus 4.7 with Claude Code takes the top spot🥇
https://x.com/ScaleAILabs/status/2052434456510878021

Gemma-4 lands in Code Arena: Frontend Webdev and shifts the Pareto Frontier! Among open models, Gemma-4-31b ranks #13 and Gemma-4-26b-a4b ranks #17. Congrats to @GoogleDeepMind on shifting the frontier!
https://x.com/arena/status/2052061349312921686

Congrats to @OpenAI for taking the top spot on our Audio MultiChallenge S2S leaderboard with the release of GPT‑Realtime‑2 🥇 GPT-Realtime-2 more than doubles GPT-Realtime-1.5 on instruction retention, rising from 36.7% to 70.8% APR, and also stands out on voice editing,
https://x.com/ScaleAILabs/status/2052451341071683732

All benchmarks are flawed, but GPQA has been fairly consistent & highly correlated with other measured benchmars. I think it’s a good way to see how far we’ve come that the free model from OpenAI, GPT 5.5 Instant, is at a level that even paid models did not reach until late 2025
https://x.com/emollick/status/2051801703209742734

MRC is already deployed across all of OpenAI’s largest supercomputers that we use to train frontier models, including our site with @Oracle Cloud Infrastructure (OCI) in Abilene, Texas, and in @Microsoft’s Fairwater supercomputers. MRC is now available through the
https://x.com/OpenAI/status/2052025533937103102

NVIDIA just open-sourced a transport protocol that powers OpenAI’s Blackwell clusters. It opened MRC, a new RDMA transport protocol for massive AI training clusters. Instead of pushing GPU traffic through one fragile path, MRC spreads a single connection across multiple network
https://x.com/kimmonismus/status/2052011784023028060

Supercomputer networking to accelerate large scale AI training | OpenAI
https://openai.com/index/mrc-supercomputer-networking/

We’ve partnered with @AMD, @Broadcom, @Intel, @Microsoft, and @NVIDIA, to release Multipath Reliable Connection (MRC), a new open networking protocol that helps large AI training clusters run faster and more reliably, with less wasted GPU time.
https://x.com/OpenAI/status/2052025532485902368

Medicine | The 2026 AI Index Report | Stanford HAI
https://hai.stanford.edu/ai-index/2026-ai-index-report/medicine

New paper (on an old AI) tests o1 against doctors on medical benchmarks & real ER cases: “across a variety of scenarios and applications, the large language model outperformed both human physicians and older models” The potential suggests an “urgent need for prospective trials.”
https://x.com/emollick/status/2050197369250033813

// Recursive Multi-Agent Systems // Great read for the weekend. (bookmark it) Multi-agent systems often pass full text messages between agents at every step. This leads to token bloat, latency, and context dilution which all grow with the number of agents. RecursiveMAS asks a
https://x.com/omarsar0/status/2050261229315477988

🎉Introducing PyFlue: The Python-Native Agent Harness Framework.🧰 💡Flue for Python: Fred K. Schott @FredKSchott CEO of HTML has launched Flue: The Agent Harness Framework for TypeScript. It brings programmable harness right into your agents rather than DIY plumbing. Python
https://x.com/Shashikant86/status/2050999432569651221

🚀 New on the @vllm_project blog: Serving Agentic Workloads at Scale with vLLM x Mooncake. Agentic traces grow to 80K+ tokens with 94%+ reusable prefixes, but local KV caches evict them and cross-instance routing misses them. By integrating Mooncake Store as a distributed KV
https://x.com/vllm_project/status/2052113331927060840

9 New approaches to Multi-Agent Systems ▪️ RecursiveMAS ▪️ OneManCompany (OMC) ▪️ OrgAgent ▪️ CORAL ▪️ LLMA-Mem ▪️ Agentic Federated Learning ▪️ CASCADE ▪️ GRASP ▪️ Reinforced Agent These methods express truly interesting various ideas! Learn more about them here:
https://x.com/TheTuringPost/status/2050957812432580956

Agentic Business Orchestration & Automation Platform | UiPath
https://www.uipath.com/

Agents are already moving at machine speed, but security is still stuck on static, outdated rules. We can close the gap On May 5, @rubrikInc is hosting a technical webinar on AI security at scale – Building AI Resilience: Managing Agent Risk with Trust Infrastructure →
https://x.com/TheTuringPost/status/2049985228421361929

Agents can now add durable execution to their plans with Dynamic Workflows.
https://x.com/celso/status/2050211184129786084

Agents for Everything Else — swyx – YouTube

Coming back to drafting a set of “”10 commandments”” for coding with agents. Here are the current candidates: – Implement to learn – Rebuild often – E2E tests are gold – Document intent – Maintain your spec – Find what’s hard (that’s the value) Thoughts? What am I missing?
https://x.com/dbreunig/status/2051081626139210202

create_agent – how we build Deep Agents on the simplest harness primitive underlying all of the harness engineering, research, and API design in Deep Agents is a very simple primitive in LangChain called create_agent the entire design of deepagents comes from optionally
https://x.com/Vtrivedy10/status/2050239109038232005

Cursor can now automatically fix CI failures. Set up always-on agents that monitor GitHub, investigate root causes, and open PRs with fixes.
https://x.com/cursor_ai/status/2051739625958584659

deepagents-cli is quietly becoming the best place to start coding with open weight models. we’ve been investing heavily in making it a harness that’s truly model-agnostic, without compromising performance! different models perform best with different harnesses — prompts,
https://x.com/masondrxy/status/2051359502918648319

Domain-Specific Agent Applications Workshop | AWS Marketplace
https://pages.awscloud.com/awsmp-gro-pude-webinar-mss-module-5-use-case-specific-agents-workshop.html?trk=e7545cd0-d7fb-4544-85cc-1d4223437479&sc_channel=el

everyone would have a deeper appreciation for Agent Products that rock because of great Context/Harness Engineering if they… talked to: – LLM Base models – Post-Trained models with no harness (no tools, no built in prompts, nothing) helps internalize how much stuff needs to
https://x.com/Vtrivedy10/status/2051674478648742002

For individual AI use, the jagged frontier is increasingly well understood. In multi-agent workflows in organizations, AI is jagged in ways that have not been well identified yet. In fact, we don’t even have a vocabulary around multi-agent systems & the ways the fail or succeed.
https://x.com/emollick/status/2051479583585616023

How AI Agent Memory Works
https://memory.cobanov.dev/

I detected a bad Agent action, what do I do about it? this is pretty much the main question that will power the future’s Human+Agent driven improvement loops Gather data -> Mine Errors -> Find out which piece(s) of the agent is contribute to this behavior -> Apply Fix -> Test
https://x.com/Vtrivedy10/status/2051727418134593632

I was quoted a couple times in this Atlantic article, but that isn’t (the only) reason I think it is good. It lays out the reasons why we whipsawed from “AI is a bubble” to “there are not enough data centers” in less than six months. Spoiler: its agents.
https://x.com/emollick/status/2050396928798535990

if you haven’t read this one by @Vtrivedy10, it’s a must read! great overview of what components a harness needs to support an agent for long running, long context tasks
https://x.com/sydneyrunkle/status/2051637638239567953

Improving token efficiency in GitHub Agentic Workflows – The GitHub Blog

Improving token efficiency in GitHub Agentic Workflows

Introducing /orchestrate, a skill that recursively spawns agents to tackle your most ambitious tasks with the Cursor SDK. We’ve used it to: – Autoresearch our internal skills, cutting token use by 20% while improving evals – Cut cold start times on our internal backend by 80%
https://x.com/cursor_ai/status/2052432778743210127

Introducing Zyphra Cloud: A full stack AI platform on AMD. Launching today with Zyphra Inference: serverless inference for frontier open-weight models focused on long horizon agentic workloads. Powered by @AMD MI355X GPUs on @TensorWave. Learn more at
https://x.com/ZyphraAI/status/2051384562870329444

Its getting hard to benchmark frontier agent performance on longer tasks. Repeated measurement is very expensive and there are differences between using models in harnesses versus via APIs. I suspect benchmarks understate progress, they are built for models, not harnessed agents
https://x.com/emollick/status/2050892355331354850

langgraph is the runtime that powers langchain and deepagents! we’ve been cooking on some new features: 1. node level error handlers 2. static + dynamic node timeouts 3. delta (diff based) channels for optimized storage 4. tons of new streaming primitives our 1.2 alpha release
https://x.com/sydneyrunkle/status/2051382622517887479

most of the time, you want an agent loop to run uninterrupted. that’s where the utility comes from! but some decisions shouldn’t be delegated to the agent. two situations come up consistently: 1/ before a consequential action, like sending an email, executing a transaction, or
https://x.com/sydneyrunkle/status/2050195081995407429

Must-read research of the week ▪️ The Last Harness You’ll Ever Build ▪️ From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company ▪️ Recursive Multi-Agent Systems ▪️ Synthetic Computers at Scale for Long-Horizon Productivity Simulation ▪️ Co-Evolving Policy
https://x.com/TheTuringPost/status/2051707579785752909

No longer are Fleet agents constrained to a single model. With multi-model support, you can build more efficient agents at scale
https://x.com/LangChain/status/2051367244060598312

now, your agent can fix itself. introducing raindrop triage. an agent for finding and investigating agent issues.
https://x.com/benhylak/status/2051727888639250450

Observability helps power the agent improvement loop But it’s not just observability! It’s also feedback! You should be trying to get as much feedback (direct, indirect, generated) into your agent observability platform as possible
https://x.com/hwchase17/status/2051708980435853513

one of the features i’m most excited about in our upcoming langgraph release is delta channels! the langgraph runtime lets you “”checkpoint”” agent progress at every step (model call, tool call, hooks). the problem, though, is that checkpoints bloat quickly when context is long!
https://x.com/sydneyrunkle/status/2052344141963555312

open-weight LLMs have come a long way on agent tasks! but the harness you wrap them in matters just as much as the model itself, and arguably the interface you use to drive that harness matters even more. dev workflows are deeply personal. what works well for one developer may
https://x.com/masondrxy/status/2051714091924828480

Sakana Fugu: A Multi-Agent Orchestration System as a Foundation Model
https://x.com/SakanaAILabs/status/2050998826190667795

serving multiple users from a single agent deployment introduces three distinct problems. luckily, langsmith’s agent server has a solution for each! 1. data isolation: your @auth.authenticate handler tags every resource with ownership on write, filters on read. 2. delegated
https://x.com/sydneyrunkle/status/2049956826670911809

The next wave of AI will not be won by better prompts. It will be won by systems that learn from experience. Today, Prime Intellect Lab is out of beta, open for you to start training your own models. The era of self-improving agents is here.
https://x.com/PrimeIntellect/status/2052225145725698102

There’s a bunch of conflicting stances I don’t fully understand in the debate of Proprietary RL’d vs Open Harness, Model intelligence, and Agent Labs building harnesses for bespoke tasks Not all of the below can be true: 1. Model is post-trained with a harness in the loop so it
https://x.com/Vtrivedy10/status/2051451869017584112

this is the part of the deep agents production series i’ve been most excited to get to: sandboxes without an execution environment, a production agent is only as capable as its fixed toolset. give an agent an execution environment where it can write and run code, and you give
https://x.com/sydneyrunkle/status/2052459962169966752

To get the most out of agent observability, store feedback with your traces. That is what turns agent traces from logs into a learning system.””
https://x.com/LangChain/status/2051709642716135729

TokenSpeed: A Speed-of-Light LLM Inference Engine for Agentic Workloads | LightSeek Foundation
https://lightseek.org/blog/lightseek-tokenspeed.html

we’re on an Open Model mission to help builders create world class agents >20x cheaper than what they have today a couple things have become evident recently: 1. The age of the token subsidy is being pulled back 2. Open Models have crossed an intelligence threshold making them
https://x.com/Vtrivedy10/status/2051148084567052690

What AI agent harness are you daily driving these days?
https://x.com/bilawalsidhu/status/2051859826083336326

You can now see a breakdown of your agent’s context usage in Cursor 3.3. Use these stats to diagnose context issues and improve your setup across rules, skills, MCPs, and subagents.
https://x.com/cursor_ai/status/2052059748544249918

NEW paper from Microsoft Research. (bookmark it) The entire interpretability literature is built around human readers. As more analysis gets delegated to agents, the right target of interpretability shifts. This paper is a recipe for designing tools that agents can actually
https://x.com/dair_ai/status/2052125514266190286

Scenethesis: A Language and Vision Agentic Framework for 3D Scene Generation”” TL;DR: combines LLM planning with vision-guided refinement to generate physically plausible and coherent 3D scenes from text
https://x.com/Almorgand/status/2051320217674870795

Very excited to release Terminal-Bench 2.1! Coding agents are among the most economically consequential deployments of LLMs to date. As agents improve, benchmark reliability matters more. We audited TB2.0 and found and corrected issues in 28/89 tasks. 30% of the benchmark!
https://x.com/ekellbuch/status/2052165464655298866

We recently built HiL-Bench, the first benchmark to test a critical question: do AI agents know what they’re missing and when to ask? Frontier models perform well with perfect specs. But remove a few key details, and they confidently guess and ship plausible wrong answers. We
https://x.com/ScaleAILabs/status/2051333688798097567

I have been testing DeepSeek-V4-Pro with the Pi coding agent. I am mindblown by how well it works out of the box. A few notes: I spent a few hours building an LLM wiki with an agent powered entirely by DeepSeek-V4-Pro on @FireworksAI_HQ inference. This is the first time I
https://x.com/omarsar0/status/2050009901234282649

🧠 Introducing NeuralBench: a unified, open-source framework to benchmark NeuroAI models. v1.0: 36 EEG tasks, 94 datasets, task-specific + foundation models. MEG/fMRI ready. MIT-licensed, FAIR’s Brain & AI @AIatMeta. Code:
https://t.co/WdROYdjjNY Paper:
https://x.com/hubertjbanville/status/2052029372282888234

1) Our team at Meta has a tough new coding benchmark challenging models to code entire programs including ffmpeg and the PHP compiler from scratch. 2) Top accuracy is 0% 3) We will be making the benchmark harder.
https://x.com/OfirPress/status/2051678633035809159

Banger paper from Meta FAIR. They introduce Autodata, an agentic data scientist that builds high-quality training and evaluation data autonomously. The headline result: on a CS research QA task, an Agentic Self-Instruct loop produces a 34-point gap between weak and strong
https://x.com/dair_ai/status/2051311905353142328

Cool paper from Meta FAIR. It’s on self-improving LLMs but on the pretraining side. (bookmark it) Most LLM safety, factuality, and reasoning fixes get bolted on at post-training. By then, the patterns have already set. This work moves those behaviors into pretraining itself.
https://x.com/omarsar0/status/2050213732970848664

AI Agents and the Future of Digital Work with Microsoft
https://www.cdata.com/resources/ai-agents-future-digital-work-microsoft/

NEW paper from Microsoft Research. If you care about training computer-use agents, this is one to keep. (bookmark it) The team builds 1,000 synthetic computers (each with realistic directory structures, documents, and artifacts) then runs long-horizon simulations on top of
https://x.com/dair_ai/status/2050263752147456238

NEW paper from Microsoft Research. Nice study on long-horizon agent generalization. (bookmark it) The team runs a study where the only variable is task horizon length. They use the same decision rules, reasoning structure but different sequence length to the goal. The main
https://x.com/dair_ai/status/2051679862788878354

📝 Agentic RL Infra Notes Insights from Zhihu Contributor 低级炼丹师 📝 🔍 Core Difference: Agentic RL vs Traditional RL • Traditional RL (RLVR): Single-time generation (answer → reward → update) — trains a “”response-generating”” model, no dynamic interaction. • Agentic RL:
https://x.com/ZhihuFrontier/status/2051691071634301064

Natural Language Autoencoders \ Anthropic
https://www.anthropic.com/research/natural-language-autoencoders

PostTrainBench results for GPT-5.5 are in it doesn’t beat Opus 4.7 in the Claude Code harness even with almost 2 more hours of working time via reprompting
https://x.com/scaling01/status/2050289320699818417

A challenge with AI regulation and vetting is how bad our benchmarks of AI model performance and risks are. There is no benchmark for risks and red-teaming requires experiments from dedicated specialist organizations & is not easy to put metrics around. No clear objective numbers
https://x.com/emollick/status/2051431009766289734

A new very hard search benchmark that exposes bottlenecks in modern neural retrievers!
https://x.com/nlp_mit/status/2052069072607547892

Are AI benchmarks doomed? @GregHBurnham and @tmkadamcz join @ansonwhho to push back on benchmark pessimism and dig into what the next generation of AI benchmarks could look like. (0:00:00) – Preview (0:00:36) – Intro: Are AI benchmarks doomed? (0:03:13) – The costs and benefits
https://x.com/EpochAIResearch/status/2051330509989368211

Benchmarks aren’t just about showing where we are now, benchmarks are a treasure map that shows us how to get to the future the benchmark specifies. SWE-bench was about automating bugfixing and small feature reqs, this is about automating entire repo development.
https://x.com/OfirPress/status/2052106927908200957

How much of SQLite, FFmpeg, PHP compiler can LMs code from scratch? Given just an executable and no starter code or internet access. Introducing ProgramBench: 200 rigorous, whole-repo generation tasks where models design, build, and ship a working program end to end. 🧵
https://x.com/jyangballin/status/2051677497562210552

I’ve never been this excited about search. 6-7 years ago, IR got an influx of the paradigms we still use, all enabled by the big headroom MS MARCO and then BEIR created. Then progress slowed. Today, Diane releases perhaps the most ambitious IR benchmark to date: OBLIQ-Bench.
https://x.com/lateinteraction/status/2052055143038713875

Looking at average pass rate is *very* misleading- every task has a big chunk of tests that are very easy to pass and sometimes a minority of tests that are much harder to pass- so you can implement 10% of the program and get a 60% pass rate.
https://x.com/OfirPress/status/2051757679283143089

New research from @AISecurityInst and Goodfire: Models sometimes recognize they’re being evaluated, occasionally even identifying the benchmark. We show this verbalized eval awareness inflates safety scores, meaning safety benchmarks may not reflect real-world behavior. (1/7)
https://x.com/GoodfireAI/status/2051382876483231968

ProgramBench uses a not so useful / weird metric like ARC-AGI > headline score of all models -> 0% > looks inside > Opus 4.6 and 4.7 pass on average >50% of tests per task > why? > they only count a task as passed if 100% of tests are successful and as we all know software
https://x.com/scaling01/status/2051733949877985349

very impressive release with lots of care at every stage of training: custom arch with bigger experts, more expressive router, compressed attention, residual scaling, and much more on the post training side including test time compute etc.. benchmark scores are very competitive
https://x.com/eliebakouch/status/2052126118891729148

NEW paper from Sakana AI (ICLR 2026). A 7B Conductor model just hit SOTA on GPQA-Diamond and LiveCodeBench by orchestrating other LLMs instead of solving problems itself. (great paper! bookmark it!) The Conductor is trained with RL to do two things at once: design
https://x.com/omarsar0/status/2051306659021242635

Introducing a new paper! Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs Static benchmarks are no longer enough. Models improve too quickly and numbers become stale quickly. Instead, we argue for continuously maintained evaluation platforms.
https://x.com/j_dekoninck/status/2051268263150276872

ProgramBench
https://programbench.com/

ProgramBench: Can Language Models Rebuild Programs From Scratch? John Yang, Kilian Lieret, Jeffrey Ma, Parth Thakkar, Dmitrii Pedchenko, Sten Sootla, Emily McMilin, Pengcheng Yin, Rui Hou, Gabriel Synnaeve, Diyi Yang, Ofir Press
https://t.co/MRk3XwhsYA [𝚌𝚜.𝚂𝙴 𝚌𝚜.𝙰𝙸]
https://x.com/ComputerPapers/status/2051895799043215415

The artificial analysis index is a normalized score of several benchmarks (and has changed over time) it is fine for roughly comparing models, it is not useful for trend analysis and it is unclear what individual point differences in the scores mean.
https://x.com/emollick/status/2051061792667754507

The recipe for “classic” reasoning benchmarks is simple: text-only, several-hour time horizons, easy to grade, with expert human baselines. What next? In this week’s Gradient Update, @GregHBurnham argues it’s as easy as dropping one of these four ingredients.
https://x.com/EpochAIResearch/status/2051760424891392204

We are launching domain-specific capability scores, tracking the capabilities of models across SWE and Math benchmarks, using the same scale as the general ECI. We also support customization for users who want to create their own variants of the ECI. Link below!
https://x.com/EpochAIResearch/status/2052069897530933438

We set out to build a better retriever, so we looked for the hardest IR benchmarks. For each, we asked how much headroom remained by running oracle reranking with a frontier LLM. Most had little room left! So we built OBLIQ-Bench to study much harder search queries than before.
https://x.com/dianetc_/status/2052053806121140254

We’re releasing Terminal-Bench 2.1 to patch 28 of the 89 tasks in Terminal-Bench 2.0 TB2.1 includes • recalibrated limits • fixed solutions • realigned verifiers Per-task breakdowns in 🧵 We’ll continue to support TB2 and TB2.1 leaderboards (new submission process 🔜)
https://x.com/terminalbench/status/2052119174500220964

🤖: Google DeepMind releases Decoupled DiLoCo, a fault-tolerant distributed training architecture that achieves 88% goodput vs. 27% for standard data-parallel at scale, using ~240x less inter-datacenter bandwidth with no measurable ML performance loss.
https://x.com/dl_weekly/status/2051693914868871205

GPT-5.5 & Opus 4.7 on ARC-AGI-3 – GPT-5.5: 0.43% – Opus 4.7: 0.18% We found 3 failure modes: – True local effect, false world model – Wrong level of abstraction from training data – Solved the level, didn’t reinforce the reward See our full analysis 🧵
https://x.com/arcprize/status/2050261221165989969

💫Very happy to release NeuralBench, to benchmark Neuro AI models and datasets in the open! 🧵Thread, 💻Code, 📝White Paper below:
https://x.com/JeanRemiKing/status/2052034314120896582

Today we’re releasing ZAYA1-8B, a reasoning MoE trained on @AMD and optimized for intelligence density. With <1B active params, it outperforms open-weight models many times its size on math and reasoning, closing in on DeepSeek-V3.2 and GPT-5-High with test-time compute. 🧵
https://x.com/ZyphraAI/status/2052103618145501459

We’ve developed our own inference engine Runtime-Optimized Serving Engine (ROSE) to serve models ranging from embeddings to trillion-parameter LLMs. With CuTeDSL integrated into our inference engine, Perplexity can build the specialized GPU kernels faster to bring models up to
https://x.com/perplexity_ai/status/2052041903970148647

[2604.27505] Leveraging Verifier-Based Reinforcement Learning in Image Editing
https://arxiv.org/abs/2604.27505

[2604.28181] Synthetic Computers at Scale for Long-Horizon Productivity Simulation
https://arxiv.org/abs/2604.28181

[2605.00503] End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer
https://arxiv.org/abs/2605.00503

[2605.01428] Hallucinations Undermine Trust; Metacognition is a Way Forward
https://arxiv.org/abs/2605.01428

@_architected no, the usage limits of the 5-hour window actually doubled Correct that weekly limits are the same. Only a very small % hit weekly limits, while a much larger portion of users hit the 5-hour limits, so we fixed that first. As the compute comes online, will look at weekly.
https://x.com/TheAmolAvasare/status/2052066157176426653

@francoisfleuret Why diffusion-like reasoning? Doesn’t diffusion lack real test-time extrapolation that we’d want from reasoning? Agree on reasoning in latent space, of course.
https://x.com/willdepue/status/2052033422915477580

@kimmonismus @kimmonismus it’s only a very small % of people who hit the weekly limit, versus a much larger portion who hit 5-hour so that’s what we wanted to alleviate first. As the compute comes online, we’ll look at weekly limits. Stay tuned.
https://x.com/TheAmolAvasare/status/2052064611692904639

@willdepue @francoisfleuret Diffusion seems like a much more natural way to get good reasoning to me. And you can directly scale as many more steps in diffusion as you want – no problem!
https://x.com/jeremyphoward/status/2052149483740545400

Amdahl’s law – Wikipedia
https://en.wikipedia.org/wiki/Amdahl%27s_law

ANN v3: 200ms p99 query latency over 100 billion vectors
https://turbopuffer.com/blog/ann-v3

Calling multiple LLM providers in production shouldn’t mean juggling separate accounts, bills, and rate limits–and one provider outage taking your whole product down with it. Our LLM Gateway just got a significant upgrade so you can: 🔹 Route across providers with automatic
https://x.com/AssemblyAI/status/2052043337751056733

Computer use is 45x More Expensive Than Structured APIs
https://reflex.dev/blog/computer-use-is-45x-more-expensive-than-structured-apis/

Excited to release the Ultimate guide to RL environments! Definitions of RL environments differ wildly in the LLM era, so we spent the last month building several RL environments across 6 different frameworks, domains and complexities to map out which are easiest to build with
https://x.com/adithya_s_k/status/2051660068471603352

from x-risk to x-rent
https://x.com/paularambles/status/2052087138670596289?s=46

Give LLMs 1. A latent space diffusion-like reasoning. 2. A real recurrent state. 3. A world-model pre-pre-training. And we are done.
https://x.com/francoisfleuret/status/2051928896027693479

Good QC for RL Data
https://www.seancai.com/philosophy/good_qc_rl_data

How did ‘large’ language models get that way? The role of Transformers and Pretraining in GPT – LessWrong 2.0 viewer
https://www.greaterwrong.com/posts/gcKhnqysxj9bBvbWD/how-did-large-language-models-get-that-way-the-role-of

I’m inclined to believe this. For whatever reason, V4-Pro is scarcely better than V4-Flash or V3.2-Speciale on WeirdML. It is not an ML engineer AI. Fireworks cost has no discount, so on DS it’d be ≈$0.08. Still more expensive for same quality as gpt-oss-120B.
https://x.com/teortaxesTex/status/2052043753892761882

I’ve spent the past few weeks reading 100s of public data sources about AI development. I now believe that recursive self-improvement has a 60% chance of happening by the end of 2028. In other words, AI systems might soon be capable of building themselves.
https://x.com/jackclarkSF/status/2051312759594471886

In addition to the CAISI evaluation, it would be useful if NIST conducted public tests of AI abilities as an independent evaluator – though those obviously should not be pre-release tests & can be done when models are public. Independent testing is important & getting expensive.
https://x.com/emollick/status/2051681014762676723

In search of wasted bits: how much information do LLM weights carry?
https://fergusfinn.com/blog/weight-entropy/

In-Kernel Broadcast Optimization: Co-Designing Kernels for RecSys Inference – PyTorch
https://pytorch.org/blog/in-kernel-broadcast-optimization-co-designing-kernels-for-recsys-inference/

Interpreting language models can feel like stumbling through a dark forest – sometimes you just wish you had a flashlight! In our new post, we introduce HeadVis, our latest flashlight for studying attention heads.
https://x.com/kamath_harish/status/2052046203030827088

Interpreting model activations is important to understand why a model is doing what its doing. Traditionally, we’ve done this with supervised methods (probing for a specific context), or unsupervised sparse decompositions (dictionary learning). But probing requires you to know
https://x.com/mlpowered/status/2052446867037020402

Introducing a new sequence model Raven which pushes the boundary of fixed-state-size sequence models! Raven bridges popular linear-time models with constant state capacity, like SSMs and sliding window attention (SWA). Like SWA, its state is a finite set of slots; unlike SWA,
https://x.com/_albertgu/status/2052442144879862003

Introducing deepsec: The security harness for finding vulnerabilities in your codebase – Vercel
https://vercel.com/blog/introducing-deepsec-find-and-fix-vulnerabilities-in-your-code-base

Introducing Recommended Connectors! Manus now helps set up what your task needs, when it needs it: • Recommends relevant connectors in context • Helps enable them with your approval
https://x.com/ManusAI/status/2051681463389610209

It is somewhat comforting that now, whenever I see a post about “here’s the thing that keeps me up at night” I know that there is absolutely no chance that this is being written by a human who is staying up all night.
https://x.com/emollick/status/2051358855246836110

It’s very interesting that cryptographic protocols and neural networks have the same high-level architecture (where they jumble information as it moves sequentially across many layers). This is the result of a convergent evolution – cryptographic protocols need every output bit
https://x.com/dwarkesh_sp/status/2051335468231360825

Just added a delay selector to allow control of the latency/accuracy tradeoff.
https://x.com/juberti/status/2052504986391879788

Live translation that actually works incredibly well! I’ll be using it from now on regularly
https://x.com/BorisMPower/status/2052472038967890022

Long AI Short AGI – by Ramy Adeeb – 1984 Newsletter
https://1984.substack.com/p/long-ai-short-agi

MiniMax-M2.7 is now available across six inference providers on Artificial Analysis, with significant differentiation in speed and price @SambaNovaAI leads on speed at 435 output tokens/s, >3x faster than any other provider. @FireworksAI_HQ, @novita_labs, @togethercompute, and
https://x.com/ArtificialAnlys/status/2051735255044997215

Model labs should spend their time pushing the frontier, not thinking about API keys, rate limits, metering, and billing. Today, we’re launching Baseten Frontier Gateway: the fastest path from trained weights to a production, white-labeled API.
https://x.com/tuhinone/status/2052082677432390130

Multilingual AI | Welo Data
https://welodata.ai/multilingual-ai/

My first blog post in over a year is a deep dive on flow maps🗺️, or how to learn the integral of a diffusion model to enable faster sampling and several other cool tricks. It’s the longest one yet👀 Let me know what you think!
https://x.com/sedielem/status/2051957402556104799

Neural geometry -> scientific discovery! We reverse-engineered a scientific foundation model, uncovering a novel class of biomarkers in a curved manifold
https://x.com/GoodfireAI/status/2052468622103085107

Powering the Inference Era: Inside the DigitalOcean AI-Native Cloud | DigitalOcean
https://www.digitalocean.com/blog/powering-the-inference-era

PSGD is indeed kino. I fear not the man who has implemented 10,000 optimizers on a single problem, but I fear the man who relentlessly improved the same optimizer 10,000 times. Xilin Li & Omead P
https://x.com/_arohan_/status/2051012103025410410

Read the first 2 posts in the series:
https://t.co/94XmrlPmoA Forthcoming posts will go into more detail on: – an example mechanism that operates on manifolds – unsupervised discovery of manifolds + the connection to SAE features – in-context geometry
https://x.com/GoodfireAI/status/2052420594193650167

Recursive Self-Learning: Why It Matters Now
https://x.com/TheTuringPost/status/2051451337427030477

sglang is the best inference framework out there. RadixArk was formed to make it even better and to democratize more of the frontier AI stack. Very happy to support the team in their seed round.
https://x.com/ibab/status/2051690211873308892

SSMs fail on recall tasks they have the capacity to solve. The two dominant approaches today, SSMs and sliding-window attention, both lack persistence: memory either decays over time or gets evicted. We built Raven to fix this, surpassing all prior linear models even at 16×
https://x.com/avivbick/status/2052438903924396377

Subquadratic — Efficiency is Intelligence
https://subq.ai/how-ssa-makes-long-context-practical

The “”magic”” of embedding lies in total generalization Embeddings are the step where pieces of language stop being just tokens and turn into a form a model can actually work with. -> Token IDs move from integers into geometry, where distance represents meaning and connection
https://x.com/TheTuringPost/status/2050520132854964706

The context window has been shattered: Subquadratic debuts a 12-million-token window – The New Stack
https://thenewstack.io/subquadratic-12-million-context-window/

The Problem with “Mathematically Proven” Claims About LLMs – Web Directions

The Problem with “Mathematically Proven” Claims About LLMs

The Truman Mythos
https://x.com/swyx/status/2051025206228218103

Tony Xu: AI Is Helping Engineers, but That Isn’t the Only Priority – Business Insider
https://www.businessinsider.com/doordash-ceo-tony-xu-ai-helping-engineers-workforce-priority-customers-2026-5

We have greatly expanded the surface area for plugins over the last few weeks, and now you can extend LLM Inference Providers and Gateway Channels with plugins. Want a custom implementation of your favorite gateway platform? Maybe we haven’t been quick enough to add inference
https://x.com/Teknium/status/2052046335583625629

We partnered with @PrimeIntellect to build Fast Ask, a small RL-trained subagent that helps our Sheets agent find answers in spreadsheets. It scores +4% over Opus on exact match accuracy at Haiku latency.
https://x.com/RampLabs/status/2052448843099254956

We worked with @RampLabs to train Fast Ask using Lab A small RL-trained subagent that helps the Ramp Sheets agent find answers in spreadsheets. The resulting FastAsk model outperformed Opus 4.6, while obtaining Haiku-level speeds at even lower costs.
https://x.com/PrimeIntellect/status/2052465182014840987

You Are Not Immune To Mode Collapse — LessWrong
https://www.lesswrong.com/posts/vKtuRbo4e3ffixmee/you-are-not-immune-to-mode-collapse

You.com | Download the Guide: Why API Latency Is a Misleading Metric
https://you.com/resources/why-api-latency-alone-is-a-misleading-metric-download

Ha love to see this port of flue to python Harness engineering is a fun time! Need more people exploring!
https://x.com/hwchase17/status/2051004516674457965

Do people have some good evals that require context compaction to be achieved?
https://x.com/_philschmid/status/2051002064826724724

Don’t just scale AI. Scale ROI. AMD Instinct MI350P PCIe cards deliver 144 GB of HBM3E memory and up to 2299 teraFLOPS (at MXFP4) in a drop-in, air-cooled card built for standard servers. That’s how you scale AI at maximum ROI without redesigning your data center. Interested
https://x.com/AMD/status/2052373018400219648

My surprise here seems warranted, this paper was retracted (There are other peer-reviewed meta-analyses of the impact of AI on education finding positive effects, like:
https://t.co/bLHHelTLCs though the best evidence of AI helping is from RCTs of interventions with AI tutors)
https://x.com/emollick/status/2051304153389932643

Today we are rolling out edit mode in AI Studio Vibe Coding ✏️, select components to quickly edit them, annotate right on the UI with a pen, and select image assets to change them with Nano Banana + upload content!
https://x.com/OfficialLoganK/status/2051698665652412919

@_philschmid The common way is to use datasets with triplets of context/question/answer and concatenate multiple contexts to create long contexts LOFT is a example of dataset like that
https://x.com/gabriberton/status/2051050627942568319

Accelerating and automating science and research is one of the noblest pursuits right now. We need to jointly train not just single meaning units like word vectors, not just embed all sentences, not only train one model to be prompted by any question, but ideally the entire
https://x.com/RichardSocher/status/2051121805482676323

AI Decoder Could Cut Quantum Errors by Up to 17×, Study Finds

AI Decoder Could Cut Quantum Errors by Up to 17×, Study Finds

AI inference just plays by different rules
https://www.theregister.com/software/2026/05/04/ai-inference-just-plays-by-different-rules/5223647

AI Outperforms Doctors in Emergency Room Tasks, New Harvard Study Shows | Harvard Magazine
https://www.harvardmagazine.com/ai/ai-outperforms-doctors-diagnosis-harvard-study

AI training faces a very different set of trade-offs compared to human evolution. Because you can directly copy a trained model basically for free, it makes sense to amortize in the pre-trained weights the learning that humans spread across a lifetime.
https://x.com/dwarkesh_sp/status/2050697033632330047

Computer science basically got started in the 1930s when Turing and Church just laid down what the theory of everything was. They just said, here’s how computation works. And then we’ve spent 90 years since then just exploring consequences of that and gradually building up more
https://x.com/dwarkesh_sp/status/2049941366348890169

DORA: A Scalable Asynchronous Reinforcement Learning System for Language Model Training Overview: DORA is an asynchronous RL training system designed to remove the rollout bottleneck in LLM post-training. The key issue is skewed generation, where a few extremely long reasoning
https://x.com/TheAITimeline/status/2051401348726317146

How To Scale Your Model
https://jax-ml.github.io/scaling-book/

How well does this work? One quick independent test is to see if it can recover an “”internal CoT”” in cases where AIs can solve math problems in a single forward pass. TLDR: it doesn’t. (TBC, this might require the NLA to see activations at multiple positions/location to work.)
https://x.com/RyanPGreenblatt/status/2052458229624672549

Humans are systematically undertrained. Given the computing power of the human brain, optimal training would require seeing orders of magnitude more data during childhood – maybe millions of years’ worth. Obviously we’d die long before this. Massive undertraining is necessary
https://x.com/dwarkesh_sp/status/2049972212942401698

Import AI 455: AI systems are about to start building themselves.
https://importai.substack.com/p/import-ai-455-automating-ai-research

Import AI 455: Automating AI Research | Import AI

Import AI 455: Automating AI Research

Neural networks might speak English, but they think in shapes. Understanding their rich *neural geometry* is key to understanding how they work – and to debugging and controlling them with precision. Starting today, we’re releasing a series of posts on this research agenda. 🧵
https://x.com/GoodfireAI/status/2052420446910644616

Note Clark’s definition of RSI here, from his newsletter, is “a frontier model is able to autonomously train a successor version of itself.” This is a weaker claim than what I assumed he meant, which was that human researchers would no longer be useful vs. AI ones.
https://x.com/goodside/status/2051388803047158175

separating infra and science for long context doesn’t make sense, most long context science is about making computation and memory (capacity and bandwidth) feasible at scale. today’s infra wouldn’t support MHA on a 1T model at 1M context
https://x.com/eliebakouch/status/2051374295620665713

Some shilling, but I really mean it: Yesterday I had a mathematical disagreement with Alex @__kolesnikov__ and we bet a beer. After going back and forth on pen and paper, we resolved where the disagreement was. (As always, he was more right than me. I concede 2/3 of the beer
https://x.com/giffmana/status/2051925008457273527

Surprisingly little effect from a well-designed study.
https://x.com/emollick/status/2051438513476747759

the lock-in isn’t the harness — it’s the context pipeline feeding it. whoever owns how repo state gets pulled, ranked, and compressed into the attention window owns the developer, regardless of model or framework choice
https://x.com/AnthonyMaio/status/2050976650943213964

This is a very interesting paper It argues that a real scientific theory of deep learning is starting to form. Researchers call it “”learning mechanics.”” It’s like physics, but for how neural networks learn. Now there are 5 active research areas that together look like pieces of
https://x.com/TheTuringPost/status/2050007859115733078

To really understand embeddings, you need a few core ideas: – vectors and dimensions – dense vs sparse representations – vector and embedding spaces – what latent space means – semantic similarity importance – and how embeddings are formed These concepts completely change and
https://x.com/TheTuringPost/status/2051255782197637393

vLLM Real-World Lab Report
https://avkcode.github.io/blog/how-vllm-works.html

We need more work on AI inequality, but this study is not about GenAI, the survey was fielded in 2022. “In this study, we selected items from Wave 119 (N = 10,087), which were collected from December 12 to December 18, 2022.”
https://x.com/emollick/status/2050231392374501701

We use previous generations of Composer to train future ones. Our autoinstall system has earlier Composer models set up dev environments for RL training. That way, the next generation can focus on learning to solve harder problems.
https://x.com/cursor_ai/status/2052116064474161556

Webinar | Beyond the Model: A Practitioner’s Guide to Harness Engineering
https://pages.temporal.io/webinar-grid-dynamics-harness-engineering.html

Why @michael_nielsen disagrees with the view that science will keep getting harder and harder as low-hanging fruit is picked:
https://x.com/dwarkesh_sp/status/2049640034878550031

Folding the TP and SP parallelism schemes onto a single axis enables us to shard both the weight tensors and the activation tensors along the same GPUs. This changes the communication volume scaling (see paper). This volume is across fewer GPUs, so we have more flexibility to
https://x.com/QuentinAnthon15/status/2051362275483963709