Technical and Dev: AI News Week Ending 05/15/2026

Cool paper from PwC. “”Earlier is always better”” is the default intuition for agent clarification. New paper claims that’s mostly wrong. Goal clarification loses nearly all of its value after just 10% of execution. The team built a forced-injection framework that drops
https://x.com/dair_ai/status/2053866106151182419

A lot of people have been wondering about Mythos, Glasswing, and the vulns we / our partners are fixing. Today, I’m excited for us to start sharing more. (For context, I lead Glasswing @AnthropicAI.) Two independent evaluations this week–from XBOW and the UK AISI–confirm what
https://x.com/logangraham/status/2054613618168082935

How fast is autonomous AI cyber capability advancing?
https://www.aisi.gov.uk/blog/how-fast-is-autonomous-ai-cyber-capability-advancing

Mythos for Offensive Security: XBOW’s Evaluation
https://xbow.com/blog/mythos-offensive-security-xbow-evaluation

The new version completely smashes GPT-5.5 and the previous Mythos version. Before Mythos Preview completed the cyber range 3 out of 10 times. The new version completed it 6 out of 10 times and is much more efficient!
https://x.com/scaling01/status/2054594892903436553

Google DeepMind is pushing medical AI into “”co-clinician”” research They shared an AI co-clinician research initiative that tests evidence-grounded clinical reasoning and real-time multimodal telemedicine simulations. The careful wording matters: supportive tool under physician
https://x.com/TheTuringPost/status/2052188488553079156

Google’s AI Drug Startup Isomorphic Labs Nears $2 Billion Capital Raise – Bloomberg
https://www.bloomberg.com/news/articles/2026-05-08/google-s-isomorphic-labs-to-raise-over-2-billion-in-new-funding

Meet physics-intern🧑‍🎓, our agentic framework for theoretical physics. It takes Gemini 3.1 Pro from 17.7% to 31.4% on CritPt, a new SOTA on one of the hardest benchmarks for LLMs. Theoretical physics is hard for humans and LLMs alike. But physics-intern decomposes problems and
https://x.com/dlouapre/status/2054217281895309480

NEW paper from Google DeepMind. (bookmark it) AI Co-Mathematician is an agentic research workbench for mathematicians, and it just hit 48% on FrontierMath Tier 4, a new high score among AI systems evaluated. The system is an asynchronous, stateful environment that supports
https://x.com/dair_ai/status/2054224343551639958

Kimi K2.6 is now open-weight #1 on Finance Agent Benchmark V2.
https://x.com/Kimi_Moonshot/status/2054803169994272819

Interaction Models: A Scalable Approach to Human-AI Collaboration – Thinking Machines Lab
https://thinkingmachines.ai/blog/interaction-models/

People talk, listen, watch, think, and collaborate at the same time, in real time. We’ve designed an AI that works with people the same way. We share our approach, early results, and a quick look at our model in action.
https://x.com/thinkymachines/status/2053938892152435174

Sharing our work on full-duplex multimodal models — real-time interaction that’s natural and intuitive without compromising on intelligence. We started Thinky in part to differentially advance capabilities for human-AI collaboration, which are underemphasized relative to
https://x.com/johnschulman2/status/2053940452789981426

thinking machines is using SGLang btw
https://x.com/eliebakouch/status/2053982248253190180

Thinking Machines know how to surprise. Those simultaneous abilities (not only translation but also creating graph while replying to a question) are pretty remarkable. Can’t wait to try it out and also learn how much it costs to use
https://x.com/TheTuringPost/status/2053975565179253010

Thinking Machines on X: “People talk, listen, watch, think, and collaborate at the same time, in real time. We’ve designed an AI that works with people the same way. We share our approach, early results, and a quick look at our model in action. https://t.co/AFJZ5kH7Ku https://t.co/uxl1InS6Ay” / X
https://x.com/thinkymachines/status/2053938892152435174

Thinky’s secret plan: 1: Increase Human<->AI bandwidth 2: Raise ceiling of human+AI intelligence 3: Help humans continue as main-characters in the new world We are at Step 1. Interaction Models are great real-time collaborative tools for humans. Here’s a preview:
https://x.com/soumithchintala/status/2053940215505645938

Very cool announcement from Thinky! The model looks nice (they go into some reasonable amount of detail), and reading some parts of the blog you can definitely see that the infea guys had a lot of fun there!
https://x.com/giffmana/status/2053953584300003405

/goal + GPT 5.5 is amazing. I can now plan really extensive refactors with e2e tests and it just works.
https://x.com/steipete/status/2052514752245481675

GPT-5.5 is both very capable and very succinct
https://x.com/gdb/status/2052783746009440658

Hello GPT-4o | OpenAI
https://openai.com/index/hello-gpt-4o/

I promise this will be the best 20 min you spend today! Robotics: Endgame, the sequel to my last year’s Sequoia AI Ascent talk, “”Physical Turing Test””. I laid out the roadmap for solving Physical AGI as a simple parallel to the LLM success story. Be a good scientist, copy
https://x.com/DrJimFan/status/2052758642781487237

Agent observability is a means to an end: making your agent better. But observability and evals tools have traditionally failed to connect traces to meaningful actions. Agent engineering teams are left combing through traces, guessing at root causes, and writing evals manually.
https://x.com/bentannyhill/status/2054949581679653326

Your customer support needs a voice agent built for the real world. Grok Voice Think Fast 1.0 handles complex workflows with speed and accuracy, even in hard-to-hear environments. From multi-step troubleshooting to high-volume tool calls, it keeps up.
https://x.com/xai/status/2052529102280880234

// The Memory Curse in LLM Agents // (bookmark it) Long histories apparently degrades agents as they become increasingly history-following and risk-minimizing. Across 7 LLMs and 4 social dilemma games over 500 rounds, expanding accessible history degraded cooperation in 18 of
https://x.com/omarsar0/status/2053863994499408214

Agentic Vector Databases – What Is That?
https://x.com/TheTuringPost/status/2052523789619953775

Agentic Vector Databases are becoming a new infrastructure layer for AI agents. Why? Because agents use retrieval fundamentally differently from humans. What changes in the agent era: • Agentic Search – retrieval stops being a one-shot search and becomes part of the iterative
https://x.com/TheTuringPost/status/2053083074355933251

External content is scanned in parallel by ML classifiers and the BrowseSafe model before agents act on it. File connector data is encrypted in transit and at rest, uploaded files automatically delete after 7 days, and more. Read more on the blog:
https://x.com/perplexity_ai/status/2054608978680873457

I have one big problem with agentic engineering: I want agents to operate autonomously, but I also want granular, reversible control over every change they make. I could solve this by committing every intermediate step to Git, but that would completely pollute my repo history.
https://x.com/itsclelia/status/2053716807748567329

Introducing Renderers RL trainers work in tokens. Environments work in messages. Going back and forth corrupts sampled tokens, wasting compute on every agentic turn. With Renderers, we fix this mismatch. This unlocks >3x throughput on popular open models.
https://x.com/PrimeIntellect/status/2054347134821154841

Introducing SWE-ZERO-12M-trajectories: the largest agentic trace dataset in the open, 5.7x larger than the previous largest. 112B tokens · 12M trajectories · 122K PRs · 3K repos · 16 languages
https://x.com/kevin_x_li/status/2054600962137100493

INTRODUCING: Duet Agent A new type of harness we’re building at @duetchat Perfect for jobs that don’t fit in one chat: – Work for weeks/months at a time – Relays work between agents via a state machine – Memory that replaces compaction – Stateless runner built for sandboxes
https://x.com/dzhng/status/2054619807715348779

It’s alive!!! What if @github and @GitHubCopilot had a love child Say hello to GitHub App!! From the repo itself: —- The GitHub Copilot app is a desktop application for agent-driven development that brings parallel workstreams, GitHub integration, and PR lifecycle management
https://x.com/OrenMe/status/2054959549413503308

Just announced at Interrupt! SmithDB. Agent traces have outgrown the databases built to hold them. That’s why we built SmithDB, a purpose-built distributed database for agent observability. Read the announcement from Co-Founder @ankush_gola11 →
https://x.com/LangChain/status/2054658661776244936

LangChain 在 Interrupt 大会上发布了底层数据库 SmithDB 和自动化排障引擎 LangSmith Engine。 Agent 运行会产生海量 trace（执行轨迹），把旧数据库撑到了瓶颈。新底座 SmithDB 放弃了本地磁盘，全面转向对象存储，将核心查询速度拉高了 15 倍。底座换新后，LangSmith Engine 顺势接管了查 Bug
https://x.com/0xLogicrw/status/2054852978243404008

LangSmith Engine is a phase shift because traces are no longer just records to be manually inspected, they’re now the catalyst for recursive agent self-improvement Engine looks at your traces, finds what broke, and suggests code changes and evals, informed by what we’ve learned
https://x.com/caspar_br/status/2054726851659248068

Must-read research of the week ▪️ Generate, Filter, Control, Replay: A comprehensive survey of rollout strategies for LLM reinforcement learning ▪️ Hallucinations Undermine Trust; Metacognition is a way forward ▪️ ARIS: Autonomous Research via Adversarial Multi-Agent
https://x.com/TheTuringPost/status/2054181240946004212

NEW: CoreWeave Sandboxes is here! We all know rm -rf / wipes a filesystem. So we ran it 1,000 times in parallel. 1,000 sandboxes died so the cluster didn’t have to. Isolated execution for RL, agent tool use, and evals on clusters or serverless.
https://x.com/wandb/status/2054958004118724672

starting to think now that every agent should have just 2 tools. search and execute. we _want_ agents to have access to 100s, if not 1000s of capabilities, that can contextually change during their lifetimes, even per message. saying stiff like “”just use bash”” doesn’t encompass
https://x.com/threepointone/status/2053751241977594102

suuuuper excited to be collaborating with the excellent LangChain Labs team on this effort prod agent tracing is the seed that lets you close the loop for continual learning. too much data gets collected but not used for learning. time to change that 🙂
https://x.com/willccbb/status/2054983266046996839

This is harder to build than it looks. Preserving full conversational context while swapping underlying model providers mid-flight is a surprisingly deep systems problem. Most tools drop state or force you to start over. deepagents-cli does this natively: swap models
https://x.com/masondrxy/status/2053717333433340034

This seems like a critical reason to open up about AI use in academia. Scholars are using old AI models, badly, and not talking about it. New models hallucinate very few citations, and good agentic harnesses drop that further. Being open about use would help us make new norms.
https://x.com/emollick/status/2053891532466348541

Turing context into reusable skills for AI agents THU, DeepLang AI and others introduced Ctx2Skill – a system that does this automatically and evolves skills in a self-improving loop. Instead of read 200 pages again and again, the model can extract procedures, rules and
https://x.com/TheTuringPost/status/2053062433141616803

VS Code was already used by millions of developers for agentic coding. However, the editor layout has traditionally been optimized for single-task and single-workspace workflows. Today, we’re introducing a new window to enable our users (and ourselves!) to work with multiple
https://x.com/pierceboggan/status/2054775908586934440

We are excited to be partnering with @LangChain for deploying self-improving agents. Continual learning in your production environment unlocks compounding capability gains for model-product optimization. Your data. Your advantage.
https://x.com/PrimeIntellect/status/2054986817779425579

we just shipped delta channels in langgraph 1.2. as agents run longer and use more context, full-state checkpointing doesn’t scale, but delta channel snapshots do. this new algorithm is now powering message histories and file storage in deepagents v0.6!
https://x.com/sydneyrunkle/status/2054278551244099706

We just shipped tons of new products to accelerate the full agent development lifecycle:
https://t.co/lt2o5ILg1F TLDR: ✅ LangSmith Engine ✅ SmithDB ✅ Sandboxes ✅ Managed Deep Agents ✅ LLM Gateway ✅ Context Hub ✅ Deep Agents 0.6
https://x.com/LangChain/status/2054617687238865013

which was your favorite launch? SmithDB (database purpose built for agent trace data):
https://t.co/xdo2Mn7Amf LangSmith Engine (agent for improving your agents based on trace data):
https://x.com/hwchase17/status/2054754206926700914

Why do AI agents need an identity complex? Here’s a live webinar from @1Password VP of AI Engineering Jeff Malnick and @fiddler_ai CEO @krishnagade – on the hidden identity problem behind AI agents →
https://t.co/rY7doFhaFJ You’ll learn how to: – Separate agent identity from
https://x.com/TheTuringPost/status/2054336838928896369

Working with agents for the past months has me convinced that outcome-only evaluation is a flawed approach to benchmarking. You need to look at the logs to understand if the agent really did its job! In our paper Log analysis is necessary for credible evaluation of AI agents, we
https://x.com/steverab/status/2054564579573698921

Does a lexical retriever suffice for agentic search when agents can keep refining their queries? As LLMs become more capable in agentic loops, agents can continuously refine their actions based on environmental feedback. We couldn’t help but ask the question above.
https://x.com/xuzihuan4/status/2054220800073642161

Give our early preview of Computer Use (with ANY model) a try today! Built into the latest Hermes Agent and powered by @trycua – opens the door to any model, not just the frontier models in special modes – to control your actual computer. Best part, it doesnt take over your PC
https://x.com/Teknium/status/2053961675985113404

OpenSquilla launches open-source AI agent to cut token costs
https://www.testingcatalog.com/opensquilla-launches-open-source-ai-agent-to-cut-token-costs/

Perplexity is building one of the most secure scalable agent runtime sandboxes in the market right now. A blog post on how we: 1. Handle proxy API keys for agents securely 2. Run safety detection for all content accessed by agents 3. Encrypt data passed via connectors to
https://x.com/AravSrinivas/status/2054619058650411174

We’re training models wrong and it’s due to chatGPT. Even the modern coding agents used daily still use message-based exchanges: They send messages to users, to themselves (CoT) and to tools, and receive messages in turn. This bottlenecks even very intelligent agents to a single
https://x.com/jonasgeiping/status/2054600427128201688

✅ Harness profiles: Per-model tuning + support for open models (@Kimi_Moonshot, @Alibaba_Qwen + @deepseek_ai) ✅ Code interpreter: A programmable runtime inside the agent loop ✅ Streaming-typed projections for messages, tool calls, + subagent events ✅ DeltaChannel:
https://x.com/LangChain_OSS/status/2054641656222388700

🚀 Excited to share our new preprint: Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs. To study research-level mathematical reasoning, we introduce Soohak, a benchmark of 439 research-level math problems created from scratch by
https://x.com/gson_AI/status/2054036114483392997

AI Gateway production index – Vercel
https://vercel.com/blog/ai-gateway-production-index

Announcing the Artificial Analysis Coding Agent Index! Our new coding agent benchmarks measure how combinations of agent harnesses and models perform on 3 leading benchmarks, token usage, cost and more When developers use AI to code they’re choosing a model, but also pairing it
https://x.com/ArtificialAnlys/status/2053865095076438427

Great read from the @RedHat_AI team — a comprehensive investigation into TurboQuant in vLLM, with FP8 and BF16 as reference baselines: 4 models (30B to 200B+, decoder-only and MoE) and 5 benchmarks covering long-context retrieval and reasoning, all on the stable vLLM 0.20.2
https://x.com/vllm_project/status/2053852636093239555

I love seeing a new eval with such low scores. When we announced GPT-5.5, almost every benchmark had a score above 50%. It’s time to retire evals like GQPA and bring in a new set.
https://x.com/polynoamial/status/2054255862441812099

Log analysis is not a “one and done” technique, it requires constant effort in validating benchmark results. One reason it’s hard to uncover evaluation bugs is that they become apparent only after models get good enough to solve tasks (or circumvent constraints in evaluations,
https://x.com/sayashk/status/2054569643080077576

The benchmarks show the gap. NVLS all-reduce latency drops from 586.1µs on H200 to 313.3µs on GB200. In MoE prefill at EP=4, combine falls from 730.1µs to 438.5µs. For decode, GB200 sustains much higher throughput at high token speeds.
https://x.com/perplexity_ai/status/2054204425833726353

The first ProgramBench task was just solved by GPT 5.5 high/xhigh. Interestingly, high/xhigh picked two different languages for the task (C vs Python). GPT 5.5 xhigh was significantly better than Opus 4.7 xhigh in all metrics. 🧵
https://x.com/KLieret/status/2054215545663144217

The most extensive independent benchmark of LLMs for software engineering just got a big update! – How does GPT-5.5 compare to Opus 4.7? – Are open models catching up, and in what areas? – How do cost and performance stack up?
https://x.com/OpenHandsDev/status/2053839810343620980

We’re excited to release Medmarks v1.0 + a technical report! This is an update to our Medmarks benchmark suite, the largest open-source automated suite for evaluating the medical capabilities of LLMs. We added 10 benchmarks (20→30) and 15 models (46→61) to the leaderboard!
https://x.com/SophontAI/status/2054270239387627927

We Tested DeepSeek V4 Pro and Flash Against Claude Opus 4.7 and Kimi K2.6
https://blog.kilo.ai/p/we-tested-deepseek-v4-pro-and-flash

.@NVIDIA explored how speculative decoding can speed up RL without changing the model’s behavior. ->The result is up to ~2.5× faster end-to-end RL at 235B scale Speculative decoding is when a smaller “”draft”” model predicts several next tokens at once, and the main model
https://x.com/TheTuringPost/status/2052180472206381268

As much as the state of benchmarks in AI is flawed, it is so much easier to track AI progress than robotics. Not sure what you can make of all the videos of robots running races or doing laundry – are there any equivalents to independent AI benchmarks for robots? ARC-AGI-BOT?
https://x.com/emollick/status/2053104629282378061

(1/n) Today, we’re publishing a deeper writeup on Tau2-Infinity: our work on autonomously mining hard tool-use tasks for RL post-training. Finding tasks that are at a precise level of difficulty (within the model’s pass@k window) is still a major challenge for humans manually
https://x.com/Shahules786/status/2054241505506648161

[1604.01753] Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding
https://arxiv.org/abs/1604.01753

[2204.01018] TransRAC: Encoding Multi-scale Temporal Correlation with Transformers for Repetitive Action Counting
https://arxiv.org/abs/2204.01018

[2405.09818] Chameleon: Mixed-Modal Early-Fusion Foundation Models
https://arxiv.org/abs/2405.09818

[2605.08078] Normalizing Trajectory Models
https://arxiv.org/abs/2605.08078

*** Presenting Fast BLT @ ICML 26′ *** BLT showed that compute-efficient byte-level pre-training was possible. Inference is still one-byte-at-a-time. We address this in FastBLT! 1. Using Block Byte-diffusion i.e. auto-regressively predict latent byte-patches (dynamically
https://x.com/sriniiyer88/status/2053882384211419375

// δ-mem: Efficient Online Memory for LLMs // One of the more elegant memory mechanisms I’ve seen this month. Most long-term memory work either inflates context or retrains the model. This paper shows a tiny external state, coupled directly into the attention computation, can
https://x.com/dair_ai/status/2054600147020222630

🚨 New paper: Introducing MIND (Monge Inception Distance) Everyone agrees that FID is broken, requires too many samples, slowing down evals. MIND requires 10x fewer samples, is more robust, faster to compute. Our new drop-in replacement for evaluating generative models. 🧵👇
https://x.com/qberthet/status/2053795951228371311

🚨Typical RL algorithms and on-policy distillation methods are blind samplers: they use privileged info to score rollouts, but not to *find* them. We ask: can we use privileged info to *actively sample* the rollouts RL wishes it can stumble upon with compute? ⤵️ Pedagogical RL
https://x.com/SOURADIPCHAKR18/status/2055057138070733176

1/ The “”20 tokens per parameter”” Chinchilla scaling law is flawed. It is an artifact of your tokenizer. Scaling shouldn’t be measured in tokens at all. It should be measured in bytes. 🧵
https://x.com/che_shr_cat/status/2054178651856339276

1/?) As promised to Sander Dieleman (@sedielem), we’re finally excited to share: Towards Closing the Autoregressive Gap in Language Modeling via Entropy-Gated Continuous Bitstream Diffusion We show that continuous diffusion can achieve very strong language modeling performance
https://x.com/LucaAmb/status/2053867347023466850

5. The most important result of EMO appears when experts are removed. – keeping only 25% of experts causes just a 1% performance drop – keeping only 12.5% of experts causes around a 3% drop In a standard MoE performance drops by ~10-15% under the same setup. So EMO can
https://x.com/TheTuringPost/status/2053795410490339720

A new Mixture-of-Experts from @allen_ai – EMO Finally, it brings real modularity to MoE architectures, and small groups of experts can work independently. ➡️ Tokens from the same document (which usually belong to the same domain) are routed through a shared pool of experts.
https://x.com/TheTuringPost/status/2053795343658303860

AI approach uncovers dozens of hidden planets in TESS data
https://warwick.ac.uk/news/pressreleases/ai-approach-uncovers-dozens-of-hidden-planets/

AI Workflow Patterns: The Real Unit of AI Adoption in 2026
https://x.com/TheTuringPost/status/2053874295714324722

All inference running on @modal . Mk1 introduced new inference requirements for us — native video at 2 FPS increases prompt length, structured outputs and hybrid thinking increase decode length. Modal was the right partner to ship fast: GPU snapshotting for cold start,
https://x.com/AkshatS07/status/2054275262289002664

An interesting research → Generate, Filter, Control, Replay: A comprehensive survey of rollout strategies for LLM reinforcement learning It reframed LLM reinforcement learning as a full rollout-engineering problem, introducing the GFCR lifecycle: Generate, Filter, Control, and
https://x.com/TheTuringPost/status/2054713822343266365

Compute Optimal Tokenization – ArXivIQ
https://arxiviq.substack.com/p/compute-optimal-tokenization

David Reich is back. He and collaborator Ali Akbari just published a paper that overturns a long-standing consensus about human evolution — that natural selection has been dormant in our species since the agricultural revolution. By scaling ancient DNA sequencing and developing
https://x.com/dwarkesh_sp/status/2052798237828960334

Diffusion models differ from LLMs in that sampling is differentiable. So if the reward model is too, i.e., ∂R(x0)/(∂x0) can be computed, gradients can flow straight back to model params. Technically, post-training a diffusion model can be as simple as training a classifier.
https://x.com/LiangZheng_06/status/2053806963839168619

Distribution-Guided Policy Optimization (DGPO) – a new PO method that shifts focus to rewarding useful exploration DGPO improves famous GRPO. It identifies which reasoning steps really mattered: 1. Like GRPO, DGPO samples several Chain-of-Thought reasoning paths for comparison.
https://x.com/TheTuringPost/status/2052539247320858975

End the tyranny of on-policy algorithms in LLM post-training! Maybe the key thing isn’t whether your rollouts are purely “”on-policy”” or not, but the extent to which they’re pedagogically useful. Early explorations into newer paradigms for RL by @SOURADIPCHAKR18* @NoahZiems*:
https://x.com/lateinteraction/status/2055065846389649436

Everyone is tweeting out “”use pnpm & set a minimumReleaseAge of 7 days”” but don’t forget blockExoticSubdeps – which would also prevent the usage of a remote github reference here!
https://x.com/ramimacisabird/status/2054178771180093858

Fast Byte Latent Transformer is accepted to ICML 2026! ⚡🥪 Byte-level LMs promise to free us from subword tokenizers, but decoding one byte at a time is super slow. We make BLT generation more efficient with BLT-D: text diffusion for parallel byte decoding. 1/
https://x.com/JulieKallini/status/2053853543552217478

For the next Marin model, we are putting together a new data mix. Currently we have 18T tokens, but could use more. So if you are sitting on some secret stash of high quality tokens, please let us know! Pre-training, mid-training, SFT data all welcome.
https://x.com/percyliang/status/2054550981527146942

Haven’t tried this but it seems very neat… Yet all of the demos (except maybe one) are the model being fun and/or annoying by correcting or reminding in real time. There are obvious uses for this sort of model in meetings, education, training, etc. Why not demo valuable cases?
https://x.com/emollick/status/2053985134227935510

I think this is bigger than it sounds at first glance. Thinking Machines hasn’t just unveiled “”ChatGPT, but better.”” Instead, they’ve introduced something that addresses a much deeper issue: the very way we interact with AI. So far, AI often feels like email with very clever
https://x.com/kimmonismus/status/2053952846064767384

I’m a simple man, I see a Kaiming He paper, I click. ELF: Embedded Language Flows This is very interesting, getting continuous diffusion models working for text! “”Unlike existing DLMs, ELF predominantly stays within the continuous embedding space until the final time step,
https://x.com/iScienceLuvr/status/2054118255778763184

If you liked tangent step + Stiefel manifold retraction for fast Muon, you’ll love the same method applied for SOAP basis updates.
https://x.com/torchcompiled/status/2054036715589771542

In modern ML accelerators, FLOPS have absolutely exploded. Often though, the bottleneck is not FLOPS but memory bandwidth. Similarly, model intelligence has exploded, causing the bottleneck to be human<->AI bandwidth. At Thinky, we think that it’s important to solve this. 1/4
https://x.com/cHHillee/status/2053940218747842619

Inference isn’t everything, but it does require a new stack — not Kubernetes, not SLURM. At @modal, we dove deep to build that stack. In this blog post we explain how, from compute management & cloud-native cacheing to CRIU & GPU checkpointing.
https://x.com/charles_irl/status/2054233051140690023

Our first VLM release is here: ➡️ +11.7% on 20 evals ➡️ near-frontier 2B performance ➡️ at a fraction of training compute ➡️ big gains in reliability + efficiency
https://x.com/pratyushmaini/status/2054607891202777192

Pareidolia, but for text. Apophenia, but for latent spaces. Its no wonder that our relationship to LLMs is so confusing.
https://x.com/emollick/status/2052198224308379845

pull_request_target has been one of the sharpest, most dangerous event types on GitHub. it’s “”necessary”” to build/test fork PRs, but you’re forever at risk of a malicious PR on a fork repo getting you. GitHub has warned against it for a long time.
https://x.com/elithrar/status/2054162732195197283

PyTorch 2.12 Release Blog – PyTorch
https://pytorch.org/blog/pytorch-2-12-release-blog/

Qdrant 1.18 is out, featuring TurboQuant; a new quantization method developed by Google Research. Our TurboQuant implementation is an extended version of the algorithm with borrowed RaBitQ ideas. It offers: 1. Similar recall to Scalar Quantization (SQ), using 2x less memory. 2.
https://x.com/qdrant_engine/status/2054166055417938266

Quite excited about llama-eval, a proposed eval tool for llama.cpp. Could be a nice step toward more comparable community evals 🎉
https://x.com/victormustar/status/2054495700822478943

Reinforcing Recursive Language Models | alphaXiv
https://www.alphaxiv.org/blog/reinforcement-learning-for-rlms

SecureForge: Finding and Preventing Vulnerabilities in LLM-Generated Code via Prompt Optimization Houjun Liu, Lisa Einstein, John Yang, Joachim Baumann, Duncan Eddy, Christopher D. Manning, Mykel Kochenderfer, Diyi Yang
https://t.co/189DhQRcVU [𝚌𝚜.𝙲𝚁 𝚌𝚜.𝙲𝙻 𝚌𝚜.𝙲𝚈]
https://x.com/FSFG/status/2054196048621367422

Semis Memo: Supply Chain Inheritance – Citrini Research
https://www.citriniresearch.com/p/semis-memo-supply-chain-inheritance

SmithDB is a feat of engineering. A new database for a new data shape Huge work from @ankush_gola11 and team
https://x.com/caspar_br/status/2054773536603144458

SmithDB is the perfect example of how far performance can be pushed by having full control over the storage layer. The DataFusion + @vortexdotdev stack seems to be emerging as THE way to build next generation databases. #ParquetIsForFloors
https://x.com/ngates_/status/2054859033488580721

So it turns out there is a reason for the limited adoption/extension of TurboQuant in the academic community. TL;DR: It doesn’t really work well. The original eval didn’t tell the full story. 🤯
https://x.com/jbhuang0604/status/2053882357833208262

Superstar AI researchers are paid >10× more than their frontier lab colleagues, and >100× more than most postdocs. Why? The naive explanation is that this is just due to differences in researcher quality. But in a new essay, @ansonwhho argues that this is very incomplete.
https://x.com/EpochAIResearch/status/2054698539566121138

The inability of AI models to produce creative variation is a huge gap. The fact that they generate similar ideas limits their ability to do science & the same-y writing limits their usefulness in many other applications This paper showed you can optimize models for creativity
https://x.com/emollick/status/2053820720241615023

The Inference Shift – Stratechery by Ben Thompson
https://stratechery.com/2026/the-inference-shift/

The Main Path to Truly Creative AI | Daniel Miessler
https://danielmiessler.com/blog/the-main-path-to-truly-creative-ai

This is a glimpse of how neural geometry can lead us to discover mechanisms we’d otherwise miss – in this case, neural computation. Understanding this machinery paves the way for better debugging, control, and design of AI. Read the full post:
https://t.co/z0RFImtHFj (6/6)
https://x.com/GoodfireAI/status/2054962356162363599

This paper could not have been written without the help of my amazing Tübingen co-authors, @guinansu , Yanwu Yang and Xueyan Li! Finally, the link to the paper is:
https://t.co/SZZ8OIKMaG where we also link to code, data and models.
https://x.com/jonasgeiping/status/2054600457746579816

thoughts after doing a bunch of synthetic data gen for eval + environment building – LLMs are incredible projections of the world bundled into a set of weights – but doing targeted extraction of certain distributions from those weight is incredibly difficult to do at scale.
https://x.com/Vtrivedy10/status/2054054238226170361

To train better open models, we need predictable scaling. Delphi is Marin’s first step: we pretrained many small models with one recipe, then extrapolated 300× to predict a 25B-param / 600B-token run with just 0.2% error. Getting there took some work 🧵
https://x.com/WilliamBarrHeld/status/2053919463880462453

Today we release Token Superposition Training (TST), a modification to the standard LLM pretraining loop that produces a 2-3× wall-clock speedup at matched FLOPs without changing the model architecture, optimizer, tokenizer, or training data. During the first third of training,
https://x.com/NousResearch/status/2054610062836892054

Today we’re releasing Toto 2.0: a family of open-weights time series foundation models spanning 4M to 2.5B parameters. The question we set out to answer was simple (yet previously open): Do time series foundation models get reliably better as they scale? Our answer: yes! 🧵
https://x.com/atalwalkar/status/2054941930497142826

Toto 2.0 is here: Datadog AI’s 5 open-weights forecasting models (4m-2.5B params) finally make scaling work for time series forecasting! #1 on BOOM, GIFT-Eval, and TIME. Weights/code Apache 2.0. 🔗 Read the blog post for more details:
https://x.com/datadoghq/status/2054929795385893108

Toto 2.0: Time series forecasting enters the scaling era | Datadog
https://www.datadoghq.com/blog/ai/toto-2/

TurboQuant has drawn a lot of attention recently, but the accompanying evals didn’t tell the full story. So we ran what I believe is the first comprehensive study of TurboQuant: where it helps, where it falls short, and how it impacts accuracy, latency, and throughput.
https://x.com/_EldarKurtic/status/2053809592061030546

We are conducting an AI-assisted review of FrontierMath: Tiers 1-4. This has flagged fatal errors in about a third of problems, and we believe most of these flags to be valid. We will release updated scores on a corrected dataset after completing a thorough human review.
https://x.com/EpochAIResearch/status/2053995435870892048

We are releasing Star Elastic – turn ONE reasoning LLM into MANY sizes with a single post-training run. 360× cheaper than pretraining a family of models. 7× better than SOTA compression. Split reasoning capability. Plus elastic budget control that beats the accuracy-latency
https://x.com/PavloMolchanov/status/2054607257166553292

We present ZAYA1-8B-Diffusion-Preview, the first diffusion language model trained on @AMD. Autoregressive LLMs generate one token at a time; diffusion generates a block in parallel, speeding up inference. We show a 4.6-7.7x decoding speedup with minimal quality degradation 🧵
https://x.com/ZyphraAI/status/2055038845809480113

We released Diffusers 0.38.0, and it’s packed with new pipelines and several library-related improvements 🔥 A bunch of new pipelines, including audio 🎼 * Ace-Step 1.5 * LongCat-AudioDiT * Ernie-Image And more! Next up, we added support for: * Flash Attention 4 *
https://x.com/RisingSayak/status/2054110949469196748

You can see the latest data mix using this token viewer that @WilliamBarrHeld built:
https://t.co/DaSgZa3Q2y Thanks to @nvidia @huggingface @allen_ai @togethercompute BigCode, CommonPile, and many others who have been releasing high quality data, which helps the entire community!
https://x.com/percyliang/status/2054550984597328101

You.com | Download the Guide: Why API Latency Is a Misleading Metric
https://you.com/resources/why-api-latency-alone-is-a-misleading-metric-download

AI co-mathematician: Accelerating mathematicians with agentic AI
https://arxiv.org/pdf/2605.06651

Cool idea from Nous Research. What if you could speed up long-context pretraining with a subquadratic wrapper that you remove before deployment? That is the idea behind Lighthouse Attention. The method wraps ordinary SDPA with a hierarchical, gradient-free selection layer that
https://x.com/omarsar0/status/2054224130103554359