Image created with gemini-3.1-flash-image-preview with claude-opus-4.7. Image prompt: High-end product photo of an upside-down Dairy Queen-style Blizzard cup filled with vanilla soft-serve loaded with chocolate-wafer circuit-board pieces, caramel solder traces, and silver dragée microchips, the red cup band displaying bold custom retro-tech lettering reading ‘TECH’ with a small ‘Est. 1951 — Milford, DE’ tag, soft directional studio light, shallow depth of field, glossy macro detail, landscape composition.

How successfully — and efficiently! — can agents carry out long-horizon tasks on the web? We built a benchmark of ~200 multi-site tasks, based on people’s real browsing history. Many of them take hours to solve. Paper:
https://t.co/yNGw8Fgvbj Led by @JangLawrenceK and
https://x.com/dan_fried/status/2049530695739932876

Thoughts after reading the DeepSeek V4 paper: – NVIDIA really is something else. Remember how back in 2024 people were bashing Blackwell as overspec’d and dismissing FP4 as just marketing? Turns out it was all groundwork for the next generation of models. Maybe NVIDIA’s moat is
https://x.com/jukan05/status/2047861732702662741

ParseBench: A benchmark for document parsing agents @llama_index just shipped a benchmark with 2k verified pages for real enterprise documents. Benchmarks are the major underrated component in the ML ecosystem, so I’m excited to see more entities doing open work in the space
https://x.com/osanseviero/status/2048777802015535189

Good agent memory paper. And great insights on the benefits of structured memory for long-horizon behavior in LLMs. Why it matters: It treats memory less like search and more like a system that will need maintenance (which they often do). Flat memories are cheap to write.
https://x.com/dair_ai/status/2047740873027543228

Insightful article to see the CPU and GPU roles in AI era. Agentic workloads add orchestration and control- plane logic best suited to CPUs, shifting GPU:CPU ratios from 7-8:1 in training toward 3-4:1 or lower in inference and agentic eras.
https://x.com/SVTrivo/status/2049205332329795730

@teortaxesTex @zephyr_z9 FYI:
https://t.co/j5ICoH2Ffe Part 1 Model Analysis Part 2 Analysis of the Network Optimization Scheme Based on 950PR/DT and A3 Clusters Ascend 950DT Benchmarks DS-V4-Flash 284B DS-V4-Pro 1.6T Atlas-A3 Benchmark DS-V4-Flash 284B Part 3 Future Plan
https://x.com/ogawa_tter/status/2047631993702363509

🚀 Better prompts. Smarter agents. More control. The latest @code release introduces the Chat Customizations Evaluation extension, designed to help analyze and refine your prompts, agents, instructions, and skills.
https://x.com/code/status/2049556204930695278

AI evals are becoming the new compute bottleneck
https://huggingface.co/blog/evaleval/eval-costs-bottleneck

Forward and backward benchmark results across common configurations.
https://x.com/Alibaba_Qwen/status/2049462776247247310

I’m not sure what explains Pro’s relative underperformance but people are underrating the Flash. I think even benchmarks are underweighting it. We don’t really have a lot of benchmarks for “”legit 1M context for pennies””. And it can likely be improved far beyond this.
https://x.com/teortaxesTex/status/2047864952862458009

in the day after Whalefall, you’ll see lots of regarded takes from the wypipo and xiaoren. “”x months behind, heh””. “”distillation””. “”benchmax””. “”blackwells””. “”underwhaleming””. All noise. They’ve completed their quest: Solid Ultra-Long Context. Why was it so important? “”Cheap””? no.
https://x.com/teortaxesTex/status/2047623905754448043

KernelBench-Hard coming soon.
https://x.com/elliotarledge/status/2048502965200372132

Let’s talk document formatting. Bold. Italics. Superscripts. Strikethroughs. The visual cues humans rely on every time we read a doc, and ones existing OCR benchmarks completely ignore. 😱””$199″” struck through next to “”$149″” isn’t decoration. It’s the meaning. 😱A superscript
https://x.com/llama_index/status/2049139409316946011

On evals: V4 Flash @ max ~= v4 Pro @ high on reasoning tasks. Pro focuses more on knowledge (simpleQA)
https://x.com/TheZachMueller/status/2047719857869791352

Public free use eval vs Proprietary STD-free eval on onw V4-Pro is “”top three open”” (with a giant gap to 2) on another it’s the only open model even close to the frontier which one looks closer to the truth, anon?
https://x.com/teortaxesTex/status/2047616662879248828

We’re announcing: VibeBench, a new benchmark for what actually matters — how models feel when used on real work by experienced software engineers. But, we need your help. Here’s how it works: 1. An initial cohort of 1000 qualified software engineers (join:
https://x.com/jpschroeder/status/2049139723776495800

You think it’s funny, I think it’s the biggest problem these labs have. They are serving at 20-30 tokens/s and don’t accept more than a couple requests. LisanBench literally takes a day to run.
https://x.com/scaling01/status/2047643015859118167

Don’t try to build a self-improving AI agent without evals. You are just wasting time and compute. An agent can’t improve from traces it can’t evaluate. This is why it’s exciting to see @FutureAGI_ going fully open source with their platform. It combines the best of all the
https://x.com/omarsar0/status/2048759865007591615

Most agentic benchmarks center around tasks that are automatically verifiable. But any task that is veriafiable is also easy to optimize for. This work instead describes the future of critical open world evaluations. Led by @sayashk, our current draft is now live.
https://x.com/sarahookr/status/2048731841759428935

Organizational design for agents is hard, benchmarking agents working in concert is hard. Together, this is the next critical frontier for making AI matter in economically valuable tasks, and we really don’t know very much about it.
https://x.com/emollick/status/2047828327856030047

Every company building on top of AI should be making their own benchmarks. This is the way if you want model progress to disproportionally benefit your company.
https://x.com/OfficialLoganK/status/2048554074107470305

Scores I would like to see from DeepSeek-V4 to confirm it being less than 6 months behind frontier models ARC-AGI-1: ~75% ARC-AGI-2: ~35% GSO: ~26% METR: 4.5-5 hours WeirdML: ~63% basically Opus 4.5 / GPT-5.2 scores
https://x.com/scaling01/status/2047686712051048598

GPT 5.5 (no thinking) scores 67.1% on WeirdML, well ahead of GPT 5.4 (no thinking) at 57.4%, but well behind Opus 4.7 (no thinking) at 76.4%. It’s at the frontier for accuracy/tokens, as it uses less tokens than Opus.
https://x.com/htihle/status/2048717753394090274

GPT-5.5 by @OpenAI is now live in the Arena, landing across multiple leaderboards. Here’s how it ranks by modality: – Code Arena (agentic web dev): #9, a strong +50pt jump over GPT-5.4 – Document Arena (analysis & long-content reasoning): #6, on par with Sonnet 4.6 – Text
https://x.com/arena/status/2048794479646388732

GPT-5.5 is on par with Claude Mythos – GPT-5.5 average pass rate of 71.4% (±8.0%) – Mythos Preview 68.6% (±8.7%) – GPT-5.5 solved a task that takes a human expert ~12 hours in under 11 minutes at a cost of $1.73
https://x.com/scaling01/status/2049870801998864606

GPT-5.5 Pro achieves a new high score of 159 on the Epoch Capabilities Index! ECI is our statistical tool that combines multiple benchmarks into a unified scale.
https://x.com/EpochAIResearch/status/2049186851844771888

GPT-5.5 xHigh is in Battle Mode in the Code Arena. Evaluate models on agentic coding tasks for front-end websites and apps. Scores coming soon!
https://x.com/arena/status/2048846896744247468

Modded-NanoGPT Optimization Benchmark Hundreds of neural network optimizers have been proposed in the literature, recently including dozens citing Muon: MARS, SWAN, REG, ADANA, Newton-Muon, TrasMuon, AdaMuon, HTMuon, COSMOS, Conda, ASGO, SAGE, and Magma, to name a few. The
https://x.com/kellerjordan0/status/2049193527440187494

To clarify, the Arena community evaluated GPT-5.5 with reasoning effort medium (default) and high. The best of GPT-5.5 with xHigh is still incoming! Stay tuned.
https://x.com/arena/status/2048820224938631492

BullshitBench: GPT-5.5 and 5.5-Pro update! They did NOT do well – 5.5 about the same level as GPT-5.4 (around 30-35 rank, 45% pushback). GPT-5.5-Pro did WORSE – only about 35% pushback. I must say the Pro result kind of shocked me. This is actually interesting, what this tells
https://x.com/petergostev/status/2047773402090426548

GPT 5.5 is much smarter than I thought Yesterday, I did one-shots, coding, benchmarks, and was disappointed. Today, I did it all again, except via the API, which is now available. Results changed completely: → one-shot prompts went from bad to very good → excellent coding
https://x.com/VictorTaelin/status/2047818978664268071

GPT-5.5 is now available in Cursor! It’s currently the top model on CursorBench at 72.8%. We’ve partnered with OpenAI to offer it for 50% off through May 2.
https://x.com/cursor_ai/status/2047744579127185843

GPT-5.5 Pro achieves a small bump on GPT-5.4 Pro with 60% lower cost and token use in our frontier science eval, CritPt CritPt tests models on graduate-level physics research problems contributed by 60+ researchers from 30+ institutions globally. When CritPt was released in
https://x.com/ArtificialAnlys/status/2049926072595280030

LisanBench results for GPT-5.5 – it’s good. GPT-5.5 is now the strongest model without Thinking on both metrics! GPT-5.5-medium uses on average ~45.6% less tokens than GPT-5.4-medium while scoring 1.77x higher! (1.14x higher score on the difficulty weighted metric) Running
https://x.com/scaling01/status/2047818395970904229

Opus 4.7 and GPT-5.5 scores on GSO are live! #1 Opus 4.7 @ 42.2% #2 Opus 4.6 @ 37.3% #3 GPT-5.5 @ 37.3%
https://x.com/scaling01/status/2048853227211251891

Summarize 📝0.14.0 is out. GPT-5.5 Fast mode via `–fast`, Reddit thread extraction in the browser extension, local PDF `–extract`, and fixes for auto model config + Meta site compatibility.
https://x.com/steipete/status/2048275589224628677

The new GPT-5.5 is #1 on Terminal-Bench at 82.7. This beats Anthropic’s Mythos Preview scoring 82.0, which they have not released to the public due to cybersecurity and safety concerns. Available in Cline now!
https://x.com/cline/status/2047769312514257148

ARC-AGI-3 testing is done for gpt-5.5 and opus 4.7 Now we’re in analysis mode going through the logs It’s pretty clear where the failure modes are for each model
https://x.com/GregKamradt/status/2049121093307547654

In Expert Arena, GPT-5.5-High ranks #5 – trailing only Claude Opus 4.6 and 4.7. Expert Arena evaluates models on advanced expert-level prompts in the Text Arena, with a focus on real-world professional use cases. This demonstrates GPT-5.5’s strong performance on complex,
https://x.com/arena/status/2048808366810800259

A few more notes on DeepSeek-V4: – it seems to be a ~GPT-5.2/Opus 4.5+ tier model, so they are still ~4-5 months behind the frontier, but ahead of other chinese labs, with Kimi K2.6 being closest – at 1.6T params they now have a model that’s in the same weight class as GPT-5.4
https://x.com/scaling01/status/2047618271310926151

DeepSeek-V4 is definitely better than GLM-5.1 but not quite Opus 4.7, GPT-5.4 or Gemini 3.1 Pro level unfortunately this video had no comparison to Kimi-K2.6
https://x.com/scaling01/status/2047733998714052819

Mistral Medium 3.5 is interesting less for the benchmarks and more for the positioning. Look at who they’re comparing against: Kimi, Qwen, GLM, Claude (Sonnet). Not GPT, not Gemini. And i dont mean that in a negative way! With Aleph Alpha being acquired by Cohere last week,
https://x.com/kimmonismus/status/2049545016784413005

New open-weight model dropped: Ling-2.6-flash by @ant_oss • AIME 2026: 73.85 • HMMT Feb 2026: 49.29 • SWE-bench Verified: 61.2 ~107B MoE, MIT license, on the @huggingface Hub now. Open models hitting 60%+ on SWE-bench Verified is wild. A year ago that was closed-frontier
https://x.com/nathanhabib1011/status/2049466639171690820

[2604.22119] Emergent Strategic Reasoning Risks in AI: A Taxonomy-Driven Evaluation Framework
https://arxiv.org/abs/2604.22119

[2604.20329] Image Generators are Generalist Vision Learners
https://arxiv.org/abs/2604.20329

MathNet – a new interesting global multimodal benchmark from @MIT for mathematical reasoning and retrieval It’s a dataset of 30,676 Olympiad-level problems from 47 countries, 17 languages, and 143 competitions over 4 decades, with expert solutions. It defines 3 tasks: – problem
https://x.com/TheTuringPost/status/2049155956135841862

.@MIT researchers introduced Hyperloop Transformers – a mix of ideas from both looped and normal Transformers. The model follows this structure: – Begin and end blocks for input and output are normal layers – Middle block is looped Even though there is one middle block, during
https://x.com/TheTuringPost/status/2047720038342476187

[2604.16529] Scaling Test-Time Compute for Agentic Coding
https://arxiv.org/abs/2604.16529

[2604.22776] Epicure: Multidimensional Flavor Structure in Food Ingredient Embeddings
https://arxiv.org/abs/2604.22776

[2604.24198] Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis
https://arxiv.org/abs/2604.24198

[2604.26779] Accelerating RL Post-Training Rollouts via System-Integrated Speculative Decoding
https://arxiv.org/abs/2604.26779

🚀 vLLM v0.20.0 is here! I’m excited about TurboQuant! • 752 commits from 320 contributors (123 new) 🎉 • TurboQuant 2-bit KV cache → 4× capacity + FA3/FA4 prefill 🗜️⚡ • FA4 re-enabled as default MLA prefill (SM90+ GPUs) • vLLM IR foundation + rms_norm (future kernel base)
https://x.com/TeksEdge/status/2048983564801450315

1/ Deep learning is going to have a scientific theory. We can see the pieces starting to come together, and it’s looking a lot like physics! We’re releasing a paper pulling together these emerging threads and giving them a name: learning mechanics. 🔨
https://t.co/92nSIHameW 🔧
https://x.com/learning_mech/status/2047723849874330047

1/ Single-block transformers can solve Extreme Sudoku, but only if you give them an explicit scratchpad and invert a classic routing initialization. Without it, performance is literally zero. 🧵
https://x.com/che_shr_cat/status/2049081240762876261

15+ LoRA (Low-Rank Adaptation) variants you should know ▪️ Original LoRA ▪️ QLoRA ▪️ DoRA ▪️ QDoRA ▪️ rsLoRA (Rank-Stabilized) ▪️ VeRA (Vector-based Random Adaptation) ▪️ SingLoRA (Single-Matrix LoRA) ▪️ Sensitivity-LoRA ▪️ ARD-LoRA (Adaptive Rank Dynamic) ▪️
https://x.com/TheTuringPost/status/2048417999636599175

A deep dive from @awscloud × @RedHat_AI into FP8 KV-cache + attention in vLLM. The headline fix: two-level accumulation in FA3 takes 128k needle-in-a-haystack from 13% → 89%, while keeping the FP8 decode speedup. Plus a new –kv-cache-dtype-skip-layers flag for hybrid-attention
https://x.com/vllm_project/status/2048796304508330462

After 100 million tokens, performance was still going up. What we’re seeing here is not the capability ceiling. From the report: “”Performance on TLO continues to scale with the amount of inference compute spent, and we have not yet observed a plateau with the best models.””
https://x.com/polynoamial/status/2049883449327243413

Ant Group has just released Ling 2.6 1T, an open weights, non-reasoning model with high cost efficiency and a reasonable intelligence tradeoff. Ling 2.6 1T scores 34 on the Artificial Analysis Intelligence Index, a 15-point jump from Ling-1T Ling 2.6 1T is the latest model from
https://x.com/ArtificialAnlys/status/2049923495602303438

big theme of 2026 – cost of closed models is too high! really excited to make deepagents work exceptionally well with OSS models
https://x.com/hwchase17/status/2049552801890771220

Darwinian Specialization in AI | Tomasz Tunguz
https://tomtunguz.com/inference-market-segmentation/

Didn’t expect to be hit with the most significant AI paper of the year on breakfast. Longer read coming but what’s obvious already: turning a sovereignty play (using Huawei ascend) into an opportunity to reshape hardware. Intercoms, memory, power: wishlist everywhere.
https://x.com/Dorialexander/status/2047632551326413109

Exactly what needs to be done. Biological data is the missing link. It may not be sexy or make for shiny announcements but building biological infrastructure is where the impact is. Huge props to CZI.
https://x.com/fidjissimo/status/2049588175555977422

Extremely frothy paper. The one representation to rule them all?
https://x.com/bilawalsidhu/status/2049490131858440453

foid was correct btw. There is a somewhat cruel aspect to V4. Unlike before, it is not exactly a democratizing technology. GRPO was a godsend. DSMoE and MLA were finicky but worth the effort. DSA is almost free gain if you master MLA. But this damn thing… who can even adopt it?
https://x.com/teortaxesTex/status/2047840426371977467

For the past few years, humans have been doing “prompt engineering” to coax the best performance out of different LLMs. In this work, we explored what happens if we train an AI to do that job instead. By training a Conductor model with RL, we found that it naturally learns to
https://x.com/hardmaru/status/2048778095935795338

Ghosts in the Distillation Pipeline
https://x.com/TheTuringPost/status/2048901122979446991

GPU library performance can be very notchy — runtime of batched torch.linalg.solve_ex() went up by over 10x going from 511×511 matrices to 512×512.
https://x.com/ID_AA_Carmack/status/2049467648900018281

Granite 4.1 LLMs: How They’re Built
https://huggingface.co/blog/ibm-granite/granite-4-1

Great example of why private deployments are important not just for large enterprises but everyone. If you control the model, you control the costs.
https://x.com/aidangomez/status/2049083965407969690

Here are the transcripts. So ~150T tokens and ~100B active parameters come up to 9e25 pretraining flops. With some back-of-the-envelope calcs, using the OpenAI 100K GB200 cluster and say MFU of 15% conservatively, this takes ~14 days
https://x.com/nrehiew_/status/2049848830292856970

Hybrid CSA + HCA Cont Example of how it goes per layer of the 61 layers: [ HCA, HCA, CSA, HCA, CSA, HCA, … MTP ] First two layers are pure HCA, then alternate. V4 flash is the same but with two sliding window layers rather than HCA
https://x.com/TheZachMueller/status/2047702996524405175

Hybrid CSA + HCA Cont The partner to this is HCA. Which is a single non-overlapping KV stream at 128 tokens. It’s a dense attention mechanism, but shares the MQA/grouped output projections as CSA
https://x.com/TheZachMueller/status/2047702488418030066

I actually really enjoyed this article on continual learning. Finally had a chance to read through over the weekend. Thanks to @a16z for the spotlight on our work at @adaption_ai I think what the deep dive gets right is it correctly pinpoints one of the core questions
https://x.com/sarahookr/status/2048759884125233453

IBM has released three new non-reasoning Granite 4.1 models (30B, 8B, 3B) as open weights under Apache 2.0. All three are notably token-efficient relative to peer non-reasoning models, with the 8B standing out for its token efficiency relative to intelligence @IBM has released
https://x.com/ArtificialAnlys/status/2049505499377193156

If you’ve ever wondered what Poolside was up to… They’ve been cooking! They just released Laguna XS and M, their first ever public models, on Hugging Face: – Two coder models: one is 225B params, of which 23B active, the other is 33B-3A. – hybrid attention: global v. sliding
https://x.com/AymericRoucher/status/2049156715304935451

In 2026 a token doesn’t mean just one thing We measure, price and argue about AI in tokens There are at least 7 distinct types, and if you don’t know the differences, you’re misunderstanding both the tech and what you’re paying for Let’s set it straight
https://x.com/TheTuringPost/status/2047780196363731103

in the Think post, I proposed this idea of an execution Ladder. Where the LLM (or even the user) _chooses_ to escalate the execution environment based on what it’s trying to do at the moment. I’ll make it so this isn’t cloudflare specific, you can bring your own
https://x.com/threepointone/status/2049463167298777310

In this work, we examined the question: “”what makes a good clarifying question?”” in the context of software engineering. We trained a model that was specifically tasked with asking clarifying questions, allowing for improved results with fewer questions.
https://x.com/gneubig/status/2047623214583492797

Ineffable Intelligence
https://www.ineffable.ai/

Introducing AutoSP – PyTorch
https://pytorch.org/blog/introducing-autosp/

Introducing talkie: a 13B vintage language model from 1930
https://talkie-lm.com/introducing-talkie

KV Cache Locality: The Hidden Variable in Your LLM Serving Cost | Ranvier
https://ranvier.systems/2026/04/30/kv-cache-locality-the-hidden-variable-in-your-llm-serving-cost.html

Laguna XS.2 and M.1: A Deeper Dive — Poolside
https://poolside.ai/blog/laguna-a-deeper-dive

Listen, people. If you have multiple GPUs in your computer, and you want them to work together to serve an AI model – that’s called *tensor parallel* @ollama, @lmstudio, llama.cpp *do not support tensor parallel.* Full stop. Don’t use them. @vllm_project is what you want.
https://x.com/QuixiAI/status/2047765475937890474

mHC: Manifold-Constrained Hyper-Connections
https://arxiv.org/pdf/2512.24880

Models gain a lot from long reasoning but maybe they don’t need to write reasoning in words at all? @IBM introduced Abstract Chain-of-Thought that replaces text reasoning with abstract tokens. The model produces a short sequence with these special tokens which are: – much
https://x.com/TheTuringPost/status/2049637933754531860

Monitoring LLM behavior: Drift, retries, and refusal patterns | VentureBeat
https://venturebeat.com/infrastructure/monitoring-llm-behavior-drift-retries-and-refusal-patterns

Must-read research of the week ▪️ Image generators are generalist vision learners ▪️ There Will Be a Scientific Theory of Deep Learning ▪️ Learning Evidence Highlighting for Frozen LLMs ▪️ Contexts are Never Long Enough ▪️ Memanto ▪️ Knowing When to STOP, RECOVER, and SEARCH ▪️
https://x.com/TheTuringPost/status/2048929202263757142

New Frontier Models Are Faster, Not More Reliable, at Spatial Biology
https://blog.latch.bio/p/new-frontier-models-are-faster-not?triedRedirect=true

🆕 Today, we’re releasing the public preview of Workflows, the orchestration layer for enterprise AI. 🌎 Enterprise teams have capable models. What they don’t have is a way to run them reliably in production. That’s the gap Workflows fills. It takes AI-powered business processes
https://x.com/MistralAI/status/2049128071874179091

New work with @AlecRad and @DavidDuvenaud: Have you ever dreamed of talking to someone from the past? Introducing talkie, a 13B model trained only on pre-1931 text. Vintage models should help us to understand how LMs generalize (e.g., can we teach talkie to code?). Thread:
https://x.com/status_effects/status/2048878495539843211

Noam Brown at ICLR: scaffolding LLM is a norm; inference compute is a strategic resource, currently undervalued; safety on long-horizon tasks can be projected.
https://x.com/hxiao/status/2048458363889938547

Not all tokens are worth learning from in on-policy distillation – shows this new interesting paper It’s a typical story about “”some tokens carry much stronger learning signal than others”” but with non-trivial findings: ▪️ There are 2 types of useful tokens: 1.
https://x.com/TheTuringPost/status/2047617791709282405

Oh yeah this is amazing and it’s also 9 years old! This is “”DSO: Direct Sparse Odometry”” from @TU_Muenchen The direct approaches were revolutionary and a true inspiration in my SLAM journey
https://x.com/Almorgand/status/2047717560121020925

Open Models & a potential looming Haiku-apocalypse? 🚀 chart comparison UI courtesy of @OpenRouter there’s a large class of problems where you don’t need frontier intelligence. you want cheap, fast, and smart enough to get the job done. these are often classification tasks
https://x.com/Vtrivedy10/status/2049201138310721616

Payload: 60kg @duatic_ag
https://x.com/IlirAliu_/status/2047381044588896515

per-tensor overhead for dynamic activation quantization is so brutal, which is why i prefer static quantization in terms of inference speed even if we sacrifice some time for calibration.
https://x.com/maharshii/status/2049058891389108640

Recently, I have been diving deeper into torch compile internals especially for inference related graph optimizations with custom kernels and below are my findings/learnings: Note that this is still a very high-level overview with lots of moving parts hidden behind the scenes.
https://x.com/maharshii/status/2049402475476861044

routing mechanism itself appears to exacerbate the emergence of these outliers”” the routing mechanism: hashing 🤦🏻‍♀️
https://x.com/suchenzang/status/2047772636881842629

SMG: The Case for Disaggregating CPU from GPU in LLM Serving – PyTorch
https://pytorch.org/blog/lightseek-smg/

The Recurrent Transformer: Greater Effective Depth and Efficient Decoding | alphaXiv
https://www.alphaxiv.org/abs/2604.21215

The World Can’t Keep Up With AI Labs – LessWrong 2.0 viewer
https://www.greaterwrong.com/posts/fewDbvpKMZLgGuWT2/the-world-can-t-keep-up-with-ai-labs

There are no AI-native enterprises
https://x.com/TheTuringPost/status/2048211358886314468

There were bugs in DeepSpeed and OpenRLHF that reduce SFT performance, and this affected several studies.
https://x.com/rosinality/status/2049024030749970699

There’s a quadrillion-dollar question at the heart of AI: Why are humans so much more sample efficient compared to LLM? There are three possible answers: 1. Architecture and hyperparameters (aka transformer vs whatever ‘algo’ cortical columns are implementing) 2. Learning rule
https://x.com/dwarkesh_sp/status/2049232356998094998

Today I re-iterate: I hate MoEs and we are wasting time on them…. Let’s unite and call a global ban on MoEs please. Please 1M+ salary researchers: do better… credits to @IlysMoutawwakil for the graph:
https://x.com/art_zucker/status/2047619111082172548

Token taxonomy (or what you’re actually paying for): 1. Input tokens – what you send in 2. Output tokens – what comes back 3. Reasoning tokens – the thinking tax 4. Speculative tokens – generated just to be discarded 5. Cached tokens – reused work (can be ~90% cheaper) 6.
https://x.com/TheTuringPost/status/2048015703350067422

TurboQuant: A First-Principles Walkthrough
https://arkaung.github.io/interactive-turboquant/

V4 discount extended to the end of May btw
https://x.com/teortaxesTex/status/2049101287161991332

We currently train AIs by having them predict human text outputs. Could we train it by having them predict human neural patterns? w. @AdamMarblestone
https://x.com/dwarkesh_sp/status/2049285160668217756

We’re excited to introduce KAME: Tandem Architecture for Enhancing Knowledge in Real-Time Speech-to-Speech Conversational AI, accepted at #ICASSP2026! 🐢 Blog
https://t.co/eyU3yECBK8 Paper
https://t.co/PVYPIcHyyM Can a speech AI think deeply without pausing to process? In real
https://x.com/SakanaAILabs/status/2049544945233764755

What you’re actually writing when you write a SKILL.md
https://internals.laxmena.com/p/what-youre-actually-writing-when

What’s So Magical About Embeddings?
https://x.com/TheTuringPost/status/2049631019234713749

Wrote up some flashcards and practice problems to help myself retain what @reinerpope taught. Hope it’s helpful to you too! Suggest more below and I’ll add them.
https://x.com/dwarkesh_sp/status/2049570394110390305

Microsoft Presents “”TRELLIS.2″”: An Open-Source, 4B-Parameter, Image-to-3D Model producing up to 1536³ PBR textured assets. Built On Native 3D VAES With 16× Spatial compression, delivering efficient, scalable, high-fidelity asset generation. Ngl, pretty cool!
https://x.com/kimmonismus/status/2049099376476459372

First open-weight model from @poolsideai! Apache license, and available on Ollama to try. 👇👇👇 model page
https://x.com/ollama/status/2049184817603031463

Leave a Reply

Trending

Discover more from Ethan B. Holland

Subscribe now to keep reading and get access to the full archive.

Continue reading