Image created with gemini-3.1-flash-image-preview with claude-sonnet-4-5. Image prompt: Using the provided reference image, preserve the exact compositional layout with subject in left third and misty right two-thirds, the deep blue-purple cinematic lighting, the emotional weight and post-party melancholy, and the wispy atmospheric smoke bleeding rightward, but replace the central figure with a dramatically lit microchip or circuit board in tight close-crop with glitter particles scattered across its silicon surface catching light, and replace the title with ‘tech’ in thin lowercase white Helvetica Neue Light positioned in the right two-thirds over the haze.
Breaking: @AIatMeta just released Muse Spark — now live across @ScaleAILabs leaderboards. Here’s how it stacks up: Tied for 🥇on SWE-Bench Pro Tied for 🥇on HLE Tied for 🥇on MCP Atlas Tied for 🥇on PR Bench – Legal Tied for 🥈on SWE Atlas Test Writing 🥈on PR Bench – Finance
https://x.com/scale_AI/status/2041934840879358223
Introducing Muse Spark, the first in the Muse family of models developed by Meta Superintelligence Labs. Muse Spark is a natively multimodal reasoning model with support for tool-use, visual chain of thought, and multi-agent orchestration. Muse Spark is available today at
https://x.com/AIatMeta/status/2041910285653737975
NEW: Meta announces Muse Spark. All you need to know: * It’s their new multi-modal reasoning model. * Strong at multi-agent orchestration and multi-modal reasoning. * Contemplating mode orchestrates multiple agents that reason in parallel. Helps to compete with models such
https://x.com/omarsar0/status/2041919769536770247
To spend more test-time reasoning without drastically increasing latency, we can scale the number of parallel agents that collaborate to solve hard problems. While standard test-time scaling has a single agent think for longer, scaling Muse Spark with multi-agent thinking enables
https://x.com/AIatMeta/status/2041926297216282639
Meta is back! Muse Spark scores 52 on the Artificial Analysis Intelligence Index, behind only Gemini 3.1 Pro, GPT-5.4, and Claude Opus 4.6. Muse Spark is the first new release since Llama 4 in April 2025 and also Meta’s first release that is not open weights Muse Spark is a new
https://x.com/ArtificialAnlys/status/2041913043379220801
try muse spark via the Meta AI app or
https://t.co/DipeeIuXm2! check out this simulation i made:
https://x.com/alexandr_wang/status/2041953243895623913
1/ today we’re releasing muse spark, the first model from MSL. nine months ago we rebuilt our ai stack from scratch. new infrastructure, new architecture, new data pipelines. muse spark is the result of that work, and now it powers meta ai. 🧵
https://x.com/alexandr_wang/status/2041909376508985381
The new model from Meta, Muse Spark, is pretty good at converting images to code!
https://x.com/skirano/status/2041920891072700631
Excited to share what we’ve been building at Meta Superintelligence Labs! We just released Muse Spark, our first AI model. It’s a natively multimodal reasoning model and the first step on our path to personal superintelligence. We’ve overhauled our entire stack to support
https://x.com/shengjia_zhao/status/2041909050728931581
Introducing Muse Spark: Scaling Towards Personal Superintelligence
https://ai.meta.com/blog/introducing-muse-spark-msl/
Meta is back in the game! It’s been fun to test out Muse Spark. Beyond benchmarks, it’s actually a good day to day model… surprisingly good at technical problems and making arcade games. Never bet against @alexandr_wang @natfriedman @danielgross
https://x.com/matthuang/status/2041911766586945770
Meta just released a frontier model, Muse Spark- it takes the #3 spot on our Vals Index.
https://x.com/ValsAI/status/2041922037745381389
try muse spark yourself! download the Meta AI app or go to
https://x.com/alexandr_wang/status/2042024651610861657
We had pre-release access to Meta’s new Muse Spark model and evaluated it on FrontierMath. It scored 39% on Tiers 1-3 and 15% on Tier 4. This is competitive with several recent frontier models, though behind GPT-5.4.
https://x.com/EpochAIResearch/status/2041947954202988757
To build personal superintelligence, our model’s capabilities should scale predictably and efficiently. Below, we share how we study and track Muse Spark’s scaling properties along three axes: pretraining, reinforcement learning, and test-time reasoning. 🧵👇 Let’s start with
https://x.com/AIatMeta/status/2041926291142930899
New research from Databricks: AI agents get measurably better as they accumulate more memory — not bigger models, not longer contexts, just better retrieval from past experience. Uncurated user logs beat hand-crafted domain instructions after just 62 records. We call it memory
https://x.com/DbrxMosaicAI/status/2042666277328609763
[2604.04872] Synthetic Sandbox for Training Machine Learning Engineering Agents
https://arxiv.org/abs/2604.04872
A Taxonomy of RL Environments for LLM Agents
https://leehanchung.github.io/blogs/2026/03/21/rl-environments-for-llm-agents/
🫱 Introducing 𝐍𝐞𝐮𝐫𝐚𝐥 𝐂𝐨𝐦𝐩𝐮𝐭𝐞𝐫s: 𝐰𝐡𝐚𝐭 𝐢𝐟 𝐀𝐈 𝐝𝐨𝐞𝐬 𝐧𝐨𝐭 𝐣𝐮𝐬𝐭 𝐮𝐬𝐞 𝐜𝐨𝐦𝐩𝐮𝐭𝐞𝐫𝐬 𝐛𝐞𝐭𝐭𝐞𝐫, 𝐛𝐮𝐭 𝐛𝐞𝐠𝐢𝐧𝐬 𝐭𝐨 𝐛𝐞𝐜𝐨𝐦𝐞 𝐭𝐡𝐞 𝐫𝐮𝐧𝐧𝐢𝐧𝐠 𝐜𝐨𝐦𝐩𝐮𝐭𝐞𝐫 𝐢𝐭𝐬𝐞𝐥𝐟? Beyond today’s conventional computers, agents, and
https://x.com/MingchenZhuge/status/2042607353175097660
AI-assisted Deployment | Spacelift Intelligence
https://spacelift.io/platform/intelligence?refid=Rundownd+Intelligence+Landing+Page
New on the Engineering Blog: Building Managed Agents–our hosted service for long-running agents–meant solving an old problem in computing: how to design a system for “programs as yet unthought of.” Read more:
https://x.com/AnthropicAI/status/2041929199976640948
JRM: Joint Reconstruction Model for Multiple Objects without Alignment”” TL;DR: jointly reconstructs objects from unaligned observations using a 3D flow-matching model, removing the need for explicit alignment
https://x.com/Almorgand/status/2040048419993985103
Know3D: Prompting 3D Generation with Knowledge from Vision-Language Models”” TL;DR: injects vision-language knowledge into diffusion-based 3D generation to make unseen regions controllable and semantically consistent
https://x.com/Almorgand/status/2040420958532514067
Day 93/365 of GPU Programming Studying parallelism today and stumbled upon this incredible blog post/book The Ultra-Scale Playbook: Training LLMs on GPU Clusters by Hugging Face that dives deep into data parallelism, expert parallelism, tensor parallelism, pipeline parallelism
https://x.com/levidiamode/status/2041229052804280811
My picture of the present in AI – LessWrong 2.0 viewer
https://www.greaterwrong.com/posts/WjaGAA4xCAXeFpyWm/my-picture-of-the-present-in-ai
[2603.28052] Meta-Harness: End-to-End Optimization of Model Harnesses
https://arxiv.org/abs/2603.28052
Must-read research of the week ▪️ Meta-Harness: End-to-End Optimization of Model Harnesses ▪️ A Survey of On-Policy Distillation for Large Language Models ▪️ The Latent Space: Foundation, Evolution, Mechanism, Ability, and Outlook ▪️ Marco DeepResearch ▪️ FIPO: Eliciting Deep
https://x.com/TheTuringPost/status/2042026647063556414
Seems like a good model from Meta that is still trailing the current series of releases. The most important thing to note is that it is not open weights. That was the main reason that Meta’s models were so important. Without that, it is a lot harder to predict the value of Spark
https://x.com/emollick/status/2041924282964394085
try for yourself!
https://t.co/DipeeIuXm2 or download Meta AI app
https://x.com/alexandr_wang/status/2041985846950424760
Our first model from MSL, Muse Spark, is now available on
https://t.co/qBMQ6BPVgP! This is an efficient all-rounder model. It supports fast responses, deeper thinking, visual chain of thought, a higher inference “Contemplating” mode. Plus, it’s natively multimodal. 1/
https://x.com/jack_w_rae/status/2041925332631183421
1/ It’s been so fun working with @shengjia_zhao, @alexandr_wang and the team to build muse spark from scratch. It is early and has rough edges, but excited to continue our research velocity. I especially love that we’re doubling down on the fundamental science. We’re focused on
https://x.com/ananyaku/status/2041913147842556390
1/ Muse Spark is live, and alongside it, our new Advanced AI Scaling Framework which details how we evaluate and prepare for advanced AI. We tested across bio, chem, cyber, and loss of control risks before and after mitigations. Muse Spark achieves a 98% bioweapons refusal rate
https://x.com/summeryue0/status/2041956901769113948
Check out Muse Spark, our first milestone in the quest for personal superintelligence! Scaling this with the team has been a total blast. Give it a spin and let us know what you think! 🥑
https://x.com/ren_hongyu/status/2041922484040298796
e
try muse spark on
https://x.com/alexandr_wang/status/2041956770864885870
We spent weeks testing text vs. image retrieval for RAG. The winner? 𝗡𝗲𝗶𝘁𝗵𝗲𝗿. Our recent publication, IRPAPERS, compares 𝘁𝗲𝘅𝘁-𝗯𝗮𝘀𝗲𝗱 𝗿𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹 (OCR + vector, keyword, and hybrid search) and 𝗶𝗺𝗮𝗴𝗲-𝗯𝗮𝘀𝗲𝗱 𝗿𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹 (multimodal late
https://x.com/weaviate_io/status/2041897318367060054
Multimodal Embedding & Reranker Models with Sentence Transformers
https://huggingface.co/blog/multimodal-sentence-transformers
For Olmo 3, we moved from a synchronous RL setup to an asynchronous one. This made our code 4x faster in terms of throughput (tokens/second). I wrote about the changes in the paper, but I finally found the time to go deeper on what was involved:
https://x.com/finbarrtimbers/status/2041176604961878271
Common Failure Modes Break VLM-Powered OCR in Production. 🔁 Repetition Loops — model spirals into infinite whitespace, exhausts resources, cascades latency across your system 🛑 Recitation Errors — safety filters hard-stop legitimate extractions as “”copyright violations””
https://x.com/llama_index/status/2041923086719631780
Excited to launch The ATOM Report with @natolambert! For over 9 months, we scraped publicly available data to measure the open ecosystem. Some insights, some of them surprising, others less so:
https://x.com/xeophon/status/2041889677343343014
.@Alibaba_Qwen Pilot Team introduced a new Policy Optimization strategy – Future-KL Influenced Policy Optimization (FIPO) It gives better “”credit assignment”” during training: tokens that strongly affect future steps get more credit. Learning which steps are more important
https://x.com/TheTuringPost/status/2040389184234651815
New report from us: Can you prompt inject your way to an “A”? As LLMs increasingly are used as judges, people are inserting AI prompts into letters, CVs & papers. We tested whether it works. It does on older & smaller models, but not on most frontier AI:
https://x.com/emollick/status/2039789473324544102
Just launched at @aiDotEngineer : our official AGI Pills! prescribe one (1) if your colleague is saying we are hitting a wall and/or trying to add inductive bias instead of Trusting The Model
https://x.com/swyx/status/2042538904574681355?s=20
Big update to #Monarch, our distributed programming framework for #PyTorch! Since its launch at the #PyTorchCon NA in October, the team has shipped Kubernetes support, RDMA on AWS EFA and AMD ROCm, distributed SQL-based telemetry, a terminal UI, and dashboards for live job
https://x.com/PyTorch/status/2041773098324603208
I Still Prefer MCP Over Skills | David Mohl
https://david.coffee/i-still-prefer-mcp-over-skills/
We are excited to be the day 0 launch partner for Rime Mist v3! Mist v3 is a significant step forward for production voice AI. While the model weights remain unchanged, @rimelabs has changed how requests are ingested, handled, and processed. This enables true concurrent request
https://x.com/baseten/status/2041552265153274163
It’s hard to briefly summarize what risks we think this model does and doesn’t pose, and how confident we are in that assessment. We spend much of the 244-page system card and the 60-page risk assessment supplement trying to lay that out.
https://x.com/sleepinyourhat/status/2041584816513335778
Unappreciated fact is the second scaling law does not seem to completely plateau in many tasks: throw more tokens at a reasoning AI model and get better answers, especially with a simple harness. Benchmark performance is actually limited by token usage.
https://x.com/emollick/status/2040911007392903231
[2604.04746] Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning
https://arxiv.org/abs/2604.04746
New paper! Want to precisely optimize synthetic training data to do practical or even wacky things? Dataset Policy Gradients get you there, letting you target any differentiable training or post-training metric. We embedded a QR code in GPT-2’s weights using only training data!
https://x.com/TristanThrush/status/2042619274637025514
@shengjia_zhao @alexandr_wang 3/ Large scale RL can be prone to instability, but after a lot of hard work the RL runs look smooth as shown in the first figure. During RL training, the model can alternate between phases of getting smarter, and thought compression, which was pretty neat
https://x.com/ananyaku/status/2041914049160679922
A must-read survey The Latent Space: Foundation, Evolution, Mechanism, Ability, and Outlook Shows how models are moving beyond tokens into continuous internal representations, covering: – What latent space is (vs. text and visual spaces) – Architecture and mechanisms – Why it
https://x.com/TheTuringPost/status/2040415344326823965
LLM Knowledge Bases Something I’m finding very useful recently: using LLMs to build personal knowledge bases for various topics of research interest. In this way, a large fraction of my recent token throughput is going less into manipulating code, and more into manipulating
https://x.com/karpathy/status/2039805659525644595
Path-Constrained Mixture-of-Experts Overview: Path-Constrained Mixture-of-Experts restricts the vast expert path space in sparse Mixture-of-Experts by sharing router parameters across consecutive layers. This constraint addresses the statistical inefficiency of independent
https://x.com/TheAITimeline/status/2040953557961080843
[2604.01193] Embarrassingly Simple Self-Distillation Improves Code Generation
https://arxiv.org/abs/2604.01193
[2604.01202] Therefore I am. I Think
https://arxiv.org/abs/2604.01202
@tmkadamcz and I started working on MirrorCode, a new long-horizon software engineering benchmark, last September. I think it’s the best benchmark for measuring AI’s ability to complete very hard (but precisely specified) software tasks–but it’s likely already saturated.
https://x.com/idavidrein/status/2042626691881930971
🏋️Thinking Mid-training: RL of Interleaved Reasoning🎗️ We address the gap between pretraining (no explicit reasoning) and post-training (reasoning-heavy) with an intermediate SFT+RL mid-training phase to teach models how to think. – Annotate pretraining data with interleaved
https://x.com/jaseweston/status/2041864833214095484
🤯 big update to our flow map language models paper! we believe this is the future of non-autoregressive text generation. read about it in the blog:
https://t.co/R6lGIK4D5A full details in the paper:
https://t.co/ZsediaetYg we introduce a new class of continuous flow-based
https://x.com/nmboffi/status/2041546540737859915
A synthetic data generation method that, when a model is trained on the generated data, it maximizes a certain differentiable objective. e.g. it is possible to make data that engraves a QR code in the weights of an LM head. (Or, more conventional things like translating documents
https://x.com/rosinality/status/2042499462065520946
ActionParty
https://action-party.github.io/
Big deal paper here: field experiment on 515 startups, half shown case studies of how startups are successfully using AI. Those firms used AI 44% more, had 1.9x higher revenue, needed 39% less capital: 1) AI accelerates businesses 2) The challenge is understanding how to use it
https://x.com/emollick/status/2040436307176898897
Execution speed is the only metric that matters. In the race to the next AI frontier, a “”perfect plan”” on paper is worth nothing if you can’t ship. At
https://t.co/n1mVZ8KnAF, we aren’t looking for teams that spend months in deliberation. We are looking for teams that move at
https://x.com/IlirAliu_/status/2040051697498624088
Ha! GitHub is not prepared for massive programming shift
https://x.com/TheTuringPost/status/2040104673483321396
I’m releasing the 34 slides on how we design and train best-in-class edge models at @liquidai I presented these slides yesterday at @aiDotEngineer They cover model architecture, pre-training, scaling laws, post-training, and even a solution to fix doom loops Special thanks to
https://x.com/maximelabonne/status/2042537534031343633?s=20
Self-Distilled RLVR paper:
https://x.com/_akhaliq/status/2041183818317509028
Sol-RL: FP4 Explore, BF16 Train
https://nvlabs.github.io/Sana/Sol-RL/
This article is a case study of why measuring AI performance is so hard. AI Overviews make mistakes. But the same mistakes are in Wikipedia. But the sources are harder to find when using AI. But the AI answers may be better than most people would find. Unclear what it all means.
https://x.com/emollick/status/2041542535802159120
This new Nature paper (using old models) illustrates the point of my latest Substack post on AI interfaces. AI did a good job diagnosing medical issues, but when users had to interact with chatbots the interface led to confusion & worse answers My post:
https://x.com/emollick/status/2040122884371140787
Three weeks ago there were rumors that one of the labs had completed its largest ever successful training run, and that the model that emerged from it performed far above both internal expectations and what people assumed the scaling laws would predict. At the time these were
https://x.com/AndrewCurran_/status/2037967531630367218
What if training LLMs didn’t require rebuilding your entire product as a sandbox? RL training forces companies to repackage their backend into clean, sandboxed APIs. We’ve seen this cause months of engineering overhead before training can even begin. To fix this, we flipped the
https://x.com/baseten/status/2041194606512279617
You can now train and run 500+ models in our free notebook!✨ GitHub repo:
https://t.co/aZWYAtakBP Colab Notebook:
https://x.com/UnslothAI/status/2041177756848083266
A new survey that helps you better understand tool use in AI Shows how models move from single tool calls to full multi-step orchestration, covering: – Single calls vs. long-horizon workflows – Sequential, graph-based, re-planning, feedback loops – Trajectory synthesis and
https://x.com/TheTuringPost/status/2041124796361236608
Added an RSS feed to the LLM Architecture Gallery so it is a bit easier to keep up with new additions over time:
https://x.com/rasbt/status/2041140643959885999
Don’t skip your determinism and numerics days. Interleave kernel days.
https://x.com/_arohan_/status/2042440378956337574
Everyone should read “”On the Folly of Rewarding A, While Hoping for B” at least once.
https://x.com/emollick/status/2041360670474580069
Falcon Perception by @TIIuae on MLX-VLM 🚀
https://x.com/Prince_Canuma/status/2040861768138789012
Farzapedia, personal wikipedia of Farza, good example following my Wiki LLM tweet. I really like this approach to personalization in a number of ways, compared to “”status quo”” of an AI that allegedly gets better the more you use it or something: 1. Explicit. The memory artifact
https://x.com/karpathy/status/2040572272944324650
Fast muon optimizer coming to consumer cards. All the code was written as matmul + epilogue so once the mainloop was implemented for Blackwell consumer cards, all the fancy symmetric matmul just works and get speed-of-light
https://x.com/tri_dao/status/2041191260682150048
I was going crazy because I could not replicate TurboQuant. Turns out the community also had issues. The community quickly made adjustments to “”make it work””, but what they did not realize is that they reimplemented (most of) HIGGS in the process (full HIGGS would be even better)
https://x.com/Tim_Dettmers/status/2041496879238611455
Keynes acquired Newton’s private papers and was shocked at what he found. @michael_nielsen reads the key passage in the essay Keynes published afterwards: “”Newton was not the first of the age of reason. He was the last of the magicians, the last great mind which looked out on
https://x.com/dwarkesh_sp/status/2041984432597307858
Making a scatter plot of 400_000 data points, some of the plots had odd gaps in coverage. It took me a little while to realize that it was only when the data was farther from the origin — it was the raw bfloat16 precision. Everything looks great from -1 to 1, but as you go past
https://x.com/ID_AA_Carmack/status/2042377293008707653
One of the most impressive models that nobody talks about – Mercury 2 from @_inception_ai Blazingly fast. Not a regular Transformer. Gives surprisingly elegant answers, feeling more like an editor Why does it behave so differently? Mercury 2 explained↓
https://x.com/TheTuringPost/status/2041645803601584321
Open Harness, separated from model providers is a critical architectural pattern.
https://x.com/avoguru/status/2042450832126591251
Our parallel reasoning project ThreadWeaver is now open-sourced 🎉! Check out our Data Gen/SFT/RL recipe at
https://t.co/R14RiSupnz In case you don’t know, ThreadWeaver 🧵⚡️ is the first parallel reasoning method to achieve comparable reasoning performance to widely-used
https://x.com/LongTonyLian/status/2041912704584331616
QJL hurts performance:
https://t.co/ZtYLG3b6sI
https://x.com/Tim_Dettmers/status/2041496886012424233
RLSD: RLVR with Self-Distillation Unifying on-policy self-distillation with verifiable rewards to fix information leakage and instability–using token-level policy differences for fine-grained updates while leveraging environmental feedback for reliable directions.
https://x.com/HuggingPapers/status/2041188981195391447
The Unsexy Truth of AI Adoption
https://x.com/TheTuringPost/status/2040850276735709684
The vibe is GOOD
https://x.com/isnit0/status/2042316879855772107?s=20
Very, very strange release is brewing. – unlimited fast model, as I’ve said, is probably what we have now, a V4-lite that’s smaller than V3.2, they want it to be Pareto-stronger + 1M context – Expert not having file uploads because of token costs makes little sense, prefill must
https://x.com/teortaxesTex/status/2041474854294098193
We in the quantization community could quickly see this and were flabbergastered by the response to TurboQuant. Whenever I saw TurboQuant on my timeline, I found it hurtful, because the work of other academics who worked so hard was discounted.
https://x.com/Tim_Dettmers/status/2041497412989071707
We rebuilt how MoE models generate tokens on Blackwell GPUs, resulting in 1.84x faster inference and more accurate outputs. These improvements directly contribute to how we train Composer, allowing us to ship improved versions of the model more often.
https://x.com/cursor_ai/status/2041260649267986643
wow the token efficiency here is impressive, they also mention they don’t use the 1M context but do context compaction at 200k
https://x.com/eliebakouch/status/2041631671787590099
Avatar V: Scaling Video-Reference Avatar Generation
https://www.heygen.com/research/avatar-v-model
In addition, we quantified unverbalized evaluation awareness on our automated behavioral audits (primarily using Activation Verbalizers). On 7.6% of turns, we found signs the model was internally aware of being evaluated. In most of these cases, it did not verbalize this
https://x.com/Jack_W_Lindsey/status/2041588522558353649?s=20
We’re actually running out of benchmarks to upper bound AI capabilities — LessWrong
https://www.lesswrong.com/posts/gfkJp8Mr9sBm83Rcz/we-re-actually-running-out-of-benchmarks-to-upper-bound-ai
Monarch: an API to your supercomputer – PyTorch
https://pytorch.org/blog/monarch-an-api-to-your-supercomputer/
Hi friends – went on TBPN to discuss the Hark launch, full session here!
https://x.com/adcock_brett/status/2039505345614401990
VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward”” TL;DR: latent geometry-guided RL aligns video diffusion models for consistent 4D scene structure, improving camera stability and cross-view coherence
https://x.com/Almorgand/status/2039772881505149093
We’ve been studying what it takes to get NVFP4 & MXFP8 deliver good speedups on modern flow models for image & video gen. on B200 🕵️♂️ Today, I’m excited to share those findings! Bringing some cool recipes through Diffusers and TorchAO with `torch.compile` 🔥 Hop in ⬇️
https://x.com/RisingSayak/status/2042597708402430290
OmniRoam: World Wandering via Long-Horizon Panoramic Video Generation”” TL;DR: panoramic video generation framework enabling long-horizon, consistent scene exploration with trajectory control and refinement
https://x.com/Almorgand/status/2041919499079725085
Researchers at Netflix just released a new AI model It erases objects from video, then rewrites the physics of the entire scene as if that object never existed It’s called VOID (Video Object and Interaction Deletion) Current inpainting tools simply paint over the gap left by
https://x.com/rowancheung/status/2041507881858826404
VOID: Video Object and Interaction Deletion
https://void-model.github.io/
But we don’t call it a World Model nor a VLA. Doesn’t really matter what we call it. It just needs to hit our goals of having useful embodied intelligence.
https://x.com/E0M/status/2041539828555321784





Leave a Reply