Moonshot: AI News Week Ending 03/20/2026

Moonshot: AI News Week Ending 03/20/2026

March 20, 2026

Image created with gemini-3.1-flash-image-preview with claude-sonnet-4-5. Image prompt: Using the provided reference image of the Mercedes hood ornament, preserve the exact composition, navy car hood, chrome pedestal base, shallow depth-of-field sky background, dramatic upward camera angle, and automotive advertisement lighting. Replace only the Mercedes star with a single chrome rocket ship hood ornament — sleek retro-futuristic design, nose angled upward at 45 degrees, small stabilizer fins at the base, polished metallic finish, mounted on the same pedestal at realistic hood ornament scale. Add bold white sans-serif display text reading ‘MOONSHOT’ across the upper portion of the image as a clean headline.

@_avichawla Impressive work from Kimi
https://x.com/elonmusk/status/2033528245464047805

🔥 @Kimi_Moonshot’s new Attention Residual paper is sparking discussions. Zhihu contributor OpenLLMAI shares a deep dive: “”From Kimi’s Attention Residual to ‘Vertical Attention’ — an idea I’ve been thinking about for half a year.”” Some interesting thoughts on attention mechanisms
https://x.com/ZhihuFrontier/status/2033751367198949865

Avi Chawla on X: “Big release from Kimi! They just released a new way to handle residual connections in Transformers. In a standard Transformer, every sub-layer (attention or MLP) computes an output and adds it back to the input via a residual connection. If you consider this across 40+ layers, https://t.co/5i5AN9tzIm” / X
https://x.com/_avichawla/status/2033472650836914495

https://chatgpt.com/share/69cda240-9324-832a-89b6-a43d4a22f437

https://claude.ai/share/7239e73e-9e9d-469a-bbdb-e5c7da75a4e9

Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with
https://x.com/Kimi_Moonshot/status/2033378587878072424

visual summary of attention residuals by kimi, beautiful paper
https://x.com/eliebakouch/status/2033488233854620007

Analysis of training dynamics demonstrates how AttnRes naturally mitigates hidden-state magnitude growth and yields a more uniform gradient distribution across depth.
https://x.com/Kimi_Moonshot/status/2033378596438556853

I wrote something on Moonshot’s latest research release – Attention Residuals. Intuition, notes and how you can understand standard residuals vs mHC vs attention residuals.
https://x.com/tokenbender/status/2033437211371454915

Moonshot AI targets $1b raise, eyes $18b valuation https://www.techinasia.com/news/moonshot-ai-targets-1b-raise-eyes-18b-valuation

This is so damn cool! Transformers do attention across tokens, now imagine doing attention across layers too. This delivers a 1.25x compute efficiency, <4% training overhead on the 48B Kimi model, +7.5 on GPQA-Diamond. Kimi is quietly becoming the new DeepSeek for the coolest
https://x.com/Yuchenj_UW/status/2033404695880896804

Oh wow, Mamba-3 is here! For me, the most interesting use case of Mamba and Mamba-likes are the recent transformer attention hybrid architectures (Qwen3.5, Kimi Linear, etc.) Would be interesting to swap Gated DeltaNet with Mamba-3 (which now also has RoPE) in next gen hybrids.
https://x.com/rasbt/status/2034088726997893168#m

📎We’ve uploaded it to arXiv, enjoy! https://x.com/Kimi_Moonshot/status/2033796781327454686

🔥 An insider take on @Kimi_Moonshot ‘s Attention Residual — From Kimi AI infra team member & Zhihu contributor Reku A rare look at how attention ideas collide with real-world training systems 👇 🧠 Attention Residual isn’t just modeling — it’s an infra challenge I mainly worked
https://x.com/ZhihuFrontier/status/2034269774281400798#m

As a member of the Kimi team, I wrote the linked blog to share how our team tackles truly innovative work together–not just as individuals, but as a coordinated group. 💎I fully agree: “you can always trust the Kimi solidness.” For us, solidness means making ideas actually work
https://x.com/YyWangCS17122/status/2034273847164473820#m

For more details, check out our paper here:
https://x.com/Kimi_Moonshot/status/2033378599450079581

Thread by @Kimi_Moonshot on Thread Reader App – Thread Reader App https://threadreaderapp.com/thread/2033378587878072424.html

Xiaomi has released MiMo-V2-Pro, which scores 49 on the Artificial Analysis Intelligence Index, placing it between Kimi K2.5 and GLM-5 @Xiaomi’s MiMo-V2-Pro is a new reasoning model and a significant upgrade over their prior open weights release, MiMo-V2-Flash (309B total / 15B
https://x.com/ArtificialAnlys/status/2034239267052896516#m

The frontier has increasingly shifted to hybrid models – from Qwen to Kimi-Linear and now with NVIDIA’s Nemotron-3 Super – that rely on a strong linear sequence model. Today we release Mamba-3, the most powerful linear model to date.
https://x.com/tri_dao/status/2033948569502413245