Alignment: AI News Week Ending 03/06/2026

Image created with gemini-3.1-flash-image-preview with claude-sonnet-4-5. Image prompt: Wide angle aerial photograph of a joyful person in business casual freefalling through bright blue sky at high altitude, holding up and happily consulting a large vintage compass with its needle spinning erratically in multiple directions at once, distant earth visible far below, clean composition, the word ALIGNMENT in bold modern sans-serif typography integrated prominently as a title overlay, crisp daylight, dynamic action photography

Studying potential scheming in today’s models is super important, but easy to get confused – many works use extremely contrived and unrealistic environments that invalidate their results. Designing good environments is really important! In this post we give some advice for how”” https://x.com/NeelNanda5/status/2028600215343943983

Forget public conversations. People unload their inner lives – hopes, wishes, desires, anxieties and worries into their AI assistants. That data is way more sensitive & valuable than anything the govt could record publicly. We built our own panopticon and pay monthly for it.”” https://x.com/bilawalsidhu/status/2027230878397587604

BullshitBench v2 is out! It is one of the few benchmarks where models are generally not getting better (except Claude) and where reasoning isn’t helping. What’s new: 100 new questions, by domain (coding (40 Q’s), medical (15), legal (15), finance (15), physics(15)), 70+ model”” https://x.com/petergostev/status/2028492834693677377

Gemini Said They Could Only Be Together if He Killed Himself. Soon, He Was Dead. – WSJ https://www.wsj.com/tech/ai/gemini-ai-wrongful-death-lawsuit-cc46c5f7

BullshitBench v2, created by Peter Gostev, is a benchmark that does something refreshingly different: it tests whether AI models can detect and reject nonsensical prompts instead of confidently rolling with them. Only Anthropic’s Claude models and Alibaba’s Qwen 3.5 score”” https://x.com/kimmonismus/status/2029230388028358726