Ethan B. Holland

Over 54,900 manually organized AI links and counting

Alignment: AI News Week Ending 12/12/2025

December 12, 2025

Image created with gemini-2.5-flash-image with claude-sonnet-4-5. Image prompt: Black and white photograph of multiple distinct cloud formations converging toward a single point on the horizon, their edges softly merging together, shot from ground level looking up at dramatic sky, high contrast film photography with rich grayscale gradients, bold sans-serif ‘ALIGNMENT’ text placed at the convergence point, minimal composition with no landscape elements.

We tested one of the most common prompting techniques: giving the AI a persona to make it more accurate We found that telling the AI “”you are a great physicist”” doesn’t make it significantly more accurate at answering physics questions, nor does “”you are a lawyer”” make it worse. https://x.com/emollick/status/1998063517681799418

Yes, there is a leak. I had investigated this. Some of the ARC-AGI-1 public evaluation examples can be found in the ARC-AGI-2 training examples. So training on both ARC-AGI-1 and ARC-AGI-2 training data is cheating as it leads to crazy good accuracy for ARC-AGI-1.”” / X https://x.com/jm_alexia/status/1998487516182467055

The GPT-5 Auto router casts a long shadow over AI perceptions. So many examples of “”ChatGPT got X wrong”” are really “”ChatGPT-5 Instant got things wrong,”” leading to beliefs about the state of AI that aren’t true. Which model you get could be clearer &better explained for all.”” / X https://x.com/emollick/status/1998838007609119010

ChatGPT’s ‘Adult Mode’ Is Coming in 2026 https://gizmodo.com/chatgpts-adult-mode-is-coming-in-2026-2000698677

Horses https://andyljones.com/posts/horses.html

Debugging misaligned completions with sparse-autoencoder latent attribution https://alignment.openai.com/sae-latent-attribution/

Historian Thomas Hughes argued that technologies are malleable when young, then harden. Right now we’re still shaping AI, or at least it is being shaped by our institutions, norms & use cases Eventually these systems build a momentum of their own. That is why choices now matter https://x.com/emollick/status/1998184719817793788

I meet a lot of very smart AI critics who never seriously try to make AI work for them by spending a couple of hours with a frontier model. People can be (and should be & are) critical after realizing what AI can do, but experience leads to better-informed and sharper critiques.”” / X https://x.com/emollick/status/1998398372986736777

Alignment Is Capability https://www.off-policy.com/alignment-is-capability/

Made this video to explain evals https://x.com/HamelHusain/status/1998452926935695649

Prediction: AI will make formal verification go mainstream — Martin Kleppmann’s blog https://martin.kleppmann.com/2025/12/08/ai-formal-verification.html

Interesting study, but this is somewhat unexpected. (green is programming, yellow is role playing) https://x.com/emollick/status/1996758326877868268

Gemini 3 Pro scores 69% trust in blinded testing up from 16% for Gemini 2.5: The case for evaluating AI on real-world trust, not academic benchmarks | VentureBeat https://venturebeat.com/ai/gemini-3-pro-scores-69-trust-in-blinded-testing-up-from-16-for-gemini-2-5

Robot policies fail on the hard parts of manipulation. The moment contact, friction, or force uncertainty shows up, the success rate drops fast. CR DAgger shows a very different path. You take a pre trained policy. You let a human correct it in the real world for a short https://x.com/IlirAliu_/status/1996871611392069708

I have been pretty frustrated with the current focus of interpretability research. Promising to see the focus on scalability and generalization. Without these two properties, works often end up being neuron interpretation overfit to a single model and not particularly”” / X https://x.com/sarahookr/status/1997795206096429415

Large scale-experiments in UK, US & Poland where people chatted with LLMs about political topics found AI is very good at persuasion, primarily by providing lots of fact-based claims Plus, AI is getting more persuasive as models grow bigger & persuasion effects lasted over time. https://x.com/emollick/status/1996770000389169205

Chris Olah’s talk is happening right now at the NeurIPS mech interp workshop, room 30, top floor. Called “”reflections on interpretability””! Followed by invited lightning talks at 16:00 https://x.com/NeelNanda5/status/1997812818788467157