Image created with gemini-2.5-flash-image with claude-sonnet-4-5-20250929. Image prompt: A cinematic close-up photograph of a medieval wax seal being pressed into aged parchment in a candlelit stone chamber, the seal’s face reveals an intricate circuit pattern glowing faintly with blue light, an armored gauntlet holds the seal firmly, dramatic chiaroscuro lighting creates deep shadows suggesting both protection and watchfulness, style of classical painting meets modern precision.
Detecting and reducing scheming in AI models | OpenAI https://openai.com/index/detecting-and-reducing-scheming-in-ai-models/
Excited to share details on two of our longest running and most effective safeguard collaborations, one with Anthropic and one with OpenAI. We’ve identified—and they’ve patched—a large number of vulnerabilities and together strengthened their safeguards. 🧵 1/6 https://x.com/alxndrdavies/status/1966614120566001801
Their ongoing testing of models like Claude Opus 4 and 4.1 has helped us find vulnerabilities and build strong safeguards before deployment. Read more: https://x.com/AnthropicAI/status/1966599337426681899
Our collaboration with the US Center for AI Standards and Innovation (CAISI) and UK AI Security Institute (AISI) shows the importance of public-private partnerships in developing secure AI models.”” / X https://x.com/AnthropicAI/status/1966599335560216770
The focus on near-term mass replacement of white collar work with AI has two big gaps: 1) The nature of organizations limits the speed of change, even with AGI 2) But if AI is good enough to do all work, the societal changes would be so huge that job loss would be just the start”” / X https://x.com/emollick/status/1965950820354322739
Jensen Huang ‘disappointed’ by reported China Nvidia chip ban https://www.bbc.com/news/articles/cqxz29pe1v0o
(1/n) Scheming has been a key concern in AI safety for 20+ years. It’s when an AI acts aligned while hiding true goals. New OpenAI + Apollo research found scheming in every tested frontier model, though no harmful scheming has been seen in production traffic.”” / X https://x.com/woj_zaremba/status/1968360708808278470
Today we’re releasing research with @apolloaievals. In controlled tests, we found behaviors consistent with scheming in frontier models—and tested a way to reduce it. While we believe these behaviors aren’t causing serious harm today, this is a future risk we’re preparing”” / X https://x.com/OpenAI/status/1968361701784568200
This is significant progress, but we have more work to do. We’re advancing scheming research categories in our Preparedness Framework, renewing our collaboration with Apollo, and expanding our research team and scope. And because solving scheming will go beyond any single lab,”” / X https://x.com/OpenAI/status/1968361716770816398
This OpenAI update on anti-scheming is exceptionally good for an AIco, clearing an (extremely low) bar of “”Exhibiting some idea of some problems that might arise in scaling the work to ASI”” and “”Not immediately claiming to have fixed everything already.”” https://x.com/ESYudkowsky/status/1968388335354921351
Another major milestone! Scale AI has been awarded a $100 million agreement from the Pentagon. We’re honored by the trust and committed to advancing national security with secure, cutting-edge AI. https://x.com/scale_AI/status/1968351086768799959
A postmortem of three recent issues \ Anthropic https://www.anthropic.com/engineering/a-postmortem-of-three-recent-issues
Anthropic Economic Index report: Uneven geographic and enterprise AI adoption \ Anthropic https://www.anthropic.com/research/anthropic-economic-index-september-2025-report
“We’ve published a detailed postmortem on three infrastructure bugs that affected Claude between August and early September. In the post, we explain what happened, why it took time to fix, and what we’re changing:”” / X https://x.com/claudeai/status/1968416781967495526
o1 Preview is exactly one year old. I still remember when o1 was still known by its project name Q*; it was a time when rumors were circulating that OpenAI had made a world-changing breakthrough that would change everything. There were concerns that this project posed a threat”” / X https://x.com/kimmonismus/status/1966627812858855624
o1-preview -> GPT 5 pro in a year”” / X https://x.com/gdb/status/1966612991421423814
OpenAI claims hallucinations persist because evaluations reward guessing and that GPT-5 is better calibrated. Do results from HAL support this conclusion? On AssistantBench, a general web search benchmark, GPT-5 has higher precision and lower guess rates than o3! https://x.com/PKirgis/status/1966547382033936577
OpenAI has finally fixed their SWEBench errors and we can now finally apples to apples compare their scores over the entire 500 sample set (the fact that it took this long says alot about how much they care about SWEBench internally and maybe there’s a lesson here) https://x.com/nrehiew_/status/1967781400528245221
OpenAI just revealed that they have an internal unreleased SWE-bench-style benchmark for large ‘refactoring’ PRs, like the one mentioned here that edits 3.5k lines across 232 files. Their new model gets 51% accuracy on this benchmark. Who wants to make a public version of this? https://x.com/OfirPress/status/1967652031704994131
OpenAI’s Models Are Getting Too Smart For Their Human Teachers — The Information https://www.theinformation.com/articles/openais-models-getting-smart-human-teachers
Public Sector Symposium Ottawa – Amazon Web Services (AWS) https://pages.awscloud.com/ottawa-symposium-2025.html?trk=bbb2bc57-27ae-4ee2-b197-f881c8a66c0b&sc_channel=psm
Lots of sympathy to the Anthropic team 🙏🙏🙏 https://x.com/cHHillee/status/1968536182284849459
Introducing VaultGemma 🧠Gemma pre-trained with differential privacy (largest open model trained from scratch like this) 🔒Strong, mathematically-backed privacy guarantees 🤏Just 1B parameters 📈Novel research on scaling laws”” / X https://x.com/osanseviero/status/1966534013511672148
VaultGemma: The world’s most capable differentially private LLM https://research.google/blog/vaultgemma-the-worlds-most-capable-differentially-private-llm/
US and China really playing peekaboo in space. Btw Maxar is a private US satellite company… makes you wonder what the dedicated NRO and PLA constellations are seeing :-)”” / X https://x.com/bilawalsidhu/status/1967197327502008708
U.S. Investors, Trump Close In on TikTok Deal With China – WSJ https://www.wsj.com/tech/details-emerge-on-u-s-china-tiktok-deal-594e009f?gaa_at=eafs&gaa_n=AS
🚀 New partnership alert! The 1Password browser extension will be available in @perplexity_ai’s Comet browser, making AI-powered browsing secure by default. Read more in our press release: https://x.com/1Password/status/1968302513079148595
Meta announced LlamaFirewall, a toolkit to protect LLM agents from jailbreaking, goal hijacking, and exploiting vulnerabilities in generated code. The toolkit is now free to use for projects with up to 700 million monthly active users. Read our summary of the paper in The https://x.com/DeepLearningAI/status/1967986588312539272
We have updated ChatGPT’s personalization page: personality configuration, custom instructions, and memories are now all in one place. Going live over the next couple of days. https://x.com/sama/status/1967789125702140021
Today we’re announcing a partnership to bring 1Password to Comet, for built-in personal security without interruption.”” / X https://x.com/perplexity_ai/status/1968387122261540948
LLMs introduce a huge range of new capabilities for research, but also make it possible for researchers to “”hack”” their results in new ways by how they chose to use models for annotation This is an interesting attempt to quantify some of the risk, and some mitigation strategies https://x.com/emollick/status/1967594325505867956
What ensures safety in AI? Guardian models are the very safety layers that detect and filter harmful prompts and outputs, defending AI today. But they go beyond simple filtering. They can: – Serve as guardrails to block harmful content in real time – Act as evaluators to check https://x.com/TheTuringPost/status/1968635881004363969




