Agents and Copilots: AI News Week Ending 12/26/2025

Agents and Copilots: AI News Week Ending 12/26/2025

December 26, 2025

Image created with gemini-2.5-flash-image with claude-sonnet-4-5. Image prompt: Seamless repeating wrapping paper pattern featuring Victorian butler and maid uniforms in deep navy and antique gold, uniform fabric contains intricate circuit pathways and task-flow diagrams woven as elaborate brocade, brass buttons and pocket watches mixed with subtle gears and delegation arrows, AGENTS monogram on lapels like prestigious household seal, elegant damask style, embossed paper texture quality, rich jewel tones with metallic accents

Bloom – an open-source agentic tool that auto-generates behavioral evaluations for AI models by @AnthropicAI It turns what was once painstaking alignment work into a matter of configuration. – Bloom crafts and judges hundreds of scenarios targeting specific traits like https://x.com/TheTuringPost/status/2003629256522498061

Andrej Karpathy on X: I’ve never felt this much behind as a programmer. The profession is being dramatically refactored as the bits contributed by the programmer are increasingly sparse and between. I have a sense that I could be 10X more powerful if I just properly string together what has become available over the last ~year and a failure to claim the boost feels decidedly like skill issue. There’s a new programmable layer of abstraction to master (in addition to the usual layers below) involving agents, subagents, their prompts, contexts, memory, modes, permissions, tools, plugins, skills, hooks, MCP, LSP, slash commands, workflows, IDE integrations, and a need to build an all-encompassing mental model for strengths and pitfalls of fundamentally stochastic, fallible, unintelligible and changing entities suddenly intermingled with what used to be good old fashioned engineering. Clearly some powerful alien tool was handed around except it comes with no manual and everyone has to figure out how to hold it and operate it, while the resulting magnitude 9 earthquake is rocking the profession. Roll up your sleeves to not fall behind.”
https://x.com/karpathy/status/2004607146781278521

@YashGouravKar1 Correct. In the last thirty days, 100% of my contributions to Claude Code were written by Claude Code”” / X https://x.com/bcherny/status/2004897269674639461?s=20

Codex vs. Claude Code (Today) https://build.ms/2025/12/22/codex-vs-claude-code-today/

Introducing Bloom: an open source tool for automated behavioral evaluations \ Anthropic https://www.anthropic.com/research/bloom

We estimate that, on our tasks, Claude Opus 4.5 has a 50%-time horizon of around 4 hrs 49 mins (95% confidence interval of 1 hr 49 mins to 20 hrs 25 mins). While we’re still working through evaluations for other recent models, this is our highest published time horizon to date. https://x.com/METR_Evals/status/2002203627377574113?s=20

Supercharge Claude Code with better Excel understanding 📊 Coding agents are general enough to do any type of knowledge work, including reading/creating docs. There are some pre-built skills for Claude Code to read Excel sheets, but they kind of suck 🚫 – it requires the agent https://x.com/jerryjliu0/status/2005709989558775919

I spent all of Christmas reverse engineering Claude Chrome so it would work with remote browsers. Here’s how Anthropic taught Claude how to browse the web (1/7) https://x.com/pk_iv/status/2005694082627297735

I’m hearing from many folks across finance industry that Claude for Excel is blowing their minds. The agentic coding takeoff but for other fields is coming in 2026.”” / X https://x.com/alexalbert__/status/2005670179045523595

Google tests 30-minute Lecture Audio Overviews on NotebookLM https://www.testingcatalog.com/exclusive-google-tests-30-minute-audio-lectures-on-notebooklm/

Introducing Manus: the first general AI agent. Try Manus today and see the future of human-machine collaboration: https://x.com/ManusAI/status/1897294098945728752

Meta just bought the fastest-growing AI agent company in history for what’s probably $1-2B. The math tells you exactly why Zuckerberg did this deal today. Manus hit $100M ARR in eight months. That’s faster than ChatGPT, faster than Midjourney, faster than any AI product ever.”” / X https://x.com/aakashgupta/status/2005815184976417117

Manus Joins Meta: Accelerating AI Innovation for Businesses | Meta for Business https://www.facebook.com/business/news/manus-joins-meta-accelerating-ai-innovation-for-businesses

Meta just acquired Manus AI. Ramp Sheets modeled it out: Estimated price: $4-6B based on AI M&A comps Fastest to $100M ARR in history (8 months) Benchmark likely 8-12x’d in under a year https://x.com/RampLabs/status/2005807066351325470

Also, MSL is hiring in Singapore! We already have some amazing researchers and engineers there, buoyed now by the 100-strong Manus team, and we’re growing fast. Feel free to DM with a resumé if interested!”” / X https://x.com/alexandr_wang/status/2005766471516053736

Fun fact: Manus is currently SOTA on the Remote Labor Index (RLI) benchmark that @scale_AI and @ai_risks released earlier this year. https://x.com/alexandr_wang/status/2005766785237410107

Introducing Manus Design View https://manus.im/blog/manus-design-view

Manus Update: $100M ARR, $125M revenue run-rate https://manus.im/blog/manus-100m-arr

Meta just bought Manus for >$1B and it makes sense. ~8 Consumer AI apps hit $100M+ arr that aren’t big labs: Perplexity: $20B ElevenLabs: $6.6B Lovable: $6.6B Replit: $3B+ Suno: $2.5B Gamma: $2.1B Character: $1B+ Manus: $500M Meta AI has ~no product. This was the cheapest, and https://x.com/deedydas/status/2005798365733478490

OpenEnv: Meta × Hugging Face’s new open standard for agentic environments. Why it matters: One environment spec and it works everywhere. Train and inference. – Train with TRL, TorchForge, verl, SkyRL, Unsloth – Deploy with the same env you trained on – MCP tool support baked in https://x.com/ben_burtenshaw/status/2005655406522085482

Bloody hell, I’ll say this GPT 5.2 Codex Extra High is a methodical beast It’s updating the OpenCode OpenAI Codex OAuth plugin Literally not leaving any stone unturned This is the first model that feels like it’s building for itself ie leaving the door open for future work”” / X https://x.com/nummanali/status/2002116277666803917

A lot of people underestimate AI due to the confluence of 4 OpenAI choices: 1) GPT-5.x instant is not a very smart model 2) Most users are free users & the ChatGPT router sends them to instant often 3) The router calls everything GPT-5.2 4) Most people don’t know Reasoners exist https://x.com/emollick/status/2001840267155153362

Today @OpenAI updated the Model Spec, laying out how models are ‘intended to behave.’ Not marketing. Just explicit rules, priorities, and tradeoffs. Great reading if you’re wondering why models respond the way they do. Changelog + teen protections in 🧵👇 https://x.com/shaunralston/status/2001744269128954350

We finally had a moment to run our system with GPT-5.2 X-High on ARC-AGI-2! Using the same Poetiq harness as before, we saw results as high as 75% at under $8 / problem using GPT-5.2 X-High on the full PUBLIC-EVAL dataset. This beats the previous SOTA by ~15 percentage points. https://x.com/poetiq_ai/status/2003546910427361402

I would judge this a win by Gemini and a close second from Claude. ChatGPT-5.2 missed the reference (though, to be fair, it did write a surprising amount of successful code to actually enhance the image) and Grok wasn’t in the ballpark. https://x.com/emollick/status/2002961280534303206

So, Claude 4.5 came in far above trend in the much-watched METR measure of the task duration that AI can accomplish autonomously at 4 hours 49 minutes. Interestingly, at the harder 80% success threshold, it is GPT-5.1 Codex Max that breaks the trend. In 2023, GPT-4 was a minute. https://x.com/emollick/status/2002208335991337467

OpenAI built the Sora Android app (which hit #1 app in the world) in just 18 days with the help of Codex https://x.com/lennysan/status/2001074732293300301

A Redditor fed his MRI into ChatGPT and it appears to have correctly identified the cause of his sciatic leg pain. This could be a watershed moment for AI. https://x.com/reddit_lies/status/2003512194672025826

Background Coding Agents: Context Engineering (Part 2) | Spotify Engineering
https://engineering.atspotify.com/2025/11/context-engineering-background-coding-agents-part-2

My talk from the @aiDotEngineer Code Summit is out! 🚨 “”How Claude Code Works”” and what we can learn about frontier agent architectures. Coding agents are suddenly really really good, and I’m trying to understand why. In short: better models, simple loop design, and bash tools https://x.com/imjaredz/status/2005731826699063657

Evaluating Context Compression for AI Agents | Factory.ai https://factory.ai/news/evaluating-compression

Everyone seems to have a different definition of what an AI agent is. Let’s define it in the context of building with LLMs. An AI agent is a system that can: 1. Make dynamic decisions about information flow 2. Maintain state across multiple interactions 3. Use tools adaptively https://x.com/weaviate_io/status/2003824281231220902

Fast and Close to Right: How Accurate Should AI Agents Be? https://www.honeycomb.io/blog/fast-and-close-to-right-how-accurate-should-ai-agents-be

Great read for AI devs. (bookmark it) LLM agents are slow. The bottleneck in complex agentic systems today is the planning part. Plan generation alone can take 25+ seconds for task requests. This compounds fast at scale. Real-world dataset analysis shows about 30% of https://x.com/omarsar0/status/2005799762252136537

Here is how @SpotifyEng uses coding background agents for thousands of code migrations and what they learned: – Define desired verifiable end states explicitly, not strict todo steps. – Code examples improved output reliability. – Agent has access to 3 tools, verify, git and https://x.com/_philschmid/status/2005537262390349899

Managing engineers in 2025 doesn’t mean managing people anymore. It means managing people who manage AI agents. The IC role just mutated into something we don’t have a name for yet. Three things that changed: Deep focus is overrated now. The new superpower is aggressive”” / X https://x.com/brivael/status/2003871914104688867

Multiplexing MCP Servers For Agentic Specialization https://www.cloudnativedeepdive.com/multiplexing-mcp-servers-for-agentic-specialization/

Purpose built, use-case specific agents that are editable, which come with template repositories that include coding agent files to help with development 🤝 We’ve made it so that every time you pull an agent template with llamactl, the relevant docs are added as context into an”” / X https://x.com/tuanacelik/status/2005690735543140678

same model, different provider, different output quality and user experience! very nice blog on the state of inference providers and all the issues that come with benchmarking them (agent harness, prompting, deployment, sampling params, etc..) https://x.com/eliebakouch/status/2003604370534072445

Shoutout @ollama for the smooth rollout 🤝 We leveled up the coding & agentic model. They made it stupid simple to run. Show us what you make with MiniMax M2.1 on ollama 👇”” / X https://x.com/MiniMax_AI/status/2003715959719362584

UIUC, @Stanford, @Harvard researchers and others outlined the key strategies for agentic AI adaptation: There are 2 things you can adapt: – The agent itself (the reasoning model) – The tools it uses (search systems, retrievers, memory, APIs) From this, the researchers define 4 https://x.com/TheTuringPost/status/2002878724598022649

We removed 80% of our agent’s tools – Vercel https://vercel.com/blog/we-removed-80-percent-of-our-agents-tools

After writing several agents with Claude Code SDK, OpenAI SDKs and DeepAgent SDK I think a majority of agents can be built by just prompts and tools! You don’t need to implement patterns like plan, reflect and tool usage manually anymore.”” / X https://x.com/diptanu/status/2003674481144004667

Good news for the developer community! Try M2.1 on Cline. Whether you’re prototyping your next idea or scaling to production, you’ve got options.”” / X https://x.com/MiniMax_AI/status/2003599117503852680

This paper is a big deal! It’s well known that RL works great for math and code. But RL for training agents is a different story. The default approach to training LLM agents today is based on methods like ReAct-style reasoning loops, human-designed workflows, and fixed https://x.com/omarsar0/status/2003862504490086596

Async Coding Agents “”From Scratch”” https://benanderson.work/blog/async-coding-agents/

Background Coding Agents: Context Engineering (Part 2) | Spotify Engineering https://engineering.atspotify.com/2025/11/context-engineering-background-coding-agents-part-2

GLM-4.7: Advancing the Coding Capability https://z.ai/blog/glm-4.7

GLM-4.7: New SoTA OS model release With key improvements being increased reliability by combining Interleaved Thinking (thinks before every response), Preserved Thinking (reuses reasoning across turns), and Turn-level Thinking (control when reasoning is on) to making long & https://x.com/askalphaxiv/status/2005622173214335476

I went offline for a couple of days to be with family and it seems like all we talked about on this platform has been coding agents. A completely new way of thinking about documentation, oss projects, developer tools we provide has actually been figuring out how to structure”” / X https://x.com/tuanacelik/status/2005635491081900161

Im very frustrated with the divergence in API Standards across all providers”” / X https://x.com/Teknium/status/2005603815618470320

Memory: How Agents Learn https://www.ashpreetbedi.com/articles/memory

These are really good takeaways, I resonate with many of them⬇️ – from lovable’s growth lead 1. The shelf life of PMF is only three months, because the model update cycle is three months. Every time the model updates, you have to re-win PMF once 2. MVP is dead, if MLP doesn’t”” / X https://x.com/crystalsssup/status/2003704941962285463

Toward Training Superintelligent Software Agents through Self-Play SWE-RL https://arxiv.org/pdf/2512.18552

Will we still write code in 2026, or just manage agents? Big changes are coming to coding! My two favorite points come from @Steve_Yegge: #1: “”If you’re still using an IDE by next summer, you’re not a good engineer anymore.”” #2: “”Next year will be the year of technical people https://x.com/TheTuringPost/status/2003225483526435126

computer use agents will be a major story in 2026 they allow AI companies to capture a lot of white collar work https://x.com/scaling01/status/2005641253682098196

Graphite is joining Cursor · Cursor https://cursor.com/blog/graphite

openenv (OpenEnv: Agentic Execution Environments) https://huggingface.co/openenv

MAI-UI: Real-World Centric Foundation GUI Agents https://tongyi-mai.github.io/MAI-UI-blog/

Get started with pre-built document agent templates that solve real-world problems out of the box. We’ve created a collection of LlamaAgent templates through llamactl that cover the most common AI use cases, from simple document Q&A to complex invoice processing workflows: 🚀 https://x.com/llama_index/status/2005686055253729587

2026 Predictions: Will We Still Write Code, or Just Manage Agents? – YouTube https://www.youtube.com/watch?v=qqvODjOezX4

Agent Lightning – an open-source framework from @Microsoft that lets developers plug RL into any AI agents. No need to rewrite a single line of core code. – Agent Lightning separates execution from training, turning any agent workflow into RL-ready data. – It plays nicely with https://x.com/TheTuringPost/status/2001458217319313572

AI-powered scientists are starting to take off! This paper introduces PHYSMASTER, an LLM-based agent designed to operate as an autonomous theoretical and computational physicist. The goal is to go from an AI co-scientist to an autonomous AI scientist in fundamental physics. https://x.com/dair_ai/status/2005648022680526873

Hooks for security and platform teams · Cursor https://cursor.com/blog/hooks-partners

Fun article on the failure of a Claude-run vending machine in the WSJ newsroom. Reporters are amazing red teamers, creating fake policies & convincing Claude to order (and give away) Playstations & live fish. And yet… there are some hints of very viable paths forward from here https://x.com/emollick/status/2001755082510012750

I guess this (from a thinking trace of Claude 4.5 Opus) suggests @tylercowen’s strategy of writing for AI is paying off. https://x.com/emollick/status/2002546946721112173

Tired of leaving your IDE to explore GitHub repos? With Zread MCP in GLM coding plan, you can now stay in your flow: dive right into repos, explore their structure, search docs, and read files. Code smarter, not harder. https://x.com/Zai_org/status/2003872419791229285

We Let Anthropic’s Claude AI Run Our Office Vending Machine. It Lost Hundreds of Dollars. – WSJ https://www.wsj.com/tech/ai/anthropic-claude-ai-vending-machine-agent-b7e84e34

this is a brilliant read for anyone building with code agents like codex/ claude code – quick notes: > default to building CLIs first (easier for agents to verify) and progressively add other surfaces (UI) > for macOS/iOS apps default to using Swift build tooling + codex”” / X https://x.com/reach_vb/status/2005554360307065023

I think what has become clear over the past year is that the AGI label was never very useful. It is both true that (1) 2025-era Reasoners would meet many prior definitions of AGI & (2) the idea of a single dimensional “”intelligence”” factor does not help us understand AI impacts https://x.com/emollick/status/2001530531666628703

LLM Leaderboard for Code Quality & Security | Sonar https://www.sonarsource.com/the-coding-personalities-of-leading-llms/leaderboard/

Cursor continues acquisition spree with Graphite deal | TechCrunch https://techcrunch.com/2025/12/19/cursor-continues-acquisition-spree-with-graphite-deal/

New Enhanced Tool Governance in Vertex AI Agent Builder | Google Cloud Blog https://cloud.google.com/blog/products/ai-machine-learning/new-enhanced-tool-governance-in-vertex-ai-agent-builder/

Powered by Gemini 3, the @YouTube Playables Builder web app can help creators develop fun, bite-sized games with text, video or image prompts. 🕹️ Find out more → https://x.com/GoogleDeepMind/status/2003137379268256006

We just published a post on how we continuously harden ChatGPT Atlas (and other agents) against novel prompt-injection attacks. This is an ongoing security problem (and a frontier research problem!) and we’re investing heavily in automated red teaming, reinforcement learning,”” / X https://x.com/cryps1s/status/2003182649662140620

Xiaomi MiMo
https://mimo.xiaomi.com/blog/mimo-v2-flash

Excited to announce that @ManusAI has joined Meta to help us build amazing AI products! The Manus team in Singapore are world class at exploring the capability overhang of today’s models to scaffold powerful agents. Looking forward to working with you, @Red_Xiao_!”” / X https://x.com/alexandr_wang/status/2005766469771223106

Meta acquired Manus 👀”” / X https://x.com/scaling01/status/2005768491740360722

18 months ago, I decided to join with @Red_Xiao_ and @peakji on my sofa. No one knew where it would lead. We just kept building, pivoting, and shipping–again and again–until now. Grateful to our team and every user who believed early. Day 1 isn’t over. We’ll keep shipping. https://x.com/hidecloud/status/2005766533910602183

GPT-5.2-Codex launches today. It is trained specifically for agentic coding and terminal use, and people at OpenAI have been having great success with it.”” / X https://x.com/sama/status/2001724019188408352

Just launched GPT-5.2-Codex! The best model for long-horizon agentic coding, including strong performance on refactors and migrations. Codex becoming very magical. https://x.com/gdb/status/2001758275998785743

“”Having a *feel the AGI* moment with @OpenAI ‘s GPT 5.2 on extra high reasoning in Codex… It actually feels like a great junior engineer. Honestly, mind blown. 🤯”” / X https://x.com/AjaySohmshetty/status/2003223257655443840

🚨BREAKING: You can now build real apps inside ChatGPT. No setup. No switching tabs. Just describe what you want — and watch it come to life. Meet Replit in ChatGPT / @Replit💈 https://x.com/details_with_ai/status/2003393465208754334

Codex has (finally) a new /experimental setting that enables background terminals. (useful for long running processes) Especially when you run a dev server or logs, you won’t be blocked and you can resume working in Codex. https://x.com/kevinkern/status/2003118604808786086

gpt-5.2 codex has been rock-solid, even on big, messy codebases it can run forever without going off track, and i rarely have to throw away what it produces the only downside is that it takes long enough that i start doing other stuff while it runs and burn through my quota”” / X https://x.com/slow_developer/status/2002250108348379605

🆕 Codex now officially supports skills Skills are reusable bundles of instructions, scripts, and resources that help Codex complete specific tasks. You can call a skill directly with $.skill-name, or let Codex choose the right one based on your prompt. https://x.com/OpenAIDevs/status/2002099762536010235

Prompting GPT 5.2 Codex for Continuity It excels at long running tasks but without explicit guidance can lose track of outcomes Put this at the top of your AGENTS .md file, it will let Codex work on even larger scale tasks It’s how I let it run for 3 hours coherently https://x.com/nummanali/status/2002724188436738459

.@OpenAI introduced a rigorous framework for evaluating “chain-of-thought monitorability” It’s a fancy way of asking: Can we understand what our AIs are thinking before they act? The answer: yes, but not without nuance. – Longer reasoning helps – Bigger models muddle things – https://x.com/TheTuringPost/status/2003636642767384639

I think there is likely too much emphasis on the METR long-task measurement as a sign of AI progress… … but it doesn’t matter. With a little help from GPT-5.2 Pro, I calculated the correlations between log(METR) & other key benchmarks, and they basically all correlate highly https://x.com/emollick/status/2002861706658398211

I’ll work to make ChatGPT a better tool for accelerating scientific and mathematical discoveries. If you come across failure cases to improve upon (or exciting success stories) please send them my way!”” / X https://x.com/ErnestRyu/status/2003542931568025676

Message from Welcome to Sonar Chat! https://www.sonarsource.com/blog/new-data-on-code-quality-gpt-5-2-high-opus-4-5-gemini-3-and-more/