Image created with gemini-2.5-flash-image with claude-sonnet-4-5. Image prompt: Photorealistic 35mm cinema photograph of child aged 7 viewed from side angle sitting on plush rug in warm-lit bedroom, absorbed by panoramic arc of glowing TV screens showing animated helpful AI character, handwritten poster on wall reads MY ROOM RULES with simple guidelines, scattered newspapers and children’s book about fairness on nightstand, warm pastels with cool screen glow, shallow depth of field, cozy intimate atmosphere, large bold text ANTHROPIC at top of frame

Anthropic preparing new Agentic Tasks Mode for Claude https://www.testingcatalog.com/anthropic-testing-new-agentic-tasks-mode-for-claude/

Sonnet 4.5 was underestimated on METR its time horizon improves around 20 minutes https://x.com/scaling01/status/2001476927362605354

We’re working on updating and improving our time horizon task suite. Recently, we found two issues with our tasks, one of which was differentially lowering the performance of Claude models. We think these also illustrate some interesting model behavior.”” / X https://x.com/METR_Evals/status/2001473506442375645

Skills Directory | Partner Skills for Claude – YouTube

Skills for organizations, partners, the ecosystem | Claude https://claude.com/blog/organization-skills-and-directory

Claude Code 🤝 LangSmith Curious what Claude Code is doing behind the scenes? Or want observability into critical workflows that you’ve set up with Claude Code. With our new Claude Code → LangSmith integration, you can view every 🤖 LLM call and 🔧 tool call Claude Code makes. https://x.com/LangChain/status/2002055677708058833

i was skeptical when @simonw said that “”Claude Skills are awesome, maybe a bigger deal than MCP”” buuut early indications are this is correct. this is the fastest talk ever to pass 100k views here at AIE. its like those 0 – 100m ARR charts but for attention. @MaheshMurag and https://x.com/swyx/status/1998786773477110049

LangSmith + Claude Code / Deepagents Pairing LangSmith tracing w/ code agents provides a powerful feedback loop. Here, we show examples of that w/ langsmith-fetch + Claude Code / Deepagents. langsmith-fetch CLI: https://x.com/LangChain/status/2001350950188126430

The Signature Flicker | Peter Steinberger https://steipete.me/posts/2025/signature-flicker

We now support Agent Skills – the open standard created by @AnthropicAI for extending AI agents with specialized capabilities. Create skills once, use them everywhere. 🔗 https://x.com/code/status/2001727543377039647

We’ve received some feedback about a potential degradation of Opus 4.5 specifically in Claude Code. We’re taking this seriously: we’re going through every line of code changed and monitoring closely. In the meantime please submit any transcripts with issues through /feedback”” / X https://x.com/trq212/status/2001541565685301248

its over opus had a stroke and is in the hospital”” / X https://x.com/Teknium/status/2001941311604326596

looks like users aren’t shizo about Opus 4.5 degradation it’s actually brain damaged mass mass mass mass Apologies I’m glitching mass mass mass — https://x.com/scaling01/status/2001933798649532889

open interesting question on models adapting to harnesses + thoughts on something like a “HarnessBench” 1. are smarter models better or worse at transferring to new harnesses? Saw recent results that Opus in CC Harness had much bigger jump than Sonnet in CC Harness 2. What’s”” / X https://x.com/Vtrivedy10/status/2000610350014607728

Skills are all you need! Like MCP, Skills adoption is happening at incredible rate. Was just a matter of time for it to become an open standard. https://x.com/omarsar0/status/2001714322817368472

Tracing Claude code – it’s here today”” / X https://x.com/hwchase17/status/2002177192206241945

What Actually Is Claude Code’s Plan Mode? | Armin Ronacher’s Thoughts and Writings https://lucumr.pocoo.org/2025/12/17/what-is-plan-mode/

Project Vend: Phase two \ Anthropic https://www.anthropic.com/research/project-vend-2

The “”compacting conversation”” thing that Claude does as a chatbot doesn’t work as well as it does for coding. It doesn’t seem built for knowledge work, abruptly resetting everything in terms of tone and flow. Rolling context windows (like ChatGPT) might be better, or an option.”” / X https://x.com/emollick/status/2000411848496291897

Building “RAG 2.0” is just making Claude Code running over your filesystem 🤖🗂️ To make this work well, you need to solve three things 1️⃣ Virtualize your filesystem to prevent the agent from messing stuff up. AgentFS by @tursodatabase is a nice example of how you can give the https://x.com/jerryjliu0/status/2000677592559706396

We spun up a new GitHub repo for all things MCP at @Google. Get info on our remote managed MCP servers, open source MCP servers, examples, and learning resources. https://x.com/rseroter/status/2000607267675410609

GPT-5.2 just overtook Claude Opus 4.5 to achieve the highest score in GDPval-AA, a benchmark that focuses on performance in real-world economically valuable tasks However, GPT-5.2 is also the most expensive model to run GDPval-AA: GPT-5.2 cost $620, compared to Claude Opus 4.5’s https://x.com/ArtificialAnlys/status/1999404579599823091

Note that our 90% prediction intervals are quite wide, spanning a factor of 2x longer or shorter than our central estimate. Also, ECI underestimated previous Claude models on Time Horizons by 30% on average. If we account for that, we predict Opus 4.5 will get 3.8 hours. https://x.com/EpochAIResearch/status/1999585243003781413

Leave a Reply

Trending

Discover more from Ethan B. Holland

Subscribe now to keep reading and get access to the full archive.

Continue reading