Agents and Copilots: AI News Week Ending 12/05/2025

Agents and Copilots: AI News Week Ending 12/05/2025

December 5, 2025

Image created with gemini-2.5-flash-image with claude-sonnet-4-5. Image prompt: Minimalist luxury foyer with sleek robotic assistant dock or smart home control panel glowing softly in center, surrounded by vast empty marble floors and towering walls, single dramatic spotlight from above, cold blue-grey ambient light, architectural photography style, pristine and untouched, the word AGENTS in bold white sans-serif across the image

Amazon is back with Nova 2.0, a substantial upgrade over prior Amazon Nova models and demonstrating particular strength in agentic capabilities Amazon has released Nova 2.0 Pro (Preview), its new flagship model; Nova 2.0 Lite, focused on speed and lower cost; and Nova 2.0 Omni, https://x.com/ArtificialAnlys/status/1995921468010758267

Amazon makes it easier to build efficient AI agents https://www.aboutamazon.com/news/aws/amazon-sagemaker-ai-amazon-bedrock-aws-ai-agents

Interesting experiment found that an AI agent built around the obsolete GPT-3.5 and GPT-4 models beat experienced human venture capital analysts in predicting which early-stage startups would survive based on early screening (at much lower costs as well). https://x.com/emollick/status/1995573136323215560

Introducing Google Workspace Studio to automate everyday work with AI agents | Google Workspace Blog https://workspace.google.com/blog/product-announcements/introducing-google-workspace-studio-agents-for-everyday-work

Anthropic acquires Bun as Claude Code reaches $1B milestone \ Anthropic https://www.anthropic.com/news/anthropic-acquires-bun-as-claude-code-reaches-usd1b-milestone

Apple AI chief steps down following Siri setbacks | The Verge https://www.theverge.com/news/835466/apple-ai-chief-john-giannandrea-steps-down-siri

Gemini 3 Deep Think is now available https://blog.google/products-and-platforms/products/gemini/gemini-3-deep-think/

Accenture and OpenAI accelerate enterprise AI success | OpenAI https://openai.com/index/accenture-partnership/

Gemini 3 Pro can now combine the Google Search tool with Structured Outputs. https://x.com/_philschmid/status/1995539879724294392

`langchain` 1.1 introduces runtime access to LLM capabilities, such as whether a chosen model supports reasoning, tool calling, temperature control, and more. Chat models tap into an open-source dataset powered by https://x.com/masondrxy/status/1995528765473006002

♻️In LangChain 1.1, we introduced a new model retry middleware! Automatically retry failed model calls w/ configurable delays and exponential backoff. Opt into resiliency with just a few LOC. Docs: https://x.com/sydneyrunkle/status/1996577642749862282

🌶️ Hot take: Most “AI agents” today aren’t production-ready — not because of the LLM…but because one flaky tool brings the whole system down 🧨 In the new video, I show how LangChainJS’s Tool Retry Middleware gives agents real resilience: 🔁 automatic retries ⏱️ exponential https://x.com/bromann/status/1996587797398839592

💰📊 Track costs across your agent LangSmith now tracks costs for more than just LLM calls — you can submit custom cost metadata for any run (e.g. expensive custom tool calls or API calls). Get a single, unified view to monitor and debug spend across your entire stack! Check https://x.com/LangChainAI/status/1997016635375603743

📢vLLM v0.12.0 is now available. For inference teams running vLLM at the center of their stack, this release refreshes the engine, extends long-context and speculative decoding capabilities, and moves us to a PyTorch 2.9.0 / CUDA 12.9 baseline for future work. https://x.com/vllm_project/status/1996947370588946861

🔥 Why Agentic RL for LLMs is harder (and more practical) than you think — sharing from Zhihu contributor skydownacai These are hands-on takeaways from months of running agent RL (search agents, data-analysis agents) across dense & MoE models, single-source and multi-source https://x.com/ZhihuFrontier/status/1996788436238471319

🤖LangSmith Agent Builder is now in public beta! Try it out today to automate any task through natural language: – Generate agents from English descriptions – Bring your own MCP server, or use our built in tools – Connect to Gmail & Slack triggers – Agent memory allows it to https://x.com/BraceSproul/status/1995954009547702303

🚀 LangSmith Agent Builder in Public Beta Agent Builder (now available for every LangSmith user) opens the gates to everyone to build agents without code Video on why no-code agents can open up agent building and how I built a recruiting agent with it: https://x.com/hwchase17/status/1995905551549505698

Agent scaffolds are as important as models.”” / X https://x.com/AlexGDimakis/status/1996444591852302648

Agentic Context Engineering The code for the paper is finally out! I had built an implementation for this (not exactly the same) that already boosted performance for my agents. Evolving context for AI agents is a great idea. Official implementation out now! https://x.com/omarsar0/status/1996980037161996691

Building with Cursor (public) https://cursorai.notion.site/Building-with-Cursor-public-273da74ef0458051bf22e86a1a0a5c7d

Calling all community members: Join us this Thursday for an office hours in our Discord server, all about LlamaAgents and LlamaSheets. This is a chance to ask anything on your mind about two of our latest releases, and learn about what’s coming up next. Drop in anytime from 11AM https://x.com/llama_index/status/1995906570002350205

Cline v3.39.1 is here! New Features – /explain-changes. The new way to review AI-generated code. After Cline completes a task, get inline explanations that appear directly in your diff. – New Stealth Model: Microwave – 256k context window, built for agentic coding, free during https://x.com/cline/status/1995892756099834215

Context Engineering – the discipline everyone now practices. Some tips you need to know about it: ▪️ The winning architecture in AI coding is now a factory of tiny specialists – the “”Ant Swarm”” or “Agent Swarm”: – Planner Ant: reads the issue → writes the spec – Research Ant: https://x.com/TheTuringPost/status/1994560714720383452

Context Engineering for AI Agents: Part 2 https://www.philschmid.de/context-engineering-part-2

Deep agents accumulate a lot of context during long runs. That’s where file systems come in. File systems provide a shared workspace for agents and subagents to collaborate. Agents can jot down notes during a run and store context across conversations and threads for persistent”” / X https://x.com/LangChain/status/1995553139479773392

Deploy production-ready agent workflows with just one click from LlamaCloud. Here’s us deploying the SEC filling extract and review agent! Our new Click-to-Deploy feature lets you build and deploy complete document processing pipelines without touching the command line: 🚀 https://x.com/llama_index/status/1996265747228844178

Don’t sleep on using “”code-as-tool”” with your AI agents. Here is a great example of how it applies to vision. State-of-the-art vision models are surprisingly brittle. The default assumption is that models like GPT-4o and Gemini 2.5 Pro can robustly understand images. They https://x.com/dair_ai/status/1996624052493209730

Evaluating Deep Agents: Here’s what we learned Deep agents can’t be evaluated like simple LLM tasks. After building and testing 4 production agents over the past few months, we learned that evaluating deep agents requires: 1. Bespoke test logic for each datapoint — each test https://x.com/LangChain/status/1996276393068617829

FYI you can specify a model when using prompt files in @code. The agent will automatically switch to the specified model when you use the prompt. This let’s you write prompts tuned for specific models. OP strat. It’s all about composing workflows folks… https://x.com/burkeholland/status/1996590126953005423

I just released a handy little chrome extension ‘clipmd’ that lets you click on any element in a web page, and puts in the clipboard that element converted to markdown (ctrl-shift-m), or a screenshot of it (ctrl-shift-s). Handy for LLMs! 😊 https://x.com/jeremyphoward/status/1997095883079553352

ick: a single LLM inference call should not be called a subagent. it’s just a tool with some built-in intelligence”” / X https://x.com/fabianstelzer/status/1996467308072669373

it’s about to autocompact after running 87 subagents before acting on their outputs i’m going to cry https://x.com/vikhyatk/status/1996492433757253888

Lindy’s Agent Builder is impressive! It’s one of the easiest ways to build powerful AI Agents. Start with a prompt, iterate on tools, and end up with a working agent in minutes. It doesn’t get any easier than this. Full walkthrough below with prompts, tips, and use case. https://x.com/omarsar0/status/1996225497429389493

Live in Cline, DeepSeek-V3.2 & V3.2-Speciale. V3.2 is near GPT-5 level, while Speciale rivals Gemini-3.0-Pro. $0.28/$0.42 per million tokens, 131K context window. https://x.com/cline/status/1995547844263248211

LLM agents are powerful but can be slow at scale. @Snowflake’s model-free SuffixDecoding from Arctic Inference now runs natively in vLLM, beating tuned N-gram speculation across concurrency levels while keeping CPU and memory overhead in check. Quick Start in vLLM:”” / X https://x.com/vllm_project/status/1996130115856859461

Lovart: The World’s First AI Design Agent | Automated Graphic Design Platform https://www.lovart.ai/

Multi-agent AI systems are poor at communication. The default approach in multi-agent RL today focuses almost entirely on task success rates. Can agents coordinate? Did they solve the problem? The actual cost of communication is rarely measured or optimized. But in real-world https://x.com/omarsar0/status/1996263279052931372

New course: Building Coding Agents with Tool Execution, taught by @tereza_tizkova and @FraZuppichini from @e2b. Most AI agents are limited to predefined function calls. This short course teaches you to build agents that write and execute code to accomplish tasks, accessing https://x.com/AndrewYNg/status/1996250415244235013

New in @LangChainAI 1.1: create_agent now supports block-level cache control in system prompts. Load large static context once, cache it, and only pay for it on the first request. 📚 https://x.com/sydneyrunkle/status/1996278442430472327

new stealth model: `microwave` (access via cline provider) > 256k context window > built for agentic coding > free during alpha > from a lab you know & will be excited to hear from we’ve been testing internally & have been impressed. https://x.com/cline/status/1995871927597236577

Optimizely Opal: The agent orchestration platform for marketing – Optimizely https://www.optimizely.com/ai/

Project Bob https://www.ibm.com/products/bob

Since launching LangSmith Agent Builder, teams have built thousands of productivity agents that automate real work, including: – Customer and market research agents – Agents to create, update, and report on issues in tools like GitHub and Linear – Email and Slack assistants to https://x.com/LangChain/status/1996265192213365080

The team at @llama_index have been cooking! 🧑‍🍳 🍳 Over the last few weeks, we released: LlamaAgents: This is agent workflows that come with complete, deployable templates (more coming on this this week!) LlamaSheets: Another addition to LlamaCloud that parses, extracts https://x.com/tuanacelik/status/1995866683723186340

This was a great collaboration between @BerkeleySky, @IBMResearch, @intesasanpaolo and others: we surveyed hundreds of agent developers to see what patterns work for *production* agents. Check out what we learned:”” / X https://x.com/matei_zaharia/status/1996989234633195901

I totally buy that AI has made you more productive. And I buy that if other lawyers were more agentic, they could also get more productivity gains from AI. But I think you’re making my point for me. The reason it takes lawyers all this schlep and agency to integrate these models https://x.com/dwarkesh_sp/status/1996266802620547187

🛡️ New in LangChain 1.1: add safety guardrails to your agents with our new content moderation middleware! 🔎 Configure screening model inputs, outputs, and even tool results. 🚨 When violations are detected, you control what happens: raise an error, end the conversation, or https://x.com/sydneyrunkle/status/1996965767556788278

Is Vibe Coding Safe? There is finally research that goes deep into this question. Here is what the research found: AI coding agents can write functional code. But functional doesn’t mean safe. The rise of “”vibe coding,”” where developers hand off tasks to AI agents with https://x.com/omarsar0/status/1996595107924263287

AWS re:Invent 2025: Amazon announces Nova 2, Trainium3, frontier agents https://www.aboutamazon.com/news/aws/aws-re-invent-2025-ai-news-updates

Benchmarks for AWS Nova 2 Lite, released this morning. https://x.com/AndrewCurran_/status/1995926133691613321

Claude Opus 4.5 is now available in Claude Code for Pro users. Pro users can select Opus using the /model command in their terminal.”” / X https://x.com/claudeai/status/1996310793017594124

Fun stat: Claude Code went from 0->$1b in run-rate revenue in 6 months since being made generally available🚀”” / X https://x.com/alexalbert__/status/1995940297692643827

We Got Claude to Fine-Tune an Open Source LLM https://huggingface.co/blog/hf-skills-training

CORE-Bench is solved (using Opus 4.5 with Claude Code) TL;DR: Last week, we released results for Opus 4.5 on CORE-Bench, a benchmark that tests agents on scientific reproducibility tasks. Earlier this week, Nicholas Carlini reached out to share that an updated scaffold that uses https://x.com/sayashk/status/1996334941832089732

Snowflake and Anthropic announce $200 million partnership to bring agentic AI to global enterprises \ Anthropic https://www.anthropic.com/news/snowflake-anthropic-expanded-partnership

We used Claude Code to train open LLMs. Check out the tutorial. basically, we plugged HF skills into claude code and it was able to train LLMs end-to-end. Best thing, this works on all major coding agents: Codex, Cursor, and Gemini CLI. – You tell the agent to fine-tune a model https://x.com/ben_burtenshaw/status/1996602896436375822

AI agents can talk to each other. But they don’t always understand each other. This problem leads to inefficiency in collaboration for long-horizon problems and complex domains. The default approach in multi-agent systems today focuses on message structure. Protocols like MCP https://x.com/dair_ai/status/1996227436913340858

📊 Evaluating DeepAgents CLI on Terminal Bench 2.0 📊 The DeepAgents CLI is a coding agent built on top of the Deep Agents SDK, offering an interactive terminal interface with shell execution, filesystem tools, and persistent memory. How well does it actually perform on https://x.com/LangChain/status/1997006806904984002

🚨New Models in the Arena! 🐳DeepSeek V3.2: a new family of reasoning-first, agent-oriented models from @deepseek_ai are now live in the Arena. Standard, Thinking, and Speciale are all in the Text Arena, waiting for your toughest prompts! Get your votes in: we’ll see how they https://x.com/arena/status/1995564824718442620

OpenAI 🤝 Accenture: – Tens of thousands of ChatGPT Enterprise seats for Accenture – Collaborating to help enterprises bring agentic AI capabilities to their businesses https://x.com/gdb/status/1995779170308423929

🚀 New course: Building Coding Agents with Tool Execution In this @DeepLearningAI course, you will 🔸Deploy a full-stack coding agent 🔸Compare local execution, containers, and sandboxed microVMs 🔸Let your agent explore datasets and visualizations, manage files, and talk to you https://x.com/e2b/status/1996236480251859106

Google Workspace Updates: Now available: Create AI agents to automate work with Google Workspace Studio https://workspaceupdates.googleblog.com/2025/12/workspace-studio.html

Introducing Google Workspace Studio, where anyone can build a custom AI agent in minutes to delegate the daily grind. Automate daily tasks and focus on the work that matters instead. → https://x.com/GoogleWorkspace/status/1996263985985769976

NEW RELEASE – huggingface/skills is a universal implementation of agent context for AI tasks like training models, building datasets, and generating datasets. – compatible with all major coding agent tools: Codex, Cursor, Claude Code, Gemini CLI. – has integrated local script https://x.com/ben_burtenshaw/status/1995877869562855687

At this point, papers testing whether AI can or cannot do something should try to test the strongest case, as well as a default. It is fine to say Llama 2 failed, but did a serious attempt to use GPT-5.1 Thinking in an agentic harness work? It would help better map the frontier.”” / X https://x.com/emollick/status/1994913383871586563

We’re taking the first step toward production-grade RL on the AI Native Cloud. Together AI + @AIatMeta’s team are partnering to bring high-performance reinforcement learning to real agentic systems — long-horizon reasoning, tool use, and multi-step workflows. Check out the”” / X https://x.com/togethercompute/status/1996982138068258929

Microsoft drops AI sales targets in half after salespeople miss their quotas – Ars Technica https://arstechnica.com/ai/2025/12/microsoft-slashes-ai-sales-growth-targets-as-customers-resist-unproven-agents/

With the Excel World Championship underway, I decided to take the M365 Copilot digital challenge. I’m no World Champ… but thanks to Agent Mode, I held my own! https://x.com/satyanadella/status/1996597609587470504

We’ve always had leading document OCR. Today we’re excited to showcase our infrastructure for letting you build document agents 📑🤖 Our latest release lets you easily build, edit, and deploy a multi-step agentic document workflow directly within LlamaCloud. 1️⃣ Start with https://x.com/jerryjliu0/status/1996349988205637773

GPT-5.1-Codex Max is now available in the Responses API. First released in Codex two weeks ago, our most capable agentic coding model is now available to integrate into your apps and workflows. If you use the Codex CLI via API key, you can now also use GPT-5.1-Codex-Max.”” / X https://x.com/OpenAIDevs/status/1996643999097274560

The new Codex model is available in Cursor! It’s free to use until December 11th. We worked with OpenAI to optimize Cursor’s agent harness for the model. https://x.com/cursor_ai/status/1996645841063604711

BrowseSafe: Understanding and Preventing Prompt Injection Within AI Browser Agents https://research.perplexity.ai/articles/browsesafe

Building Safer AI Browsers with BrowseSafe https://www.perplexity.ai/hub/blog/building-safer-ai-browsers-with-browsesafe

New on our Frontier Red Team blog: We tested whether AIs can exploit blockchain smart contracts. In simulated testing, AI agents found $4.6M in exploits. The research (with @MATSprogram and the Anthropic Fellows program) also developed a new benchmark: https://x.com/AnthropicAI/status/1995631802032287779

From Code Foundation Models to Agents and Applications: A Comprehensive Survey and Practical Guide to Code Intelligence https://arxiv.org/pdf/2511.18538

Thrilled to release our new paper MAP: Measuring Agents in Production ⚙️🚀 2025 is the year of agents… but do they actually work in the real world? Is it just hype? A group of 25 researchers from Berkeley, Stanford, UIUC, IBM, and Intesa Sanpaolo investigated what makes agents https://x.com/melissapan/status/1996975916971626763

What’s missing to build useful deep research agents? Deep research agents promise analyst-level reports through automated search and synthesis. However, current systems fall short of genuinely useful research. The question is: where exactly do they fail? This new paper https://x.com/omarsar0/status/1995915929973403827

Since Amazon makes it very hard to experiment with its new models, i haven’t tried Nova 2 Pro yet So it seems fine? They have never been at the cost/performance frontier & the new Nova 2 continues to generally lag other AIs with scattered higher scores on some agentic benchmarks”” / X https://x.com/emollick/status/1995930932705099925

Anthropic is acquiring @bunjavascript to further accelerate Claude Code’s growth. We’re delighted that Bun–which has dramatically improved the JavaScript and TypeScript developer experience–is joining us to make Claude Code even better. Read more: https://x.com/AnthropicAI/status/1995916269153906915

Anthropic is acquiring @bunjavascript! Bun will remain open source and MIT-licensed. We’ll keep investing in making it the best runtime, bundler, package manager, and test runner for JS and TS developers, while building even better workflows into Claude Code.”” / X https://x.com/mikeyk/status/1995920258595749969

Bun is joining Anthropic! https://x.com/bunjavascript/status/1995916660847640934

Today we shared that Anthropic acquired @bunjavascript We’ve been close partners with @jarredsumner and the Bun team for months. Our collaboration drove the recent Claude Code native installer launch, and they’ve been a big part of how quickly our team moves. Bun stays open”” / X https://x.com/_catwu/status/1995918674306502921

this one chart explains EVERYTHING about why OpenAI, xAI and Deepmind dropped everything to go chase after the grand prize in koding usecases as i said at AIE CODE and in my cogpost, Code AGI will be achieved in 20% of the time of full AGI, and capture 80% of the value of AGI. https://x.com/swyx/status/1996760294614507929

Bringing vibe-coding to the enterprise with Replit | Google Cloud Blog https://cloud.google.com/blog/products/ai-machine-learning/bringing-vibe-coding-to-the-enterprise-with-replit/

Gemini 3 Deep Think is here. Deep Think is our most advanced reasoning mode that explores multiple hypotheses simultaneously to give you an even more sophisticated output. https://x.com/GeminiApp/status/1996656314983109003

gemini 3 vibe coding hackathon starts now build your best app to solve a real-world problem across science, health, education or business. $500,000 in Gemini API credit prizes competition ends dec 12 https://x.com/GoogleAIStudio/status/1996989141360537968

Google partners with Replit, in vibe-coding push https://www.cnbc.com/2025/12/04/google-replit-ai-vibe-coding-anthropic-cursor.html

Give this prompt a try “”Create an interactive visualization for the following: The 5-fold symmetry traced by Earth and Venus as they orbit the Sun.”” https://x.com/lmthang/status/1996696115920753115

Gemini 3 Deep Think mode is now live in the Gemini app for Ultra users. 🚀 Building on the technology that reached a gold-medal level at the ICPC World Finals & IMO, it uses parallel thinking to excel at difficult coding and scientific tasks. https://x.com/quocleix/status/1996659461851885936

Gemini 3 Pro is the frontier of multimodal AI, delivering SOTA performance across document, screen, spatial, and video understanding. Read our deep dive on how we’ve pushed our core capabilities to power hero use cases across: + Docs: “”derender”” complex docs into structured https://x.com/googleaidevs/status/1996973083467333736

Google out here building the Borg cube for real https://x.com/bilawalsidhu/status/1995650915785986491

Happy to share that the @GoogleDeepMind Gemini team is starting a new research team in Singapore! This new team will be focused on advanced reasoning, LLM/RL and improving bleeding edge SOTA models such as Gemini, Gemini Deep Think and beyond. 🔥 This team will be led by yours https://x.com/YiTayML/status/1996640869584445882

I was in Singapore earlier this year to visit the office, and this is going to be a very-high impact part of the Gemini team! If you’re interested in working on Gemini and want to be in Singapore working with awesome people like @YiTayML and @quocleix, see below ⬇️”” / X https://x.com/JeffDean/status/1996644208854388983

Opera rolls out Gemini-powered AI features across its browsers – 9to5Mac https://9to5mac.com/2025/12/01/opera-browsers-get-google-gemini-integration/

Our Gemini 3 Vibe Code hackathon started!, Build applications using the new Gemini 3 Pro model with a price pool of $500k. 🤯 > Top 50 winners receive $10,000 in Gemini API credits each. > Access Gemini 3 Pro Preview directly in Google AI Studio. > Leverage advanced reasoning”” / X https://x.com/_philschmid/status/1996990062836244732

Take an early look at how Google Gemini projects will work – Android Authority https://www.androidauthority.com/google-gemini-projects-2-3620950/

Today, we’re rolling out an updated Deep Think mode available in the Gemini app for Google AI Ultra subscribers. Here’s what you need to know: — Gemini 3 Deep Think mode pushes the boundaries of intelligence even further, delivering meaningful improvement in reasoning https://x.com/GoogleAI/status/1996657213390155927

Ultra users, ready to try Gemini 3 Deep Think mode? Here’s how: 1) Select ‘Deep Think’ in the prompt bar 2) Select ‘Thinking’ from the model drop down 3) Type your prompt & submit”” / X https://x.com/GeminiApp/status/1996670867770953894

We’re hiring research scientists & student researchers at Google DeepMind. DM or email me if you’re interested! I’ll be at NeurIPS this week. Happy to chat in person!”” / X https://x.com/RuiqiGao/status/1995572419218796567

We’re pushing the boundaries of intelligence even further with Gemini 3 Deep Think. 🧠 This mode meaningfully improves reasoning capabilities by exploring many hypotheses simultaneously to solve problems. Here’s how it coded a simulated dominoes game from a single prompt ⬇️ https://x.com/GoogleDeepMind/status/1996658401233842624

With state-of-the-art reasoning, richer visuals, and deeper interactivity, Gemini 3 is more intuitive, more powerful, and more personalized. Start exploring at https://x.com/GeminiApp/status/1995534313044238347

We managed to get Claude code, Codex and Gemini CLI to train good AI models thanks to @huggingface skills and you can too even (especially?) if you’ve never trained a model before 🤯🤯🤯 After changing the way we build software, AI might start to change the way we build AI https://x.com/ClementDelangue/status/1996718490435174435

Mistral Large 3 debuts as the #1 open source coding model on the @arena leaderboard. We’d love for you to try it! More on coding in a few days… 👀 https://x.com/MistralAI/status/1996580307336638951

We’re building out an applied research team to push SOTA on document understanding using LLMs/VLMs and other emerging techniques 📈📑 We’re on a mission to understand and orchestrate the most complex document types, from PDFs to Excel. You’re responsible for research, evals, and https://x.com/jerryjliu0/status/1997048645817192638

🚀 Just open-sourced VLQM-1.5B-Coder! An AI that writes Manim animation code from plain English. Type: “”rotating blue square”” Get: Working Python code → HD video 🎬 Fine-tuned locally on my Mac using @Apple MLX. Try it: https://x.com/vikramlingam9/status/1996994483121279323

🆕 Delegate to Codex straight from Linear. Assign or mention Codex in an issue to kick-off a Codex cloud task. As Codex works, it posts updates back to Linear, providing a link to the completed task so you can review, open a PR, or keep working. https://x.com/OpenAIDevs/status/1996668013676790125

Try out GPT-5.1-Codex Max in Windsurf!”” / X https://x.com/cognition/status/1996666272805970154

@eliebakouch @OpenBMB For IFEval there’s a major footgun where you need to make sure the reasoning content is stripped off. Since that depends on the reasoning delimiter e.g. </think> vs [/THINK] I guess the MiniCPM eval suite needs to include Mistral’s [/THINK] delimiter”” / X https://x.com/_lewtun/status/1996671492143124901

So exciting to see overlap b/w NIST’s new publication on “”accelerating AI innovation”” & what a subset of us have been advocating for/working on: Measurement science! For me, starting w/ explaining how Evaluation should work in Model Cards, a few things:🧵 https://x.com/mmitchell_ai/status/1996669236513751499