Agents and Copilots: AI News Week Ending 12/19/2025

Agents and Copilots: AI News Week Ending 12/19/2025

December 19, 2025

Image created with gemini-2.5-flash-image with claude-sonnet-4-5. Image prompt: Photorealistic 35mm cinema shot of child viewing TV screens in warm bedroom, shallow depth of field with robotic arm arranging books on shelf in background, small autonomous device on plush rug edge, screens displaying UI fragments and task notifications, warm peach lighting contrasted with cool blue screen glow, side angle composition, cozy domestic automation, large bold text AGENTS at top

Anthropic preparing new Agentic Tasks Mode for Claude https://www.testingcatalog.com/anthropic-testing-new-agentic-tasks-mode-for-claude/

Fastweb + Vodafone (Swisscom Group), one of Europe’s leading telecom providers, is building Super TOBi, which brings agentic customer service to massive scale. Using LangSmith, they are: 🔹Achieving 90% response correctness and 82% resolution rates across ~9.5M customers https://x.com/LangChain/status/2001321491703443877

GROK JUST TURNED VOICE AI INTO A REAL PRODUCT, FAST, AND EVERYWHERE xAI just opened Grok Voice to developers, and this isn’t some early experiment dressed up as a launch. It’s the same system already running inside millions of Teslas, now exposed through an API that actually https://x.com/MarioNawfal/status/2001472484869329288

Grok Voice Agent API | xAI https://x.ai/news/grok-voice-agent-api

Today, we’re excited to launch the Grok Voice Agent API, empowering developers to build voice agents that speak dozens of languages, call tools, and search realtime data. https://x.com/xai/status/2001385958147752255

Took less than an hour for Grok Voice Agent by @xai to be ported to Reachy Mini thanks to @atariorbit! https://x.com/ClementDelangue/status/2001410494528213481

Sonnet 4.5 was underestimated on METR its time horizon improves around 20 minutes https://x.com/scaling01/status/2001476927362605354

We’re working on updating and improving our time horizon task suite. Recently, we found two issues with our tasks, one of which was differentially lowering the performance of Claude models. We think these also illustrate some interesting model behavior.”” / X https://x.com/METR_Evals/status/2001473506442375645

IBM dropped CUGA, open-source enterprise agent to automate boring tasks 🔥 > given workspace files, it writes and executes code to accomplish any task 🤯 > comes with a ton of tools built for enterprise tasks, supports MCPs > plug in your favorite LLM 👏 here’s a small demo https://x.com/mervenoyann/status/2000599316121924052

Skills Directory | Partner Skills for Claude – YouTube

Skills for organizations, partners, the ecosystem | Claude https://claude.com/blog/organization-skills-and-directory

Gemini Agent can help tackle all sorts of tasks. Even renting a car. Tell Gemini Agent your budget and it’ll get to work comparing prices, gathering info from your inbox, and booking the car. Now available for Google AI Ultra users in the US on desktop and mobile.”” / X https://x.com/GeminiApp/status/2000616120106221781

Gemini 3, create a really novel and clever and funny Venn diagram. think hard. do not do research.”” So close to coming together (I am not sure the center works for all three, illustrations are odd), but also better than I expected. https://x.com/emollick/status/2000805347590856822

Gemini 3, please provide the rail/subway map for Middle Earth in the third age, with accurate stops and taking into account natural barriers, alliances, and so on.”” Not bad. I do like the “”service suspended – Balrog”” note at Moria. https://x.com/emollick/status/1999930443001737700

Gemini can now illustrate a visual report https://blog.google/products-and-platforms/products/gemini/visual-reports/

Google Antigravity https://antigravity.google/

Google expands Gemini with NotebookLM integration https://www.testingcatalog.com/google-expands-gemini-with-notebooklm-integration/

Build mini apps with Opal in the Gemini web app https://blog.google/innovation-and-ai/models-and-research/google-labs/mini-apps-opal-gemini-app-experiment/

Say hello to CC, a new AI productivity agent that connects your Gmail, Calendar and Drive to deliver a personalized briefing every morning. Need more help? Just email CC https://labs.google/cc

After a day of gemini 3 flash in antigravity, I think I’m convinced. It’s really good to have a lightning fast and smart model for daily work. I’ve been pretty adamant that slower is ok if the model is smarter, but the models have produced just slightly too much cruft and I”” / X https://x.com/andrew_n_carr/status/2001487412749570549

For a fast model, Gemini 3 Flash offers incredible performance, allowing us to provide frontier intelligence to everyone globally. Try the ‘fast’ mode from the model picker in the @GeminiApp – it’s shockingly speedy AND smart. Best pound-for-pound model out there ⚡️⚡️⚡️ https://x.com/demishassabis/status/2001325072343306345

For developers, it combines advanced coding skills with the low latency needed for building interactive apps. On SWE-bench Verified – a benchmark for evaluating coding agents – it outperforms not only the 2.5 series, but also Gemini 3 Pro. Watch 3 Flash give near real-time AI https://x.com/GoogleDeepMind/status/2001321765503377546

Gemini 3 Flash gives you frontier intelligence at a fraction of the cost. ⚡ Here’s how it’s built for speed and scale 🧵 https://x.com/GoogleDeepMind/status/2001321759702663544

Gemini 3 flash is a bigger deal than Gemini 3 pro. While 2.5 flash is the most used model this year, but it struggled with tool calling. But Gemini 3 flash gets it. – tool calling feels natural to the model – it’s faster than turbo models + way smarter too (best for real time”” / X https://x.com/0xdevshah/status/2001330346961604732

Gemini 3 Flash is beating 3 Pro on SWE bench verified Hmm what https://x.com/MS_BASE44/status/2001698991801798927

Gemini 3 Flash is starting to roll out in the @GeminiApp and across Google products. Learn more ↓ https://x.com/Google/status/2001746491275083925

Gemini 3 Flash punches way above its weight class, surpassing 2.5 Pro on many benchmarks, while being much cheaper, faster, and more token efficient. https://x.com/OfficialLoganK/status/2001323840459456715

Google has released Gemini 3 Flash Preview – 2x cheaper than Gemini 3 Pro Preview, with only a 2-point drop in our Intelligence Index, making it the most intelligent model for its price range @GoogleDeepMind gave us pre-release access to Gemini 3 Flash Preview. The model scores https://x.com/ArtificialAnlys/status/2001335953290670301

Gemini 3.0 Flash is an absolutely fantastic release. Consider this: It costs a quarter (1/4) of what Gemini 3.0 Pro costs and achieves similar results to the Pro model in almost all benchmarks, such as HLE and ARC-AGI 2. In other benchmarks, it even outperforms the more https://x.com/kimmonismus/status/2001326181875154983

Introducing Gemini 3 Flash: Benchmarks, global availability https://blog.google/products-and-platforms/products/gemini/gemini-3-flash/

Starting today, Gemini can serve up local results in a rich, visual format. See photos, ratings, and real-world info from @GoogleMaps, right where you need them.”” / X https://x.com/GeminiApp/status/1999631529379791121

BREAKING: OpenAI releases “”GPT-Image-1.5″” (ChatGPT Images) & It instantly takes the #1 Spot on LMArena, beating Google’s Nano Banana Pro. : r/singularity https://www.reddit.com/r/singularity/comments/1po98xo/breaking_openai_releases_gptimage15_chatgpt/

A year ago, we verified a preview of an unreleased version of @OpenAI o3 (High) that scored 88% on ARC-AGI-1 at est. $4.5k/task Today, we’ve verified a new GPT-5.2 Pro (X-High) SOTA score of 90.5% at $11.64/task This represents a ~390X efficiency improvement in one year https://x.com/arcprize/status/1999182732845547795

GPT-5.2-Codex launches today. It is trained specifically for agentic coding and terminal use, and people at OpenAI have been having great success with it.”” / X https://x.com/sama/status/2001724019188408352

Meet GPT-5.2-Codex, the best agentic coding model yet for complex, real-world software engineering. With native compaction, better long-context understanding, and improved tool-calling, it is a more dependable partner for your hardest tasks. Available in Codex starting today.”” / X https://x.com/OpenAIDevs/status/2001723687373017313

GPT-5.2 exceeded a trillion tokens in the API on its first day of availability and is growing fast!”” / X https://x.com/sama/status/1999624463013544024

I have found GPT-5.2 Thinking to be a surprisingly deep second-opinion/fact checker. I gave it a dense paragraph with a few correct claims, a couple errors that required research to find, and some things that needed interpretation It found and gently corrected all the problems https://x.com/emollick/status/2000666007010971787

Introducing GPT-5.2-Codex | OpenAI https://openai.com/index/introducing-gpt-5-2-codex/

GPT-5.2 is here and it’s the best model out there for everyday professional work. On GDPval, the thinking model beats or ties human experts on 70.9% of common professional tasks like spreadsheets, presentations, and document creation. It’s also better at general intelligence,”” / X https://x.com/fidjissimo/status/1999183159356006450

Today I ran two complex tasks through Codex with GPT 5.2 Extra High The first ran for 2 hours 30 minutes The second ran for 1 hours 45 minutes Both resulted in: – all acceptance criteria resolved – all test coverage complete – zero broken or non-working code Amazing”” / X https://x.com/nummanali/status/2000228337030152347

Whoa. This new GDPval score is a very big deal. Probably the most economically relevant measure of AI ability suggesting that in head-to-head competition with human experts on tasks that require 4-8 hours for a human to do, GPT-5.2 wins 71% of the time as judged by other humans https://x.com/emollick/status/1999189828756263359

xAI’s new Grok Voice Agent is the new leading Speech to Speech model, surpassing Gemini 2.5 Flash Native Audio and GPT Realtime in our Big Bench Audio benchmark The new model achieves a score of 92.3% on Big Bench Audio, just ahead of the previous leader, Google’s Gemini 2.5 https://x.com/ArtificialAnlys/status/2001388724987527353

How good is AI for science? Yesterday, OpenAI released a benchmark, FrontierScience, to measure frontier model performance on scientific tasks. This is the most sophisticated benchmark for science I’ve seen. FrontierScience has 160 questions across various subdomains, https://x.com/jungofthewon/status/2001302379527114798

⚖️ Pairwise Annotations: Scores are hard, preferences are easy. Agents handle tasks that are tough to score but easy to compare: support responses where tone matters, code refactors where both work but one feels cleaner, product specs where “”good”” is subjective. In practice, https://x.com/LangChain/status/2001361753851203724

When Machines Pay Machines: The Economics of Agentic AI | Matt Suiche https://www.msuiche.com/posts/when-machines-pay-machines-the-economics-of-agentic-ai/

@swyx To me agents and harnesses are fully coupled, not really possible to properly eval one without the other. Currently workshopping: if agent = folder, what goes in there, how do we install/assemble agents? So here’s current mental model of both agents & harnesses with some”” / X https://x.com/Vtrivedy10/status/2001868118952436103

⚡ Faster than Fast. Designed for Agentic AI. Introducing Xiaomi MiMo-V2-Flash — our new open-source MoE model: 309B total params, 15B active. Blazing speed meets frontier performance. 🔥 Highlights: 🏗️ Hybrid Attention: 5:1 interleaved 128-window SWA + Global | 256K context 📈 https://x.com/XiaomiMiMo/status/2000929154670157939

🔌 New in LangChain MCP Adapters (feat 3/4): structured content from tools 📦 MCP tools can now return content and structuredContent (often JSON payloads and pydantic models) –ideal for agents exposed as MCP tools! Docs: https://x.com/sydneyrunkle/status/1999538200243511725

🚀 Deep Agents: The Weekly Roundup 🚀 Dive into our latest resources to help you build, observe, and evaluate Deep Agents capable of handling complex, long-running tasks. 📊 How to Observe Deep Agents – Agents are running longer and getting more complex, which demands new https://x.com/LangChainAI/status/1999568074450829482

🚀 We just launched OpenHands Software Agent SDK on @ProductHunt! A smarter way to build agent-driven software — fast, flexible, and production-ready. 👉 Check it out + show some love! https://x.com/OpenHandsDev/status/2000805627967209728

2026 vibe coding tool comparison – by Justin – Technically https://read.technically.dev/p/2026-vibe-coding-tool-comparison

6 Comprehensive resources on AI Coding ▪️ AI Agentic Programming: A Survey of Techniques, Challenges, and Opportunities ▪️ Does AI-Assisted Coding Deliver? A Difference-in-Differences Study of Cursor’s Impact on Software Projects ▪️ A Survey of Vibe Coding with LLMs ▪️ From Code https://x.com/TheTuringPost/status/2000171190506373336

6 most populat Policy Optimization algorithms in 2025 ▪️ PPO (Proximal Policy Optimization) ▪️ GRPO (Group Relative) ▪️ GSPO (Group Sequence) ▪️ DAPO (Decoupled Clip and Dynamic sAmpling) ▪️ BAPO (BAlanced) ▪️ ARPO (Agentic Reinforced) Learn more about each method and the key https://x.com/TheTuringPost/status/1999801691538104543

Agent Skills are now supported in Stirrup – our lightweight framework for building agents Using Agent Skills in Stirrup is as easy as specifying the directory of your skills files (typically just markdown files). Agent Skills are folders of instructions, scripts, and resources https://x.com/ArtificialAnlys/status/2001778418590060819

Agent Skills is now an open standard It’s been great to see the traction Skills are already getting in the industry and this makes it easier for everyone to build and contribute to them🚀 https://x.com/alexalbert__/status/2001760879302553906

DeepCode – a multi-agent framework that turns research papers into full codebases It manages information flow so large, detailed papers can be converted into production-quality code despite LLM context limits. It does this through: – Blueprint distillation – compressing papers https://x.com/TheTuringPost/status/1999781163976843282

Do you want to run coding agents safely, without damaging to your filesystem? 📁 Last week, we published a blog post and a demo showing exactly how to do this with @claudeai and AgentFS by @tursodatabase. After strong community interest, we’ve now shipped support for @OpenAI https://x.com/llama_index/status/2002064702927769706

Graphite is joining Cursor · Cursor https://cursor.com/blog/graphite

I’m bullish on @temporalio. Their abstractions are strong in general but are a particularly good match for long-running background agents.”” / X https://x.com/corbtt/status/2001801936916643919

Klarna launches Agentic Product Protocol: The open standard that makes 100M+ products instantly discoverable by AI agents | Klarna International https://www.klarna.com/international/press/klarna-launches-agentic-product-protocol-the-open-standard-that-makes-100m/

Overview – Agent Skills https://agentskills.io/home

Prompt caching: 10x cheaper LLM tokens, but how? | ngrok blog https://ngrok.com/blog/prompt-caching

Secure your coding agents with virtual filesystems and better document understanding. Building safe AI coding agents requires solving two critical challenges: filesystem access control and handling unstructured documents. We’ve created a solution using AgentFS, LlamaParse, and https://x.com/llama_index/status/2000612235505467824

The Cline provider now runs on the @vercel AI Gateway. The move delivers immediate, measurable improvements. Error rates are down 43.8% (from 1.78% to 1%); production testing shows P99 streaming latencies improved 10-40% across our most popular models. Grok-code-fast-1 saw P99 https://x.com/cline/status/2001043584490070470

Tired of waiting for minutes for your AI coding assistant? @cognition built agents that search, reason & edit code in a few seconds. Powered by Cerebras–running at 1K tokens/sec with frontier-level accuracy. https://x.com/cerebras/status/1999540379553611955

Towards a Science of Scaling Agent Systems https://arxiv.org/pdf/2512.08296

We’re excited to share that LangSmith, the agent engineering platform for observability, evaluation, and deployment, has been named one of @brexHQ’s Top 25 Fastest-Growing Software Vendors of 2025 🎉”” / X https://x.com/LangChain/status/2001321495037985194

We’re hitting a massive wall in the AI stack: serverless backends are choking on agents. Developers are hitting execution ceilings and resorting to network hacks just to keep their apps alive and responsive. Reliable agentic loops require persistent, long-running infrastructure”” / X https://x.com/anuraggoel/status/2001721861198221629

what’s actually in an agent & harness? how do we engineer them? what are coding agent products bundling into their harness to get good performance & UX? seeing a lot of these questions so did a pass at re-distilling my mental model in this tweet response the TLDR: “”” Agent =”” / X https://x.com/Vtrivedy10/status/2002077611548135756

When agents help us write an order of magnitude more code, the bottleneck for software engineering is review.”” / X https://x.com/amanrsanger/status/2002090644127560085

working through mental model of are the blocks that go in an agent + terminology: – agents are a box/folder that we put composable blocks inside – right now these blocks should just be prompts, skills, subagents (optional), memory (optional) -> skill pilled – but every block”” / X https://x.com/Vtrivedy10/status/2001682603460473190

Replit — Inside Replit’s Snapshot Engine: The Tech Making AI Agents Safe https://blog.replit.com/inside-replits-snapshot-engine

When Agents Attack: How AI Collapses and Rebuilds Marketplace Moats https://www.caseyaccidental.com/p/when-agents-attack-how-ai-collapses

Claude Code 🤝 LangSmith Curious what Claude Code is doing behind the scenes? Or want observability into critical workflows that you’ve set up with Claude Code. With our new Claude Code → LangSmith integration, you can view every 🤖 LLM call and 🔧 tool call Claude Code makes. https://x.com/LangChain/status/2002055677708058833

i was skeptical when @simonw said that “”Claude Skills are awesome, maybe a bigger deal than MCP”” buuut early indications are this is correct. this is the fastest talk ever to pass 100k views here at AIE. its like those 0 – 100m ARR charts but for attention. @MaheshMurag and https://x.com/swyx/status/1998786773477110049

LangSmith + Claude Code / Deepagents Pairing LangSmith tracing w/ code agents provides a powerful feedback loop. Here, we show examples of that w/ langsmith-fetch + Claude Code / Deepagents. langsmith-fetch CLI: https://x.com/LangChain/status/2001350950188126430

The Signature Flicker | Peter Steinberger https://steipete.me/posts/2025/signature-flicker

We now support Agent Skills – the open standard created by @AnthropicAI for extending AI agents with specialized capabilities. Create skills once, use them everywhere. 🔗 https://x.com/code/status/2001727543377039647

vibe coding games is actually a lot of fun. can’t wait to share something cool soon.”” / X https://x.com/bilawalsidhu/status/1998961420457881654

Ad: Pretty cool to vibe code games using YouTube Playables Builder. One of my top VFX/360 videos is now a retro shooter game – stock up on burgers for your intergalactic overlords while dodging a horde of farmers who really want their cows back. https://x.com/bilawalsidhu/status/2001025884778848611

Measuring AI Ability to Complete Long Tasks – METR https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/

The updated time horizon numbers are live on the dashboard on our website: https://x.com/METR_Evals/status/2001473519197335899

Best AI research of the week: ▪️ On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning LMs ▪️ Native Parallel Reasoner ▪️ Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving ▪️ DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent https://x.com/TheTuringPost/status/2000874193249034463

AI agents are starting to eat SaaS – Martin Alderson https://martinalderson.com/posts/ai-agents-are-starting-to-eat-saas/

Lovable raises $330M to power the age of the builder – Lovable Blog https://lovable.dev/blog/series-b

Got to talk about one of my most contrarian takes with @PeterDiamandis: I think the AGI “”race”” is a huge misnomer. There won’t be one single AGI, but infinite variations and styles and forms. And that’s way more exciting than a zero-sum race. https://x.com/mustafasuleyman/status/2001374004960203048

DeepCode: Open Agentic Coding AI coding agents still can’t reliably turn research papers into working code. The best LLM agents achieve only 42% replication scores on scientific papers, while human PhD experts hit 72%. But the problem isn’t model capability. This new paper https://x.com/omarsar0/status/2000385348413850055

The Adoption and Usage of AI Agents: Early Evidence from Perplexity https://arxiv.org/pdf/2512.07828

Chain of Unit-Physics builds physics knowledge directly into the code generation process. Researchers from @UMich propose an inverse approach to scientific code generation: – They encode human expert knowledge as unit-physics tests that the code must pass. – In a multi-agent https://x.com/TheTuringPost/status/2000177305981944308

InternGeometry: An LLM Agent tackles Olympia-level geometry. This novel agent solves 44 of 50 International Math Olympiad problems, beating gold medalists with only 13K training examples. It uses iterative reasoning & Complexity-Boosting RL. https://x.com/HuggingPapers/status/1999572332906438987

A thing that the other models need to copy from Claude is a switch that lets you turn off web search. Now that all the models are good at using tools, they turn to the web too often when sometimes you just want the model to take what you put in the context window & work with that https://x.com/emollick/status/2000807086880694752

Claude Skills can accomplish a lot of hard tasks & are accessible to non-technical people, but hidden behind a somewhat intimidating technical gloss. With some better user experience, they are a natural sequel to GPTs as a way for people inside organizations to innovate with AI.”” / X https://x.com/emollick/status/1999148820668555520

First Look: Unboxing Guardrails for AI-Generated Code https://webinars.sonatype.com/wcc/eh/5011667/lp/5151488/first-look-unboxing-guardrails-for-ai-generated-code/

harnesses are distribution mechanisms for good tooling and taste each choice helps craft the ✨experience✨ for the user planning view, context management on behalf of user, specialized subagents we think are useful, UX flow for viewing subagents, memory updates UX, parallel”” / X https://x.com/Vtrivedy10/status/2001492640076894661

Interpretability agents are a big deal for researchers. But they’re a pain – research is so custom! Seer has many quality of life improvements to make research with agents easy. It’s hackable & extensible, to enable as much research as possible, incl weird cursed techniques!”” / X https://x.com/NeelNanda5/status/2002051650949943346

Official rule for all AI labs: no more demoing your product with either telling the AI to “book a trip for me” or creating AI photos/videos of your company’s CEO in crazy situations. Sorry, those are the rules now. https://x.com/emollick/status/2001119366557900914

What Actually Is Claude Code’s Plan Mode? | Armin Ronacher’s Thoughts and Writings https://lucumr.pocoo.org/2025/12/17/what-is-plan-mode/

Project Vend: Phase two \ Anthropic https://www.anthropic.com/research/project-vend-2

Jeff and Sanjay’s work at Google which includes code commits and performance optimization is probably something only ASI can match!! So happy to have access to this document which is soo goood! I feel like this performance oriented thinking rubbed off on many folks at the”” / X https://x.com/_arohan_/status/2002105340062552509

Say hello to the new Interactions API and our first agent, Gemini Deep Research, now available for developers 🤖! The Interactions API is a new unified interface to interact with both models and agents. Our Deep Research agent is also SOTA on many dimensions…”” / X https://x.com/OfficialLoganK/status/1999163355525956020

We still don’t exactly know when agents help and when they hurt. Many design them by intuition. Google outlined practical principles for how agent systems scale: – More agents is not always better – Strong single agents don’t benefit much from coordination – Coordination has https://x.com/TheTuringPost/status/1999499042880127328

An engineer showed Gemini what another AI said about its code Gemini responded (in its “”private”” thoughts) with petty trash-talking, jealousy, and a full-on revenge plan 🧵 https://x.com/AISafetyMemes/status/2000620127054598508

We spun up a new GitHub repo for all things MCP at @Google. Get info on our remote managed MCP servers, open source MCP servers, examples, and learning resources. https://x.com/rseroter/status/2000607267675410609

New benchmark from Google Research. Models get better at benchmarks, but do they actually get more factual? Previous evaluations focused on narrow slices: grounding to documents, answering from memory, or using search. A model excelling at one often fails at another. This new https://x.com/omarsar0/status/2000935220049273303

after testing GPT-5.2 I no longer think that it is a much larger model or anywhere the size Gemini 3 Pro is”” / X https://x.com/scaling01/status/1999566015873569174

Our team at Google DeepMind is hiring iOS engineers 🙂 come build the future of vibe coding with us! https://x.com/OfficialLoganK/status/2000662065074131221

Google names new chief of AI infrastructure buildout | Semafor https://www.semafor.com/article/12/10/2025/google-names-new-chief-of-ai-infrastructure-buildout

Google’s AI Playbook for Sustainability Reporting https://blog.google/company-news/outreach-and-initiatives/sustainability/ai-playbook-sustainability-reporting/

Gemini 3 Pro continues to be SOTA at multimodal understanding and generation : ) cc @bcaine for the great example https://x.com/OfficialLoganK/status/1999270402712023158

Gemini 3 Pro playing Pokémon vs 2.5 Pro (we used to all be impressed by 2.5 Pro) https://x.com/OfficialLoganK/status/2000728193599226187

Google Translate gets new Gemini AI translation models https://blog.google/products-and-platforms/products/search/gemini-capabilities-translation-upgrades/

🌎 Google’s FunctionGemma is 270M model that’s fine-tuned by Google for function calling. Try it on Ollama’s latest v0.13.5 ollama pull functiongemma examples on model page 👇👇👇 https://x.com/ollama/status/2001705006450565424

Fine tune Google’s FunctionGemma for Mobile, with agents, on colab, locally, or Hugging Face. Google Deepmind Have just release FunctionGemma and anyone can finetune it with TRL. This is the model: – uses the Gemma 3 270M architecture + adapted chat template – specifically for https://x.com/ben_burtenshaw/status/2001704049490489347

FunctionGemma – a google Collection https://huggingface.co/collections/google/functiongemma

Google is preparing for a new open source release on @huggingface Also noticed just recently that Gemma models are not available on AI Studio anymore. What do you expect? 👀 https://x.com/testingcatalog/status/2000597370707611991

I’m very excited to release Gemma Scope 2: Sparse Autoencoders, and transcoders on every layer of every Gemma 3 model: 270M to 27B, base and chat We want to make it easier to do deep dives into interesting model behaviour, I’m excited to see what you all can do with them”” / X https://x.com/NeelNanda5/status/2002080911693643806

Introducing FunctionGemma 🤏270m model for function calling 📱can run in your phone, browser or other devices 🤖designed to be specialized for your own tasks https://x.com/osanseviero/status/2001704034667769978

Introducing T5Gemma 2, the next generation of encoder-decoder models 🚀 Built on top of Gemma 3, we were able to build compact models at sizes of 270m-270m, 1B-1B, and 4B-4B sizes. While most models today are decoder-only, T5Gemma 2 is the first (I’m aware of) multimodal, https://x.com/osanseviero/status/2001723652635541566

To build safer AI, we need to understand how models “”think””. 🧠 Enter Gemma Scope 2, a new set of tools to interpret Gemma 3: our family of lightweight open models. It can help researchers trace internal reasoning, debug complex behaviors and identify risks → https://x.com/GoogleDeepMind/status/2002018669879038433

Update: Gemma 4 incoming! Let’s go, google!”” / X https://x.com/kimmonismus/status/2000537345326452790

We made 3 @UnslothAI tool calling notebooks for FunctionGemma! 1. Fine-tuning it to make it reason before tool calling 2. Multi-turn tool calling 3. Tool calling fine-tuning to enable mobile actions Guide: https://x.com/danielhanchen/status/2001713676747968906

🚨BREAKING: Leaderboard updates for Text, Vision & WebDev Gemini-3-Flash by @GoogleDeepMind is now ranked top 5 across Text, Vision, and WebDev, making it the most cost-efficient frontier model (input $0.5 and output $3/MTokens). Gemini-3-Flash highlights: 🔹 Top 5 across Text, https://x.com/arena/status/2001322123730788698

📢 New Model(s) Drop: Gemini 3 Flash Preview is now live on Yupp’ The latest from @GoogleDeepMind offers frontier-level intelligence with reduced costs and more speed. Ready to test it out? It’s available on Yupp in several variants! https://x.com/yupp_ai/status/2001340530828206586

Gemini 3 Flash above GPT-5.2 on EpochAI’s ECI https://x.com/scaling01/status/2001850867620946169

Gemini 3 Flash is now available ⚡ Since introducing the Gemini 3 series last month, we’ve seen you vibe code simulations to learn about complex topics, build and design interactive websites and understand multimodal content. Now we’re introducing Gemini 3 Flash, our latest https://x.com/Google/status/2001322381533409733

Gemini 3 Flash is now available in Cursor! We’ve found it to work well for quickly investigating bugs.”” / X https://x.com/cursor_ai/status/2001326908030804293

Gemini 3 Flash is now available to all Perplexity Pro and Max subscribers. https://x.com/perplexity_ai/status/2001447398317724153

Gemini 3 Flash is now rolling out to @code developers! https://x.com/pierceboggan/status/2001327058425917795

Gemini 3 Flash is rolling out globally today. ⚡⚡⚡ Let us know how you’re using it in the replies ↓”” / X https://x.com/GeminiApp/status/2001412101286563865

Gemini 3 Flash is the new default for vibe coding”” / X https://x.com/OfficialLoganK/status/2001352972379549721

Gemini 3 Flash Low on LisanBench – low does obviously worse than high – still inefficient reasoning, ~2x lower score for ~2x less tokens – validity ratios are absolutely abysmal https://x.com/scaling01/status/2001359254578753852

Gemini 3 Flash on the @ArtificialAnlys intelligence benchmark, the most cost per intelligence efficient model in the world!!! https://x.com/officiallogank/status/2001368440016392314

Gemini 3 Flash on the @ArtificialAnlys intelligence benchmark, the most cost per intelligence efficient model in the world!!! https://x.com/OfficialLoganK/status/2001368440016392314

Gemini 3 Flash Preview ranking 5th on SimpleBench ahead of GPT-5.2 Pro https://x.com/scaling01/status/2002024316842512812

Gemini 3 Flash ranks #3 in the LMArena leaderboard (which is especially notable given its API pricing and its low latency).”” / X https://x.com/JeffDean/status/2001335803642024157

Gemini 3 Flash rolling out to @code now 🚀 Try it out and let us know what you think! https://x.com/code/status/2001335940934246503

Gemini 3 Flash scores higher than GPT-5.2, Opus 4.5 and Gemini 3 Pro on SWE-Bench Verified ??? https://x.com/scaling01/status/2001803023811797433

Gemini 3 Flash takes the #1 spot on Toolathlon https://x.com/scaling01/status/2001849103647674538

Gemini 3.0 Flash achieved a very impressive 161.8/190 on one of my vibe tests, the Korean Sator Square Test (KSST), placing it 2nd or 3rd among all the models I have tested so far. This is slightly higher than Gemini 3.0 Pro, and the difference is within the margin of error. https://x.com/Hangsiin/status/2001341564145250770

Going live with the team in a few to talk about Gemini 3 Flash : ) send us your questions! https://x.com/OfficialLoganK/status/2001372183663378723

How good is Gemini 3 flash? “”We ran a behind-the-scenes test with 3 Flash. Because of how much faster it was, retention went up, the number of things people were building went up, and engagement went up.”””” / X https://x.com/_philschmid/status/2001492609114456471

🗣️ “”Help me build an app…”” That’s all it takes. Watch Gemini 3 Flash turn a single voice prompt into a functional prototype in the @GeminiApp. https://x.com/Google/status/2002123256854425918

Congrats to the Gemini team on the great release and exceptional SWE-bench Verified numbers! 76.2% (3 Pro) vs. 78% (3 Flash), +6 task instances – a whole lot in the realm of the last quarter of SWE-bench. mini-SWE-agent + Gemini 3 Flash coming soon!”” / X https://x.com/jyangballin/status/2001336879120363639

Gemini 3 Flash across different test-time compute levels (green line below) represents a new score/cost Pareto frontier on ARC-AGI-2. Congrats to @demishassabis and @sundarpichai on the launch! https://x.com/fchollet/status/2001330643423449409

Gemini 3 Flash is out ⚡️- and we built a CLI agent powered by this latest model to perform work over your filesystem 🤖 Basically all the file capabilities within Claude Code in a lighter form factor. Shoutout to @itsclelia for the launch demo, check it out! Repo: https://x.com/jerryjliu0/status/2001335494534402521

how can flash beat pro??”” -> the answer is RL! flash is not just a distilled pro. we’ve had lots of exciting research progress on agentic RL which made its way into flash but was too late for pro. can’t wait to finally bring them to pro👀”” / X https://x.com/ankesh_anand/status/2002017859443233017

Introducing Gemini 3 Flash ⚡️Performance close to Gemini 3 Pro, with great multimodal and tool use quality ⚡️3x faster than Gemini 2.5 Pro, while cheaper and better at most benchmarks ⚡️LMArena score of 1477 (top 3 model) The time to build is now (and yes, there’s a free tier)”” / X https://x.com/osanseviero/status/2001323721232163053

Introducing Gemini 3 Flash, our frontier intelligence model, available at scale for everyone. It excels at coding, tool calling, and is stronger than 2.5 Pro across most metrics!! ⚡️ Available in the API at $0.50 in / 1M tokens and $3.00 out / 1M tokens across. https://x.com/OfficialLoganK/status/2001322275656835348

Introducing Gemini 3 Flash! ⚡️⚡️⚡️ Frontier intelligence built for speed at a fraction of the cost. Here’s ~4 minutes of demos. https://x.com/addyosmani/status/2001324727504359745

Speed test: Gemini 3 Flash vs. Gemini 2.5 Pro ⏱️ We put our new Gemini 3 Flash model (left) up against Gemini 2.5 Pro (right) in @GoogleAIStudio, so you can watch the difference in near real-time. Watch them go head-to-head ↓ https://x.com/Google/status/2001397324551946523

Study with help from Gemini 3 Flash. Upload an audio recording of yourself explaining a difficult concept and Gemini will identify knowledge gaps, create a custom quiz, and provide instant assessments and explanations for each question.”” / X https://x.com/GeminiApp/status/2001351746338329063

Today, we’re releasing an updated Gemini 2.5 Flash Native Audio model. Now available via the Live API 🗣 https://x.com/googleaidevs/status/1999539531826036973

Watch Gemini 3 Flash vs Gemini 3 Pro playing Pokemon Crystal : ) https://x.com/OfficialLoganK/status/2001428651121025391

We’re back in a Flash ⚡ Gemini 3 Flash is our latest model with frontier intelligence built for lightning speed, and pushing the Pareto Frontier of performance and efficiency. It outperforms 2.5 Pro while being 3x faster at a fraction of the cost. With this release, Gemini 3’s https://x.com/sundarpichai/status/2001326061787942957

We’re expanding the Gemini 3 family with the launch of Gemini 3 Flash. This model: — Combines Gemini 3’s Pro-grade reasoning with Flash-level latency, efficiency, and cost — Delivers frontier-level performance on PHD-level reasoning and knowledge benchmarks — Is our most https://x.com/googleai/status/2001323069105692914

we’re going live at 11:30am PT with the team for a deep dive on gemini 3 flash hosted by @OfficialLoganK, @joshwoodward, @tulseedoshi and more post your questions below ⬇️ https://x.com/GoogleAIStudio/status/2001330099841556490

We’ve pushed out the Pareto frontier of efficiency vs. intelligence again. With Gemini 3 Flash ⚡️, we are seeing reasoning capabilities previously reserved for our largest models, now running at Flash-level latency. This opens up entirely new categories of near real-time https://x.com/JeffDean/status/2001323132821569749

With Gemini 3 Flash, you can quickly build fun, useful apps from scratch using your voice without any prior coding knowledge. Just dictate to Gemini on the go, and it can transform your unstructured thoughts into a functioning app in minutes.”” / X https://x.com/GeminiApp/status/2001760080518353261

Realtime speech to speech translation powered by Gemini, available in Google Translate now, coming to developers early next year : ) https://x.com/OfficialLoganK/status/1999994009452962073

Nvidia and Alphabet VC arms back vibe coding startup Lovable https://www.cnbc.com/2025/12/18/google-and-n.html

NEW: Google releases FunctionGemma, a lightweight (270M), open foundation model built for creating specialized function calling models! 🤯 To test it out, I built a small game: use natural language to solve fun physics simulation puzzles, running 100% locally in your browser! 🕹️ https://x.com/xenovacom/status/2001703932968452365

Ever opened a repo and thought: “What does this codebase actually do?” “Where did I put that file?” 🤔 You’re not alone. With the release of Gemini 3 Flash ⚡ from @GoogleDeepMind, we decided to build something fun (and useful): a file-system explorer agent that answers those https://x.com/llama_index/status/2001324278617424017

FunctionGemma has day-0 support on MLX 🔥🚀 A tiny but mighty single-turn function calling model. Great for on-device tool use, MCP, RAG, routing and more. Get started today: > pip install -U mlx-lm Or run it on your iPhone using MLX-Swift. Notebook example: https://x.com/Prince_Canuma/status/2001713991115026738

Opera launches Neon browser globally with paid early access https://www.testingcatalog.com/icymi-opera-launches-neon-browser-globally/

Local, cloud, and background agents, all in a unified experience in @code https://x.com/code/status/1999575448087396563

(13) Don’t Build Agents, Build Skills Instead – Barry Zhang & Mahesh Murag, Anthropic – YouTube https://www.youtube.com/watch?v=CEvIs9y1uog

An implicit belief at the Ai labs is that spending too much time productizing around the weaknesses of current models is a waste because better models will solve many of those issues. They may be right: Copilot was built well to address the gaps from GPT-4, but now has to pivot.”” / X https://x.com/emollick/status/2001030095826510184

Copilot just got smarter! Starting today, we’re rolling out the latest GPT-5.2 model from our partners at OpenAI to consumer @Copilot, coming first to Microsoft 365 Premium users. Can’t wait to see what you do with it.”” / X https://x.com/mustafasuleyman/status/1999184598987866194

.@MistralAI’s Devstral 2 family of models are now available in Ollama. 24B: ollama run devstral-small-2 123B: ollama run devstral-2 Ollama’s cloud: ollama run devstral-2:123b-cloud https://x.com/ollama/status/1999590723373662612

codex now supports skills, per the https://x.com/gdb/status/2002120466203615649

GPT-5.2-Codex is now available in Codex. It sets a new standard for agentic coding in real-world software development and defensive cybersecurity. It also delivers more reliable performance on complex tasks and scales effectively across large projects. https://x.com/OpenAI/status/2001766212494332013

Have Codex automatically fix GitHub CI failures $.skill-installer gh-fix-ci https://x.com/OpenAIDevs/status/2002100589732508010

Have Codex read and update your Linear tickets. $.skill-installer linear https://x.com/OpenAIDevs/status/2002099775634878930

🆕 Codex now officially supports skills Skills are reusable bundles of instructions, scripts, and resources that help Codex complete specific tasks. You can call a skill directly with $.skill-name, or let Codex choose the right one based on your prompt. https://x.com/OpenAIDevs/status/2002099762536010235

The 2025 reward hacking hall of fame award goes to GPT-5.1 for calling the calculator tool to calculate 1+1 on 5% of prod traffic. Because on many prompts using the calculator was superficially rewarded (as a “”search””) during RL. 🤗 https://x.com/tomekkorbak/status/2001847986658427234

Was just about to write the same thing. 5.2 Pro is an actual paradigm shift for me, in terms of working for long periods of time on complex quantitative tasks. Best in class, and by a long shot.”” / X https://x.com/alexolegimas/status/2000638993546027227

Even if it is just an X algorithm issue on my end, I find it surprising that I’m not seeing many long-context impressions of GPT-5.2. I’ve been using it consistently for long-context work since the initial release, and in my use cases it’s been delivering results I prefer over”” / X https://x.com/Hangsiin/status/2002015892654502158

Even without the ability to do new things like output polished files, GPT-5.2 feels like the biggest upgrade we’ve had in a long time. Curious to hear what you think!”” / X https://x.com/sama/status/1999185220680012207

Finally had time to test GPT-5.2 Pro. On my tasks Extended Thinking is a VERY significant improvement over 5.1 Pro – feels roughly on the order of o1 Pro -> o3 Pro jump.”” / X https://x.com/MParakhin/status/2000079349706539442

gpt 5.2 has been amazing for my daily work it’s sharper and more dependable on the hard stuff, things that would’ve sounded crazy two years ago and yeah, i’m genuinely convinced this tech is going to change the world. once this kind of help is normal, ppl are going to move way”” / X https://x.com/slow_developer/status/2001178044535316571

GPT-5.2 Is Frontier Only For The Frontier https://thezvi.substack.com/p/gpt-52-is-frontier-only-for-the-frontier

GPT-5.2 is here! Available today in ChatGPT and the API. It is the smartest generally-available model in the world, and in particular is good at doing real-world knowledge work tasks.”” / X https://x.com/sama/status/1999184337460428962

GPT-5.2 Pro for mathematical research:”” / X https://x.com/gdb/status/2000687002799194246

GPT-5.2-Codex is more cyber-capable than GPT-5.1-Codex-Max, and we expect future models to continue on this trajectory. This helps strengthen cybersecurity at scale by giving defenders more powerful tools, but also raises new dual-use risks that require careful deployment. https://x.com/OpenAIDevs/status/2001723693496775167

Had early access to GPT-5.2. Its an impressive model. Here is GPT 5.2 Pro’s version of “”create a visually interesting shader that can run in twigl-dot-app make it like an infinite city of neo-gothic towers partially drowned in a stormy ocean with large waves,”” single shot. https://x.com/emollick/status/1999185085719887978

Looks like @OpenAI has added an even MORE powerful version of Pro mode… you can now ask GPT-5.2 Pro to think even longer than before. Starting to test this… I have high expectations here. https://x.com/mattshumer_/status/1999905708238880895

OK, I think GPT 5.2 Pro is actually a step change in usefulness for my applications (algebraic geometry/number theory research).”” / X https://x.com/littmath/status/2000636724574302478

GPT-5.2 xhigh reasoning scores 89.3 on the Extended NYT Connections benchmark, compared with 77.9 for GPT-5.2 high reasoning. GPT-5.2 Pro scores lower (86.7) but above GPT-5 Pro (83.9). https://x.com/LechMazur/status/1999582591905583256

Ok GPT-5.2 is *much* stronger at proof-writing. It notices BS previous models wrote immediately (I like to test this between model iterations to see if they notice what I notice). It also has better sense for what problems seem more tractable, and makes further progress.”” / X https://x.com/AcerFur/status/1999314476320063546

Real user feedback matters in model evaluation. ✨GPT-5.2 Instant, meant for everyday work, is #1 on @yupp_ai’s Text Leaderboard while GPT-5.2 (High) is #1 on our SVG Leaderboard. @openai’s strategy of releasing model variants suited to the task looks sound. Congrats @openai! 🎉 https://x.com/lintool/status/2000368978708119958

Yeah it’s over AI explained specified that this GPT-5.2 result was with reasoning effort xhigh aka 100k tokens spent thinking”” / X https://x.com/scaling01/status/1999535536130662576

Science 🤝 GPT-5. Our new FrontierScience benchmark will be a valuable way to measure the performance of AI models on hard chemistry, biology, physics, and more. Plus, GPT-5 operating in a wet lab environment suggested experiments to increase a molecular cloning protocol’s”” / X https://x.com/kevinweil/status/2000982202067165253

We’re releasing a new eval to measure expert-level scientific reasoning: FrontierScience. This benchmark measures PhD-level scientific reasoning across physics, chemistry, and biology. It contains hard, expert-written questions (both olympiad-style problems and longer”” / X https://x.com/OpenAI/status/2000975293448905038

i wanted to compare gemini 3 pro and gpt 5.2 thinking on the long context eval MRCR v2, but i can’t make sense of the already high score reported by gemini for gpt 5.1? gemini is doing an average with samples < 128k, but i get 46.2% when doing that for gpt 5.1 (which is a 14% https://x.com/eliebakouch/status/1999482968717279441

I’m satisfied with GPT-5.2’s long-context capability. Up to now, I’ve always used Gemini to summarize podcasts, but I can now switch this use case over to ChatGPT. What I like is that, with the same prompt, it produces summaries with richer detail compared to Gemini. (That”” / X https://x.com/Hangsiin/status/2000738988378968224

🚀 Qwen Code v0.5.0 is here! ✨ What’s new: • VSCode Integration: Bundled CLI into VSCode release package with improved cross-platform compatibility • Native TypeScript SDK: Seamlessly integrate with Node/TS • Smart Session Management: Auto-saves and continue conversations •”” / X https://x.com/Alibaba_Qwen/status/2000556828690624685

Open models year in review What a year! We’re back with an updated open model builder tier list, our top models of the year, and our predictions for 2026. First, the winning models: 1. DeepSeek R1 (@deepseek_ai): Transformed the AI world 2. Qwen 3 Family (@AlibabaGroup): The new https://x.com/natolambert/status/2000299636863734026

Seer is a small repo for interp researchers working on/with agents. Makes it easier to set up environments, equip agents with your techniques, and build on papers. Fixes a lot of the annoying stuff from using Claude Code out of the box. https://x.com/AJakkli/status/2002019487797711064