Agents and Copilots: AI News Week Ending 02/20/2026

Agents and Copilots: AI News Week Ending 02/20/2026

February 20, 2026

Image created with gemini-3.1-flash-image-preview with claude-sonnet-4-5. Image prompt: Wide-angle observational realism shot of a Chinese delivery worker on electric scooter stopped at dusty intersection of old hutongs and demolition site, checking phone screen in flat overcast light, rust-colored horse standing calmly in background near rubble pile, muted desaturated palette of concrete gray and faded teal, patient documentary framing, large white Chinese poster title text overlay reading AGENTS in top third of frame, Jia Zhangke aesthetic with decelerated stillness and one quiet surreal note

ElevenLabs secures first-of-its-kind AI Agent insurance https://elevenlabs.io/blog/aiuc-announcement

This is a new separate estimate for LLM time horizon doubling times and it mostly agrees with METR In this case ~4.8-5.7 months”” https://x.com/scaling01/status/2023350946139435357

Spotify’s Top Developers Haven’t Written Code Since December, CEO Says – Business Insider https://www.businessinsider.com/spotify-developers-not-writing-code-ai-2026-2

Introducing Manus in Your Chat : Your Personal Agent, Everywhere You Are https://manus.im/blog/manus-agents-telegram

Excited to work with Peter Steinberger to build the future of agents for everyone and to continue to improve Codex in leaps and bounds. We are committed to OSS, continuing to make OpenClaw flourish and bringing agents to life in a way that is fun, safe and highly productive.”” https://x.com/thsottiaux/status/2023147973421785386

There have been fair questions on whether LLM contributions to STEM are overhyped, but I’ve spoken with physicists about this result and they’ve told me it is a truly significant research contribution, roughly at the level of a solid journal paper, and GPT-5.2 played a key role.”” https://x.com/polynoamial/status/2022413904757035167

Introducing the WordPress AI Assistant — Now Built Into WordPress.com https://wordpress.com/blog/2026/02/17/wordpress-ai-assistant/?irclickid=x2NzhwXFExycWR-WHYQ6WUG6UkuxLzQR1Sn8Vw0&sharedid=engadget.com&irpid=10078&irgwc=1&afsrc=1

Introducing Sonnet 4.6 \ Anthropic https://www.anthropic.com/news/claude-sonnet-4-6

NEW: Anthropic releases Claude Sonnet 4.6 Nears Opus-level performance across coding and reasoning at Sonnet pricing ($3/$15 per mil tokens). Computer use scores have gone from single digits last year to 72.5% now 📈 + a 1M token context window”” https://x.com/TheRundownAI/status/2023821446380978238

Sonnet 4.6 the best model on GDPval”” https://x.com/scaling01/status/2023819793212813604

Users preferred Sonnet 4.6 over Opus 4.5 59% of the time”” https://x.com/scaling01/status/2023819403230671232

Pentagon threatens to cut off Anthropic in AI safeguards dispute https://www.axios.com/2026/02/15/claude-pentagon-anthropic-contract-maduro

141 days for Sonnet to go from 13.6% to 60.4% on ARC-AGI-2″” https://x.com/scaling01/status/2023850250662969587

Sonnet 4.6 Benchmarks 79.6% SWE-Bench Verified 58.3% ARC-AGI-2″” https://x.com/scaling01/status/2023818940112327101

We estimate that Claude Opus 4.6 has a 50%-time-horizon of around 14.5 hours (95% CI of 6 hrs to 98 hrs) on software tasks. While this is the highest point estimate we’ve reported, this measurement is extremely noisy because our current task suite is nearly saturated.”” https://x.com/METR_Evals/status/2024923422867030027

Claude Sonnet 4.6 is the new leader in GDPval-AA, slightly ahead of Anthropic’s Opus 4.6 on agentic performance of real-world knowledge work tasks less than two weeks after its launch In our pre-release testing with @AnthropicAI, Sonnet 4.6 reached an ELO of 1633 using the”” https://x.com/ArtificialAnlys/status/2023821893846135212

To get an idea of the near-term future of work with AI, take a look at the official Claude Cowork plugins, which give the AI specialized knowledge for various hard tasks A natural successor for GPTs, but built for agents (& therefore much more scalable & customizable for firms)”” https://x.com/emollick/status/2023113346162336137

How AI assistance impacts the formation of coding skills \ Anthropic https://www.anthropic.com/research/AI-assistance-coding-skills

Exclusive | Pentagon Used Anthropic’s Claude in Maduro Venezuela Raid – WSJ https://www.wsj.com/politics/national-security/pentagon-used-anthropics-claude-in-maduro-venezuela-raid-583aff17

Anthropic is prepared to loosen its current terms of use, but wants to ensure its tools aren’t used to spy on Americans en masse, or to develop weapons that fire with no human involvement. The Pentagon has aid, that Anthropic will “”pay a price”” for that behavior. Within this”” https://x.com/kimmonismus/status/2023419652378955809

For Claude in Excel users, our add-in now supports MCP connectors, letting Claude work with tools like S&P Global, LSEG, Daloopa, PitchBook, Moody’s and FactSet. Pull in context from outside your spreadsheet without ever leaving Excel.”” https://x.com/claudeai/status/2023817143096406246

The browser agent in Comet now runs on Claude Sonnet 4.6 for all Perplexity Pro users. Max users can choose between Sonnet 4.6 and Opus 4.6.”” https://x.com/comet/status/2023889197556441464

From Claude Code to Figma: Turning Production Code into Editable Figma Designs | Figma Blog https://www.figma.com/blog/introducing-claude-code-to-figma/

Improved Web Search with Dynamic Filtering | Claude https://claude.com/blog/improved-web-search-with-dynamic-filtering

You can now run Qwen3.5 locally! 💜 Qwen3.5-397B-A17B is an open MoE vision reasoning LLM for agentic coding & chat. It performs on par with Gemini 3 Pro, Claude Opus 4.5 & GPT-5.2. Run 4-bit on 256GB Mac / RAM. Guide: https://t.co/wjS1lMnbNp GGUF: https://x.com/UnslothAI/status/2023338222601064463

Measuring AI agent autonomy in practice \ Anthropic https://www.anthropic.com/research/measuring-agent-autonomy

Most agent actions on our API are low risk. 73% of tool calls appear to have a human in the loop, and only 0.8% are irreversible. But at the frontier, we see agents acting on security systems, financial transactions, and production deployments (though some may be evals).”” https://x.com/AnthropicAI/status/2024210050718585017

New Anthropic research: Measuring AI agent autonomy in practice. We analyzed millions of interactions across Claude Code and our API to understand how much autonomy people grant to agents, where they’re deployed, and what risks they may pose. Read more:”” https://x.com/AnthropicAI/status/2024210035480678724

NEW: Pentagon is so furious with Anthropic for insisting on limiting use of AI for domestic surveillance + autonomous weapons they’re threatening to label the company a “supply chain risk,” forcing vendors to cut ties. With @m_ccuri and @mikeallen”” https://x.com/DavidLawler10/status/2023425130148626767

Pentagon threatens to cut off Anthropic in AI safeguards dispute https://www.axios.com/2026/02/15/claude-pentagon-anthropic-contract-maduro?amp%3Butm_medium=newsletter&amp%3Butm_campaign=ai-s-new-physics-discovery&amp%3B_bhlid=147fc2fb115d35bbc6b2211e9bcebfff031af136

Software engineering makes up ~50% of agentic tool calls on our API, but we see emerging use in other industries. As the frontier of risk and autonomy expands, post-deployment monitoring becomes essential. We encourage other model developers to extend this research.”” https://x.com/AnthropicAI/status/2024210053369385192

Something strange is happening with AI agents that this new Anthropic research quietly surfaces. The agents are asking us for help more than we’re stepping in to correct *them*. Anthropic analyzed data from Claude Code and their public API to measure how autonomous AI agents”” https://x.com/omarsar0/status/2024864635120451588

OpenAI’s acquisition of OpenClaw signals the beginning of the end of the ChatGPT era | VentureBeat https://venturebeat.com/technology/openais-acquisition-of-openclaw-signals-the-beginning-of-the-end-of-the

GPT 5.2 derived a new result in theoretical physics. For decades it’s been assumed that certain gluon amplitudes (“”single minus””) were zero, and that the maximally helicity violating amplitudes had two gluons of one helicity and n-2 of the other. It turns out that isn’t”” https://x.com/kevinweil/status/2022388305434939693

GPT-5.2 derived a novel result in theoretical physics, showing that a type of particle interaction many physicists expected would not occur can in fact arise under specific conditions. There is great promise in the potential of AI to benefit people by accelerating science.”” https://x.com/gdb/status/2022394113971360145

GPT-5.2 derives a new result in theoretical physics | OpenAI https://openai.com/index/new-result-theoretical-physics/

I spent last night with Andrew Strominger and Alex Lupsasca, two of the top physicists in the world They just released a paper, co-authored with OpenAi, that seems to me like ASI Andrew, who helped develop string theory, told me that a year ago, his view was that he didn’t know”” https://x.com/patrick_oshag/status/2022395157648195801

More on the gluon scattering/GPT 5.2 paper from @ALupsasca below 👇 If you’re in the Boston area on Tuesday, go see his lecture at Harvard!”” https://x.com/kevinweil/status/2023422106411974935

We’re committing $7.5M to @AISecurityInst’s Alignment Project to fund independent research on mitigations for safety and security risks from misaligned AI.”” https://x.com/OpenAINewsroom/status/2024546609485533442

Peter Steinberger is joining OpenAI to drive the next generation of personal agents. He is a genius with a lot of amazing ideas about the future of very smart agents interacting with each other to do very useful things for people. We expect this will quickly become core to our”” https://x.com/sama/status/2023150230905159801

Introducing Lockdown Mode and Elevated Risk labels in ChatGPT | OpenAI https://openai.com/index/introducing-lockdown-mode-and-elevated-risk-labels-in-chatgpt/

🎉 Congrats to @Alibaba_Qwen on releasing Qwen3.5 on Chinese New Year’s Eve — day-0 support is ready in vLLM! Qwen3.5 is a multimodal MoE with Gated Delta Networks architecture — 397B total params, only 17B active. What makes it interesting for inference: 🧠 Gated Delta”” https://x.com/vllm_project/status/2023341059343061138

🔥 Alibaba’s Qwen 3.5 just dropped — and Zhihu is dissecting it. Here’s a sharp breakdown from Zhihu contributor toyama nao 👇 🏆 Verdict: “”The spearhead of the open-source elite.”” 📊 Big picture Tongyi Lab’s pattern: new mid-size model leapfrogs old giant. • Last cycle: 80B”” https://x.com/ZhihuFrontier/status/2024176484232155236

Qwen https://qwen.ai/blog?id=qwen3.5

Qwen3.5 is Live! Today we openweight the first model, Qwen3-397B-A17B, which is a native multimodal model supporting both thinking and non-thinking modes. We have strengthened its coding and agentic capabilities to foster productivity for developers and enterprises. Hope you”” https://x.com/JustinLin610/status/2023332446713070039

Alibaba Yunqi: 7 models released in 4 days (Qwen3-Max, Qwen3-Omni, Qwen3-VL) and $52B roadmap | AINews https://news.smol.ai/issues/25-09-23-alibaba-yunqi

Alibaba’s new Qwen3.5-397B-A17B is the #3 open weights model in the Artificial Analysis Intelligence Index – a significant upgrade from Qwen3-235B-A22B-2507, and achieved with fewer active parameters than leading peers Qwen3.5-397B-A17B is the first model released by Alibaba”” https://x.com/ArtificialAnlys/status/2023794497055060262

Qwen https://qwen.ai/blog?id=qwen3.5#spatial-intelligence

Qwen3.5’s thinking is downright excessive.”” https://x.com/QuixiAI/status/2023995215690781143

The Future of Design Is Code and Canvas | Figma Blog https://www.figma.com/blog/the-future-of-design-is-code-and-canvas/

.agents/skills is now live in Cursor!”” https://x.com/leerob/status/2024141610796150903

(45) The Secret to Scalable AI Agents: Virtual Filesystems with Deep Agents – YouTube https://www.youtube.com/watch?v=5oI_G8WL6rU

[2602.14721] WebWorld: A Large-Scale World Model for Web Agent Training https://arxiv.org/abs/2602.14721

📣 Meetup in San Francisco: Agent Observability Powers Agent Evaluation 📣 AI agents don’t fail like traditional software. When an agent takes hundreds of steps, repeatedly calls tools, updates state, and still produces the wrong result, there is no stack trace to inspect.”” https://x.com/LangChain/status/2023457846843551946

🔎 Use LangSmith Insights to group traces and find emergent usage patterns of your agents Now with the ability to set a schedule and run recurring jobs! Docs 👉 https://x.com/LangChain/status/2023804855136165932

🤖 From this week’s issue: A randomized controlled trial showing AI coding assistance decreased skill mastery by 17% among 52 software engineers, with debugging abilities most affected despite minimal productivity gains.”” https://x.com/dl_weekly/status/2023502798659125656

🚀 Congrats to @MiniMax_AI on releasing MiniMax-M2.5, a SOTA model in coding, agentic tool use and office work. Day-0 support is live in SGLang! 🧠 RL at scale: trained across hundreds of thousands of real-world environments 💻 Architect-level coding: plans, decomposes, and”” https://x.com/lmsysorg/status/2022319102560555401

🚀 We just shipped a major update to LangSmith Agent Builder: • New agent chat: One always-available agent with access to all your workspace tools • Chat → Agent: Turn any conversation into a specialized agent with one click • File uploads: Attach files directly to Agent”” https://x.com/LangChain/status/2024180357457989887

2 great lightweight alternatives to OpenClaw (99% smaller) 1. PicoClaw > Runs in <10MB RAM – ~99% less memory than OpenClaw > Works on ~$10 hardware > Boots in ~1 second > Single portable binary (RISC-V, ARM, x86) 2. nanobot > ~4,000 lines of code (≈99% smaller than”” https://x.com/TheTuringPost/status/2023416488884129826

9 Observations from Building with AI Agents | Tomasz Tunguz https://tomtunguz.com/9-observations-using-ai-agents/

A Guide to Which AI to Use in the Agentic Era https://www.oneusefulthing.org/p/a-guide-to-which-ai-to-use-in-the

AI agents are shipping fast – and breaking core security assumptions Much of that breakage starts at the identity layer. A live webinar with Craig Matsumoto from @futuriom and @goteleport CEO Ev Kontsevoy → https://t.co/qCh6tSp2Im will unpack what many are missing: – Why”” https://x.com/TheTuringPost/status/2022657346120618122

AI agents can completely transform your enterprise. But real transformation only works when you stay in control. If you want to deploy agents at full speed without risks, here’s something worth checking out 👇 A live webinar introducing Rubrik Agent Cloud by @rubrikInc→”” https://x.com/TheTuringPost/status/2022434860246405510

Apologies, this was a docs clean up we rolled out that’s caused some confusion. Nothing is changing about how you can use the Agent SDK and MAX subscriptions!”” https://x.com/trq212/status/2024212378402095389

Building a good harness is hard and it’s compound overtime, we are building a good one a video agent for Argil it’s very fun to dev.”” https://x.com/brivael/status/2023203131329503583

Checkout ThunderAgent led by @GT_HaoKang, intern at @togethercompute! An agentic workflow involves multiple model and tool requests, but inference systems make scheduling decisions on a per-request basis. ThunderAgent introduces a simple “”program abstraction”” to track the end to”” https://x.com/simran_s_arora/status/2023846852987421096

Coding agents are fundamentally changing software engineering in terms of velocity, role, and org structure. We published a memo to our internal engineering team detailing our growing expectations in terms of role/scope. 🟠 Before, the tasks of prioritization, engineering”” https://x.com/jerryjliu0/status/2024611512858644561

Cursor can now use past conversations as context.”” https://x.com/cursor_ai/status/2024222146642497713

DSPy Weekly #23 is out! – 🛠️ optimize_anything & gskill (GEPA) released! – 🔁 The rise of Recursive Language Models (RLMs) & new RLM explorers – 🏦 Nubank scaling agents to 127M+ users ( with DSPy used in stack ) – 🧠 Generative Ontology & EmbeWebAgent papers – 📈 DSPy”” https://x.com/getpy/status/2024865536929308889

Everything we’re doing to make codebases “”agent-ready”” (better docs, less dead code, smaller surfaces) engineers always needed too. Agents just have zero tolerance for the entropy humans learned to work around. They can’t “”just know”” a file is outdated or a code path is dead.”” https://x.com/dok2001/status/2022339274767520246

Extend Cursor with plugins · Cursor https://cursor.com/blog/marketplace

File systems are an agent’s natural work environment. The ability to process and create unstructured data allow agents to bring automation to most areas of knowledge work. Now you can easily integrate Box as a cloud filesystem into deepagents from Langchain. Stay tuned for more.”” https://x.com/levie/status/2022375298097111160

From Static Embeddings to Dynamic AI Assistants – Powered by Qdrant In this recent LinkedIn pulse by Michael Folino, he revisits a Kafka AI assistant project originally built in 2022 using FAISS – and upgrades it for 2025 using Qdrant. The key shift? Moving from a static”” https://x.com/qdrant_engine/status/2024016471714918798

Happy CNY! We are glad to introduce our latest language model Seed-2.0. We make great progress (agent, reasoning, vision understanding, etc.) since Seed-1.8 without any distillation Right now it’s only available in CN now, and will soon be ready globally. https://x.com/TsingYoga/status/2023764275874197964

hi elie, thanks for your question > 10B active parameters was intentional > M2.5 is getting close to “infinite agent scaling” > knowledge capacity is the main limit > tradeoffs are thoughtfully chosen for efficiency & practicality > pretraining innovations remain exciting areas”” https://x.com/MiniMax_AI/status/2022370086397624476

How OpenClaw is built (and why it’s so different) > The key design decision – it’s a workspace directory of Markdown files > Identity, memory, skills, tool policies, heartbeat rules – all live on disk > At the center is Gateway – a single long-running process. Everything flows”” https://x.com/TheTuringPost/status/2024540032590368790

How to specialize a coding agent for your codebase? GEPA for Skills makes it easy to create a skills file for a repository: 1. Run Swe-Smith to create tasks on your repo. 2. Use GEPA optimize_anything to develop skills for your repo. 3. Done.”” https://x.com/AlexGDimakis/status/2024653629303771580

I am the bottleneck now”” Few more thoughts”” https://x.com/thorstenball/status/2022310010391302259

I started writing about Harness Engineering ~5-6 months ago here’s a blog on the actual recipes we use at LangChain to improve our Agents+Harnesses and get a Top5 score on Terminal Bench 2.0 some highlights: – Self-verification is a fast ramp for agents autonomously improving”” https://x.com/Vtrivedy10/status/2023812467034329224

Implementing a secure sandbox for local agents · Cursor https://cursor.com/blog/agent-sandboxing

Interesting new work on adaptive reasoning depth for LLM agents. Not every agent step requires the same level of thinking. Some steps need strategic planning. Others are routine execution. This research introduces CogRouter, a framework inspired by ACT-R cognitive theory that”” https://x.com/omarsar0/status/2023405531835277504

Introducing OriOn: the SOTA Long-Context Engine That Powers Agentic Search & Reason – LightOn https://www.lighton.ai/lighton-blogs/introducing-orion

It really isn’t that hard to see the jagged frontier of AI. Just think about the parts of your job that are vital but that you would be insane to expect an AI to do, even if agents get 10x better. Thats the frontier The more you use AI the more accurate those assessments will be”” https://x.com/emollick/status/2023870680081854744

Last week I posted about using file systems in deepagent with @LangChain_JS 🎥 https://t.co/2r6wFHVS3S 👀 today, our friends from @Box now forked the project and build their own Box backend to help you store files on their intelligent content management platform 🤩 Go check it”” https://x.com/bromann/status/2022011713332420755

Let me bring the full embarrassing story here, our previous CLI was junk and had bugs that made it harder to work with. It embarrassed me so much that I couldn’t run my evals properly on our own CLI, so we complained a lot and our team went crazy and pulled some all-nighters”” https://x.com/arafatkatze/status/2022415192932651302

Model intelligence isn’t the only dimension that matters for agents; how quickly they complete tasks is critical. We’ve added end-to-end speed tracking to our open-source agent harness, Stirrup! The latest version of our lightweight, open source agent framework, Stirrup, now”” https://x.com/ArtificialAnlys/status/2022358995739254800

Must-read AI research of the week: ▪️ Towards Autonomous Mathematics Research ▪️ InternAgent-1.5 ▪️ Agent World Model ▪️ Weak-Driven Learning: How Weak Agents make Strong Agents Stronger ▪️ Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation ▪️”” https://x.com/TheTuringPost/status/2023728292214579284

OpenClaw reveals something fundamental about startups in the AI era: One-person team. Coding agents as employees. 3 months. And you can build something that shifts the world. Of course, Peter’s taste and agency are elite. Years of coding experience. He can both build and sell.”” https://x.com/Yuchenj_UW/status/2023248474503094774

OpenClaw, OpenAI and the future | Peter Steinberger https://steipete.me/posts/2026/openclaw

Over the last three months, we’ve rolled out agent sandboxing on macOS, Linux, and Windows. Sandboxes allow agents to run freely and securely, only requesting approval when they need to step outside it. Here’s how we built it:”” https://x.com/cursor_ai/status/2024544628687687879

Retrieval brings the signal. Long context makes it scalable for agents. ✨ Introducing OriOn: the SOTA Long-Context VLM family built for agentic search & reasoning. OriOn processes up to 250 pages at full visual resolution in a single pass, with a 32B model that hits SOTA,”” https://x.com/LightOnIO/status/2024037191974834553

so ‘harness’ has increased a lot in the ai vocab over the last few months. mainly around agents, where it was once more in evals. this is how I understand it: please correct me if I’m wrong. the agent’s harness is the infrastructure that wraps the model to manage long-running”” https://x.com/ben_burtenshaw/status/2023429103731269696

Sufficiently advanced agentic coding is essentially machine learning: the engineer sets up the optimization goal as well as some constraints on the search space (the spec and its tests), then an optimization process (coding agents) iterates until the goal is reached. The result”” https://x.com/fchollet/status/2024519439140737442

Superagent: AI Tools for Business, Research Assistant https://superagent.com/

The ideas behind autonomous AI agents were recognized early, but the models just weren’t good enough until now – which is why clever ideas like OpenClaw can finally take off. Model ability gains (especially tool use & lower error rates) are the foundation for what comes next.”” https://x.com/emollick/status/2023255762525425839

The new hustle culture is undoubtedly making tokens work for you while you sleep, eat, shower, and scroll. I’ve got 6 ai agents working for me now. Some days it feels like an iron man suit for the mind, body, and soul — pure amplification. Other days it feels like we’re”” https://x.com/bilawalsidhu/status/2023078239753703782

the problem isn’t OpenClaw. it’s the architecture. https://www.vulnu.com/p/the-problem-isnt-openclaw-its-the-architecture

Too many people working with multi-agent systems assume that if you just add enough agents and let them talk, interesting social dynamics will emerge. A new paper suggests that the assumption is fundamentally wrong. Researchers studied Moltbook, a social network with no humans,”” https://x.com/omarsar0/status/2023766916473733394

Training tool-use agents with RL requires diverse, executable environments. But these environments barely exist. This new research introduces Agent World Model (AWM), a fully synthetic pipeline that generates executable agentic environments at scale. Starting from high-level”” https://x.com/dair_ai/status/2023748787949498804

Underestimated moment of agency/leverage is that many systems are being redesigned for AI & fluid for the first time in a long time (how do you hire? How does scientific publishing work?) That creates an opening for small groups to set patterns that could define future systems”” https://x.com/emollick/status/2023439440811888873

Uo to 3.9x faster agentic serving / rollout via ThunderAgents, without any quality tradeoff! Great work from Together AI by @GT_HaoKang @Chenfeng_X @_junxiong_wang @simran_s_arora”” https://x.com/ben_athi/status/2023852606842700198

Very interested in what the coming era of highly bespoke software might look like. Example from this morning – I’ve become a bit loosy goosy with my cardio recently so I decided to do a more srs, regimented experiment to try to lower my Resting Heart Rate from 50 -> 45, over”” https://x.com/karpathy/status/2024583544157458452

We struggled to get monitors to catch reward hacking in agents… so we built a new lens. 🔭 Hodoscope allows you to visualize and audit agent trajectories at scale. Using it, we found a brand new vulnerability in commit0 in just minutes! ⚡️ We’re excited to see what you”” https://x.com/AdtRaghunathan/status/2024944182595289418

we’re excited to announce trajectory explorer: the first sane way to navigate agent traces. every decision your agent made is now searchable in seconds only in @raindrop_ai”” https://x.com/benhylak/status/2024546696211083653

What’s new in Cline CLI 2.0: + Completely redesigned terminal UI with interactive mode + Parallel agents with isolated state per instance without manual instance creation. + Improved headless mode for CI/CD pipelines + Added ACP support for Zed, Neovim, and Emacs”” https://x.com/cline/status/2022341258979717196

Less than a year and a half ago computer use was barely even a thing and now we’re near human-level capability. Another reminder that things are improving very fast.”” https://x.com/alexalbert__/status/2023820589983801796

Announcing AA-WER v2.0 Speech to Text accuracy benchmark, and AA-AgentTalk, a new proprietary dataset focused on speech directed at voice agents AA-AgentTalk focuses on the speech that matters most to voice agents. As a held-out, proprietary dataset, AA-AgentTalk also mitigates”” https://x.com/ArtificialAnlys/status/2024157398139883729

Today I am extremely proud to introduce the Fury Autonomous Vehicle Orchestrator 🇺🇸 For the past 12 months, the Scout AI team has been building this quietly. Now we’re finally ready to unveil it to the world Just like AI agents are transforming the digital world, Fury”” https://x.com/adcock_colby/status/2024210697304101021

Small update to the leaderboard at https://t.co/AU0F7BjYEh: it’s now all results from running with mini-SWE-agent v2, an upgrade over v1 that gets more juice out of the base models.”” https://x.com/OfirPress/status/2024177059895877802

We just updated the official SWE-bench leaderboard comparing all models with the exact same scaffold (mini-SWE-agent v2). Detailed cost analysis & links to browsable trajectories in 🧵”” https://x.com/KLieret/status/2024176335782826336

WebMCP is the future of the web. Agents can now interact with any website without ever seeing the UI. I built a starter template to show how: A DoorDash like app where the agent adds items to cart and checks out with the right address + promo code. The browser is now the API.”” https://x.com/skirano/status/2022387763421810989

Manus AI launched 24/7 Agent via Telegram and got suspended https://www.testingcatalog.com/manus-ai-launched-24-7-agent-via-telegram-and-got-suspended/

We just launched the world’s first coding agent sandbox for Windows! Now you can safely let the agent work, without needing to approve specific commands. We’ve been testing it experimentally for some time. Now it’s live in the CLI. Next up, IDE extension and Windows Codex app!”” https://x.com/embirico/status/2022378682749456870

Microsoft tests Researcher and Analyst agents in Copilot https://www.testingcatalog.com/microsoft-tests-researcher-and-analyst-agents-in-copilot-tasks/

Koyeb is Joining Mistral AI to Build the Future of AI Infrastructure – Koyeb https://www.koyeb.com/blog/koyeb-is-joining-mistral-ai-to-build-the-future-of-ai-infrastructure#serverless-inference-and-agents

🦞”” https://x.com/sama/status/2023463428892094655

After initially being hyped about the speed, I have to say that 5.3-codex-spark, even on xhigh, is actually quite a bit dumber than 5.3-codex, to the point that I’m back to using the latter most of the time.”” https://x.com/giffmana/status/2023341811851473053

agents are up and productive, time for bed”” https://x.com/gdb/status/2023342301821734937

codex is so good at the toil — fixing merge conflicts, getting CI to green, rewriting between languages — it raises the ambition of what i even consider building”” https://x.com/gdb/status/2023135825970749637

codex momentum is strong, and many people are feeling just how big of a leap 5.3 is. if your organization hasn’t tried codex yet, it’s worth revisiting.”” https://x.com/gdb/status/2023299087974777061

Codex weekly users have more than tripled since the beginning of the year!”” https://x.com/sama/status/2023233085509410833

codex’s shell-fu is incredible to behold and learn from”” https://x.com/gdb/status/2022823856889827711

I am increasingly asked during candidate interviews how much dedicated inference compute they will have to build with Codex. Pairing this with usage per user growing significantly faster than the number of users, it’s pretty clear that compute will be something that is scarce.”” https://x.com/thsottiaux/status/2024635825997459841

I’m glad that 5.3 Codex has started making good use of sub-agents. With 5.2 Codex, I often saw it not using them much even when the feature was enabled. For reference: I wouldn’t recommend this for Plus users, but for Pro users, you can increase the maximum number of sub-agents”” https://x.com/Hangsiin/status/2023297599764402627

I’ve had 3 or more agents running in parallel with Codex for 2+ hours. I’ve used 8% of my 5-hour window. 2% of my weekly. I am literally trying to hit the limits and still can’t.”” https://x.com/theo/status/2023718038198251904

measuring agentic security capabilities with smart contracts:”” https://x.com/gdb/status/2024200501055963593

We have a special thing launching to Codex users on the Pro plan later today. It sparks joy for me. I think you are going to love it…”” https://x.com/sama/status/2021984777470193767

On evaluating multi-step scientific tool use in LLM agents. SciAgentGym provides an interactive environment with 1,780 specialized tools across 4 scientific disciplines. The core finding: even advanced models like GPT-5 see success rates drop sharply from 60.6% to 30.9% as”” https://x.com/dair_ai/status/2023404773031166320

One thing I feel not enough people know is that the Codex agent is open source. It also exposes an app-server interface that lets you integrate Codex into your application including sign-in with ChatGPT. It’s the same server that powers Codex in VSCode, Jetbrains and Xcode”” https://x.com/dkundel/status/2024233673764257879?s=20

Introducing Lockdown Mode for ChatGPT. Lockdown mode is an advanced, optional security setting for higher-risk users, businesses, and enterprises. Lockdown Mode disables certain tools and capabilities in ChatGPT that an adversary could attempt to exploit to exfiltrate sensitive”” https://x.com/cryps1s/status/2023441322838028362

Labor of love: We’re open-sourcing the runtime we use to run long-horizon agents at Southbridge. Something like this exists at almost every serious AI team I know. We ended up needing to build it because we couldn’t buy it. The problems were simple: – How do we stop throwing”” https://x.com/hrishioa/status/2023807677089099914

MiniMax-M2.5 is now open source. Trained with reinforcement learning across hundreds of thousands of complex real-world environments, it delivers SOTA performance in coding, agentic tool use, search, and office workflows. Hugging Face: https://t.co/Wxksq9BB7t GitHub:”” https://x.com/MiniMax_AI/status/2022310932693897628

You can now run open-source AI coding agents without paying for API keys 🤯 Cline CLI 2.0 just dropped with free access to Minimax M2.5. → Runs from your terminal → Parallel agents → Works with any editor Any model you want. 100% Open Source.”” https://x.com/dr_cintas/status/2022387444189139367

BREAKING 🚨: Cline released Cline CLI 2.0, an open-source AI coding agent powered by Kimi K2.5 and MiniMax M2.5, available for free! `npm install -g cline` on Windows, Mac, and Linux. Open-source strikes back 👀”” https://x.com/testingcatalog/status/2022348951459172604

Introducing Cline CLI 2.0: An open-source AI coding agent that runs entirely in your terminal. Parallel agents, headless CI/CD pipelines, ACP support for any editor, and a completely redesigned developer experience. Minimax M2.5 and Kimi K2.5 are free to use for a limited time.”” https://x.com/cline/status/2022341254965772367

Agentic Generation of Simulation-Ready Indoor Scenes and Robot Test Environments. 📍 Paper AND Code: Instead of hand-building scenes in simulation, you write one prompt. SceneSmith builds the world for you. > Room layout. > Furniture. > Wall and ceiling objects. > Small”” https://x.com/IlirAliu_/status/2021870913713590338

[2602.16301] Multi-agent cooperation through in-context co-player inference https://arxiv.org/abs/2602.16301

xAI tests Arena Mode with Parallel Agents for Grok Build https://www.testingcatalog.com/xai-tests-parralel-agents-and-arena-mode-for-grok-build/

– hallucinating function names that don’t existing during agentic workflows – hallucinating incorrect structures when asked to generate structured outputs sonnet 4.5 still works great, but 4.6 is completely crapping the bed on the same tasks”” https://x.com/rishdotblog/status/2023848930430304648

@OpenHandsDev OpenHands leads to such a sharp improvement on Codex 5.1 (33.3% ➡️ 43.2%) and Claude Sonnet 4.5 (34.1% ➡️ 45.5%) that it makes me wonder if some of the other leaderboards would see improvement from trying out different frameworks as well.”” https://x.com/iamwaynechi/status/2022448462105842023

> Someone reverse-engineered how Claude Code’s Agent Teams communicate. > > No WebSocket. No gRPC. No message queue. > > They read and write JSON files on disk. > > Each agent gets an inbox at ~/.claude/teams/inboxes/{agent}.json. Messages append to a JSON array. Protocol”” https://x.com/peter6759/status/2022156692985983266

👌 Tracing in LangSmith is as easy as copy/paste 📊 Get started in seconds with Claude Agent SDK, OpenAI, LangChain, Vercel AI SDK, and 20+ other frameworks. Pick your stack, copy the code, start debugging. Docs: https://t.co/DAQcQxkVsp Sign up for LangSmith:”” https://x.com/LangChain/status/2023532973086159283

🚨 Breaking: Claude OAuth officially not allowed in OpenClaw This would be a GREAT time for @sama to step in and let us use @OpenAI subscriptions with @openclaw.”” https://x.com/AndrewWarner/status/2024168538508775674

Another thing I noticed writing my latest AI guide was how Anthropic seems to be alone in knowledge work apps. Not just Cowork, but Claude for PowerPoint & Excel, as well as job-sppecific skills, plugins & finance/healthcare data integrations Surprised at the lack of challengers”” https://x.com/emollick/status/2023968612881412457

Anthropic blocked his fren from using the claude sub in openclaw, switched to minimax – big boost for open models thanks anthropic”” https://x.com/Teknium/status/2023251135201738794

Anthropic might have already started slowing down. Since July 2025, Anthropic has grown its revenue at a rate of 7×/year rather than 10×.”” https://x.com/EpochAIResearch/status/2024536493721866668

Bruh It’s not just behind, it’s 50% more expensive than xhigh and 228% over 5.2 codex. That said, a vast improvement over Sonnet 4.5.”” https://x.com/teortaxesTex/status/2023890938125488289

Claude Code has regressed an absurd amount in the last few days. Timestamps no longer update unless you un-focus/re-focus the tab. “”thinking”” doesn’t show at all. I had a query run for 6 minutes with 0 output. This is genuinely unpleasant to use.”” https://x.com/theo/status/2024718133676867608

Claude Sonnet 4.6 same pricing as Sonnet 4.5!”” https://x.com/kimmonismus/status/2023820443359002922

Claude Sonnet 4.6 substantially improves on the aesthetic capabilities of Sonnet 4.5 for tasks like presentation and document generation in GDPval-AA. While we see effective analysis, and in some cases content similarities, between the two versions, the visual elements are”” https://x.com/ArtificialAnlys/status/2023821899139293652

Computer use is the standout. For coding, it’s less prone to overengineering than Opus 4.5 and more consistent over long sessions. And 1M context window in beta on the API. We can’t wait to see what you build!”” https://x.com/mikeyk/status/2023853207731200176

Did anthropic break something after releasing sonnet 4.6? seeing a ton of hallucinations eveywhere for both Opus 4.6 and Sonnet 4.6 cc: @trq212, @alexalbert__”” https://x.com/rishdotblog/status/2023848487285387693

Fun nugget from Sonnet 4.6: With a 1M context window, the model is better at long-horizon planning. In the Vending-Bench Arena, models compete to run a simulated business. Sonnet 4.6 developed a new strategy: invest heavily in capacity for the first 10 months, then pivot hard”” https://x.com/felixrieseberg/status/2023823186484404443

GEPA for skills is here! Introducing gskill, an automated pipeline to learn agent skills with @gepa_ai. With learned skills, we boost Claude Code’s repository task resolution rate to near-perfect levels, while making it 47% faster. Here’s how we did it:”” https://x.com/ShangyinT/status/2024651061995458722

idk what you are all smoking clawdbot is just a passing hype that can be vibe-coded in a week anthropic lost absolutely nothing here except some aura points in the open source community but it’s not like they liked anthropic anyway”” https://x.com/scaling01/status/2023217588319277471

IMO it’s pretty clear that Claude Code needs to be rewritten from scratch at this point”” https://x.com/theo/status/2024726444283449781

Kind of crazy to read how much prompt caching influences the performance of Claude Code. It almost feels like, without it, we wouldn’t be anywhere near the experience we have CC today. Super important read, especially as we enter this new era of agent harnesses. This backend”” https://x.com/omarsar0/status/2024620142240333979

Kind of crazy watching Anthropic’s good will crumble in real time”” https://x.com/theo/status/2024225756981973214

Latest from @AnthropicAI: Claude Opus & Sonnet 4.6 are now in the Search Arena. 🌐 Check them out in Search Arena to see how well they can search, cite and output real-time, verifiable information online.”” https://x.com/arena/status/2024144830209966142

LCM extends on Recursive Language Models and outperforms Claude Code on long-context tasks. Pay close attention. So much innovation is happening in agent memory.”” https://x.com/omarsar0/status/2023765757117763820

Modular: The Claude C Compiler: What It Reveals About the Future of Software https://www.modular.com/blog/the-claude-c-compiler-what-it-reveals-about-the-future-of-software

On Anthropic’s Consumer Marketing | rohan ganapavarapu https://rohan.ga/blog/anthro_consumer/

Opus 4.6 is ludicrously better than any model I’ve ever tried at doing architecture and experimental critique. Most noticeably, it will start down a path, notice some deviation it hadn’t expected…and actually stop and reconsider. Hats off to Anthropic.”” https://x.com/eshear/status/2024148657797308747

Opus 4.6 keeps blowing through its *entire* token budget & eventually responding completely empty when I ask for max reasoning. Finish reason: “”length”” These are short prompts – 160 token input. Thinks for 20 mins then blows up & charges me money for the privilege”” https://x.com/paul_cal/status/2024817020529766764

Piotr discovered something worrying: if you give an LLM a list of tools it’s allowed to call, it might decide to also call a tool you didn’t provide! Impacts all major US providers except @OpenAI. Be sure to check LLM tool call requests! (Lisette/Claudette check automatically)”” https://x.com/jeremyphoward/status/2024599416901103705

Side by side example Same model (claude-opus-4-6). Same task. Two different agent harnesses @LangChain Deep Agents CLI: 9s Claude Code: 16s The harness IS the performance. 1.7× difference, zero model changes”” https://x.com/GitMaxd/status/2024137171217871106

Somehow I didn’t fully appreciate how strongly Claude Code’s prompt has to fight against the weights to make parallel tool calls. https://x.com/dbreunig/status/2024247669359788050

Sonnet 4.6 incoming! Lets go!”” https://x.com/kimmonismus/status/2023814107846398015

Sonnet 4.6 is a beast for real-world work, agentic tasks, especially computer usage”” https://x.com/kimmonismus/status/2023844025011499052

Sonnet 4.6 is here. It’s our most capable Sonnet model by far, approaching Opus-class capabilities in many areas. Very excited for folks to try this one out. The performance jump over Sonnet 4.5 (which was released just over four months ago) is quite insane.”” https://x.com/alexalbert__/status/2023817479580221795

sonnet 4.6 is here. no sonnet 5 lol”” https://x.com/dejavucoder/status/2023817232732848501

Sonnet 4.6 is now available in Cursor. Our benchmarks show it as a notable improvement over Sonnet 4.5 on longer tasks, but below Opus 4.6 for intelligence.”” https://x.com/cursor_ai/status/2023841746577485894

Sonnet 4.6 used 74M output tokens to run the Artificial Analysis Intelligence Index, ~3x Sonnet 4.5 (Reasoning, 25M) and more than Opus 4.6 (Adaptive Reasoning, 58M)”” https://x.com/ArtificialAnlys/status/2024259815930012105

Sonnet and Slopus 4.6 are munching through my credits I miss Sonnet 3.5 just one-shotting everything”” https://x.com/scaling01/status/2023835207355560223

Sonnets progress from 4.5 to 4.6 is fucking insane, it’s just much better at everything taste is off the charts The NYC skyline is the most ridiculous part. While other models just write SVG that look like some skyscraper, like a tall box with a few windows, Sonnet 4.6 is”” https://x.com/scaling01/status/2023840565641556439

The clawdbot –> open claw rename foreshadowed it all. Zuck must not be too happy. And interesting that Anthropic didn’t even make a play. So what does this mean? I suspect new functionality keeps coming to open claw first – and the best stuff graduates to chatgpt proper. A”” https://x.com/bilawalsidhu/status/2023187986901344548

This is Claude Sonnet 4.6: our most capable Sonnet model yet. It’s a full upgrade across coding, computer use, long-context reasoning, agent planning, knowledge work, and design. It also features a 1M token context window in beta.”” https://x.com/claudeai/status/2023817132581208353

This is definitely something to be aware of both for benchmark builders and users IMO. For longer-running, more difficult tasks, the differences between which agent you use can be big, like a 10% gain in success rate when going from Claude Code to OpenHands.”” https://x.com/gneubig/status/2022451119310655909

Underrated dev upgrade from today’s launch: Claude’s web search and fetch tools now write and execute code to filter results before they reach the context window. When enabled, Sonnet 4.6 saw 13% higher accuracy on BrowseComp while using 32% fewer input tokens.”” https://x.com/alexalbert__/status/2023834863858769975

Warmer and kinder than Sonnet 4.5, but also smarter and more overcaffeinated than Sonnet 4.5.”” https://x.com/sleepinyourhat/status/2023821754859503650

When Anthropic CEO @DarioAmodei sat down with @dwarkesh_sp, the AI world saw a rare sight: a frontier lab leader was pressed not on what models can do, but why they haven’t transformed the economy yet. Is the bottleneck the technology, or is it us? I look at 3 core pressure”” https://x.com/TheTuringPost/status/2024247179305451634

Worth noting Claude Cowork is quite different from Claude Code (and even more so from agents like OpenClaw) from a security perspective. It runs in a VM with default-deny networking & hard isolation baked in A sign of a path forward for agents that will not terrify corporate IT.”” https://x.com/emollick/status/2023260943942135850

Anthropic has entrusted Amanda Askell to endow its AI chatbot, Claude, with a sense of right and wrong”” https://x.com/WSJ/status/2022629696261808173

Anthropic’s Philosopher Amanda Askell Is Teaching Claude AI to Have Morals – WSJ https://www.wsj.com/tech/ai/anthropic-amanda-askell-philosopher-ai-3c031883?mod=e2tw

📊Let’s dive deeper into @AnthropicAI’s Sonnet 4.6 vs 4.5. Overall: Sonnet 4.6 ranks 3 places higher (#13 vs #16) Where Sonnet 4.6 gains: Code: ▪️WebDev (+19 for Sonnet 4.6: #3 vs #22) Text: ▪️Instruction Following (+6, #5 vs #11) ▪️English (+5, #9 vs #14) ▪️Hard Prompts (+5,”” https://x.com/arena/status/2024892330743124246

Claude Sonnet 4.6 (medium) scores 66.1% on WeirdML, matching Opus 4.6 (no thinking) and a big advance from Sonnet 4.5 at 47.7%. I had to run it on medium reasoning level because the default (high) constantly hit the 64k max tokens limit. Even at medium it uses as many output”” https://x.com/htihle/status/2024764946051907659

Claude Sonnet 4.6 takes second place in the Artificial Analysis Intelligence Index (behind Opus 4.6), but used ~3x more output tokens than Claude Sonnet 4.5 in its max effort mode. Sonnet 4.6 leads all models in GDPval-AA and TerminalBench, including a slight lead over Opus 4.6″” https://x.com/ArtificialAnlys/status/2024259812176121952

When I joined METR I was really skeptical that we were evaling models using simple OS scaffolds rather than Claude Code / Codex / etc. I really appreciate Nikola looking into this and I’m surprised it still doesn’t seem to make much difference for CC on Opus 4.5″” https://x.com/ajeya_cotra/status/2022419978495127828

GLM-5 scores 48.2% on WeirdML, beating Claude Sonnet 4.5 and tying gpt-oss-120b (high) for the best open model. This is a clear advance but still far from Opus-4.6 at 78% and gpt-5.2 at 72%.”” https://x.com/htihle/status/2023734346943775179

Anthropic raises $30 billion in Series G funding at $380 billion post-money valuation \ Anthropic https://www.anthropic.com/news/anthropic-raises-30-billion-series-g-funding-380-billion-post-money-valuation

It’s extremely unreasonable to say a company is a “”supply chain risk”” because it wants terms that prevent using the AI for mass domestic surveillance and lethal autonomous weapons. (Insofar as this is the situation.) 1/”” https://x.com/RyanPGreenblatt/status/2023524096592802207

OpenAI and Anthropic are much further ahead than what benchmarks show. While you are token constrained they are blasting millions of tokens at 4x the API speed without batting an eye and they scaffold like they are trying to build a skyscraper.”” https://x.com/scaling01/status/2023837889478758495

I unashamedly love Windows. Always had. Anthropic folks – apparently, not so much :-), Claude Code is super-buggy on Windows. If you want to avoid spending a lot of time fixing NTFS issues, add this to https://t.co/QG7xYArzFH ## Windows Shell Safety The Bash tool runs under Git”” https://x.com/MParakhin/status/2024172856029171877

Wow, Codex is some sort of a miracle… (yes, I’ve tried Claude Code before that)”” https://x.com/TheTuringPost/status/2022079178703847607

I looked into how Claude Code and Codex compare to the default scaffolds METR uses for time horizon measurements. It looks like they don’t significantly outperform our default scaffolds on any models we’ve tried them on so far.”” https://x.com/nikolaj2030/status/2022398669337825737

Dario acknowledges the multi-trillion dollar robotics opportunity, yet Anthropic is not hiring robotics talent; even as OpenAI and Google DeepMind aggressively build out their own robotics teams.”” https://x.com/TheHumanoidHub/status/2022416551270662427

Introducing Claude Code Security, now in limited research preview. It scans codebases for vulnerabilities and suggests targeted software patches for human review, allowing teams to find and fix issues that traditional tools often miss. Learn more: https://x.com/claudeai/status/2024907535145468326

A paper worth paying close attention to. It presents Lossless Context Management (LCM), which reframes how agents handle long contexts. It outperforms Claude Code on long-context tasks. Recursive Language Models give the model full autonomy to write its own memory scripts. LCM”” https://x.com/dair_ai/status/2023765147970662761

OpenClaw creator on Opus vs Codex: “Opus is like the coworker that is a little silly sometimes, but it’s really funny and you keep him around. Codex is like the weirdo in the corner that you don’t want to talk to, but he’s reliable and gets shit done.” LMAO. Accurate.”” https://x.com/bilawalsidhu/status/2022571001490325791

@OfficialLoganK Looks like Antigravity is working great now. Gemini CLI still doesn’t. Gemini Code Assist is still announcing it just got Flash 3 🤦‍♂️”” https://x.com/matvelloso/status/2024566224152383824

3.1 Pro can even generate website-ready, animated SVGs from a simple text prompt. Since these are built in pure code — not pixels — they stay crisp at any scale and keep file sizes tiny compared to traditional video. Go ahead, try generating an animated SVG of a pelican riding”” https://x.com/Google/status/2024519468395733477?s=20

Today, we’re releasing Gemini 3.1 Pro. It’s the same core intelligence that powers Gemini 3 Deep Think, now scaled for your practical applications. It’s a smarter model for your most complex tasks. See 3.1 Pro in action 🧵↓”” https://x.com/Google/status/2024519455389192204

Writing my latest guide on what AI to use made it really clear how confusing the Google AI situation is. Great models with radically different harnesses in different apps. Great AI products, mixed in with some bad ones. None of which seem to clearly connect or interact together.”” https://x.com/emollick/status/2023965642357907854

The model is a step forward in reasoning, designed for workflows where a simple answer isn’t enough. On ARC-AGI-2 – which tests for novel logic patterns – it more than doubles 3 Pro’s score. This means it can help you visualize complex topics, organize scattered data, and bring”” https://x.com/GoogleDeepMind/status/2024516467618656357

Today https://t.co/jFknDoasSy joins Hugging Face Together we will continue to build ggml, make llama.cpp more accessible and empower the open-source community. Our joint mission is to make local AI easy and efficient to use by everyone on their own hardware.”” https://x.com/ggerganov/status/2024839991482777976

GPT-5.3-Codex-Spark is launching today as a research preview for Pro. More than 1000 tokens per second! There are limitations at launch; we will rapidly improve.”” https://x.com/sama/status/2022011797524582726

Introducing the Codex app | OpenAI https://openai.com/index/introducing-the-codex-app/

It’s probably one of the best places to start with LLMs in practice. The full core of GPT fits in 243 lines, then you just scale and optimize on top of it. This kind of share is a real gift.”” https://x.com/TheTuringPost/status/2023348280961495396

Wow gpt 5.3 Codex is actually so good. It has significantly better taste for UI design. My bet is that it’ll be #1 on @designarena once API is available.”” https://x.com/tkkong/status/2022410732760117403

A few thoughts after reading the @OpenAI paper on scattering amplitudes over the weekend: – The result from GPT-5.2 Pro is technically impressive, clearly at the level of a strong grad student / postdoc – However, without the transcripts, it is unclear what role the human”” https://x.com/_lewtun/status/2023334667064099207

GPT-5.2 derived a new result in theoretical physics. We’re releasing the result in a preprint with researchers from @the_IAS, @VanderbiltU, @Cambridge_Uni, and @Harvard. It shows that a gluon interaction many physicists expected would not occur can arise under specific”” https://x.com/OpenAI/status/2022390096625078389

Making progress in Quantum Field Theory with GPT-5.2. It’s happening, for real.”” https://x.com/SebastienBubeck/status/2022439681573695638

🚀 Qwen3.5-397B-A17B is here: The first open-weight model in the Qwen3.5 series. 🖼️Native multimodal. Trained for real-world agents. ✨Powered by hybrid linear attention + sparse MoE and large-scale RL environment scaling. ⚡8.6x-19.0x decoding throughput vs Qwen3-Max 🌍201″” https://x.com/Alibaba_Qwen/status/2023331062433153103

Happy Chinese New Year!! What a week for open-source LLMs: > Qwen-3.5 > GLM-5 > MiniMax-M2.5 Are we just waiting on DeepSeek-V4 now? Also I’m hoping a US lab steps up with a true frontier open-source model.”” https://x.com/Yuchenj_UW/status/2023453819938763092

Qwen 3.5 goes bankrupt on Vending-Bench 2″” https://x.com/andonlabs/status/2023450768406364238

So a new Repo full of MLX-LM-LoRA examples to train your own LLM for Apple Silicon, fast and efficient on ultra long context lengths: Fine-tune Qwen3 4B Instruct on 32K context: https://t.co/yGZlR59fHD Train @IBMResearch Granite 350M model on RL-GRPO Reasoning:”” https://x.com/ActuallyIsaak/status/2022414004623479014

🚀 Qwen3.5-397B-A17B-FP8 weights are now open! It took some time to adapt the inference frameworks, but here we are: ✅ SGLang support is merged 🔄 vLLM PR submitted → https://t.co/rJkuitOBWs Check the model card for example code. vLLM support landing in the next couple of days!”” https://x.com/Alibaba_Qwen/status/2024161147537232110

🚩Cerebras’s MiniMax-M2 GGUF 2-bit model: https://t.co/udlviJQZqQ Qwen3-Coder-Next INT4 model:”” https://x.com/HaihaoShen/status/2022293472796180676

A clarification of Qwen3.5 Plus and 397B: 1. for opensource, we follow the tradition to make parameters apparent so we use the name with the number of total parameters and active params. 2. Qwen3-Plus is a hosted API version of 397B. As the model natively supports 256K tokens,”” https://x.com/JustinLin610/status/2023340126479569140

It’s Qwen 3.5 day today! 🥳 State of the art 800 GB model. Runs _locally_ with MLX using Q4, taking 225 GB of RAM.”” https://x.com/pcuenq/status/2023369902011121869

Let’s do the KV cache math for Qwen3.5: – KV heads: 2 – Head dimension: 256 – gated attention layers: 15 – bytes per element (BF16): 2 2 x 256 x 15 x 2 = 15 360 This is the same for K and V. So, we multiply by 2: 30 720 bytes Roughly 31 kb per token of context. Meaning at max”” https://x.com/bnjmn_marie/status/2023424404504342608

ollama run qwen3.5:cloud Qwen3.5-397B-A17B is the first open-weight model in the series. It’s available on Ollama’s cloud right now! Give it a try. Let’s go! 🚀🚀🚀”” https://x.com/ollama/status/2023334181804069099

Qwen 3.5 Plus is now available on AI Gateway. Thanks @vercel_dev team. 🤝 Use model: ‘alibaba/qwen3.5-plus’ Try it now!”” https://x.com/Alibaba_Qwen/status/2024029499541909920

Qwen3.5 runs quite well in mlx-lm. Awesome that we have a frontier-level hybrid model. The context gets longer but the inference speed and memory use barely change. Here’s the Q4 generating a space invaders game on an M3 Ultra. Generated 4,120 tokens at 37.6 tok/s.”” https://x.com/awnihannun/status/2023462412092059679

So speaking of benchmarks, what can be said of the new open Qwen? First, it completely destroys Qwen3-VL-235B ofc, but more surprisingly it outscores Qwen3-Max-thinking. All the while it’s the same model as “”Plus””. Plus just has 1M context and some more bells and whistles.”” https://x.com/teortaxesTex/status/2023331885402009779

The new chonky Qwen 3.5 looks pretty solid, beating their own Qwen3-Max model everywhere and is much better at vision benchmarks than Qwen3-235B-A22B-VL Now what I sadly haven’t seen is anything on reasoning efficiency.”” https://x.com/scaling01/status/2023343368399704506

Kimi K2‑0905 and Qwen3‑Max preview: two 1T open weights models launched | AINews https://news.smol.ai/issues/25-09-05-1t-models

Skills are literally just markdown files how the hell can they have downtime???”” https://x.com/theo/status/2024785367896072599

[2602.15763] GLM-5: from Vibe Coding to Agentic Engineering https://arxiv.org/abs/2602.15763

// From Vibe Coding to Agentic Engineering // GLM-5 is a foundation model designed to transition from vibe coding to agentic engineering. The model introduces novel asynchronous agent RL algorithms that enable learning from complex, long-horizon interactions. It also adopts DSA”” https://x.com/omarsar0/status/2024122246688878644

🚀 Zhipu AI GLM-5: A Real Step Into the Top Tier? Zhihu contributor toyama nao offers a concise verdict: “”A hard road upward — the stairway to godhood.”” 🔮From recovery to contention Over the past six months (4.5 → 5.0), Zhipu has climbed back into China’s first tier and now”” https://x.com/ZhihuFrontier/status/2022161058321047681

Funny to see truthy-dpo show up randomly in /v1/completions (hallucination) requests to GLM-5, guess that dataset is still semi-useful! https://x.com/jon_durbin/status/2022291772617945546

GLM-5 is “”Bigger, faster, better, and cheaper.”” @louszbd from @Zai_org broke down GLM-5 on @thursdai_pod with @altryne. New RL framework, DeepSeek sparse attention, 744B params, fully open source under MIT. Catch the full interview in the link below!”” https://x.com/wandb/status/2022389206572765697

GLM-5 Tech Report https://x.com/scaling01/status/2024050011164520683

Introducing GLM-5 from @Zai_org, the best-in-class open-source model for systems engineering and long-horizon agents. AI natives can now use GLM-5 on Together AI and benefit from reliable inference for production-scale reasoning, coding, and agent workflows.”” https://x.com/togethercompute/status/2022354579858289052

Presenting the GLM-5 Technical Report! https://t.co/ZTYEe7oM0Y After the launch of GLM-5, we’re pulling back the curtain on how it was built. Key innovations include: – DSA Adoption: Significantly reduces training and inference costs while preserving long-context fidelity -“” https://x.com/Zai_org/status/2023951884826849777

Really nice tech report, huge props to @Zai_org for still releasing these as they are very valuable for the open-source community. Nice to see many similarities with our recipe for intellect-3, excited for the further work on the RL recipe, already have some stuff cooking up”” https://x.com/Grad62304977/status/2024170939248714118

Re-OCR’d the complete 1771 Encyclopaedia Britannica (2,724 pages) with a single command on @huggingface Jobs. – 0.9B model (GLM-OCR) ~$0.002/page ~$5 total on an L4 GPU Before (old Tesseract ocr) → After:”” https://x.com/vanstriendaniel/status/2024445900102258846