Agents and Copilots: AI News Week Ending 09/05/2025

Image created with Flux Pro v1.1 Ultra. Image prompt: Agents, neat flowchart made from small bananas connected by thin stems, arrowheads implied by angled banana tips, photorealistic, editorial, minimal, soft studio light, high detail, 3:2 landscape

lee from cursor JUST showed me the future of coding in this 29 min tutorial. what if instead of thinking about coding as you sitting alone in front of a screen , you started to think about it as you and a swarm of agents each taking on very specific roles, each showing up https://x.com/gregisenberg/status/1962935654238847279

The real story in AI right now isn’t just agents. It’s agents + deterministic workflows. That combo is what’s actually driving results today. Free-roaming agents sound cool. But the highest ROI is coming from structured workflows that use models for judgment, not autonomy https://x.com/wadefoster/status/1962940151450976717

China’s DeepSeek Preps AI Agent for End-2025 to Rival OpenAI https://finance.yahoo.com/news/china-deepseek-preps-ai-agent-152907224.html?guccounter=1&guce_referrer=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS8&guce_referrer_sig=AQAAADiJs67uOGL7PzqX3MGgvD-A6UJVzmztcJfvPzJTz9iF2iWfg-h2zg2pcwJoIuJ-4IUs3BMrEvPbbpbf4j7qXCmM4BqK78UMZVzrZl3fSuokrWneWMYpy8S7L3-xciC9d74km3boS_g57OxikNZN7Owozd204A5KlQA0MSzkqp42

UI-TARS-2 Technical Report Advancing GUI Agent with Multi-Turn Reinforcement Learning https://x.com/_akhaliq/status/1963229296236937443

Can AI agents reliably navigate the web? Does the choice of agent scaffold affect web browsing ability? To answer these questions, we added Online Mind2Web, a web browsing benchmark, to the Holistic Agent Leaderboard (HAL). We evaluated 9 models (including GPT-5 and Sonnet 4) https://x.com/sayashk/status/1963343022252315112

We can finally share UI-TARS-2🥳🥳 — a native GUI agent trained with multi-turn agent RL ⚡️⚡️Key highlights (all-in-one model!): 💻Computer Use: 47.5 OSWorld · 50.6 WindowsAgentArena 📱Phone Use: 73.3 AndroidWorld 🛜Browser Use: 88.2% Online-Mind2Web 🎮Gameplay: ~60% human https://x.com/TsingYoga/status/1963629621326614940

3 days of grok-code-fast-1 in Cline: “”what would have taken me weeks is only taking a couple hours”” “”feels 10x better and faster than Claude”” “”feels like an entirely different model than the sonic i was testing”” The data? >level with Sonnet-4 in diff edits, and improving https://x.com/cline/status/1961488289803939915

Grok Code Fast 1 is versatile across the full stack and is particularly strong at TypeScript, Python, Java, Rust, C++, and Go. Using Grok Code Fast 1, @DannyLimanseta built the following game in a day. https://x.com/xai/status/1961129796349423944

Grok Code Fast from @xai scored 90% on Roo Code evals — top-tier performance at half the cost of its peers. ⚡️ Free to try in Roo Code Cloud until Sept 10. See why speed + savings make @grok a strong new addition: https://x.com/roo_code/status/1962571908224110673

Grok Code just hit #1 on the OpenRouter leaderboard, beating Claude Sonnet https://x.com/elonmusk/status/1961677739762790630

Grok Code lead increased to 60% higher usage than Claude Sonnet https://x.com/elonmusk/status/1962265197462110473

grok-code-fast-1 has good vibes. prob makes the best tradeoff on the speed / intelligence curve right now. gpt-5 is too spiky, sometimes it’s surprisingly good sometimes it overthinks something way too much. you end up spending too much time waiting for some pedantic output.”” / X https://x.com/dzhng/status/1961905091960791194

Humbling to see Grok-Code-Fast-1 smash daily token records. The community response has been so incredible that we’re extending our free promo until September 10th. 🧵 Here’s how to get set up in your favorite code editors:”” / X https://x.com/veggie_eric/status/1961877264599306573

I tried out @cline + @xai grok-code-fast-1 to assist me with my effort to port a large project (tinygrad) from python to c. So far, I’d been using combination of Claude Code + Claude 4.1 Sonnet/Opus and @roo_code + GPT5 medium for this, with success (though with a lot of hand”” / X https://x.com/QuixiAI/status/1962600301309108304

interesting trend from the @xAI team that we haven’t seen from other frontier model labs this is the second round of free access to @grok models they’ve provisioned to Cline users in exchange for rich @cline usage data why is cline data so valuable? it’s a heavyweight workout https://x.com/nickbaumann_/status/1961539461860487664

Some great quotes about Grok Code Fast 1 from our friends at Cline and opencode 🩵 https://x.com/veggie_eric/status/1961474457295622515

The improvement from `sonic` to `grok-code-fast-1` has been notable according to Cline users”” / X https://x.com/cline/status/1962628786366881795

New Scale research: Can smaller models reliably oversee stronger LLM agents? We red team monitoring systems to detect covert sabotage, like agents secretly downloading sensitive information. https://x.com/scale_AI/status/1961233659228557530

Apple released FastVLM
so I tried vibe coding a video captioning AI app with it
took 5 prompts to get a working app in anycoder and deployed it on Hugging Face
85x faster and 3.4x smaller than comparable sized VLMs
the deployed app works 100% locally in your browser powered by transformers.js and WebGPU https://x.com/_akhaliq/status/1962018549674684890

We can now say pretty definitively that AI progress is well ahead of expectations from a few years ago. In 2022, the Forecasting Research Institute had super forecasters & experts to predict AI progress. They gave a 2.3% & 8.6% probability of an AI Math Olympiad gold by 2025… https://x.com/emollick/status/1962859757674344823

Enterprise AI, Built Your Way | You.com https://you.com/home

We’re officially a YOUnicorn! Excited to share that @youdotcom just raised $100M Series C at a $1.5B valuation, led by @CoxEnterprises We’ve been heads down building the search infrastructure for the AI and agent future. Soon there will be more AI agents using the web than humans, but today’s search wasn’t built for this. Agents need deep, contextual information from both public web and internal private data to make real decisions. Our web search API delivers the most up-to-date, accurate, and fastest search results for LLMs and agents. Real benchmarks show we consistently outperform the competition on accuracy and speed while staying cost-effective.https://x.com/RichardSocher/status/1963277700711461241

We raised $85M in Series B funding at a $700M valuation, led by Benchmark. Exa is a research lab building the search engine for AI. https://x.com/ExaAILabs/status/1963262700123000947

Did you know you can build a Browser Agent that can navigate Chromium with Gemini 2.5 Flash and @browser_use in under 10 lines of code? https://x.com/_philschmid/status/1963233076034650481

Autonomous News Agent A LangGraph-powered AI agent that autonomously curates news briefings, extracts facts, and summarizes content with integrated human feedback and dynamic tool selection. https://x.com/LangChainAI/status/1962213801249710230

Get a free visual guidebook to learn MCPs from scratch (with 11 projects):
https://x.com/_avichawla/status/1961677843903185078

The funny thing about the prediction that AI would be writing 90% of all code by now is that the prediction’s failure distracts from the fact that AI adoption in code writing is actually extremely high, it was over 30% in December, 2024 according to one measure, with large impact https://x.com/emollick/status/1963262680271094229

This bit strikes me as true based on what I have seen. And a reason why AI agents shouldn’t be owned solely by the IT functions in organizations. https://x.com/emollick/status/1961925069539549479

~40% of daily code written at Coinbase is AI-generated. I want to get it to >50% by October. Obviously it needs to be reviewed and understood, and not all areas of the business can use AI-generated code. But we should be using it responsibly as much as we possibly can. https://x.com/brian_armstrong/status/1963315806248604035

xpander.ai is Backend-as-a-Service for autonomous agents. It abstracts the ops layer so AI engineers focus on behavior and outcomes GitHub repo: https://x.com/_avichawla/status/1962765005537059007

🚨 We’ve just published a recipe to train a frontier-level deep research agent using RL. With just 30 hours on an H200, any developer can now beat Sonnet-4 on DeepResearch Bench using open-source tools. (Thread 🧵) https://x.com/corbtt/status/1962954306078048297

Finnt (@finnt_app) builds AI agents that run your accounting & controllership SOPs end-to-end, replicating workflows, reconciling data & delivering ERP-ready journals. Faster close, audit-ready books. Congrats on the launch, @anjismail! https://x.com/ycombinator/status/1963265695304544271

Turn Claude Code into a Financial Analyst 🤖💹 In this video we point Claude Code at a bucket of 10k filing PDFs, and have it perform complex analysis across the entire set of docs! Claude Code doesn’t have file understanding out of the box (it kind of does, but it’s terrible / https://x.com/jerryjliu0/status/1962586155523940828

Cool research from Microsoft! They release rStar2-Agent, a 14B math reasoning models trained with agentic RL. It reaches frontier-level math reasoning in just 510 RL training steps. Here are my notes: https://x.com/omarsar0/status/1964045125115662847

rStar2-Agent: Agentic Reasoning Technical Report “”We introduce rStar2-Agent, a 14B math reasoning model trained with agentic reinforcement learning to achieve frontier-level performance.”” “”three key innovations that makes agentic RL effective at scale: (i) an efficient RL https://x.com/iScienceLuvr/status/1962798181059817480

Comet is coming soon to mobile and is now available for pre-orders on Android Play Store https://x.com/AravSrinivas/status/1963620578344276366

Another major Perplexity iOS app update. Team cooked. Answers are now streamed smooth as butter. Tables, markdown, intermediate steps. Update and enjoy! https://x.com/AravSrinivas/status/1963758210281882029

Pro users in South Korea, Brazil, and Spain can now download Comet. https://x.com/perplexity_ai/status/1963638853975040456

🚀 Select PayPal and @Venmo customers can skip the waitlist for early access to @perplexity_ai’s AI-powered Comet browser and receive a free 12-month Perplexity Pro trial. This offer is part of the new PayPal Subscriptions Hub, where you can: ✨ Manage subscriptions ✨ Update https://x.com/PayPal/status/1963229273071698199

We are rolling out Comet to all students worldwide. Ask Comet to manage your schedule, order textbooks, or prepare for exams with Study Mode. https://x.com/perplexity_ai/status/1963285255198314951

Framer Raises $100 Million Series D at a $2 Billion Valuation to Redefine How Businesses Build Websites https://www.businesswire.com/news/home/20250828901842/en/Framer-Raises-%24100-Million-Series-D-at-a-%242-Billion-Valuation-to-Redefine-How-Businesses-Build-Websites

Hugging Face team just released an agent dataset. Training on it drastically improves the ability to execute code and analyze data. 📈 They use E2B sandboxes to simulate a real code execution environment. Check it out:”” / X https://x.com/e2b/status/1962945170736849262

‼️LangChain & LangGraph 1.0alpha releases Today we are announcing alpha releases of v1.0 for langgraph and langchain, in both Python and JS. 🕸️LangGraph is a low-level agent orchestration framework, giving developers durable execution and fine-grained control to run complex https://x.com/LangChainAI/status/1962934869065191457

.@Kimi_Moonshot just released Kimi-K2-0905 and here’s the scoop: > 2x context window (131k to 256k) > trained for improved tool use, esp in agent harnesses like Cline > it’s markedly better at frontend Try it now in Cline 🌔 https://x.com/cline/status/1963804927584833725

“Agent/Client Protocol” (ACP) Manages agent-IDE interactions, similar to an LSP. Claude Code and Gemini CLI supported.https://x.com/mathemagic1an/status/1963273618705482155

Alex – Xcode AI Coding Assistant https://www.alexcodes.app/blog/alex-team-joins-openai

AWorld Orchestrating the Training Recipe for Agentic AI https://x.com/_akhaliq/status/1961456228044873888

Finally, a production-ready backend for Agents that actually works! xpander is a plug-and-play backend for Agents that manages memory, tools, states, version control, guardrails, and more. Works with any framework, like CrewAI, Agno, Langchain, etc. Fully self-hostable! https://x.com/_avichawla/status/1962764993587564861

Get your data request done in your sleep. Playbooks are autonomous agent runs that: – Work with ANY existing infrastructure (Snowflake, BigQuery, Databricks, dbt, Looker, etc.) – Execute complex analysis workflows autonomously – Deploy entirely in your VPC with SOC2 Type II https://x.com/TextQL/status/1962925594620166364

Introducing the new Retool example on slime: a clean abstraction for multi-turn rollouts with tool use. https://x.com/Zai_org/status/1963836843633332457

Introduction – Agent Client Protocol https://agentclientprotocol.com/overview/introduction

Just released 1.0alpha releases for LangChain and LangGraph packages Targeting late october to go live with the first major versions for both packages Thoughts? Feedback? Let us know!”” / X https://x.com/hwchase17/status/1962935384490565926

Luna-2 Guardrails Tiny and fast models that enable protection at scale. • 20+ live metrics • <200ms latency at 100% sampling • 97% cheaper than LLM guardrails • Customizable for your use cases Think “”real-time firewall”” for agents. https://x.com/omarsar0/status/1962880989111197854

One of the nice parts of `langchain` 1.0 is standard content blocks for reasoning, citations, multimodal stuff, etc Has been annoying to see providers implement slightly different variants of these, this should help make it easy to treat these in a standard way”” / X https://x.com/hwchase17/status/1963287729007165488

SimpleTIR End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning https://x.com/_akhaliq/status/1963228487524679988

This was a long time coming, it was a lot of fun to finally write down all the thought that went into langgraph, lmk any comments!”” / X https://x.com/nfcampos/status/1963652967443435723

Today we’re releasing jina-code-embeddings, a new suite of code embedding models in two sizes—0.5B and 1.5B parameters—along with 1~4bit GGUF quantizations for both. Built on latest code generation LLMs, these models achieve SOTA retrieval performance despite their compact size. https://x.com/JinaAI_/status/1963637135439007824

try the codex vscode plugin — it’s already very good and will improve fast: https://x.com/gdb/status/1961349040056000719

vLLM is proud to support the great Kimi update from @Kimi_Moonshot , better tool-calling, longer context, and more! Check the deployment guide at https://x.com/vllm_project/status/1963805972352188895

We’ve been trying to wrangle Responses API, and here’s my response to this: Claim #1 -> Responses is just ChatCompletitions++ Wrong! Responses makes managing your own context window a PINA and also introduces a brand new (not generally supported) API that is very obviously”” / X https://x.com/sarahwooders/status/1964016384889016787

Building AI agents for production comes with unique challenges. In our latest blog, we share how we designed LangGraph to tackle them: 🔹 Why heavy abstractions fail and what really matters for control & durability 🔹 The 6 features every production agent needs in practice 🔹”” / X https://x.com/LangChainAI/status/1963646974315606428

https://x.com/omarsar0/status/1962875111037358540

AI Agents for notes and research is wild! Claude Code is a beast at this. https://x.com/omarsar0/status/1962268120069853538

Claude code in Zedd https://x.com/zeddotdev/status/1963258131191853285

We just added a few major tool updates to the code execution tool in the Anthropic API: – bash tool for running any bash command – str_replace for precise file editing – view for reading files, browsing dirs, displaying images – create for writing new files”” / X https://x.com/alexalbert__/status/1962912152555225296

Introducing SemTools – add blazing-fast semantic search to your entire filesystem without a vector database ⚡️ Coding agents like Claude Code/Cursor have full access to the CLI like grep, cat, and pipe operations for search. But they lack ‘proper` semantic search that’s actually https://x.com/jerryjliu0/status/1961488443663597857

pip install -U mlx https://x.com/awnihannun/status/1961484829037330612

Smart ring rivalry heats up: Ultrahuman sues Oura over patent claims https://www.msn.com/en-us/money/other/smart-ring-rivalry-heats-up-ultrahuman-sues-oura-over-patent-claims/ar-AA1L2RLo?apiversion=v2&noservercache=1&domshim=1&renderwebcomponents=1&wcseo=1&batchservertelemetry=1&noservertelemetry=1

Evaluate Your AI Agents Like a Pro! 🔥 Agno’s Simple Agent Evals are unit tests for your Agents. You can use them to measure the accuracy, performance, and reliability. The best part is that they are easy to use and powerful. 100% Open-Source Link to code examples in the https://x.com/tinztwins/status/1962197412077842846

Everyone claims SOTA for Computer Use Agents (CUAs), but there’s no way to ensure reproducible results. We’re publicly releasing our OSWorld Verified leaderboard, starting with CUA models from OpenAI and Anthropic. We will include more evals and models soon. https://x.com/hud_evals/status/1963321238056796573

Improving the reliability of multi-agent systems is extremely hard. And it’s a must when deploying AI agents. Galileo offers one of the most comprehensive agent eval solutions I’ve seen. Here is how they help devs and huge companies deploy reliable AI agents: https://x.com/omarsar0/status/1962880974104014948

Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure (update) pretty interesting method and ranking. 2.5 Flash > 2.5 Pro suggests it’s not so much about model capacity, but still. V3.1-NS is far above R1-0528. OpenAI crushingly dominant. https://x.com/teortaxesTex/status/1961298849047117832

Which LM is better at agentic coding? We have a bunch of useful academic benchmarks like SWE-Bench, but we don’t have a good comparison of agentic coding LMs *in the wild*. To solve this, we released PR Arena: https://x.com/gneubig/status/1963267468853477809

Who is inducing failure in LLM Agentic Systems? This is a cool idea to diagnose errors in multi-agent interactions. AgenTracer-8B outperforms giant proprietary LLMs like Gemini-2.5-Pro and Claude-4-Sonnet by up to 18.18%. https://x.com/omarsar0/status/1963618829680218254

Today we’re launching Atla — the improvement engine for AI agents. Atla helps agent builders find and fix recurring failures. Instead of just surfacing traces, Atla automatically identifies your agent’s most critical failure patterns and suggests targeted fixes. https://x.com/Atla_AI/status/1963586200305836264

Gartner predicts 40% of projects will fail by 2027 due to reliability. This is why companies like MongoDB + Cisco use Galileo to protect 100s of agents and millions of queries/day. Love what this team is building. Can’t recommend them enough. https://x.com/omarsar0/status/1962880991569059950

I’ve been locked in the last few months building something big. Introducing Questera — your 24/7 agentic growth team. The ultimate AI operating system for customer engagement: real-time signals → instant next best action → across every channel (email, push, ads, in-app & https://x.com/richexplorer_/status/1962948237914202585

Releasing the Jupyter Agent Dataset! 🚀 Training on this data dramatically improves the ability to execute code and analyze data. Built from 7 TB of real Kaggle datasets + 20k notebooks, creating real code exec traces using Qwen3-Coder and E2B. https://x.com/a_yukh/status/1962911097452683710

Jina Code Embeddings: SOTA Code Retrieval at 0.5B and 1.5B https://x.com/JinaAI_/status/1963637141675843791

vibe coding app: https://x.com/_akhaliq/status/1962920607684730977

🧱Building LangGraph: Designing an Agent Runtime from first principles @nfcampos wrote a detailed blog on how he designed and built LangGraph, going deeper than any other piece of content we’ve put out. Includes: 1. what’s different about building agents compared to traditional https://x.com/hwchase17/status/1963647954587455568

Codex on VSCode > copilot (and the UI is so clean 🤌) https://x.com/flavioAd/status/1961094013567562195

Codex CLI hype, final verdict: The codex cli hype is real, I just tried it. GPT-5 (high) in codex is great: – It stays on track much longer than opus – Never “”gives up”” on your task even if takes a while – Much longer context window – Not arguing & “”you’re absolutely right””ing”” / X https://x.com/Yampeleg/status/1963260958257578497

gpt-5 for Xcode:”” / X https://x.com/gdb/status/1961563165541777914

GPT-5 is now built into Xcode 26! https://x.com/OpenAIDevs/status/1961557515331862853

I now use gpt-5 exclusively for coding (well almost, o3 at times too). It took some getting used to as I find it a tad pedantic, requiring more exact prompting. But as of late it’s been consistently the best model across the domains I use it for.”” / X https://x.com/martin_casado/status/1961903651733307452

I think congrats again to OpenAI for cooking with GPT-5 Pro. This is the third time I’ve struggled on something complex/gnarly for an hour on and off with CC, then 5 Pro goes off for 10 minutes and comes back with code that works out of the box. I had CC read the 5 Pro version”” / X https://x.com/karpathy/status/1964020416139448359

people seem to really like the new codex features!”” / X https://x.com/sama/status/1961096744533647501

The Codex CLI and IDE extension are evolving at truly a mind blowing pace. https://x.com/dkundel/status/1963834846394147125

`langchain` 1.0, now in alpha, ships with improved standardization for reasoning, citations, tool calls, multimodal data, and other content across LLM providers. No more juggling APIs— just one consistent interface.
https://x.com/LangChainAI/status/1963285794954907750

AI Rails App Builder A natural language-powered system that builds and modifies Rails applications in real-time. Using LangGraph, it handles file operations and Rails commands through an intelligent agent with live previews. https://x.com/LangChainAI/status/1962183602185314525

Issue Triager Agent A GitHub issue management solution that uses LangGraph to automatically handle stale issues with human oversight through Agent Inbox. Built with LangSmith integration for comprehensive monitoring and control. https://x.com/LangChainAI/status/1962198699653861755

this repo is wild. 100+ production-ready AI Agents, RAG, Multi-Agent teams, Voice Agents, MCP, and LLM apps with step-by-step tutorials. 100% free and open source by @Saboo_Shubham_ 👏 link in next post 👀 https://x.com/MakerThrive/status/1962661273335742780

Agents really really need ultra long context”” / X https://x.com/Teknium1/status/1963807244190900618

Learning When to Plan LLM agents trained with dynamic planning learn when to spend test-time compute, balancing cost & performance. This is the first work to explore training LLM agents for dynamic test-time compute allocation in sequential decision-making tasks. https://x.com/arankomatsuzaki/status/1963820986668626156

We investigate Reinforcement Learning (RL) on Agentic search tasks without explicit gathering information from external search engines, e.g., LLMs, web engines. https://x.com/TheTuringPost/status/1961927988704076157

ChatMCP, now in FastMCP Cloud by @PrefectIO. – Push a commit. – Get a remote MCP server. – Get a chat client automatically connected to it. Deploy, use, test, dogfood, experiment all on one platform. https://x.com/fastmcp/status/1961436552057278512

Github: MCP Universe https://x.com/_philschmid/status/1962935892999331922

Introducing 20+ connectors powered by MCP and a fully controllable Memory in Le Chat—making it one of the most connected and relevant AI assistants for enterprises and consumers. Why switch to Le Chat? A 🧵 https://x.com/MistralAI/status/1962881084183527932

This is massive for AI! Everyone knows about MCP and A2A, but you can’t build complete agentic solutions without people! That’s what the Agent-User Interaction Protocol (AG-UI) is for. This is a protocol for building user-facing AI agents. It’s a bridge between a backend AI https://x.com/svpino/status/1962844250539962521

Together, these updates unlock new capabilities and make the code execution tool more efficient, requiring fewer tokens on average. Learn more in the docs: https://x.com/alexalbert__/status/1962912195983114725

BOOM! Now you can deploy powerful MCP servers to Google Cloud in just a single command🔥 > $ gradio deploy –provider gcloud > Built-in queue for scaling up to production workloads⚡ Keep reading to know more. https://x.com/Gradio/status/1963636954999754955

How can we benchmark Agents in realistic, complex environments? MCP-Universe is a new benchmark using Model Context Protocol (MCP) servers to test Agents on 231 challenging, practical tasks. Benchmark: 1️⃣ Tasks from 6 practical domains, Location Navigation, Repository https://x.com/_philschmid/status/1962935890415599650

MCP-Bench Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers https://x.com/_akhaliq/status/1961456699564294651

年初想提升 tool-calling 时特别缺靠谱的benchmark，以为 “”mcp 火了等几天肯定有开源的mcp-bench可用””，结果等了几个月也没等到，但是这最近怎么每周都有好几个 mcp-bench release出来？”” / X https://x.com/bigeagle_xd/status/1961461441799852128

Apple’s rumored AI search tool for Siri could rely on Google | The Verge https://www.theverge.com/news/770712/apple-ai-search-tool-siri-google-gemini

ollama-style CLI for running MLX models on Apple Silicon https://x.com/tom_doerr/status/1961309536406392877

often researcher’s ability to iterate on a capability is limited by our ability to measure that capability. i do believe progress is more eval-limited than people think. sometimes evals feel causal. did SWE-Bench follow agentic coding, or did agentic coding follow SWE-bench? we”” / X https://x.com/willdepue/status/1963739518554489250

Today we’re updating Artificial Analysis Intelligence Index to V3, now incorporating agentic evaluations Terminal-Bench Hard and 𝜏²-Bench Telecom! Tool calling and agentic workflows are increasingly the norm for how language models are used by both developers and consumers. https://x.com/ArtificialAnlys/status/1962881314925023355

airline customer service AI demo with agent handoffs https://x.com/tom_doerr/status/1962972766174339271

Now you can grep a PDF (and any document) Introducing SemTools – simple parsing and semantic search for the command line https://x.com/LoganMarkewich/status/1961448960184520945

📣Groq’s first agentic system is ready for production at scale. Already battle tested by 100K+ developers across 5M+ requests. Compound is now GA, available to everyone on GroqCloud. Go Build ⬇️ https://x.com/GroqInc/status/1963635205899710798

Mass Intelligence means we are going to inundated with stories about people using AI to do amazing things and horrifying things as over a billion people increasingly get access to advanced (and easy-to-use) AI models. Things are going to get very weird. https://x.com/emollick/status/1961469787415949417

GPT-5 Pro and Gemini 2.5 Pro Deep Think are both very impressive models for hard problems. I think they were both undersold during their respective launches, in part because I am not sure the labs themselves really understand the market for a slow, “”deep-thinking”” model, yet.”” / X https://x.com/emollick/status/1962375003053216041

We just added OpenAI Codex CLI formal support in Hugging Face MCP Server – go play with it now!! 🔥 https://x.com/reach_vb/status/1963599978909008321

Finally, MCP servers can now deliver UI-rich experiences!
MCP servers in Claude/Cursor don’t offer UI any experience yet, like charts. It’s just text/JSON.
mcp-ui lets you add interactive web components to its output that can be rendered by the MCP client. https://x.com/_avichawla/status/1961677831861395495

Our most compact LLM from the Hermes 4 series is locally usable and optimized for consumer hardware, providing at-home access to its powerful hybrid reasoning and tool calling.
https://x.com/NousResearch/status/1963349882837897535

We’re collaborating with @Microsoft to bring you an intimate meetup in London on the challenge most AI agents face: handle complex projects that require real planning and execution over time. Join us as Harrison Chase (our co-founder & CEO) shares insights on building “”Deep https://x.com/LangChainAI/status/1963316066735812876

Le Chat. Custom MCP connectors. Memories. | Mistral AI https://mistral.ai/news/le-chat-mcp-connectors-memories

From payments data and refunds to invoices and subscriptions, @MistralAI’s users can now handle it all inside Le Chat with @stripe’s MCP. Here’s how it works: https://x.com/emilygsands/status/1962884010289590583

Getting questions on why Codex web feels faster – We added container caching that helps new tasks and followups run 90% quicker, dropping the median start time from 48 seconds to 5 seconds ⏩ https://x.com/derrickcchoi/status/1961263884641194391

gpt-5 is great as your daily driver for coding:”” / X https://x.com/gdb/status/1961931756246024600

gpt-5 is incredible at coding, and really shines with the right prompting style:
https://x.com/gdb/status/1961839687619969288

In AI SDK 5, the OpenAI provider defaults to the Responses API. The Completions API remains available and fully supported. https://x.com/aisdk/status/1963999103626727518

Projects in ChatGPT are now available to Free users.
https://x.com/OpenAI/status/1963329936368046111

really cool to see how much people are loving codex; usage is up ~10x in the past two weeks! lots more improvements to come, but already the momentum is so impressive.”” / X https://x.com/sama/status/1963365966953505103

step function improvement in start time for Codex remote tasks:”” / X https://x.com/gdb/status/1961927789214626288

thanks andrej! do you care more about it getting smarter or faster?”” / X https://x.com/sama/status/1964032346975588371

VS Code adds support for custom OAI-compatible endpoints
https://x.com/ggerganov/status/1963255949373677959

We’re releasing new Codex features to make it a more effective coding collaborator: – A new IDE extension – Easily move tasks between the cloud and your local environment – Code reviews in GitHub – Revamped Codex CLI Powered by GPT-5 and available through your ChatGPT plan.”” / X https://x.com/OpenAIDevs/status/1960809814596182163

Alex – I’m excited to announce that we’re joining OpenAI’s Codex team! When we started out, Xcode had no AI. Building a “Cursor for Xcode” sounded crazy, but we managed to do it anyway. And, over time, we built the best coding agent for iOS & MacOS apps. https://x.com/danieledrisian/status/1963301872036712652

What is @OpenAI’s Responses API, and should you use it instead of Chat Completions? 🤔 TL;DW: → Built for agents (& remote MCPs 🤫) → Better streaming control → Better structured + multimodal outputs The Responses API is available now on @GroqInc in beta. https://x.com/benankdev/status/1961444239327240500

Nice to see open models (+ OpenHands) showing really strong performance here. Arguably for pure agentic coding tasks open models have almost caught up with closed ones. For more diverse tasks there’s still a little ways to go.”” / X https://x.com/gneubig/status/1963045532022010231

Glad to see Qwen3-Coder performing well on the GSO leaderboard!”” / X https://x.com/Alibaba_Qwen/status/1963049864474120475

Overview of Self-Evolving Agents
https://x.com/omarsar0/status/1962202247154352502

Modern AI teams need hyperscalers & neoclouds, but legacy tools like SLURM can’t keep up. @AbridgeHQ moved from SLURM to multi-cloud AI infra with @skypilot_org. ✅ 10x faster dev cycles ✅ SLURM-like convenience, K8s’ reliability ✅ Scale on any infra https://x.com/skypilot_org/status/1963637217055646139

it’s really fast”” / X https://x.com/vikhyatk/status/1961959454347501781