Agents and Copilots: AI News Week Ending 08/15/2025

Agents and Copilots: AI News Week Ending 08/15/2025

August 15, 2025

Image created with Flux Pro v1.1 Ultra. Image prompt: CU Boulder brand style — CU Gold & Black, Helvetica Neue, Flatirons, Tuscan-vernacular sandstone + red-tile roofs; classroom whiteboard in Old Main, overcast soft light, over-shoulder student POV, subtle Flatirons contour linework; integrate the category “Agents” via Diagram: hand-drawn multi-agent flowchart with tool nodes and arrows labeled AGENTS; natural light, clean professional inspiring tone, crisp focus, subtle grain, editorial composition

Clodo (@ClodoAI) is the AI Assistant for Real Estate Agents. It helps agents remember and follow up with all their leads. Dominate your follow-up, dominate your market. https://x.com/ycombinator/status/1953546689278804034

After nearly four years as CEO, I’m leaving GitHub to become a startup founder again. With more than 1B repos and forks, 150M+ developers, and Copilot continuing to lead the most thriving market in AI with 20M users and counting, GitHub has never been stronger than it is today.”” / X https://x.com/ashtom/status/1954920157853172064

Apple App Intents Voice Control Feature for Siri, Apps; iOS 26 Release Timing – Bloomberg https://www.bloomberg.com/news/newsletters/2025-08-10/apple-app-intents-voice-control-feature-for-siri-apps-ios-26-release-timing

It’s sometimes hard to grasp the significance of the reasoning and logic updates that are starting to emerge in powerful models, like GPT-5. Here’s a *very simple* example of how powerful these models are getting.

I took a recent NVIDIA earnings call transcript document that came in at 23 pages long and had 7,800 words. I took part of the sentence “and gross margin will improve and return to the mid-70s” and modified “mid-70s” to “mid-60s”.

For a remotely tuned-in financial analyst, this would look out of place, because the margins wouldn’t “improve and return” to a lower number than the one described as a higher number elsewhere. But probably 95% of people reading this press release would not have spotted the modification because it easily fits right into the other 7,800 words that are mentioned.

With Box AI, testing a variety of AI models, I then asked a series of models “Are there any logical errors in this document? Please provide a one sentence answer.”

GPT-4.1, GPT4.1 mini, and a handful of other models that were state of the art just ~6 months ago generally came back and returned that there were no logical errors in the document. For these models, the document probably seems coherent and follows what it would expect an earnings transcript to look like, so nothing really stands out for them on what to pay attention to – sort of a reverse hallucination.

GPT-5, on the other hand, quickly discovered the issue and responded with:

“Yes — the document contains an internal inconsistency about gross-margin guidance, at one point saying margins will “return to the mid-60s” and later saying they will be “in the mid-70s” later this year.”

Amazingly, this happened with GPT-5, GPT-5 mini, and, remarkably, *even* GPT-5 nano. Bear in mind, the output tokens of GPT-5 nano are priced at 1/20th of GPT-4.1’s tokens. So, more intelligent (at this use-case) for 5% the cost.

Now, while doing error reviews on business documents isn’t often a daily occurrence for every knowledge worker, these types of issues show up in a variety of ways when dealing with large unstructured data sets, like financial documents, contracts, transcripts, reports, and more. It can be finding a fact, figuring out a logical fallacy, running a hypothetical, or requiring sophisticated deductive reasoning.

And the ability to apply more logic and reasoning to enterprise data becomes especially critical when deploying AI Agents in the enterprise. So, it’s amazing to see the advancements in this space right now, and this is going to open up a ton more use-cases for businesses.
https://x.com/levie/status/1953670264988016931

Here’s the thing: For 35 years, I’ve researched the immune system & have been fortunate to make many important, impactful discoveries, placing me in the top 0.5% of immunology experts.

The
@OpenAI
GPT-5 Thinking & Pro models now match or even surpass my expertise in immunology!

Comet for Enterprise is here. Comet is an AI-powered browser agent that thinks with you, linking tools for streamlined workflows and trusted answers. Enterprise Pro users maintain the security, privacy, and compliance standards that come with an Enterprise subscription. https://x.com/perplexity_ai/status/1956046685509210183

Exclusive | Perplexity Makes $34.5 Billion Offer for Google’s Chrome Browser – WSJ https://www.wsj.com/tech/perplexity-ai-google-chrome-offer-5ddb7a22

🔍🤖 LangChain + Oxylabs Guide Integrate LangChain’s AI framework with Oxylabs’ Web Scraper API for advanced web scraping. Includes dedicated module, MCP server, and built-in solutions for IP blocking and CAPTCHAs. Learn more about the integration 👉 https://x.com/LangChainAI/status/1954241268114182433

simulate a million bots in social networks https://x.com/tom_doerr/status/1952290852182647003

Apple’s AI Turnaround Plan: Robots, Lifelike Siri, Home Security Cameras (AAPL) – Bloomberg https://www.bloomberg.com/news/articles/2025-08-13/apple-s-ai-turnaround-plan-robots-lifelike-siri-and-home-security-cameras

GitHub just got less independent at Microsoft after CEO resignation | The Verge https://www.theverge.com/news/757461/microsoft-github-thomas-dohmke-resignation-coreai-team-transition

People assume that AI homogenizes creative writing, producing much less diverse work than groups of humans This paper finds this isn’t true: given stories to complete, GPT-4o writes as diversely as humans (stylistic, lexical, & semantic) when prompted with context & randomness https://x.com/emollick/status/1955265535714726303

Gemini 2.5 Pro has a 67% winrate against GPT-5 Thinking https://x.com/scaling01/status/1954546677185970271

GLM4.5V is out! it’s a multimodal reasoning MoE with 106B total and 12B active params 🔥 it comes with transformers support from get-go! 💗 you can also use with @huggingface Inference Providers powered by @novita_labs 👏 https://x.com/mervenoyann/status/1954907611368771728

Introducing DINOv3 🦕🦕🦕 A SotA-enabling vision foundation model, trained with pure self-supervised learning (SSL) at scale. High quality dense features, combining unprecedented semantic and geometric scene understanding. Three reasons why this matters… https://x.com/maxseitzer/status/1956029421602623787

new TRL comes packed for vision language models 🔥 we shipped support for > native supervised fine-tuning for VLMs > multimodal GRPO > MPO 🫡 read all about it in our blog 🤗 next one! https://x.com/mervenoyann/status/1955622287920537636

Say hello to DINOv3 🦖🦖🦖 A major release that raises the bar of self-supervised vision foundation models. With stunning high-resolution dense features, it’s a game-changer for vision tasks! We scaled model size and training data, but here’s what makes it special 👇 https://x.com/BaldassarreFe/status/1956027867860516867

zai-org/GLM-V: GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning https://github.com/zai-org/GLM-V

🏆NVIDIA AI-Q, an NVIDIA Blueprint for building AI agents with advanced reasoning skills, is now the leading open and portable #AIagent for high-fidelity research on the Deep Research Bench leaderboard. ➡️ https://x.com/NVIDIAAIDev/status/1952429440551547332

Worth reading what GPT-5 wrote in the intro to my new post when asked to do something dramatic. The second image explains the tricks it used. https://x.com/emollick/status/1953520251913564420

This is what Andrej predicted! GPT-5 + ElevenLabs = engagement gold. Think those ‘monkey economy’ videos? Same formula — but swap monkeys for AI-generated cats, dogs, whatever. Script with GPT-5, voice with ElevenLabs, visuals with AI. Low effort, high share potential https://x.com/Dvnagelx/status/1954096453594288285

OpenAI gpt-oss has over 5M downloads, 400+ fine-tunes and *the* most liked release this year so far! 🔥 Great job @OpenAI 🤗 https://x.com/reach_vb/status/1954909541805801799

OpenAI hasn’t open-sourced a base model since GPT-2 in 2019. they recently released GPT-OSS, which is reasoning-only… or is it? turns out that underneath the surface, there is still a strong base model. so we extracted it. introducing gpt-oss-20b-base 🧵 https://x.com/jxmnop/status/1955436067353502083

OpenAI gpt-oss 120B orchestrates a full video using Hugging Face spaces! 🤯 All of it, in one SINGLE prompt: create an image of a Labrador and use it to generate a simple video of it 🛠️ Tools used: 1. Flux.1 Krea Dev by @bfl_ml 2. LTX Fast by @Lightricks That’s it, gpt-oss https://x.com/reach_vb/status/1955678303395696821

REFERENCE THE QUOTES FROM SOTHEBY’S PRESENTATION The web’s next user isn’t human. AIs will soon use the internet far more than humans ever have. At Parallel, we are building for the web’s second user. Our API is the first to surpass humans and all leading AI models (including GPT-5) on deep web research tasks. https://x.com/p0/status/1956007609250492924

Introducing Parallel | Web Search Infrastructure for AIs | Parallel Web Systems | Enterprise Deep Research API
https://parallel.ai/blog/introducing-parallel

Lindy 3.0 is live. You can now create agents with a simple prompt and have them use a computer just like a human would. All year → this moment. Agent Builder, Autopilot, Team Collaboration. We accidentally built a website builder while testing autopilot. That’s how powerful https://x.com/getlindy/status/1952420360734847205

Agents will become a common way people shop. So today we are releasing 3 tools to make adding commerce to those agents trivial: – Checkout Kit: embed commerce widgets and checkout(!) directly into your agent and chat. This is already being used by Microsoft’s @Copilot. – Shopify https://x.com/tobi/status/1952800271257706676

It’s official. We’ve raised $14m led by @OpenAI Startup Fund to bring AI to Excel. Endex is the first AI agent to live inside Excel. For the past year, we’ve been working with financial firms. Today we’re releasing it to the world. Our capacity is limited; comment below for https://x.com/TarunAmasa/status/1953130965355905140

When and if AI development plateaus (and no indication that is happening yet), it may actually accelerate AI integration into our lives, because then it becomes easier to figure out what products & services are needed to complement AI. Right now capabilities are changing too fast”” / X https://x.com/emollick/status/1954855248679334261

📊 Behind every great business is… a spreadsheet. Budgets, forecasts, inventory, project plans. It is where the real work happens. Now, Manus builds them for you. From a single prompt, you can generate full spreadsheets with structure, formulas, charts, and logic baked in. https://x.com/ManusAI_HQ/status/1953115359609012640

OpenAI is optimizing to be a billion user consumer product over being a developer platform.”” / X https://x.com/bilawalsidhu/status/1955119548794839295

GPT-5 is the most significant product release in AI history, but not for the reason you might think. What it signals is that we’re moving from the “”bigger model, better results”” era to something much more nuanced. This is a genuine inflection point. The fact that people call a”” / X https://x.com/douwekiela/status/1955329657852834207

RT @jandotai: Introducing Jan-v1: 4B model for web search, an open-source alternative to Perplexity Pro. In our evals, Jan v1 delivers 91%…”” / X https://x.com/ggerganov/status/1955191376217297057

What can OpenAI’s new open models do with the news? I built a News Agent to find out. It can answer questions about the news in real time, and every answer comes with original source links so you can dive deeper. Runs with Hugging Face inference providers, letting you compare https://x.com/fdaudens/status/1955296761582358828

This guy literally built a viral website from scratch in 10 minutes with GPT-5 https://x.com/aaditsh/status/1954210152170893668

@dsog @Replit @amasad I built a product only using Replit that is now generating $11k MRR. Took me 1 month to build the MVP and 3,5 months until $10k MRR. Replit truly is a superpower and I feel like it’s only getting better 🙏🏼”” / X https://x.com/ViktorThulin/status/1952309684758523995

📚 🛠️ Agent Reliability A practical guide to help you catch hallucainations, verify groundedness, and monitor tool usage for LangChain/LangGraph applications: https://x.com/LangChainAI/status/1954233716487958845

Agent Reinforcement Trainer has taken off like a rocket since we launched RULER a couple weeks ago. Today, we passed 5,000 stars on GitHub! The community is super friendly and active and it has never been easier to get started with RL. Come join us on GitHub/Discord! https://x.com/corbtt/status/1953277065115124054

built a fun camera-based color picker with @v0 and @figma tap anywhere on the live feed to grab a color and give it a name swipe right to save in localStorage, left to forget build palettes, download color cards. i call it Phantome 🎨📷 https://x.com/vi_harkrishnan/status/1953178965625000125

First fully open Action Reasoning Model (ARM); can ‘think’ in 3D & turn your instructions into real-world actions: [📍 Bookmark for later] A model that reasons in space, time, and motion. It breaks down your command into three steps: ✅ Grounds the scene with depth-aware https://x.com/IlirAliu_/status/1955684880110796952

GLM-4.5 is indeed an incredible AI model.”” / X https://x.com/Kilo_Code/status/1955629042205696084

I built 65 projects + websites with @lovable_dev and here’s how to get the best results: I literally lived inside this app for 2 months. 1. Start with the intention, not the function → Write what the design should feel like or achieve, not just what it is. 2. Use emotional https://x.com/patgetsit/status/1953113998506770845

i’m really surprised by *just* how common this sentiment is, one person even started to learn how to write code just to use DSPy last month :/ that said, you don’t need to do that per se, there’s a DSPy in almost every major language, e.g. @dosco maintains an excellent one in JS”” / X https://x.com/lateinteraction/status/1955419751246934187

Implemented @thorstenball’s famous agent post in < 200 LoC with python and DSPy w/ arbitrary bash commands with human in the loop capability (no “”rm -rf /”” without my approval). Now polishing it (rich TUI ftw) and then writing a tutorial as well. (more ⬇️)”” / X https://x.com/rasmus1610/status/1955617801802260691

Introducing Anything Agent that ships mobile apps & web. Designs that don’t look AI made. Everything you need built in. Live now, reply for 1 week of free credits https://x.com/anythingai/status/1953439316815786478

Launching Langbase Python SDK! Build serverless AI agents with full-stack context engineering (tools, memory, threads, parse, chunk, etc). 🤦‍♂️ Anti-framework: Zero frameworks needed. ✨ Unified layer over 600+ LLMs 📟 ZeroBloat™ https://x.com/MrAhmadAwais/status/1953486955389567335

Lovable is a team of 50+ and 200M$ and we built that alone. See how 👇🏻 https://x.com/ksaksham39/status/1953043519846924302

Meet @thedriveAI, the world’s first agentic workspace. Humans spend hours dealing with files: creating, sharing, writing, analyzing, and organizing them. The Drive AI can handle all of these operations in just a few seconds — even while you’re off-screen getting your coffee, on https://x.com/bgyankarki/status/1953510349157883958

MiniMax 150K USD AI Agent Challenge · Luma https://luma.com/2u17h1zw

minimax-ai-challenge https://minimax-agent-hackathon.space.minimax.io/

This was super fun to make! Thanks @lifeofmansoor for trusting me with this project 🙂 Built with @v0 https://x.com/robably__/status/1951874205840199932

Very interesting application of GEPA to Observable Javascript, thanks a lot @tomlarkworthy for the great blog! https://x.com/LakshyAAAgrawal/status/1955455810802421991

Vibe Minecraft: a multi-player, self-consistent, real-time world model that allows building anything and conjuring any objects. The function of tools and even the game mechanics itself can be programmed by natural language, such as “”chrono-pickaxe: revert any block to a previous https://x.com/DrJimFan/status/1955293865579360299

We use DSPy in production for over a year to build our agentic flows, structure the code around dspy.Module classes; minimize our AI modules latency and inference cost; Congratulation #DSPy team and personally Mr. Omar.”” / X https://x.com/JuiceSharp/status/1955460115957682444

Built this over the last couple months to manage my bookmarks on X Today I am making it available on Github as well as a Replit Remix. It connects to @X and @OpenAI to process (tag & summarize) your bookmarks, generate reports, create a graph view, etc. Links in the thread ⬇️ https://x.com/raymmar_/status/1953131094573801724

Researchers at Stanford and Carnegie Mellon analyzed over 1,000 Character AI users and 400,000 messages to gauge how AI companionship affects mental health. The study showed heavier reliance on bots for friendship or romance correlated with lower satisfaction and higher https://x.com/DeepLearningAI/status/1954226191071576552

a bit cringe but pretty proud… for the first time, I added actual code to the @lummipics codebase… went from prototype, a vibecoded design artifact… to an actual feature, built with @v0, then Claude Code, and shipped to production to my design peers… it can be done! lol https://x.com/pablostanley/status/1953111162540695589

Claude Code has a new /model option: Opus for plan mode. This setting uses Claude Opus 4.1 for plan mode and Claude Sonnet 4 for all other work—getting the best of both models while maximizing your usage. https://x.com/_catwu/status/1955694117264261609

How well can LLMs select the right MCP tool for solving real world tasks? Not good. LiveMCPBench is a new benchmark that evaluates agents on a large-scale, dynamic, and realistic set of 527 tools. It shows that most models struggle with tool retrieval and utilization leading to https://x.com/_philschmid/status/1955601309966447074

Opus 4.1 plan, Sonnet 4 execute Best model combo there is https://x.com/alexalbert__/status/1955687538129252807

RT @claudeai: Claude can now reference past chats, so you can easily pick up from where you left off. https://x.com/AnthropicAI/status/1954999404387242341

GPT-5 takes 55% more time than Sonnet 4, but is 40% cheaper on the RooCode Leaderboard Which one are you choosing? https://x.com/scaling01/status/1955669720843358502

On the big picture: GPT-5 as a model is pretty much on the same curve as the other top labs. I’d expect the usual leapfrogging between Gemini, Claude, OpenAI, & Grok to continue. Where there are some big gains is that GPT-5 seems well-trained for real world tasks in new ways.”” / X https://x.com/emollick/status/1953565365465964668

The self-driving analogy lands perfectly as the real breakthrough in autonomy wasn’t just better models, it was the ability to systematically engineer the right failure modes into training and evaluation. Simulation is the missing layer for AI agent reliability!”” / X https://x.com/apoorvapandhi/status/1956033885126468050

RT @karpathy: I’m noticing that due to (I think?) a lot of benchmarkmaxxing on long horizon tasks, LLMs are becoming a little too agentic b…”” / X https://x.com/teortaxesTex/status/1954398794604253335

RT @QodoAI: Qodo Command—our CLI AI agent—just scored 71.2 on the SWE-benchmark, high enough to put us in the top 5. It achieved this sc…”” / X https://x.com/hwchase17/status/1955110032720400464

State of torch.compile, August 2025. https://x.com/ezyang/status/1955820298907082876

Are frontier AI models really capable of “PhD-level” reasoning? To answer this question, we introduce FormulaOne, a new reasoning benchmark of expert-level Dynamic Programming problems. We have curated a benchmark consisting of three tiers, in increasing complexity, which we call https://x.com/shai_s_shwartz/status/1955968602978320727

There’s a window right now where AI agents will get built for every vertical and domain. The playbook is to go deep on the context engineering required for the vertical or particular space, figure out the right UX that ties into the existing workflows naturally, and connect to”” / X https://x.com/levie/status/1952110754276200929

To test reasoning, I got GPT-5 to create a complete launch plan for an AI app from a single idea It did competitor research, product specs, logo, pricing, GTM strategy, roadmap, and more for me If I were starting with zero business knowledge, this is an insane resource https://x.com/rowancheung/status/1953505326206013820

Today, I’m proud to announce our $3.6m seed round, and launch @TracelightAI to the world. Tracelight is the best AI agent for spreadsheet tasks. No waitlist – use Tracelight today. We’re giving away a month of our pro plan for free – comment and we will send you a code. https://x.com/peterfuller23/status/1953457581998878767

Your AI Agents can only analyze 1GB of data Enterprises have 1,000,000x as much data We fixed that… It took 3 years and building an entirely new execution environment to pull this off. TextQL lets AI agents analyze your ENTIRE enterprise data – not just the 0.0001% they can https://x.com/TheEthanDing/status/1953139460406665457

📚LangChain Academy: Deep Research One of the most popular use cases for agents is “”deep research”” We’ve added a new hour long course on how to build one of these Course: https://x.com/hwchase17/status/1956036358709108979

🔥 Our latest LangChain Academy course – Deep Research with LangGraph – is now live! 🔥 Deep research agents are taking off – from major AI labs to companies building their own. Research is inherently open-ended. You can’t always predict whether a question needs broad https://x.com/LangChainAI/status/1956027411302375631

Anyone can build useful AI Agents. But it requires having a solid framework to design and improve AI agents. That’s what we’ll teach in our new training on Building Effective AI Agents. Topics include context engineering, augmenting AI agents, multi-agent systems, and more. https://x.com/dair_ai/status/1955623925901353351

Is it possible to reinvent Resumés? We attempted that yesterday during a 6-hour stream. (Built via @lovable_dev ) https://x.com/steve_fau/status/1951993836802093137

🚨I BUILT ULTIMATE COPYTRADING BOT ON GPT 5 Most copytraders lose money because they copy the wrong wallets. I built a GPT 5 copytrading bot that already made me $200K Here’s how it grows my bag 🧵👇 https://x.com/onchainmilady/status/1953804379678511424

🤖 From this week’s issue: A Google Cloud blog post illustrating how to implement short-term and long-term memory for AI agents using the Agent Development Kit (ADK) and Vertex AI Memory Bank. https://x.com/dl_weekly/status/1954308710374760684

one-shotted @aisdk v5 migration with cursor cli + gpt-5, zero errors did this ssh’d into my kitchen raspberry pi from my phone you can literally cook anywhere now 👨‍🍳 https://x.com/ryolu_/status/1953847132706025926

Copilot Labs: Discover experimental AI initiatives https://copilot.microsoft.com/labs/experiments/copilot-3d

Guillermo Rauch says instead of fearing AI, they embraced it and built V0, a text-to-app tool. Inspired by using Copilot, he saw this as the next big shift in AI and acted early. Since launching V0, its growth has been massive, and it’s reshaping who can build software. Now, https://x.com/WesRothMoney/status/1953486296611238221

A compilation of experiences I made with GPT-5 in one shot. The poem camera app is particularly impressive because the model came up with all the details, like the way the photos stack in the gallery, the photo developing animation, etc https://x.com/skirano/status/1953516768317628818

GPT-5 Pro is an impressive geo-guesser. I gave it a cropped photo with metadata removed and it figured out the city. https://x.com/emollick/status/1954288373797203991

Today we’re open sourcing a “vibe coding agent” powered by GPT-5. It’s like @v0, but agnostic to framework, language, runtime. It can vibe code htmx and Haskell if you want. Built on @aisdk, Sandbox and AI Gateway. If you want to add codegen to your platform or build your own https://x.com/rauchg/status/1953539863703425336

We’ve been working closely with the @OpenAI team to integrate GPT-5 into Devin. Starting today, you can select a preview version of Devin that uses GPT-5 as part of our agent orchestration. GPT-5 eval results 👇 https://x.com/cognition/status/1953521661028053410

Codex CLI + GPT-5:”” / X https://x.com/gdb/status/1953556751762288653

Congrats to the whole OpenAI team on GPT-5, lots of work to make this level of progress given where we were just 2 years ago, and with such high expectations.”” / X https://x.com/OfficialLoganK/status/1953523549819613288

Credit where it’s due: seems like OpenAI has fixed a lot of GPT-5 issues in the last 12-24 hours, and Codex CLI works really well in auto mode Still terrible if you use in a “”approve before making edits”” mode, but hopefully they fix it soon🤞🏼”” / X https://x.com/rishdotblog/status/1955318363653280185

Faster GPT-5 in Cursor:”” / X https://x.com/gdb/status/1955532973119508775

PSA: If you say “”think deeply”” then you get the thinking model in ChatGPT for free. If you click “”ChatGPT Thinking””, it costs $20/month min to access, and you get limited usages. https://x.com/jeremyphoward/status/1954366856627978684

RT @gdb: gpt-5 is the best coding model in the world and is now the default in @cursor_ai. https://x.com/xikun_zhang_/status/1955049082772402643

RT @OpenAI: We’ve scored highly enough to achieve gold at this year’s IOI online competition with a reasoning system — placing #6 when rank…”” / X https://x.com/xikun_zhang_/status/1955049010257097080

ChatGPT-5 Pro is the first model to successfully do this non-puzzle consistently. GPT-5 Thinking and GPT-5 fail as every other model before has (except for, occasionally, Sonnet). https://x.com/emollick/status/1953604710205690212

RT @deedydas: Ridiculous that OpenAI claimed 74.9% on SWE-Bench just to prove they were above Opus 4.1’s 74.5%… By running it on 477 probl…”” / X https://x.com/akbirkhan/status/1954231799590301953

showcase of a type of hard, valuable task that gpt-5 can do where previous models struggled:”” / X https://x.com/gdb/status/1953700116365492552

figured out how to “”undo”” the RL and turn gpt-oss back into a base model will drop the weights tomorrow gn https://x.com/jxmnop/status/1955099965828526160

Artificial Analysis on X: “GPT-5 occupies both the #1 and #2 positions in our long context reasoning benchmark (AA-LCR) 🤯 AA-LCR tests long context performance through testing reasoning capabilities across multiple long documents (~100k tokens). Questions typically require considering multiple documents https://t.co/BEq9ZspRMs” / X
https://x.com/ArtificialAnlys/status/1953523986526351576

Bartosz Naskręcki on X: “Ok, some general comments while I am waiting for the other tasks. GPT 5 and GPT 5 Thinking are nothing very novel for research mathematicians compared to o3-pro and o4-mini-high. But GPT 5 Pro is different. It uses much more compute and the quality of the answers is superb.” / X
https://x.com/nasqret/status/1953566692686397885

Matthew Berman on X: “Vibe coding a full Excel clone step-by-step with GPT-5 is kinda nuts. (I know I’m still far from a full clone, but this is after about 45 minutes of work) https://t.co/upr5QSmluJ” / X
https://x.com/MatthewBerman/status/1954694677736956297

GPT-5 has been hovering around a 7% diff edit failure rate since its release to Cline last Thursday. How have you liked GPT-5 so far in Cline? https://x.com/cline/status/1955357460627329151

GPT-5 is live in Cline. We’ve been working with OpenAI to get this model ready, and here’s our take: it’s disciplined, persistent, & highly competent. It’s collaborative in planning & and a diligent operator while acting. It plans thoroughly, asks optioned follow-ups when https://x.com/cline/status/1953525433808695319

GPT-5 is speed-running Pokemon It’s 3x faster than o3 https://x.com/scaling01/status/1955813023735828587

gpt-5 is the best coding model in the world and is now the default in @cursor_ai. https://x.com/gdb/status/1953521501548032512

GPT-5 Just Finished Pokemon Red! : r/singularity https://www.reddit.com/r/singularity/comments/1mq2irv/gpt5_just_finished_pokemon_red/

GPT-5 just finished Pokémon Red! 6,470 steps vs. 18,184 for o3! Check the stats site to compare! That’s a huge improvement! Well done, @OpenAI you cooked with GPT-5. What an incredible model. Next up: GPT-5 vs. Pokémon Crystal (16 Badges + Red). The run starts soon on Twitch. https://x.com/Clad3815/status/1955980772575268897

GPT-5 now rolled out to 20% of paid users and doing >2B TPM on the API! so far so good… excellent work by the eng and infra teams!”” / X https://x.com/sama/status/1953563605733118317

gpt-5 is SOTA on FrontierMath:”” / X https://x.com/gdb/status/1953710811957858404

Cats out of the bag, our #1 requested feature: ⚡️ Unlimited Supabase databases, natively in Bolt Every project gets a built in DB by default- no auth, no signups, *no extra cost*. It just works! Full drop next week, reply “”GIMME DB”” to join the private beta 👇 https://x.com/EricSimons/status/1953249524807577903

David and I built Lovable for ad landing pages. • matches your ad campaign • connected to ad platforms • leads flow directly into your CRM This is the piece of software I’m most proud of in my entire life 🥹 https://x.com/jacintofleta/status/1953024749220557043

“”Introducing Agentic RAG with GPT-5! 🔥 We built a smart research agent that uses your reference links and delivers way deeper insights than your average LLM search. The code is completely open source and free! Built with @AgnoAgi , @lancedb & @streamlit ! Check it out! https://x.com/Arindam_1729/status/1953533944877793759

In short 49th to 98th percentile of performance in IOI in one year without training any specialised models. Same RL as for everything else we do”” / X https://x.com/MillionInt/status/1954977818128888311

Not all data is created equal. Scaling quality control for data that can challenge PhDs and the most advanced LLMs demands a different approach. To meet this demand, we built something new: autoraters powered by multi-agent model debate. Here’s how it works 🧵 https://x.com/scale_AI/status/1955405890288570617

RT @Zai_org: Presenting the GLM-4.5 technical report!👇 https://x.com/_lewtun/status/1955242926596035023

I’m noticing that due to (I think?) a lot of benchmarkmaxxing on long horizon tasks, LLMs are becoming a little too agentic by default, a little beyond my average use case. For example in coding, the models now tend to reason for a fairly long time, they have an inclination to”” / X https://x.com/karpathy/status/1954224651443544436

When you chat with Rufus on Amazon app, it is powered by vLLM! https://x.com/vllm_project/status/1956116150259212619

Intro Wooo! Say Wooo. Post on @LC, inside Telegram. > I came up with this idea 12 hours ago, and after tinkering with GPT-5 and Claude Code for a while, the bot is now online. I have to say, the Lens SDK is rock-solid. https://x.com/dao_leno/status/1953901099314033058

Just found out that there is an Opus Plan Mode in Claude Code. Opus 4.1 for Planning and Sonnet 4 otherwise. This makes a lot of sense. I feel like custom modes should be a thing in CC. But this is nice for now. https://x.com/omarsar0/status/1955339275806884016

piece by piece, ai is getting memory it’s also v instructive and surprisingly consistent how Anthropic solves the same class of problems vs {competitor}: – lean on long context – make it transparent/explainable – default self-drive but give user levers to take control”” / X https://x.com/swyx/status/1954990553566941399

Tip: Ask @claude_code to run your dev server in the background (Ctrl+B). Then have Claude code run integration tests against the dev server. No need to wait for users to copy-paste error traces. @claude_code continues until the integration succeeds. Builders review and give https://x.com/claude_code/status/1955210320244326460

The current state of AI for sustained work: exponential progress continues with no unexpected leaps but also no walls. (Yes, this METR measure is just one of many benchmarks, and like all benchmarks has flaws, but also has the advantage of have neither a ceiling or floor effect) https://x.com/emollick/status/1954180531785994670

We tested how autonomous AI agents perform on real software tasks from our recent developer productivity RCT. We found a gap between algorithmic scoring and real-world usability that may help explain why AI benchmarks feel disconnected from reality. https://x.com/METR_Evals/status/1955747420324946037

Measuring Thinking Efficiency in Reasoning Models: The Missing Benchmark https://x.com/NousResearch/status/1956090990005248341

To get a sense of GPT-5’s vibes, I exported my Tweet data over the last year and got it to write like my top posts Then took my newsletter and made it create 3 separate long-form tweets It’s not 100% there, but it beats Claude, which was previously my go-to for editing https://x.com/rowancheung/status/1953505497237029346

Our secure agentic AI platform, North, is now widely available. https://x.com/cohere/status/1953078403860709547

Check out @snowglobe_so , the new simulation engine for testing chatbots from @guardrails_ai”” / X https://x.com/goodfellow_ian/status/1956040393361121540

Introducing ❄️ @snowglobe_so, the simulation engine for AI chatbots. Magically simulate the behavior of your users to test and improve your chatbots. Find failures before your users do. https://x.com/ShreyaR/status/1956023326721368337

Synthetic Society tests your product with AI-powered user simulations. Their agents mimic real users to catch bugs, bad UX, and edge cases. Ship faster, kill manual testing, and build with confidence. https://x.com/ycombinator/status/1953154105641509348

Exclusive | Billions Flow to New Hedge Funds Focused on AI-Related Bets – WSJ https://www.wsj.com/finance/investing/billions-flow-to-new-hedge-funds-focused-on-ai-related-bets-48d97f41

New Google AI Studio landing page just dropped, @ammaar and the team are cooking 🔥 https://x.com/OfficialLoganK/status/1954264163347488887

🚀 Small upgrade to our Deep Research capabilities! 1️⃣ Smarter, more insightful reports 2️⃣ Deeper search for richer findings 3️⃣ More accurate information with less hallucination 4️⃣ Modular tools with parallel execution 5️⃣ Multi-modal input support: upload files, images Try it https://x.com/Alibaba_Qwen/status/1955295298957480298

Copilot Mode, on the other hand, doesn’t replace your default search flow. Instead, it works alongside it.”” Built for how your brain actually works – and it’s free for a limited time in @MicrosoftEdge. Access instructions below (and yes, it has GPT-5) https://x.com/mustafasuleyman/status/1955009697284850071

As AI models get commoditized, the value will be added in that final layer of orchestration. Not just routing to just one “”best”” model, but coordinating multiple models to combine strengths and create Chain of Debate.”” / X https://x.com/mustafasuleyman/status/1954956981330120832

Why is it the right move? Seriously? 1. Models already think more for harder problems in reasoning mode. 2. You could just always have it try to reason, then itll never fail you in case it needs to. 3. Any time an answer isnt satisfactory if you didnt have reasoning on, you”” / X https://x.com/Teknium1/status/1954519089902473436

GPT-5 for Computer-Use agents. Same tasks, same grounding model – we just swapped GPT-4o → GPT-5 as the thinking model. Left = 4o, right = 5. Watch GPT-5 pull away. 1/2 https://x.com/trycua/status/1953583236501631084

gpt-5 for immunology:”” / X https://x.com/gdb/status/1955445380310802845

gpt-5 for long context reasoning:”” / X https://x.com/gdb/status/1953747271666819380

gpt-5 for math research:”” / X https://x.com/gdb/status/1955662632771522650

gpt-5 for vibe coding whole applications:”” / X https://x.com/gdb/status/1954706670267035999

My first project at OpenAI involved teaching our models to reason and use tools by improving their competitive programming skills. Back then, GPT-4 struggled with even the simplest Codeforces problems, often oom-ing in the sandbox. It’s incredible to see that just 2.5 years”” / X https://x.com/ahelkky/status/1954973043320819907

New ChatGPT model selector. We are back to where we started 🙂 Nice to see cleaner naming though. I assume auto routing will get better over time, but for now I default to GPT-5 thinking for most queries and GPT-4.5 for writing tasks. https://x.com/bilawalsidhu/status/1955732509377089786

GPT-5 is pretty good at coding I kept adding features expecting something to break, but it just kept chugging along. I added music and sound using ElevenLabs. I *actually* enjoyed playing this game https://x.com/WesRothMoney/status/1953921754105299092

GPT-5 with high reasoning effort on SimpleBench https://x.com/scaling01/status/1953771276549358041

gpt-5: our smartest, fastest, and most useful model to date. it’s also incredible at coding. rolling out to everyone (excitingly including free ChatGPT users!) today.”” / X https://x.com/gdb/status/1953509854603358597

I had access to GPT-5. I think it is a very big deal as it is very smart & just does stuff for you Full write up in comments, but this is “make a procedural brutalist building creator where i can drag and edit buildings in cool ways”” & “”make it better”” a bunch. I touched no code https://x.com/emollick/status/1953502029126549597

I saw a lot of people complaining about 32k context size in ChatGPT for plus users, which would be terrible for coding. But actually we are giving 196k context size for plus users when using GPT5 thinking and that’s the model you should use for coding use-cases! 32k is for the”” / X https://x.com/yanndubs/status/1955194413283737716

I suspect this is right. And I wouldn’t be surprised if the vast majority of the 700M users of ChatGPT already greatly prefer GPT-5 & that the opinion on X is not reflective of the typical experience. (Which doesn’t mean that the issues identified here aren’t very real)”” / X https://x.com/emollick/status/1954442950491902393

I used GPT-5 to leverage trade memes. From open to taking profit to setting stop losses, I followed every choice it made. Of course, I used @wasabi_protocol and decided to do this all with $troll on 3x leverage, which was freshly listed and giga sending. https://x.com/ChrisCoffeeEth/status/1954282100389281866

I’ve been using gpt-5 for a bit now. This model broke me. It is so good. I didn’t know what the price was. I assumed it would be o3-pro priced because it is that smart. Nope. Truly insane. Videos coming very soon. https://x.com/theo/status/1953507203979391011

If you have been following the GPT-5 rollout, one thing you might be noticing is how much of an attachment some people have to specific AI models. It feels different and stronger than the kinds of attachment people have had to previous kinds of technology (and so suddenly”” / X https://x.com/sama/status/1954703747495649670

Important GPT-5 PSA; if you want an answer that is maximally correct, do tell the model to think hard in your prompt. It literally will do so clearly we failed to communicate this well, apologies for that”” / X https://x.com/ericmitchellai/status/1954418339536683078

LLM meets analog. Turns out LLMs are a great brainstorming partner for synth patches. This was co-created with gpt-5 🎵 https://x.com/martin_casado/status/1953868101596192850

Let’s take a look into GPT-5’s record-setting performance on FrontierMath. How did it perform on the holdout vs. non-holdout set, how did it do across tiers, and what new Tier 4 problems did it solve? 🧵 https://x.com/EpochAIResearch/status/1955667249252978741

initial gpt-oss download stats looking exciting!”” / X https://x.com/gdb/status/1954992508964155587

i thought the transformers gpt-oss MoE finetuning was broken, how did you get it working?”” https://x.com/jxmnop/status/1955347764130254863

GPT-5-high is pretty good at competitive programming Just a 700 point gap in rating between GPT-5 and Gemini 2.5 Pro https://x.com/scaling01/status/1955053949637021732

my gpt-oss MFUmaxxer PR is here! ✅ cat/splice sink -> flexattn ✅ sin/cos pos embs -> complex freqs_cis ✅ moe for-loop -> grouped gemm ✅ checkpoint conversion ✅ matches huggingface fwd pass currently adding parallelism and ensuring training steps healthily ⬇️”” / X https://x.com/khoomeik/status/1955433361402724679

tldr: Fireworks, Deepinfra, and TogetherAI are the accurate inference providers for hosting gpt-oss-120b.”” / X https://x.com/jeremyphoward/status/1955438370274087369

The first thing that is immediately noticeable about GPT-5 is the ability to code good front-end/UI GPT-5 generated this fully functioning budgeting app in one shot with ~1000 lines of code, and made it Tetris-themed It even added the sound effects https://x.com/rowancheung/status/1953502382681198610

My “Move 37” Moment with GPT-5 Today, I’m sharing one of my most remarkable experiences testing the GPT-5 Thinking and Pro models. In our lab, about 2 years ago we conducted a series of cutting-edge immunology experiments designed to manipulate the energy metabolism of T https://x.com/DeryaTR_/status/1954354352648225235

OpenAI’s o3 Crushes Grok 4 In Final, Wins Kaggle’s AI Chess Exhibition Tournament – Chess.com https://www.chess.com/news/view/kaggle-game-arena-chess-2025-day-3

when you get access to gpt-5, try a message like “”use beatbot to make a sick beat to celebrate gpt-5″”. it’s a nice preview of what we think this will be like as AI starts to generate its own UX and interfaces get more dynamic. it’s cool that you can interact with the https://x.com/sama/status/1953529799219319205

Overall, the general vibes of GPT-5 feel much more human-like It’s hard to measure *vibes*, but the combination of speed, lower hallucination rate, and intelligence is very noticeable As a power user, I’ve always enjoyed o3, but the speed makes it impossible for daily queries”” / X https://x.com/rowancheung/status/1953505371487600877

@SebastienBubeck Heads up, I’m fairly certain that the o3 run being compared to did not have the google search tool – which is important, since Bulbapedia gives the solution in one tool-use call to some puzzles that can take a much longer time if solved without solution info”” / X https://x.com/kiranvodrahalli/status/1956044490885751273

🎨Deep Agents UI Deep agents operate with a todo list, file system, and subagents We built a dedicated UI for running deep agents that properly highlights all of these things! Repo: https://x.com/LangChainAI/status/1955674201853247584

🥇Qwen3-Coder, try it now in Qwen-Code”” / X https://x.com/Alibaba_Qwen/status/1955436295603490864

Watching the model solve these IMO problems and achieve gold-level performance was magical. A few thoughts 🧵”” / X https://x.com/SherylHsu02/status/1946478334013321231

RT @dorsa_rohani: New fastest shortest-path algorithm in 41 years! Tsinghua researchers broke Dijkstra’s 1984 “sorting barrier,” achieving…”” / X https://x.com/dilipkay/status/1954701721932046423

Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning “”We introduce GFPO (Group Filtered Policy Optimization), which curbs this length explosion by sampling larger groups per problem during training and filtering responses to train on based on two https://x.com/iScienceLuvr/status/1955955524790575212

🖥️🤖 LangGraph CLI Connect to LangGraph Platform directly from the terminal! Featuring comprehensive management of assistants, threads, and runs with real-time streaming capabilities. Explore the CLI on GitHub 🚀 https://x.com/LangChainAI/status/1954226169412493544

RT @elonmusk: Grok wins hands-down at coding. It wasn’t close. https://x.com/Yuhu_ai_/status/1955058946861072642