Benchmarks: AI News Week Ending 07/25/2025

Benchmarks: AI News Week Ending 07/25/2025

July 25, 2025

Image created with OpenAI GPT-Image-1. Image prompt: over-the-top 1990s pro-wrestling promo poster, press-conference podium featuring “Benchmark Bruiser” slamming clipboards marked with record scores; flashbulb frenzy, grainy print texture, vivid neon titles

ARC-AGI-3 scores 0% for AI, 100% for humans now live with API where you can test your agent: https://x.com/scaling01/status/1946261191782797717

Can AI file your taxes? Not yet. We tested the latest frontier models and the results were full of catastrophic errors. Letting AI do your taxes would mean IRS rejections, audits, and penalties (Thread with many posts):
https://x.com/michaelrbock/status/1948039876043313509

Now that this exists AI will be able to do your taxes very well, very soon”” / X https://x.com/Teknium1/status/1948668301829439846

Today, we’re releasing TaxCalcBench: a first-ever benchmark dataset & eval framework for testing AI’s ability to calculate US personal income tax returns.
Tax is a secretive industry, so we’re proud to release a research paper sharing our findings:
https://arxiv.org/abs/2507.16126

BREAKING: OpenAI just launched ChatGPT Agent It allows ChatGPT to think, plan, and execute complex tasks on its own virtual computer while you do other things I had early access, and ChatGPT Agent built me a complete early retirement plan in 20 minutes: > Found local tax laws https://x.com/rowancheung/status/1945896543263080736

ChatGPT agent did real, revenue-generating work that used to take @mhp_guy an entire day. We’re gradually entering the age of the agentic economy — and it’s going to reshape capitalism as we know it. Traditionally, capitalism relied on two inputs: labor and capital. In the”” / X https://x.com/xikun_zhang_/status/1948244478265016327

ChatGPT agent Does Research & Actions – YouTube https://www.youtube.com/watch?v=Ht2QW5PV-eY

ChatGPT agent for finding a great Airbnb:”” / X https://x.com/gdb/status/1946075573476069580

ChatGPT agent for working with Excel, Powerpoint, etc.:”” / X https://x.com/gdb/status/1946007318824673534

ChatGPT agent is now fully rolled out to all Plus, Pro, and Team users. Sorry about the delay! https://x.com/OpenAI/status/1948530029580939539

ChatGPT agent Makes Slideshows – YouTube https://www.youtube.com/watch?v=szJI9YJNEZk

ChatGPT agent Makes Spreadsheets – YouTube https://www.youtube.com/watch?v=JAQ4p662It8

ChatGPT agent: “”create a PDF of a novel D&D adventure, add illustrations, make it super interesting and deep, add tables, etc”” “”Fix the formatting, build it out more”” Got a 19 page PDF. Agent doesn’t do layouts well, but pulls off building a coherent adventure, hard for LLMs. https://x.com/emollick/status/1946047390118445354

ChatGPT Agent: our first AI with access to a text browser, a visual browser, and a terminal. Rolling out in ChatGPT Pro, Plus, and Team today. https://x.com/gdb/status/1945907023444660644

I am finding ChatGPT agents to be useful. They are a better fit with the “”intern”” analogy than any former AI – requiring oversight, still saving lots of time overall. For example, I update an AI cost/performance chart frequently. The agent did all the grunt work, with guidance. https://x.com/emollick/status/1947482417888932258

I had early access & ChatGPT agent is, I think, a big step forward for getting AIs to do real work Even at this stage, it does a good job autonomously doing research & assembling Excel files (with formulas!), PowerPoint, etc. It gives a sense of how agents are coming together https://x.com/emollick/status/1945892669575647431

In the same way ChatGPT was the first AI experience for 90% of society, ChatGPT Agents will be the first Agent experience for 90% of society. If you are reading this, you are still early”” / X https://x.com/AtomSilverman/status/1945895569437642782

Introduction to ChatGPT agent – YouTube https://www.youtube.com/watch?v=1jn_RpbPbEc

One implication from ChatGPT agent (not a creative name, but a descriptive one – a rare naming win!) is the labs are learning that many knowledge workers live in Excel & PowerPoint. Surprised that Microsoft did not do more to push past Copilots when they had this to themselves.”” / X https://x.com/emollick/status/1945926194043424954

OpenAI launches a general purpose agent in ChatGPT | TechCrunch https://techcrunch.com/2025/07/17/openai-launches-a-general-purpose-agent-in-chatgpt/

Recursion! I gave ChatGPT Agent access to my ChatGPT by logging in and then… https://x.com/emollick/status/1947829896845127983

RT @emollick: I had early access & ChatGPT agent is, I think, a big step forward for getting AIs to do real work Even at this stage, it do…”” / X https://x.com/nickaturley/status/1945975092342841487

RT @KerenGu: We’ve activated our strongest safeguards for ChatGPT Agent. It’s the first model we’ve classified as High capability in biolo…”” / X https://x.com/sama/status/1945995659682910540

tip for chatgpt agent slides: first ask it to do the research only, then ask it to make the slides!”” / X https://x.com/isafulf/status/1946231119751545014

Today we launched a new product called ChatGPT Agent. Agent represents a new level of capability for AI systems and can accomplish some remarkable, complex tasks for you using its own computer. It combines the spirit of Deep Research and Operator, but is more powerful than that”” / X https://x.com/sama/status/1945900345378697650

watching chatgpt agent use a computer to do complex tasks has been a real “”feel the agi”” moment for me; something about seeing the computer think, plan, and execute hits different.”” / X https://x.com/sama/status/1945901039104004467

When we founded OpenAI (10 years ago!!), one of our goals was to create an agent that could use a computer the same way as a human — with keyboard, mouse, and screen pixels. ChatGPT Agent is a big step towards that vision, and bringing its benefits to the world thoughtfully.”” / X https://x.com/gdb/status/1945923067403984979

You can ask ChatGPT Agent to train an AI on datasets you are interested in, and do analyses for you. Building AI and doing data analysis will be automated end-to-end in the future. You are hearing it right. We are working hard to automating our own job :)”” / X https://x.com/xikun_zhang_/status/1946278266786189744

“Hey Comet, join my team meetings for me, turn off the camera and keep me muted, unmute and say “nothing from my end, thanks” when it’s my turn to speak, mute again, end meeting when it’s done”. How many want this ?”” / X https://x.com/AravSrinivas/status/1947501358007128149

Comet can make an entire Spotify playlist and start playing it for you! https://x.com/AravSrinivas/status/1948489790036365796

Comet can use LinkedIn for you and do all your work there https://x.com/AravSrinivas/status/1948835728798220539

Comet lets you search over everything like an agent would. Even stuff that’s not easy to index. https://x.com/AravSrinivas/status/1948056269958648309

How to watch YouTube on Comet https://x.com/AravSrinivas/status/1946240617031606672

Interesting Comet use case that a user pointed out just now to me: Use Comet to order food directly from the restaurant (eg: Chipotle) instead of an aggregator delivery app. Cheaper. Friction of having to deal with random websites gone. And you still get the same meal delivered.”” / X https://x.com/AravSrinivas/status/1948818172985196862

Just so that it’s clear to a bunch of confused folks. You lose nothing you already have in ad-blocking browsers, when you come to Comet. All ad-blockers work natively. No extensions needed. Even incognito. We have all the resources needed to keep working on this.”” / X https://x.com/AravSrinivas/status/1948102473597829200

perplexity comet browser ranks above the wikipedia page of comet on google serp, ~10 days since release https://x.com/AravSrinivas/status/1947173109083332988

Perplexity in talks with phone makers to pre-install Comet AI mobile browser on devices | Reuters https://www.reuters.com/business/perplexity-talks-with-phone-makers-pre-install-comet-ai-mobile-browser-devices-2025-07-18/

RT @JoannaStern: OK, Perplexity’s Assistant in the new Comet browser is good. Really good.”” / X https://x.com/AravSrinivas/status/1948215175976497394

the % of users who switch to comet as default browser has been steadily increasing since the launch day. and there’s still so much more to do to keep increasing this number. really promising future for comet.”” / X https://x.com/AravSrinivas/status/1948794199069110519

The TAM for Comet is bigger than Perplexity because it appeals to people who don’t even want AI. Just the best core browser in the market at the end of the day.”” / X https://x.com/AravSrinivas/status/1946035102150238475

The waitlist for Comet has doubled since launching. We will begin ramping up invites to waitlisted users starting today.”” / X https://x.com/AravSrinivas/status/1947407684996894969

This is an incredible end to end deep research workflow on Comet. Makes me realize how powerful and fast deep research can be with a hybrid client-sever compute architecture https://x.com/AravSrinivas/status/1946398572955766979

Underrated aspect of Comet: better memory management than Chrome”” / X https://x.com/AravSrinivas/status/1947817943934587362

we’re going to be shipping so many awesome new things on comet https://x.com/AravSrinivas/status/1948415154330415350

With the release of comet, perplexity has turned from a “ask anything” company to a “do anything” company”” / X https://x.com/AravSrinivas/status/1947175881203683577

🚀 Introducing Qwen3-MT – our most powerful translation model yet! Trained on trillions of multilingual tokens, it supports 92+ languages—covering 95%+ of the world’s population. 🌍✨ 🔑 Why Qwen3-MT? ✅ Top-tier translation quality ✅ Customizable: terminology control, domain https://x.com/Alibaba_Qwen/status/1948406830688018471

🚀 We’re excited to introduce Qwen3-235B-A22B-Thinking-2507 — our most advanced reasoning model yet! Over the past 3 months, we’ve significantly scaled and enhanced the thinking capability of Qwen3, achieving: ✅ Improved performance in logical reasoning, math, science & coding https://x.com/Alibaba_Qwen/status/1948688466386280706

Less than two weeks Kimi K2’s release, @Alibaba_Qwen’s new Qwen3-Coder surpasses it with half the size and double the context window. Despite a significant initial lead, open source models are catching up to closed source and seem to be reaching escape velocity. https://x.com/cline/status/1948072664075223319

Qwen COOKED – beats Kimi K2 and competitive to Claude Opus 4 at 25% total parameters 🤯 https://x.com/reach_vb/status/1947357343101960424

Qwen3-235B-A22B scored 41% on ARC-AGI-1 without thinking! That’s the same level as Gemini 2.5 Pro, Sonnet 4 or o3-low with thinking. But it might be trained on it, if not, then it’s insane”” / X https://x.com/scaling01/status/1947351789222711455

RT @itsPaulAi: Wait so Alibaba Qwen has just released ANOTHER model?? Qwen3-Coder is simply one of the best coding model we’ve ever seen.…”” / X https://x.com/ClementDelangue/status/1947775783067603188

RT @lmstudio: Qwen/Qwen3-Coder with tool calling is supported in LM Studio 0.3.20, out now. 480B parameters, 35B active. Requires about 25…”” / X https://x.com/huybery/status/1948327670493970534

The new Qwen3 update takes back the benchmark crown from Kimi 2. Some highlights of how Qwen3 235B-A22B differs from Kimi 2: – 4.25x smaller overall but has more layers (transformer blocks); 235B vs 1 trillion – 1.5x fewer active parameters (22B vs. 32B) – much fewer experts in https://x.com/rasbt/status/1947393814496190712

The updated Qwen3-235B-A22B is now the best non-reasoning models period. It beats Kimi-K2, Claude-4 Opus and DeepSeek V3 on multiple benchmarks like GPQA, AIME, ARC-AGI, LiveCodeBench or BFCLv3, just to name a few. https://x.com/scaling01/status/1947350866840748521

So to recap: – Yesterday, frontier closed model equivalent reasoning model from Qwen, – This morning, frontier closed model equivalent reasoning vision capabilities from stepfun – sometime today(?) a frontier video model from wan? All open source What is America doing?”” / X https://x.com/Teknium1/status/1948744914876920039

@OriolVinyalsML Impressive result, but let’s be clear, the Gemini model got heavy IMO-specific prep, curated solutions, hints, and strategy guides. That’s not general reasoning. OpenAI’s model hit IMO gold with zero task-specific tuning. One is coached, the other is capable. https://x.com/VraserX/status/1947368827253076001

@pli_cachete For OpenAI at least for this IMO competition: – No tool use, no calculators, internet, formal proof software, algebra packages – same time limits – the same input to the question as for students; no rewriting it to another more suitable format – only one submission”” / X https://x.com/BorisMPower/status/1946859525270859955

🤖 From this week’s issue: Gemini with Deep Think officially achieved gold-medal standard at the International Mathematical Olympiad (IMO) by solving five out of the six IMO problems. https://x.com/dl_weekly/status/1948105084480397503

1/N I’m excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world’s most prestigious math competition—the International Math Olympiad (IMO). https://x.com/alexwei_/status/1946477742855532918

10. My career as a mathematician certainly isn’t threatened by AI; in fact, I hope to leverage AI to accelerate my work. However, I’m unsure whether “”mathematician”” will remain a career path for my son’s generation. (10/10)”” / X https://x.com/ErnestRyu/status/1946700798001574202

4. OpenAI surely knew GDM was working on the IMO, so they beat GDM to the punch with their Saturday morning announcement, generating hype. GDM’s slow-science scholarship cost them the PR battle. (4/10)”” / X https://x.com/ErnestRyu/status/1946699212307259659

5. In my experience using LLMs for math research, Gemini outperforms ChatGPT. We will see if the next-gen models (which seem to be what OpenAI and GDM are using for IMO) perform at research-level math. (5/10)”” / X https://x.com/ErnestRyu/status/1946699302308635130

Advanced version of Gemini Deep Think (announced at #GoogleIO) using parallel inference time computation achieved gold-medal performance at IMO, solving 5/6 problems with rigorous proofs as verified by official IMO judges! Congrats to all involved! https://x.com/koraykv/status/1947335096740049112

Advanced version of Gemini with Deep Think officially achieves gold-medal standard at the International Mathematical Olympiad – Google DeepMind https://deepmind.google/discover/blog/advanced-version-of-gemini-with-deep-think-officially-achieves-gold-medal-standard-at-the-international-mathematical-olympiad/

An advanced version of Gemini with Deep Think has officially achieved gold medal-level performance at the International Mathematical Olympiad. 🥇 It solved 5️⃣ out of 6️⃣ exceptionally difficult problems, involving algebra, combinatorics, geometry and number theory. Here’s how 🧵 https://x.com/GoogleDeepMind/status/1947333836594946337

As confirmed by the new IMO rankings, Grok 4’s eye-popping benchmarks were driving by the following innovations: – train on test – train on test – train on test”” / X https://x.com/nsaphra/status/1946804513114882227

DeepMind has the best research on using AI to solve hard Math: AlphaEvolve AlphaProof AlphaGeometry FunSearch AlphaDev AlphaTensor AlphaCode Despite making IMO Silver 28/42 in ’24, OpenAI announced Gold in ’25 35/42 before them Here’s DeepMind’s 10 best research papers on https://x.com/deedydas/status/1946987560875766212

Drastic progress on maths with Gemini 2.5! As a math undergrad, I am impressed 🤯 🥈 -> 🥇 ✅ Formal -> Informal ✅ Specialized model -> General model ✅ Available soon ✅ Huge thanks to IMO and congrats to all participants! Blog: https://x.com/OriolVinyalsML/status/1947341047547199802

Gary Marcus strikes again: “”No pure LLM is anywhere near getting a silver medal in a math olympiad”” “”Pure deep learning had a good run, but it’s time to move on”” 😂😂😂 https://x.com/scaling01/status/1946530148813025544

Gemini solved the math problems end-to-end in natural language (English).”””” / X https://x.com/denny_zhou/status/1947360696590839976

Gold medal-level performance on the 2025 International Math Olympiad from our latest experimental reasoning LLM. Model operated in natural language (i.e. outputs natural language proofs) under the same rules as humans (e.g. 4.5 hours per session, no tools). Amazing milestone!”” / X https://x.com/gdb/status/1946479692485431465

Had a super fun time training this model. A big yolo run that resulted in a super strong model. Most important thing is to trust your model and give it morale support. 🦾 Was also a big eye opener to see how prep for IMO is done. Before this I knew absolutely zero about this”” / X https://x.com/YiTayML/status/1948464752545726886

hippo at IMO: 0/42 model trained by hippo: 35/42 🥇 😂😂😂”” / X https://x.com/agihippo/status/1947348097144611123

IMO 2025 Solutions https://storage.googleapis.com/deepmind-media/gemini/IMO_2025.pdf

It wasn’t just OpenAI. Google also used a general purpose model to solve the very hard math problems of the International Math Olympiad in plain language. Last year they used specialized tool use Increasing evidence of the ability of LLMs to generalize to novel problem solving”” / X https://x.com/emollick/status/1947356382581137867

It’s hard to overstate the significance of this. It may end up looking like a “moon‑landing moment” for AI.
Just to spell it out as clearly as possible: a next-word prediction machine (because that’s really what it is here, no tools no nothing) just produced genuinely creative proofs for hard, novel math problems at a level reached only by an elite handful of pre‑college prodigies. https://x.com/SebastienBubeck/status/1946577650405056722

MathArena – IMO Blogpost https://matharena.ai/imo/

maybe a better headline would be that oai and gdm ranked 27 at the IMO. some talented kids here! https://x.com/damekdavis/status/1947357679040569520

Not Even Bronze: Evaluating LLMs on 2025 International Math Olympiad 🥉 https://x.com/hardmaru/status/1946942279807308210

Officially validated IMO gold medal, purely via search in token space, achieved in 4.5 hrs (unclear at what compute cost). The solutions read nicely as well https://x.com/fchollet/status/1947337944215523567

On IMO P6 (without going into too much detail about our setup), the model “”knew”” it didn’t have a correct solution. The model knowing when it didn’t know was one of the early signs of life that made us excited about the underlying research direction!”” / X https://x.com/alexwei_/status/1947461238512095718

One piece of info that seems important to me in terms of forecasting usefulness of new AI models for mathematics: did the gold-medal-winning models, which did not solve IMO problem 6, submit incorrect answers for it? https://x.com/littmath/status/1947398065209462981

Other AI models seem to have made big leaps in the International Math Olympiad, not just OpenAI. Not all announcements seem to be out yet.”” / X https://x.com/emollick/status/1947053944192082170

Our IMO gold model is not just an “”experimental reasoning”” model. It is way more general purpose than anyone would have expected. This general deep think model is going to be shipped so stay tuned! 🔥”” / X https://x.com/YiTayML/status/1947350087941951596

P6 was definitely the hardest and most interesting problem. Most people can understand it, but very few can solve it. All models scored 0/7. https://x.com/deedydas/status/1946250774960537927

Right before #imo2025, together with colleagues from Mountain View, NYC, Singapore, etc, we all gathered at @GoogleDeepMind headquarter in London for our final push for IMO. I believe that week was when all magic happened! We put all individual recipes (that we figured out https://x.com/lmthang/status/1948458590492393834

RT @demishassabis: Btw as an aside, we didn’t announce on Friday because we respected the IMO Board’s original request that all AI labs sha…”” / X https://x.com/TheZachMueller/status/1947419062423982583

RT @demishassabis: Official results are in – Gemini achieved gold-medal level in the International Mathematical Olympiad! 🏆 An advanced ver…”” / X https://x.com/AndrewLampinen/status/1947370582393425931

RT @Mihonarium: 🚨 According to a friend, the IMO asked AI companies not to steal the spotlight from kids and to wait a week after the closi…”” / X https://x.com/AndrewLampinen/status/1947072974621982839

RT @ns123abc: Bruh… people already reproduced Google’s IMO results without RL with just prompting openai researchoors think they have the…”” / X https://x.com/_philschmid/status/1948304855837085717

RT @polynoamial: Today, we at @OpenAI achieved a milestone that many considered years away: gold medal-level performance on the 2025 IMO wi…”” / X https://x.com/kchonyc/status/1946526143433015349

The hardest high school math exam in the world, the 6 problem 9 hour IMO 2025, was this week. AI models performed poorly. Gemini 2.5 Pro scored the highest, just 13/42, costing $431.97, in a best of 32 eval. Bronze cutoff was 19. Long way to go for AI to solve hard Math. https://x.com/deedydas/status/1946244012278722616

The two cents: 1. The OpenAI IMO solutions to P1-P5 seem to be correct. 2. P6 is a significantly novel and more difficult problem. P1-P5 are arguably within reach of “standard” IMO problem-solving techniques, but P6 requires creativity. (2/10)”” / X https://x.com/ErnestRyu/status/1946698896375492746

There are always a flood of posts about what AI can or cannot do, so it is worth pausing and paying attention to this one. It is a very hard test, done without tools. It was also viewed as an unlikely goal. Prediction markets had the chance of this happening this year as 20%”” / X https://x.com/emollick/status/1946563737604743386

This wins my respect. https://x.com/Yuchenj_UW/status/1947339774257402217

Tough look for OpenAI They’ve pissed off the international math community by jumping the gun, meanwhile @GoogleDeepMind has an officially-confirmed result that will be available commercially months earlier”” / X https://x.com/mathemagic1an/status/1947352370037305643

Two cents on AI getting International Math Olympiad (IMO) Gold, from a mathematician. Background: Last year, Google DeepMind (GDM) got Silver in IMO 2024. This year, OpenAI solved problems P1-P5 for IMO 2025 (but not P6), and this performance corresponds to Gold. (1/10)”” / X https://x.com/ErnestRyu/status/1946698766305968446

we achieved gold medal level performance on the 2025 IMO competition with a general-purpose reasoning system! to emphasize, this is an LLM doing math and not a specific formal math system; it is part of our main push towards general intelligence. when we first started openai,”” / X https://x.com/sama/status/1946569252296929727

We might be heading into a plot twist in the OpenAI vs. DeepMind IMO saga. Just saw a post from Joseph Myers (involved in the Math Olympiad since 1992): the IMO committee reportedly asked AI labs not to publish results until 7 days after the closing ceremony — out of respect for https://x.com/zjasper666/status/1947013036382068971

Why am I excited about IMO results we just published: – we did very little IMO-specific work, we just keep training general models – all natural language proofs – no evaluation harness We needed a new research breakthrough and @alexwei_ and team delivered”” / X https://x.com/millionint/status/1946551400365994077

Baidu strikes deal to bring its driverless cars to Uber globally https://www.cnbc.com/2025/07/15/baidu-strikes-deal-to-bring-its-driverless-cars-to-uber-globally.html?__source=sharebar%7Ctwitter&par=sharebar

Uber to invest hundreds of millions of dollars in Lucid and Nuro in massive robotaxi deal | The Verge https://www.theverge.com/news/708479/uber-lucid-nuro-robotaxi-deal-investment

Today, we’re announcing a preview of ARC-AGI-3, the Interactive Reasoning Benchmark with the widest gap between easy for humans and hard for AI
We’re releasing:
* 3 games (environments)
* $10K agent contest
* AI agents API
Starting scores – Frontier AI: 0%, Humans: 100%
https://docs.arcprize.org/

OpenAI says ChatGPT users send over 2.5 billion prompts every day | The Verge https://www.theverge.com/news/710867/openai-chatgpt-daily-prompts-2-billion

timescope: testing if large models understand long videos or they just claim to do so 🤠 they randomly insert needles (short videos/static images) in long videos and ask questions about the needle itself 🤯 Gemini seems to be the best! very cool work by @orr_zohar et al 👏 https://x.com/mervenoyann/status/1948049876228452788

Perplexity Comet vs ChatGPT Agent”” / X https://x.com/AravSrinivas/status/1946076236683624616

Agentar‑Fin‑R1 shows that a 32B‑parameter finance‑tuned model can outscore much bigger general systems on Fineva, FinEval, FinanceIQ, and Finova. Today’s finance-AI still miss strong reasoning and safety checks, so this paper builds a fresh pipeline to fix both. It starts by https://x.com/rohanpaul_ai/status/1948382668372193631

An example of the power & limitations of ChatGPT agent I asked it to analyze a dataset from Kaggle, and turn it into a PPT and Excel. It made no errors, but I thought some of the data was odd. I gave that feedback & the AI figured out the data was bad and why. Human + AI needed https://x.com/emollick/status/1945944153554104379

ChatGPT agent for investment banking:”” / X https://x.com/gdb/status/1946074958238765503

Tejal Patwardhan on X: “these results were eye-opening for me… chatgpt agent performed better than i expected on some pretty realistic investment banking tasks https://t.co/nkpW0pr5jN” / X
https://x.com/tejalpatwardhan/status/1945894313977860203

Natural language powered Stock Screener on Perplexity Finance.”” / X https://x.com/AravSrinivas/status/1948812710952796576

Intelligence isn’t a collection of skills. It’s the efficiency with which you acquire and deploy new skills. It’s an efficiency ratio. And that’s why benchmark scores can be very misleading about the actual intelligence of AI systems.”” / X https://x.com/fchollet/status/1946668452045029861

✅ Try out @Alibaba_Qwen 3 Coder on vLLM nightly with “”qwen3_coder”” tool call parser! Additionally, vLLM offers expert parallelism so you can run this model in flexible configurations where it fits. https://x.com/vllm_project/status/1947780382847603053

🚨 Model Update: Qwen3-coder is in the WebDev Arena! @Alibaba_Qwen have released their best coding model to date and it’s now live in WebDev Arena awaiting your hardest prompts for real world testing. Prompt: “”style a basic login form using Tailwind CSS with dark mode https://x.com/lmarena_ai/status/1948399802947084347

Another incredible OSS model release this summer: the new Qwen 3 update is now live on @togethercompute APi.”” / X https://x.com/vipulved/status/1947871449282216055

Bye Qwen3-235B-A22B, hello Qwen3-235B-A22B-2507! After talking with the community and thinking it through, we decided to stop using hybrid thinking mode. Instead, we’ll train Instruct and Thinking models separately so we can get the best quality possible. Today, we’re releasing https://x.com/Alibaba_Qwen/status/1947344511988076547

Cerebras https://www.cerebras.ai/press-release/cerebras-launches-qwen3-235b-world-s-fastest-frontier-ai-model-with-full-131k-context-support

Did a benchmark with the new Qwen3 Reasoner 220B on Arena-hard v1 It scores an 89% winrate over gpt4-0314, 4o scores an 81% dont have numbers for o3/4o-mini etc but its basically saturated a near perfect win rate. nicee”” / X https://x.com/Teknium1/status/1948836009183224132

Open source models 📈 qwen3-coder is available in Cline”” / X https://x.com/cline/status/1948452627278430376

Please note, we’re not able to reproduce the 41.8% ARC-AGI-1 score claimed by the latest Qwen 3 release — neither on the public eval set nor on the semi-private set. The numbers we’re seeing are in line with other recent base models. In general, only rely on scores verified by”” / X https://x.com/fchollet/status/1947821353358483547

Qwen just released a 480B coding model & a space to try it out for web dev. Fun! Model: https://x.com/ClementDelangue/status/1947780025886855171

Qwen-MT: Where Speed Meets Smart Translation | Qwen https://qwenlm.github.io/blog/qwen-mt/

RT @Alibaba_Qwen: Performance of Qwen3-Coder-480B-A35B-Instruct on SWE-bench Verified! https://x.com/QuixiAI/status/1947773200953217326

RT @cline: Qwen3-Coder is now available in Cline 🧵 New 480B parameter model with 35B active parameters. > 256K context window > comparabl…”” / X https://x.com/Alibaba_Qwen/status/1947954292738105359

RT @GregKamradt: Anyone have a connection at @Alibaba_Qwen? Trying to reproduce the results on @arcprize and getting different metrics Wa…”” / X https://x.com/clefourrier/status/1947994251410682198

RT @OpenRouterAI: 🟣New: Qwen3-Coder by @Alibaba_Qwen – 480B params (35B active) – Native 256K context length, extrapolates to 1M – Outperf…”” / X https://x.com/huybery/status/1947808085504102487

RT @UnslothAI: @Alibaba_Qwen Congrats guys on another epic release! We’re uploading Dynamic GGUFs, and one with 1M context length so you gu…”” / X https://x.com/QuixiAI/status/1947773516368994320

RT @WolframRvnwlf: I’m now using Qwen3-Coder in Claude Code. Works with any model actually, but this is surely the best one currently. The…”” / X https://x.com/huybery/status/1948184493631959536

We’ve updated Qwen3 and made excellent progress. The non‑reasoning model now delivers significant improvements across a wide range of tasks and many of its capabilities already rival those of reasoning models. It’s truly remarkable, and we hope you enjoy it!”” / X https://x.com/huybery/status/1947345040470380614

Wow the new qwen reasoner at only 232B params is as good as the top closed frontier lab models Big day for OS”” / X https://x.com/Teknium1/status/1948711699013665275

NVIDIA’s Canary-Qwen-2.5B 1st place on the @HuggingFace leaderboard for automatic speech recognition – lowest word error rate (WER) ever recorded on the Hugging Face OpenASR leaderboard: 5.63%. – its the first speech model built on top of an existing LLM. – At its core, it https://x.com/rohanpaul_ai/status/1946823138932863210

AudioRAG is becoming real! Just built a demo with ColQwen-Omni that does semantic search on raw audio, no transcription needed. Drop in a podcast, ask your question, and it finds the exact chunks where it happens. You can also get a written answer. What’s exciting: it skips https://x.com/fdaudens/status/1946226098905169967

so many open LLMs and image LoRAs dropped past week, here’s some picks for you 🫡 LLMs > ByteDance released a bunch of translation models called Seed-X-RM (7B) > NVIDIA released reasoning models of which 32B surpassing the giant Qwen3-235B with cc-by-4.0 license 👏 > LG released https://x.com/mervenoyann/status/1948018642462933149

Looking at the HuggingFace configs, this is a wider/shallower model compared to Qwen3. – 62 layers vs 94 – dim 6144 vs 4096 – 160 experts vs 128 – 96 attn heads vs 64 Curious why the architectural change? Qwen3.5?”” / X https://x.com/nrehiew_/status/1947770826943549732

RT @SIGKITTEN: qwen3-coder, running locally I had it set up testing infra using minunit and gcov and write some tests on a small ~5000 lo…”” / X https://x.com/huybery/status/1948184517673644466

missed this, @NVIDIAAIDev silently dropped Open Reasoning Nemotron models (1.5-32B), SoTA on LiveCodeBench, CC-BY 4.0 licensed 🔥 > 32B competing with Qwen3 235B and DeepSeek R1 > Available across 1.5B, 7B, 14B and 32B size > Supports upto 64K output tokens > Utilises GenSelect https://x.com/reach_vb/status/1947331118983696907

RT @reach_vb: Lets GOOO! @NVIDIAAIDev just dropped Canary Qwen 2.5 – SoTA on Open ASR Leaderboard, CC-BY licensed 🔥 > Works in both ASR an…”” / X https://x.com/reach_vb/status/1946087224346313175

Now it’s possible to do RAG with any-to-any models 🔥 Learn how to search in a video dataset and generate using OmniEmbed, an all modality retriever, and Qwen2.5-Omni, any-to-any model in this notebook 🤝 https://x.com/mervenoyann/status/1947285360926494911

now AI can write novel proofs at the level of a world-class competitive mathematician but it still can’t reliably book me a weekend trip to boston so strange”” / X https://x.com/jxmnop/status/1946675650686746879

This past week, Harmonic had the opportunity to represent our advanced mathematical reasoning model, Aristotle, at the International Mathematics Olympiad – the most prestigious mathematics competition in the world. To uphold the sanctity of the student competition, the IMO Board https://x.com/HarmonicMath/status/1947023450578763991

Yes, there is an official marking guideline from the IMO organizers which is not available externally. Without the evaluation based on that guideline, no medal claim can be made. With one point deducted, it is a Silver, not Gold.”” / X https://x.com/lmthang/status/1946960256439058844

We now have audited data on water consumption for AI. Over the 18 month lifespan of Mistral Large 2, a 128B model, all water usage (including chats, training; hardware & data centers) took as much water as 678 US households use yearly. Each additional query is 45 mL. (Fixed) https://x.com/emollick/status/1947782699948675528

NEW: Higgs Audio V2 from @boson_ai open, unified TTS model w/ voice cloning, beats GPT 4o mini tts and ElevenLabs v2 🔥 > Trained on 10M hours (speech, music, events) > Built on top of Llama 3.2 3B > Works real-time and on edge > Beats GPT-4o-mini-tts, ElevenLabs v2 in prosody https://x.com/reach_vb/status/1947997596456272203

Its pretty funny that the Turing Test used to be a very big deal a couple years ago & now it isn’t. (I know retrospectively we all know how flawed it was, but for decades it was The Test, and the only way to beat it was through limited interaction & trickery, eg Eugene Goostman)”” / X https://x.com/emollick/status/1946791395894714758

ReasonVQA drops a giant 4.2M‑question stress‑test on Visual Question Answering, forcing models to pull facts from Wikidata and link them to what they actually see. The authors wrap up by stressing that ReasonVQA is a low‑cost, 4.2M‑question benchmark that forces models to pull https://x.com/rohanpaul_ai/status/1948319999636148226

Yo we made it to #1 yall thanks for checkin out the dataset”” / X https://x.com/Teknium1/status/1946824832764785135

GPT‑4 Turbo grades code summaries almost like humans yet flags only 50% of faulty functions. The study asks whether models can replace fragile test suites and BLEU scores for everyday evaluation. Researchers checked 374 Java and Python tasks where 8 LLMs wrote or reviewed code, https://x.com/rohanpaul_ai/status/1948679870328045968

RT @sdrzn: Seriously blown away by Moonshot’s new Kimi K2 model in @cline. It beats Claude Opus 4 on coding benchmarks and is up to 90% che…”” / X https://x.com/ClementDelangue/status/1946316382313869778

The Tiny Teams Playbook – by Shawn swyx Wang – Latent.Space https://www.latent.space/p/tiny

Incredible results! Open source is winning. https://x.com/AravSrinivas/status/1947810865685925906

🚨 BIG NEWS 🚨 Search Arena is live with 7 top models with search capabilities ready for testing. Be sure to have the “”Search”” modality selected in the chat box, and get testing. 🌐 @xAi: Grok 4 @anthropic: Claude Opus 4 @perplexity: Sonar Pro High & Reasoning Pro High https://x.com/lmarena_ai/status/1948053410139541626

Matching training data exactly to the evaluation tasks lets language models hit the same scores with roughly half the compute needed before. Today most teams still filter web text by fuzzy ideas of quality instead of task relevance. Apple researchers test this idea with a https://x.com/rohanpaul_ai/status/1946887816106967066

Shrek inspired, multi-person generation (with voice cloning) – this is possible now with a *single* TTS model! https://x.com/reach_vb/status/1948012058630303857

RT @andrewwhite01: HLE has recently become the benchmark to beat for frontier agents. We @FutureHouseSF took a closer look at the chem and…”” / X https://x.com/_arohan_/status/1948154873217994971

Lovable becomes a unicorn with $200M Series A just 8 months after launch | TechCrunch https://techcrunch.com/2025/07/17/lovable-becomes-a-unicorn-with-200m-series-a-just-8-months-after-launch/

How well do LLMs reason across languages? Introducing MultiNRC, our latest SEAL Leaderboard addition built to test native multilingual reasoning. ⬇️ https://x.com/scale_AI/status/1948073498791801074

From GPT to MoE: I reviewed & compared the main LLMs of 2025 in terms of their architectural design from DeepSeek-V3 to Kimi 2. Multi-head Latent Attention, sliding window attention, new Post- & Pre-Norm placements, NoPE, shared-expert MoEs, and more… https://x.com/rasbt/status/1946549778319339931

The Big LLM Architecture Comparison https://magazine.sebastianraschka.com/p/the-big-llm-architecture-comparison

🖼️ Bytedance’s Seedream 3.0 is in the Arena! Tied at #4 with top Text-to-Image models: – Flux 1 Kontext Max & Pro @bfl_ml – Imagen 3.0 Generate @googledeepmind Let’s see how it does with a few use cases 👇🧵 https://x.com/lmarena_ai/status/1947341726534013391

RT @pashmerepat: I’d like to point out that for the real world tasks (not benchmarks), Kimi K2 outperforms Gemini. This is telemetry acro…”” / X https://x.com/cline/status/1946389822043504745

This is the dumbest open source models ever will be”” / X https://x.com/reach_vb/status/1947364340799283539