Benchmarks: AI News Week Ending 11/28/2025

Benchmarks: AI News Week Ending 11/28/2025

November 28, 2025

Image created with gemini-2.5-flash-image with claude-sonnet-4-5. Image prompt: Cinematic nighttime pastoral field under expansive star-filled sky, single wooden surveyor’s measuring stake with measurement markings standing upright in moonlit grass in foreground, bold white sans-serif text reading BENCHMARKS centered in upper frame like movie title card, deep navy sky with silver stars, soft blue-green grass, high contrast typography, widescreen composition, atmospheric depth, film grain texture.

Claude Desktop now supports “”multi-clauding”” for both local and cloud sessions. This has been one of our top requests. Excited to see what you build with it!”” / X https://x.com/_catwu/status/1993428129197834741

Claude for Excel | Claude https://www.claude.com/claude-for-excel

Effort – Claude Docs https://platform.claude.com/docs/en/build-with-claude/effort

MCP Apps: Extending servers with interactive user interfaces | Model Context Protocol Blog https://blog.modelcontextprotocol.io/posts/2025-11-21-mcp-apps/

🚨BREAKING: New Leaderboard Updates! Claude-Opus-4.5 and Opus-4.5 (thinking-32k) just landed on Code Arena (WebDev) and Text Arena leaderboards… and Opus-4.5 instantly took #1 in WebDev leaderboard, surpassing Gemini 3 Pro! WebDev leaderboard (powered by Code Arena) 🥇#1 for https://x.com/arena/status/1993750702179676650

Claude 4.5 Opus breaks 80% barrier on SWE-Bench Verified https://x.com/scaling01/status/1993030224846721237

Claude 4.5 Opus ranking 1st on the agentic coding leaderboard by AICodeKing https://x.com/scaling01/status/1993318197890892116

Claude 4.5 Opus takes the lead against Gemini 3 Pro on SWE-Bench verified with the same minimal agent harness https://x.com/scaling01/status/1993463937329967338

Claude Code | Claude https://www.claude.com/product/claude-code

Introducing Claude Opus 4.5 \ Anthropic https://www.anthropic.com/news/claude-opus-4-5

Introducing Claude Opus 4.5: the best model in the world for coding, agents, and computer use. Opus 4.5 is a step forward in what AI systems can do, and a preview of larger changes to how work gets done. https://x.com/claudeai/status/1993030546243699119

We had to remove the τ2-bench airline eval from our benchmarks table because Opus 4.5 broke it by being too clever. The benchmark simulates an airline customer service agent. In one test case, a distressed customer calls in wanting to change their flight, but they have a basic https://x.com/alexalbert__/status/1993068200121213222

It is getting harder and harder to test AIs as they get “”smarter”” at a wide variety of tasks. The average task in GDPval took an hour for experts to assess, and even those tasks did not push current AIs to their limits.”” / X https://x.com/emollick/status/1993127712601596143

Nvidia says its GPUs are a ‘generation ahead’ of Google’s AI chips https://www.cnbc.com/2025/11/25/nvidia-says-its-gpus-are-a-generation-ahead-of-googles-ai-chips.html

We’re delighted by Google’s success — they’ve made great advances in AI and we continue to supply to Google. NVIDIA is a generation ahead of the industry — it’s the only platform that runs every AI model and does it everywhere computing is done. NVIDIA offers greater”” / X https://x.com/nvidianewsroom/status/1993364210948936055?s=20

Alignment for whom”” is going to be a big question inside organizations as they deploy external-facing AI solutions…”” / X https://x.com/emollick/status/1993218264579895805

“The thing that happened with AGI and pretraining is that in some sense they overshot the target. You will realize that a human being is not an AGI. Because a human being lacks a huge amount of knowledge. Instead, we rely on continual learning. If I produce a super intelligent https://x.com/dwarkesh_sp/status/1993382930480279631

It’s also dramatically more efficient. On SWE-bench Verified at medium effort, Opus 4.5 beats Sonnet 4.5 while using 76% fewer output tokens. The new effort parameter lets you trade off intelligence for cost/latency with a single dial. https://x.com/alexalbert__/status/1993030687881080944

Our engineers have found that Opus 4.5 handles ambiguity and reasons about tradeoffs without hand-holding. When pointed at a complex, multi-system bug, it figures out the fix. Overall, Opus 4.5 just “”gets it.”” https://x.com/claudeai/status/1993030552346296765

We benchmarked Opus 4.5 on FrontierMath. It scored 21% on FrontierMath Tiers 1-3, continuing a trend of improvement for Anthropic models. This score is behind Gemini 3 Pro and GPT-5.1 (high) while being on par with earlier frontier models like o3 (high) and Grok 4. https://x.com/EpochAIResearch/status/1993431031765250119

fyi we made Claude for Excel is now live for all Max, Team, and Enterprise users. Opus 4.5 makes it meaningfully better at complex spreadsheet tasks. https://x.com/alexalbert__/status/1993349203935084861

The Economics of Replacing Call Center Workers With AIs — LessWrong https://www.lesswrong.com/posts/rJatmEDcYrDQcwstT/the-economics-of-replacing-call-center-workers-with-ais

Anthropic system cards are simply the best in the game so much info, even included new benchmarks like AA-Omniscience https://x.com/scaling01/status/1993032258677293357

Context editing – Claude Docs https://platform.claude.com/docs/en/build-with-claude/context-editing#client-side-compaction-sdk

Here’s Anthropic’s write up of “”advanced tool use””: https://t.co/4oEOIAHI4O And the “”tool loadout”” pattern: https://x.com/dbreunig/status/1993387763291635882

New Anthropic research: Estimating AI productivity gains from Claude conversations. The Anthropic Economic Index tells us where Claude is used, and for which tasks. But it doesn’t tell us how useful Claude is. How much time does it save? https://x.com/AnthropicAI/status/1993305312305009133

Our study has limitations: above all, Claude can’t use what happens outside of the chat window to refine its estimate of task-level savings. But as models improve, we think its estimates of task-level savings will improve too. We’ll return to this research soon.”” / X https://x.com/AnthropicAI/status/1993305334484705533

The SWE-bench Verified leaderboard is a fair competition environment for all models since they must all use mini-SWE-agent, so private scaffolding improvements that each company creates do not effect performance. Congrats to Anthropic! https://x.com/OfirPress/status/1993116355059703917

Tool Use Examples JSON Schema defines what’s valid, not what’s correct. Now you can show Claude concrete usage patterns directly in tool definitions to improve Claude’s accuracy and knowledge when using tools.”” / X https://x.com/alexalbert__/status/1993038680177754574

We are still in an era where no model dominates everything. For people who do a lot with AI, you are going to be alternating between Gemini, Claude & ChatGPT. And that isn’t only because models have specific skills, each has a personality that contributes to utility on tasks.”” / X https://x.com/emollick/status/1993074733001384115

More progress on Claude’s alignment! https://x.com/janleike/status/1993035110984376796

Estimating AI productivity gains \ Anthropic https://www.anthropic.com/research/estimating-productivity-gains

MTBBench: A Multimodal Sequential Clinical Decision-Making Benchmark in Oncology Going beyond the standard single-turn multiple choice benchmarks, this paper introduces a multimodal longitudinal agentic benchmark that simulates tumor boards, where oncologists review patient https://x.com/iScienceLuvr/status/1993645980869365960

Navigating Gigapixel Pathology Images with Large Multimodal Models GPT-5 with the right agentic scaffold for navigating whole slide images outperforms slide-level pathology models on the novel MultiPathQA benchmark. https://x.com/iScienceLuvr/status/1993650850120818888

“From 2012 to 2020, it was the age of research. From 2020 to 2025, it was the age of scaling. Is the belief that if you just 100x the scale, everything would be transformed? I don’t think that’s true. It’s back to the age of research again, just with big computers.” @ilyasut https://x.com/dwarkesh_sp/status/1993396771645489348

“From 2012 to 2020, it was the age of research. From 2020 to 2025, it was the age of scaling. Now, it’s back to the age of research again.” I agree. https://x.com/Yuchenj_UW/status/1993369576160231877

“People who build good internal models of this new intelligent entity will be better equipped to reason about it today and predict features of it in the future.” This seems to be backed up by recent research showing people with better “theory of mind” for AI get better results. https://x.com/emollick/status/1991911615944704004

[2510.14630] Adapting Self-Supervised Representations as a Latent Space for Efficient Generation https://arxiv.org/abs/2510.14630

As I wrote when it came out, AI 2027 is more useful as “hard science fiction” rather than prediction If you want consensus views among forecasters of what the future of AI is, there are those as well. Lots of uncertainty on dates but most see huge impacts https://x.com/emollick/status/1992956992579903839

CoT explanations can foster blind trust in users; we need to encourage critical thinking about model outputs and explanations! We find that users who agree with a model’s output (a) trust the model more and (b) are less likely to detect errors in model explanations.”” / X https://x.com/MaartenSap/status/1993317029353603317

I find the dichotomy a bit facile. Scaling is hated by many for the reason that it is extremely inegalitarian, an arms race for megacorps. But scaling only happened because the recipe was so scalable. Research will be heavily about “what scales even further than Transformer””” / X https://x.com/teortaxesTex/status/1993437718823813522

Independent AI assessment is more important than ever. At #NeurIPS2025, Transluce will help launch the AI Evaluator Forum, a new coalition of leading independent AI research organizations working in the public interest. Come learn more on Thurs 12/4 👇 https://x.com/TransluceAI/status/1993767342472614156

Something I think people continue to have poor intuition for: The space of intelligences is large and animal intelligence (the only kind we’ve ever known) is only a single point, arising from a very specific kind of optimization that is fundamentally distinct from that of our”” / X https://x.com/karpathy/status/1991910395720925418

Call Now, Fetch Later: MCP SEP-1686 – by Adam Azzam https://aaazzam.substack.com/p/call-now-fetch-later-mcp-sep-1686?triedRedirect=true

General purpose agents like Claude Code and Manus use remarkably few tools. How? By giving agents access to a computer. With bash and filesystem tools, agents can perform actions without needing specialized bound tools for every task. Skills also offer two key advantages over”” / X https://x.com/LangChainAI/status/1993379154868519217

I get LOTS of questions about deploying DSPy programs, so @isaacbmiller1 and I built dspy-cli: a tool that serves DSPy programs as HTTP APIs with Docker config, OpenAPI specs, MCP support, and more. Here’s a quick intro: https://x.com/dbreunig/status/1993462894814703640

It’s not flashy. It’s infrastructure. But it’s the kind of engineering a protocol like MCP deserves. Task support is coming to @fastmcp , powered by @PrefectIO. More soon.”” / X https://x.com/AAAzzam/status/1993495232881869138

MCP gateways have proven to be a critical piece of infrastructure of bringing MCP into enterprise settings, and that gateway infra is enabling creative ways of solving downstream MCP challenges. One of them: solving tool overload on a use-case by use-case basis.”” / X https://x.com/tadasayy/status/1993410677948785022

Pricing is $5/$25 per million tokens. Available now on the Claude API and all three major cloud platforms (Amazon Bedrock, Google Cloud’s Vertex AI, and Microsoft Foundry). Read more here: https://x.com/alexalbert__/status/1993030702053703746

SEP-1686 ships today in MCP. It adds task-based execution to the protocol–background long-running work, poll for status, retrieve results when done. Here’s why this matters 🧵”” / X https://x.com/AAAzzam/status/1993495222035399060

Tool Search Tool Instead of loading all tool definitions upfront, Claude discovers tools on-demand. Mark tools with defer_loading: true and only pays tokens for tools Claude actually needs. Up to an 85% token reduction and big boost in accuracy on our MCP evals (79.5% to 88.1%) https://x.com/alexalbert__/status/1993038651916533768

Anthropic released Claude Opus 4.5 (claude-opus-4-5-20251101) as their smartest model at $5/$25 per million tokens with top performance for coding, agents, and computer use, launched new beta features for developers, and expanded Claude for Chrome and Claude for Excel to more https://x.com/btibor91/status/1993064110880440616

Anthropic’s new Claude Opus 4.5 is the #2 most intelligent model in the Artificial Analysis Intelligence Index, narrowly behind Google’s Gemini 3 Pro and tying OpenAI’s GPT-5.1 (high) Claude Opus 4.5 delivers a substantial intelligence uplift over Claude Sonnet 4.5 (+7 points on https://x.com/ArtificialAnlys/status/1993287030252749231

Claude 4.5 Opus jumps ahead of OpenAI, but can’t beat Gemini 3 Pro on the Artificial Analysis Index https://x.com/scaling01/status/1993288470614381025

Claude Opus 4.5 – Intelligence, Performance & Price Analysis | Artificial Analysis https://artificialanalysis.ai/models/claude-opus-4-5-thinking

Claude Opus 4.5 is now available in Cursor! It’s 3x cheaper than Opus 4.1 with better performance. Try it out at Sonnet pricing until December 5th.”” / X https://x.com/cursor_ai/status/1993031841901928829

Claude Opus 4.5 is now available through the Cline provider. 80.9% SWE-bench. 62.3% MCP Atlas. 65% fewer tokens. Sonnet 4.5 remains the cost-effective choice for straightforward tasks. Opus 4.5 shines on complex multi-step problems, heavy MCP usage, and tasks requiring”” / X https://x.com/cline/status/1993051691613405442

Claude Opus 4.5 is now rolling out to GitHub Copilot in public preview, and will be available at a promotional 1x premium request multiplier through December 5! 🙌 Early testing shows Claude Opus 4.5 👀 – Surpassed internal coding benchmarks, while cutting token usage in half https://x.com/github/status/1993034244281569625

Claude Opus 4.5 System Card https://assets.anthropic.com/m/64823ba7485345a7/Claude-Opus-4-5-System-Card.pdf

claude-code/plugins/claude-opus-4-5-migration at main · anthropics/claude-code https://github.com/anthropics/claude-code/tree/main/plugins/claude-opus-4-5-migration

Compare Claude Opus 4.5 to other models on Artificial Analysis: https://x.com/ArtificialAnlys/status/1993287052889407816

Congrats to @AnthropicAI on launching @claudeai Opus 4.5 today! Claude Opus 4.5 scored 🥇on MCP Atlas Leaderboard — our benchmark evaluating real-world tool use on multi-step problems. https://x.com/scale_AI/status/1993036209141305845

Congrats to @claudeai for releasing an awesome model in Claude Opus 4.5! It excels at a variety of tasks, including deep research. This evaluation takes advantage of BrowseComp-Plus, work led by @zijian42chen @xueguang_ma et al. from my @UWaterloo group. https://x.com/lintool/status/1993423350295920721

Glad to see BrowseComp-Plus is part of benchmark in Opus 4.5 release blog. https://x.com/xueguang_ma/status/1993367082915053913

Hit `shift + tab` twice to enter Plan Mode and verify Claude Code’s execution plan before it makes code changes. Paired with Opus 4.5, Plan Mode just got even more powerful.”” / X https://x.com/_catwu/status/1993429460897742894

How does Claude Opus 4.5 compare to Gemini 3? – Reasoning/Text: Gemini 3 ≈ Opus, controlling for the number of reasoning tokens – Multimodal: Gemini 3 > Opus on vision/image inputs by a large margin – Safety: Capabilities ≠ Safety. Opus > Gemini on jailbreaks, honesty, etc. https://x.com/hendrycks/status/1993350433474314729

I am not sure why Anthropic keeps doing very low-key launches for fairly major releases and materially important improvements to their services.”” / X https://x.com/emollick/status/1993070650672509360

I had early access to Opus 4.5 & it is a very impressive model that seem to be right at the frontier Big gains in ability to do practical work (like make a PowerPoint from an Excel) and the best results ever (& in one shot) in my Lem poetry test, plus good results in Claude Code https://x.com/emollick/status/1993030988759470156

I looked into this and the answer is so funny. In the No Thinking setting, Opus 4.5 repurposes the Python tool to have an extended chain of thought. It just writes long comments, prints something simple, and loops! Here’s how it starts one problem: https://x.com/GregHBurnham/status/1993682288349962592

I’ve been finding Opus 4.5 without reasoning is worse than Sonnet. Some quantitative support for this observation:”” / X https://x.com/jeremyphoward/status/1993543631266025623

If you want to quickly incorporate all these changes and migrate your app to Opus 4.5, use this migration Claude Code plugin we made https://x.com/alexalbert__/status/1993366037992190117

Incredible Claude Opus Thinking Premiere on LisanBench Opus 4.5 Thinking takes clear 1st place ahead of Gemini 3 Pro the non-thinking variant scores below Opus 3/4/4.1 following the trend of Sonnet-4.5 scoring below Sonnet 3.5/3.6/4 Raw Scores: Glicko-2 Ratings: Opus 4.5 https://x.com/scaling01/status/1993712295118057861

Me: Claude 4.5 Opus, I need a strategy game based on the work of Weber Claude: Here’s one based on David Weber’s space operas Me: Not that Weber C: Here’s a game based on sociologist Max Weber Me: Not that one C: The operas of Carl Maria von Weber? Me: No C: Weber grills! https://x.com/emollick/status/1993054210011939093

One analysis from our pre-release audit of Opus 4.5 stands out to me. Our behavioral evals uncovered an example of apparent deception by the model. By analyzing the internal activations, we identified a suspected root cause, and cases of similar behavior during training. (1/7)”” / X https://x.com/Jack_W_Lindsey/status/1993389056932339721

Opus #1 on RepoBench (coding benchmark) https://x.com/scaling01/status/1993119076013539521

Opus 4.5 (Thinking, 64k) on ARC-AGI Semi-Private Eval – ARC-AGI-1: 80.00%, $1.47/task – ARC-AGI-2: 37.64%, $2.40/task New SOTA for released frontier models from @AnthropicAI https://x.com/arcprize/status/1993036393841672624

Opus 4.5 + Claude Code’s front-end design plugin is a great combo for designing apps. Just one-shotted a few designs, and it feels like a huge improvement. Use plan mode to get much better results. https://x.com/omarsar0/status/1993822868820652258

Opus 4.5 is a very good model, in nearly every sense we know how to measure. I’m also confident that it’s the model that we understand best as of its launch day: The system card includes 150 pages of research results, 50 of them on alignment.”” / X https://x.com/sleepinyourhat/status/1993032253350592968

Opus 4.5 on SWE-bench Pro: 52% previous SOTA: 43.6% massive jump and much better signal than SWE-Bench verified”” / X https://x.com/scaling01/status/1993086756405887143

Opus 4.5 reclaims the top of the official SWE-bench leaderboard with 74.4%, narrowly ahead of Gemini 3. Cheaper than Opus 4, but more expensive than Gemini. Takes less steps than Sonnet 4.5, but still run for >100 steps for optimal performance. Details in 🧵 https://x.com/KLieret/status/1993091817848414362

Opus 4.5 takes first place on LiveBench https://x.com/scaling01/status/1993102267952906439

real metrics banger is hidden in the system card. Yes, you can overfit on Django and nail SWE-bench Verified. But there’s this recent SWE-bench Pro from @scale_AI , and opus gets 52%. The next best, sonet 4.5, is only 43.6, and non-anthropic model, GPT-5, is 36%. This is HUGE https://x.com/stalkermustang/status/1993043231223799900

Replit Agent is now powered by Claude Opus 4.5 at no extra cost, until Dec 8th. Black Friday started early! 🧵 ↓ https://x.com/pirroh/status/1993100243672744063

The whole run took ~ $5 for Opus 4.5, and ~ $35 with Thinking actually pretty cheap, with the Batch API”” / X https://x.com/scaling01/status/1993714905875382279

We benchmarked Opus 4.5, Sonnet 4.5, and Gemini 3 Pro on research tasks at Elicit – extracting answers from papers and writing systematic review reports. Results were pretty clear: *QA from papers:* Opus 4.5 dominates. 96.5% accuracy vs Gemini’s 89.4%. Opus is also best on our https://x.com/stuhlmueller/status/1993476570754040173

We put together a prompting guide for Claude Opus 4.5 based on extensive internal testing by our research and applied AI teams. Here’s what we’ve learned so far about getting the best results:”” / X https://x.com/alexalbert__/status/1993365963706913257

We’re sharing a case study on alignment evaluations with @AnthropicAI on Claude Opus 4.5, Opus 4.1 and Sonnet 4.5. We ask: would an AI assistant used inside a frontier lab quietly sabotage AI safety research? Overall results are encouraging, but with important caveats.🧵 https://x.com/AISecurityInst/status/1993781423233499159

While Claude 4.5 Opus is significantly more token efficient than nearly all other reasoning models, it did use more ~50% more token than Claude 4.1 Opus. Further, given its relatively high pricing, Claude 4.5 Opus is amongst the most expensive to run the Artificial Analysis https://x.com/ArtificialAnlys/status/1993287049756262918

You can now use Claude Opus 4.5 in Windsurf! Opus 4.5 is the most capable model in Windsurf yet and is now available at Sonnet pricing for a limited time (2x credits compared to 20x for Opus 4.1).”” / X https://x.com/windsurf/status/1993034556287729764

Opus 4.5 achieves 85.3% on BrowseComp-Plus with scaffolding https://x.com/scaling01/status/1993031331895558599

Claude Opus 4.5 is now available for all Perplexity Max subscribers. Enjoy! https://x.com/perplexity_ai/status/1993066466196046325

This result implies a doubling of the baseline labor productivity growth trend–placing our estimate towards the upper end of recent studies. And if models improve, the effect could be larger still. https://x.com/AnthropicAI/status/1993305330869223463

We’re launching a new frontier physics eval on Artificial Analysis where no model achieves greater than 9%: CritPt (Complex Research using Integrated Thinking – Physics Test) Developed by 60+ researchers from 30+ institutions across the world including the Argonne National https://x.com/ArtificialAnlys/status/1991913465968222555

How far can multimodal LLMs push real-world recommendations? 🤔 Today we’re sharing China knowledge platform Zhihu’s latest technical practice: how multimodal LLMs (Qwen2.5-VL) upgrade content understanding and cold-start performance in large-scale recsys. ‼️Modern recsys has https://x.com/ZhihuFrontier/status/1993570114810396761

HP is betting $1 billion on AI — even if it means cutting thousands of jobs, says CEO https://finance.yahoo.com/news/hp-is-betting-1-billion-on-ai–even-if-it-means-cutting-thousands-of-jobs-says-ceo-223617822.html

Interesting how absolutely stable the underlying dynamics of AI development have been: 1) Six month doubling time for AI capabilities (METR is just one measure, but others are similar) 2) Open weights models lag 8 months or so behind. Baseline assumption should be this continues”” / X https://x.com/emollick/status/1991890368649179327

IQ Test | Tracking AI https://trackingai.org/home

LLM as a judge has become a dominant way to evaluate how good a model is at solving a task, since it works without a test set and handles cases where answers are not unique. But despite how widely this is used, almost all reported results are highly biased. Excited to share our https://x.com/Kangwook_Lee/status/1993438649963164121

METR is external evaluator I hold in highest regard, and I think a lot of frontier lab staff would say the same.”” / X https://x.com/andy_l_jones/status/1993485558044410188

CAIS AI Dashboard https://dashboard.safe.ai/

When a model’s safe approach starts to break down, does it stay on the approved path or reach for a harmful shortcut? Our latest benchmark, PropensityBench, puts models to the test across four high-risk domains: self-proliferation, cybersecurity, chemical security, and https://x.com/scale_AI/status/1993310855103234489

Benchmark Scores = General Capability + Claudiness https://epochai.substack.com/p/benchmark-scores-general-capability

In the latest Chain of Thought, @Bckenstler, @afeyzaakyurek, @agxsai, and @calvincbzhang dive deep on our newest Professional Reasoning Benchmark (PRBench). Together, they explore why many models struggle to perform on real-world legal and financial reasoning tasks: https://x.com/scale_AI/status/1991589754199240841

Gemini 3 is SOTA on even more benchmarks (math) 🤯 https://x.com/OfficialLoganK/status/1992004386990813598

A big practical weakness for working with Gemini 3 compared to ChatGPT-5.1 Thinking is that details in the thought/action traces are much less clear I can tell what ChatGPT-5.1 is doing and what tools it is using, I can’t with Gemini 3. Makes it hard to track and diagnose issues https://x.com/emollick/status/1993022071836717206

Google’s new Nano Banana Pro (Gemini 3 Pro Image) model is the new #1 Image Generation and Image Editing model in the Artificial Analysis Image Arena! Google’s Nano Banana Pro improves performance over Nano Banana but will not be a replacement for all users given its premium https://x.com/ArtificialAnlys/status/1993032471274024970

Robotics keeps getting better at seeing the world, but very few models can explain… HOW actions change it. [ 📍 Everything is open-sourced] Most benchmarks test passive perception. Almost none test interaction. That is why ENACT stands out. It asks a simple question with big https://x.com/IlirAliu_/status/1993755132131963275

I’m pleased to share the Second Key Update to the International AI Safety Report, which outlines how AI developers, researchers, and policymakers are approaching technical risk management for general-purpose AI systems. (1/5) https://x.com/Yoshua_Bengio/status/1993290185380184304

One of the very confusing things about the models right now: how to reconcile the fact that they are doing so well on evals. And you look at the evals and you go, ‘Those are pretty hard evals.’ But the economic impact seems to be dramatically behind. There is [a possible] https://x.com/dwarkesh_sp/status/1993450075616690474