Anthropic: AI News Week Ending 10/25/2024

Anthropic: AI News Week Ending 10/25/2024

October 25, 2024

“I got to play with the new Claude model that controls a mouse and keyboard last week. Full post shortly, but I had it play Paperclip Clicker (of course) and it did well over a hundred moves executing a coherent strategy without any intervention. Agents start to come into view.

I got to play with the new Claude model that controls a mouse & keyboard last week. Full post shortly, but I had it play Paperclip Clicker (of course) and it did well over a hundred moves executing a coherent strategy without any intervention.

Agents start to come into view. pic.twitter.com/ruxXhOyErk
— Ethan Mollick (@emollick) October 22, 2024

“Computer use API We’ve built an API that allows Claude to perceive and interact with computer interfaces. You feed in a screenshot to Claude, and Claude returns the next action to take on the computer (e.g. move mouse, click, type text, etc).

Computer use API

We've built an API that allows Claude to perceive and interact with computer interfaces.

You feed in a screenshot to Claude, and Claude returns the next action to take on the computer (e.g. move mouse, click, type text, etc). pic.twitter.com/slISN3fVKP
— Alex Albert (@alexalbert__) October 22, 2024

Show HN: Agent.exe, a cross-platform app to let 3.5 Sonnet control your machine | Hacker News

https://news.ycombinator.com/item?id=41926770

“The most significant gains are in coding. The new 3.5 Sonnet sets a new state-of-the-art on SWE-bench Verified with a score of 49% (using no complex scaffolding)—besting all models including reasoning models like OpenAI o1-preview and specialized models for agentic coding.

The most significant gains are in coding.

The new 3.5 Sonnet sets a new state-of-the-art on SWE-bench Verified with a score of 49% (using no complex scaffolding)—besting all models including reasoning models like OpenAI o1-preview and specialized models for agentic coding. pic.twitter.com/3liZqf9yY3
— Alex Albert (@alexalbert__) October 22, 2024

“I can’t tell you the last time I was so excited to see a new AI capability in action. We plugged in Claude computer use in @Replit Agent as a human feedback replacement. And… it just works! I feel it won’t take long until our agent will become fully autonomous.

I can't tell you the last time I was so excited to see a new AI capability in action.

We plugged in Claude computer use in @Replit Agent as a human feedback replacement.
And… it just works! I feel it won't take long until our agent will become fully autonomous. pic.twitter.com/2rkSeL7IeW
— Michele Catasta (@pirroh) October 22, 2024

“Claude 3.5 Haiku 3.5 Haiku replaces 3.0 Haiku as our fastest and least expensive model. It outperforms many state-of-the-art models on coding tasks—including the original Claude 3.5 Sonnet and GPT-4o. 3.5 Haiku will be made available in the coming weeks.

Claude 3.5 Haiku

3.5 Haiku replaces 3.0 Haiku as our fastest and least expensive model.

It outperforms many state-of-the-art models on coding tasks—including the original Claude 3.5 Sonnet and GPT-4o.

3.5 Haiku will be made available in the coming weeks. pic.twitter.com/SmMGCJGdYt
— Alex Albert (@alexalbert__) October 22, 2024

“The new Claude 3.5 Sonnet has *insane* capabilities when used as a Minecraft agent. It’s powered by a project called Mindcraft. Running this code allows you to spawn AI bots that will follow your instructions, build, and play the game. Here’s how to set it up in <15min.

The new Claude 3.5 Sonnet has *insane* capabilities when used as a Minecraft agent.

It’s powered by a project called Mindcraft.

Running this code allows you to spawn AI bots that will follow your instructions, build, and play the game.

Here’s how to set it up in <15min. pic.twitter.com/9JZixgEhFu
— Mckay Wrigley (@mckaywrigley) October 24, 2024

“🚨Anthropic just released the most amazing AI technology I’ve ever used I’m not kidding AI agents are here and you can now build your own personal army of AI’s that will do work for you Here is your demo and complete beginner’s guide: (trust me, you want to bookmark this)

🚨Anthropic just released the most amazing AI technology I've ever used

I'm not kidding

AI agents are here and you can now build your own personal army of AI's that will do work for you

Here is your demo and complete beginner's guide:

(trust me, you want to bookmark this) pic.twitter.com/MueqisKpmd
— Alex Finn (@AlexFinnX) October 22, 2024

“New @AnthropicAI Computer Use feels surreal. But don’t take my word for it. We made a template on Replit for you to try. Watch me fork the template, ask the agent to go to YouTube, find a video, and even skip the ads — all in a few minutes.

New @AnthropicAI Computer Use feels surreal.

But don't take my word for it. We made a template on Replit for you to try.

Watch me fork the template, ask the agent to go to YouTube, find a video, and even skip the ads — all in a few minutes. pic.twitter.com/qbeJAJVz1o
— Amjad Masad (@amasad) October 22, 2024

Anthropic announces AI agents for complex tasks, racing OpenAI

https://www.cnbc.com/2024/10/22/anthropic-announces-ai-agents-for-complex-tasks-racing-openai.html

“Claude 3.5 Sonnet’s current ability to use computers is imperfect. Some actions that people perform effortlessly—scrolling, dragging, zooming—currently present challenges. So we encourage exploration with low-risk tasks. We expect this to rapidly improve in the coming months.” / X

Claude 3.5 Sonnet's current ability to use computers is imperfect. Some actions that people perform effortlessly—scrolling, dragging, zooming—currently present challenges. So we encourage exploration with low-risk tasks.

We expect this to rapidly improve in the coming months.
— Anthropic (@AnthropicAI) October 22, 2024

“Introducing an upgraded Claude 3.5 Sonnet, and a new model, Claude 3.5 Haiku. We’re also introducing a new capability in beta: computer use. Developers can now direct Claude to use computers the way people do—by looking at a screen, moving a cursor, clicking, and typing text.

Introducing an upgraded Claude 3.5 Sonnet, and a new model, Claude 3.5 Haiku. We’re also introducing a new capability in beta: computer use.

Developers can now direct Claude to use computers the way people do—by looking at a screen, moving a cursor, clicking, and typing text. pic.twitter.com/ZlywNPVIJP
— Anthropic (@AnthropicAI) October 22, 2024

“Playing with Claude Computer Use is very worthwhile. It’s obvious that its something that’ll be used in the future, much like when you first try ChatGPT or amazing tech like AirPods. BUT, it’s clear its integration will take some serious time. Here’s an example web task,

Playing with Claude Computer Use is very worthwhile.
It's obvious that its something that'll be used in the future, much like when you first try ChatGPT or amazing tech like AirPods. BUT, it's clear its integration will take some serious time.

Here's an example web task,… https://t.co/zO3dAIeXmy pic.twitter.com/r6JwgXUdqf
— Nathan Lambert (@natolambert) October 23, 2024

Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku \ Anthropic

https://www.anthropic.com/news/3-5-models-and-computer-use

Claude | Computer use for coding – YouTube

Claude | Computer use for automating operations – YouTube

“Claude’s “computer use” beta is wild because you don’t need to make custom tools for LLMs to use — automation is about to look a lot more like screen recording a task/workflow involving any desktop apps, and asking Claude to take control and do it for you.

Claude’s “computer use” beta is wild because you don’t need to make custom tools for LLMs to use — automation is about to look a lot more like screen recording a task/workflow involving any desktop apps, and asking Claude to take control and do it for you.pic.twitter.com/EaupsSgoWz
— Bilawal Sidhu (@bilawalsidhu) October 22, 2024

“Anthropic’s computer use can operate mobile devices including iOS, Android, and mobile browsers 📱 Here it is ordering me an Uber and posting for me on X.

Anthropic's computer use can operate mobile devices including iOS, Android, and mobile browsers 📱

Here it is ordering me an Uber and posting for me on X. https://t.co/WZZUvndASw pic.twitter.com/y9A51LCMhy
— Ethan Sutin (@EthanSutin) October 23, 2024

“Just got our first AI-ordered pizza with Lindy + Claude computer use 🙂

Just got our first AI-ordered pizza with Lindy + Claude computer use 🙂 pic.twitter.com/p5vaLATx5Z
— Flo Crivello (@Altimor) October 25, 2024

Computer use (beta) – Anthropic

https://docs.anthropic.com/en/docs/build-with-claude/computer-use

“We’re trying something fundamentally new. Instead of making specific tools to help Claude complete individual tasks, we’re teaching it general computer skills—allowing it to use a wide range of standard tools and software programs designed for people.

We're trying something fundamentally new.

Instead of making specific tools to help Claude complete individual tasks, we're teaching it general computer skills—allowing it to use a wide range of standard tools and software programs designed for people. pic.twitter.com/42u8VeTvXd
— Anthropic (@AnthropicAI) October 22, 2024

Introducing the analysis tool in Claude.ai \ Anthropic

https://www.anthropic.com/news/analysis-tool

anthropic-quickstarts/computer-use-demo at main · anthropics/anthropic-quickstarts · GitHub

https://github.com/anthropics/anthropic-quickstarts/tree/main/computer-use-demo

“One of those “AI feels like a superpower” moments. I went to an old tweet about a diagram of city streets by entropy, and pasted the scientific paper and image into Claude and asked it to create code to replicate it. It built the code in one shot, even replicated color scheme

One of those "AI feels like a superpower" moments.

I went to an old tweet about a diagram of city streets by entropy, and pasted the scientific paper and image into Claude and asked it to create code to replicate it.

It built the code in one shot, even replicated color scheme https://t.co/wHOEBihoIa pic.twitter.com/s4s55XQE7r
— Ethan Mollick (@emollick) October 24, 2024

Claude | Computer use for orchestrating tasks – YouTube

“The ability of multimodal AI to “understand” images is underrated. I just took these. Given the first photo Claude guesses where I am. Given the second it identifies the type of plane. These aren’t obvious.

The ability of multimodal AI to “understand” images is underrated.

I just took these. Given the first photo Claude guesses where I am. Given the second it identifies the type of plane. These aren’t obvious. pic.twitter.com/cn0S5CV15v
— Ethan Mollick (@emollick) October 23, 2024

“Anthropic computer use API + iPhone mirroring to a Mac = AI controlled phone. Watch Claude control my phone and successfully look up stats in my Sports app. I even got it to play a game in the Chess app against another AI – pretty crazy. And this is the worst it’ll ever be.

Anthropic computer use API + iPhone mirroring to a Mac = AI controlled phone.

Watch Claude control my phone and successfully look up stats in my Sports app.

I even got it to play a game in the Chess app against another AI – pretty crazy.

And this is the worst it’ll ever be. pic.twitter.com/35mh7T6PPs
— Mckay Wrigley (@mckaywrigley) October 23, 2024

“The new Claude 3.5 Sonnet is the first frontier AI model to offer computer use in public beta. While groundbreaking, computer use is still experimental—at times error-prone. We’re releasing it early for feedback from developers.

The new Claude 3.5 Sonnet is the first frontier AI model to offer computer use in public beta.

While groundbreaking, computer use is still experimental—at times error-prone. We're releasing it early for feedback from developers. pic.twitter.com/a5SZQMKvLj
— Anthropic (@AnthropicAI) October 22, 2024

“We’ve built an API that allows Claude to perceive and interact with computer interfaces. This API enables Claude to translate prompts into computer commands. Developers can use it to automate repetitive tasks, conduct testing and QA, and perform open-ended research.

We've built an API that allows Claude to perceive and interact with computer interfaces.

This API enables Claude to translate prompts into computer commands. Developers can use it to automate repetitive tasks, conduct testing and QA, and perform open-ended research. pic.twitter.com/eK0UCGEozm
— Anthropic (@AnthropicAI) October 22, 2024

Initial explorations of Anthropic’s new Computer Use capability

https://simonwillison.net/2024/Oct/22/computer-use

“Last week in AI was 🔥 Does @nvidia’s Llama 3.1 fine-tune outperform @OpenAI GPT-4o and @AnthropicAI Claude 3.5? New Zyphra’s Zamba2 challenges the Transformer architecture? > NVIDIA Llama 3.1 Nemotron 70B topped Arena Hard (85.0) & AlpacaEval 2 LC (57.6) > Zamba2 7B matched

Last week in AI was 🔥 Does @nvidia's Llama 3.1 fine-tune outperform @OpenAI GPT-4o and @AnthropicAI Claude 3.5? New Zyphra's Zamba2 challenges the Transformer architecture?

> NVIDIA Llama 3.1 Nemotron 70B topped Arena Hard (85.0) & AlpacaEval 2 LC (57.6)
> Zamba2 7B matched… pic.twitter.com/c3ot1Fy134
— Philipp Schmid (@_philschmid) October 21, 2024

Sabotage evaluations for frontier models \ Anthropic

https://www.anthropic.com/research/sabotage-evaluations

Claude 3.5 Opus is no longer mentioned at all on https://docs.anthropic.com/en/d… | Hacker News

https://news.ycombinator.com/item?id=41920044

Evaluating feature steering: A case study in mitigating social biases \ Anthropic

https://www.anthropic.com/research/evaluating-feature-steering

Claude 3 Model Card October Addendum.pdf

chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://assets.anthropic.com/m/1cd9d098ac3e6467/original/Claude-3-Model-Card-October-Addendum.pdf

“The new Sonnet also sets SOTA on aider’s more demanding refactoring benchmark with a score of 92.1%! 92% Sonnet 10/22 75% o1-preview 72% Opus 64% Sonnet 06/20 49% GPT-4o 08/06 45% o1-mini

https://twitter.com/paulgauthier/status/1848839965201076618