Update: Dec 15, 2024
Making onscreen content available to Siri and Apple Intelligence | Apple Developer Documentation
https://developer.apple.com/documentation/appintents/making-onscreen-content-available-to-siri-and-apple-intelligence
I believe Apple is purposefully slowing their AI releases because the impacts would be too disruptive.

Apple is sitting on a Large Action Model that could “use your phone for you”. It could use apps, navigate interfaces, and take actions… possibly within the OS, without having to “open” anything.
Most people are still talking about using “AI” as writing assistants or content aggregators.
Few people are thinking about the future of apps and browsers. Agents are finally getting a bit of traction – agents are AIs that do things for you. This week Anthropic announced their frontier model Claude can use a computer and TED AI held a panel on agents in San Francisco. Occasionally, the “death of page views” shows up on the radar of publishers.
I think browsers and apps themselves are going to disappear. In April, I cautioned:
“Stop worrying about LLMs crawling the web. Start worrying about LLMs learning how to use computers and eating the entire concept of user interfaces. There will be no way to “block” AI because AI will be driving the operating systems. Entire new industries and disciplines are coming.”
Think about the iPhone. For it to be the magic it is today, it needs a lot of “parts” that add up to a sum that’s greater than any one thing. Wifi connectivity. Digital rights management. Content creators. Apps. A camera. Speech to text. Along the way, the phone ate everything on our desk. See this photo (below)? That’s what’s going to happen to everything we’re using now.

In addition to gobbling up hardware and software, the iPhone enabled entirely new businesses to appear. Uber could not work without mobile phones, an app store… and most importantly the addition of GPS to the iPhone 3G. Uber is an entire business with a market cap of $121 billion, and it depends on the sum of parts.
We need to start thinking of AI as a sum rather than the parts: a language model is not a writing assistant, it’s a plain language interface. Object tracking, segmentation, and depthing are not cool tricks to track objects, they are an interface between AI and real life.
First, we’ll chat with the bots. Then, the bots will recognize things in images and videos. Then the bots will understand real life physical objects and context. Then they’ll be embodied and we’ll be talking to them in plain language. Our phones, a drone, a car, a robot… they will all be the same thing.
Our phones, a drone, a car, a robot… they will all be the same thing.
It won’t be long until you can just talk out loud and get what you want, with continued conversations everywhere you go. It’s already like this in your pocket, but you have to take the phone out.
It’s critical to look at all elements of AI as pieces that will come together to replace phones and laptops…. and embody them into speakers, screens, robots, and cars.
A year ago in October 2023, Apple released an open model called Ferret that could identify and ground objects in a UX – essentially segmentation.

In November 2023 I wrote an article that was published in January 2024, “The AI Future: Exploring the Adjacent Possible with Emerging AI Solutions“. In it I wrote a section called The Future of Interfaces:
The Future of Interfaces
“The ‘content’ of any medium is always another medium. The content of writing is speech, just as the written word is the content of print, and print is the content of the telegraph.” – Marshall McLuhan 1964
Each new medium both contains and can emulate the one medium it replaces. The internet contains and emulates film, radio, television, publishing, and retail. The content of AI will include… the Internet.
Language models communicate through conversations, and if we gather and refine information through dialog, we’re not visiting websites. If we need to see, hear, or watch something, the agent can deliver it.
Bill Gates predicted agents in 1995. In the November 2023 edition of “Gates Notes,” Gates reiterates “You won’t have different apps for different tasks. You’ll simply tell your device, in everyday language, what you want to do… Agents are not only going to change how everyone interacts with computers. They’re going to upend the software industry.”
As we converse with our tools using plain, intuitive language, they blend into our lives and depart from the constructs of laptops, browsers, and phones. If you use an Amazon Echo or Apple Siri (early agents) to get what you need, you won’t need to open your laptop or pick up the phone.
In April 2024, Apple published a paper called “Grounded Mobile UI Understanding with Multimodal LLMs”.

I believe Apple is sitting on a Large Action Model that could “use your phone for you” right now. It could use apps, navigate interfaces, and take actions… possibly within the OS, without having to “open” anything.
From Apple in April:
Recent advancements in multimodal large language models (MLLMs) have been noteworthy, yet, these general-domain MLLMs often fall short in their ability to comprehend and interact effectively with user interface (UI) screens. In this paper, we present Ferret-UI, a new MLLM tailored for enhanced understanding of mobile UI screens, equipped with referring, grounding, and reasoning capabilities.
https://huggingface.co/papers/2404.05719
The sooner we start to see the parts combining into the sum the better we can prepare for what’s coming our way…. I predict for a lot of people it’s going to be as sudden as the scene in Braveheart.
Postscript: I share all of my posts with AI to see what it thinks (and get myself into the training data). GPT cracks me up. I gave it the URL of my article and it replied “They may take our home screens, but they’ll never take our… ecosystem!” SOLID answer!!!





Leave a Reply