Ethan B. Holland

Over 54,400 manually organized AI links and counting

AI Six-Month Recap: The Biggest Stories in Multimodality (Feb–Aug 2025)

September 17, 2025

March 7, 2025
- Microsoft’s Phi-4 Multimodal Takes Top Spot in Speech Recognition (OpenSource)
- Sesame Launches Voice AI That Rivals Human Interaction
March 21, 2025
- OpenAI has released three advanced audio models, which include text-to-speech and speech-to-text with better transcription
March 28. 2025
- Researchers developed an AI model that can identify endometrial cancer in tissue samples with a 99.2% accuracy rate. This is significantly better than the current best-in-class methods which are at 80%.
April 11, 2025
- Google has integrated vision into its AI tools, finally catching up to OpenAI’s multimodal product.
- Audio leader ElevenLabs has adopted Anthropic’s model context protocol (MCP) to enable locally running servers to create voice agents to make outbound calls and clone voices.

April 25, 2025
- Microsoft added the ability to “see” to its web browser. It’s also launched computer use and web browsing within the operating system.
May 9, 2025
- OpenAI’s o3 analyzed a Harvard Business School case study PDF, extracted scattered financial data and built a business model comparable to an MBA student.
May 16, 2025
- Amazon launched Nova Sonic, that can have natural back-and-forth conversations in real time. Sounds like Alexa wants to drink Siri’s milkshake.
June 6, 2025
- The FDA approved the first AI tool to predict breast cancer risk from mammograms.
June 14, 2025
- Meta released a robot training system that learns how the physical world works by watching videos, similar to children developing intuitions by watching the world.
- Instagram is testing features that convert static photos into a 3-D images.
- OpenAI updated their voice mode to be a lot more expressive. You can ask the voice to sound nervous, excited, or jittery, and the new voice features can capture those emotions.

June 20, 2025
- OpenAI rolled out a record mode for ChatGPT and can capture meetings and voice notes.
June 27, 2025
- Amazon Ring launched video descriptions where your security camera simply tells you what’s happening outside so you don’t need to look at the camera. It could email you what happened or describe it out loud as it happens. “A guy drove up in a UPS truck and put a box on your front steps.” “A woman is at the door with a dog on a leash and keeps looking into the window.” Etc.
- A few months ago, Google DeepMind released a tool called VideoPrism that is suddenly gaining attention. VideoPrism can watch a video and provide deep context and details about every frame, a task that would be virtually impossible for a human to accomplish. Think real estate, legal discovery, etc…
  - https://research.google/blog/videoprism-a-foundational-visual-encoder-for-video-understanding/ <- Quick look at it in action

June 27, 2025 (continued)
- Alibaba developed a system that can identify gastric cancer from standard CT scans with greater accuracy than doctors. China has already deployed the system and screened over 78,000 patients!
- Google also launched an update of their Gemma model which is designed to run locally on smart phones and edge devices. Gemma provides multimodal AI capabilities without any Internet connection. In addition to being able to understand text, images, audio, and video, the model can handle translation and is open source.
- Google DeepMind released an AI system called AlphaGenome that predicts how genetic variations affect biological processes and can analyze DNA sequences up to 1 million letters long.
July 4, 2025
- Researchers have developed an artificial intelligence system that can detect Parkinson’s disease by analyzing short videos of people smiling, achieving 88% accuracy.
- Mayo Clinic researchers have created an artificial intelligence tool that can identify nine different types of dementia from a single brain scan.

July 4, 2025 (continued)
- Nvidia released an 8 billion parameter vision model designed for document processing and character recognition. This can extract and understand information from complex documents, PDFs, images, tables, charts, formulas, and diagrams. This will lead to automating workflows in finance healthcare and law firms. These models will also help with invoice processing and compliance.
- Google released a hurricane forecasting model for with long-term forecasting… up to two weeks ahead of a storm.
July 18, 2025
- ChatGPT is on track to edit and understand Excel and PowerPoint within chats, without opening Microsoft Office.
July 25, 2025
- Meta released a dataset of 4,000+ videos of face-to-face conversations and over 65,000 social interactions with full annotations to help models learn and emulate behavior.
- Google DeepMind released an AI tool called Aeneas that will help historians interpret fragments of Latin inscriptions.
August 8, 2025
- Google released a world simulation model that creates entire explorable 3D worlds with just a prompt.