Ethan B. Holland

Over 49,700 manually organized AI links and counting

Multimodal: AI News Week Ending 08/08/2025

August 8, 2025

Image created with Flux Pro v1.1 Ultra. Image prompt: Ornate showgirl glamour in orange-and-teal tones, radiant crystal collage of feathers, lights, and props symbolizing multiple forms, stylized text “Multimodality” in bold glitter marquee script across the backdrop; spotlit, dramatic contrast, vintage grain, cinematic, high-detail

How AI is helping advance the science of bioacoustics to save endangered species – Google DeepMind https://deepmind.google/discover/blog/how-ai-is-helping-advance-the-science-of-bioacoustics-to-save-endangered-species/

Cameras as Relative Positional Encoding”” TLDR: comparison for conditioning transformers on cameras: token-level raymap, attention-level relative pose encodings, a (new) relative encoding Projective Positional Encoding -> camera frustums, (int|ext)insics for relative pos encoding https://x.com/Almorgand/status/1951331762463822212

We’re excited to introduce a new parsing mode within LlamaCloud that lets you get complex visual recognition capabilities over documents 🖼️📑 at a cheaper price compared to pretty much anything else out there ⚡️ There’s a variety of VLM-enabled document parsing solutions out https://x.com/jerryjliu0/status/1953227974716665996

RT @__sunil_kumar_: Excited to share some new work: we show how to efficiently train small vision-language models to use a zoom tool with G…”” / X https://x.com/andersonbcdefg/status/1953324558485713316

RT @SergioPaniego: Latest TRL release brings major upgrades for multimodal alignment! We dive into 3 new techniques to improve VLM post-tr…”” / X https://x.com/algo_diver/status/1953484920724005206

Now live for Motorola users in moto ai: Copilot Vision, so you can show, not tell. Translating street signs? Figuring out what’s wrong with your vacuum? Here to help, in 50+ languages. Just one click away, right on your Motorola device. https://x.com/mustafasuleyman/status/1953139009674170780

Technology Trends 2025 | Technology Vision | Accenture https://www.accenture.com/us-en/insights/technology/technology-trends-2025

🚨 Just released: X-Humanoid unveils the world’s first general-purpose multimodal perception system for humanoid robots — HumanoidOccupancy. Why does this matter? Just like humans, robots need multiple senses — eyes, ears, even a nose — to perceive and understand the world. https://x.com/reborn_agi/status/1950053236628697468

Truly general-purpose robots must be able to navigate spaces they have never seen before. ⦿ Skild Brain enables end-to-end autonomous locomotion from raw vision and joint inputs, without mapping or pre-planning. ⦿ The model adapts in real time to new terrain such as stairs, https://x.com/TheHumanoidHub/status/1953237429537894550

It’s wild how much performance can differ depending on provider implementation! (probably the outcome of too heavy quantization) I also had cases where some providers silently fail to respect tool calling or structured generation formats: always have your own mini-benchmark to”” / X https://x.com/AymericRoucher/status/1953115586273394873

For whatever reason, @modal_labs always gets compared to “”inference providers””, which always confuses me? Modal was always built to be a general-purpose platform for AI/ML/data. Yes – we do inference really well! But we also do batch processing, sandboxes, training, …”” / X https://x.com/bernhardsson/status/1951729049866514508