Ethan B. Holland

Over 54,400 manually organized AI links and counting

Multimodal: AI News Week Ending 08/01/2025

August 1, 2025

Image created with Flux Pro v1.1 Ultra. Image prompt: photorealistic still image of a middle-aged man standing behind a woman, woman covering part of her face with her hand, man looking over her shoulder, both illuminated with warm stadium jumbotron lighting, natural skin tones, subtle lens flare, shallow depth of field, exact color temperature of a live event projection, man wearing a jacket decorated with icons for text, video, and audio, woman holding a matching tote, cinematic realism –no text, captions, watermarks

we plugging ViTPose into Basketball AI according to @NBA rules, a player is considered to be in the paint only if both feet are inside the paint notebook: https://x.com/skalskip92/status/1950231824933982428

what player is that? in the upcoming supervision-0.27.0 release, you’ll be able to freely control text position, including applying custom offsets from the detection box supervision annotators are now so advanced, you can literally use them to create full visual content link: https://x.com/skalskip92/status/1950984077617799534

Runway, Luma Target Sales to Robotics Companies — The Information https://www.theinformation.com/articles/runway-luma-target-sales-robotics-companies

Alibaba to launch AI-powered smart glasses creating rival to Meta https://www.cnbc.com/2025/07/28/alibaba-ai-smart-glasses-creates-rival-to-meta.html

NotebookLM updates: Video Overviews, Studio upgrades https://blog.google/technology/google-labs/notebooklm-video-overviews-studio-upgrades/

Pierre and team really cooked with this vision language model (VLM)! Excited for you to try it out! 111B open parameters”” / X https://x.com/JayAlammar/status/1950931480349143259

RT @1vnzh: Command A Vision – SOTA enterprisemaxx multimodal model – Outperforms GPT 4.1, Llama 4 Maverick, and Mistral Medium 3 in enterpr…”” / X https://x.com/aidangomez/status/1950927454383616343

RT @nickfrosst: cohere vision model 🙂 weights on huggingface https://x.com/andrew_n_carr/status/1951068402090647608

Step3: Cost-Effective Multimodal Intelligence | StepFun https://stepfun.ai/research/en/step3

Introducing Command A Vision: Multimodal AI https://cohere.com/blog/command-a-vision

OpenAI agent is blocked by OpenAI captcha. https://x.com/gneubig/status/1948915714955641159

i stopped using GQA as an eval when i found this woman was labeled a bird and the phone as white. the annotations have a 20-30% error rate. (and it’s supposed to be a “”cleaned up”” version of visual genome, so steer clear of that one too) https://x.com/vikhyatk/status/1949365273901060474

Step3 benchmarks at last. The first «DeepSeek-like» that’s strongly multimodal (Ernie disappointed). It’s very different from V3, too – another in-house attention, the logic around inference economics. A big release. https://x.com/teortaxesTex/status/1951008169989382218

very surprising that fifteen years of hardcore computer vision research contributed ~nothing toward AGI except better optimizers we still don’t have models that get smarter when we give them eyes”” / X https://x.com/jxmnop/status/1949869844142473322

Unitree Introducing | Unitree R1 Intelligent Companion Price from $5900 Join us to develop/customize, ultra-lightweight at approximately 25kg, integrated with a Large Multimodal Model for voice and images, let’s accelerate the advent of the agent era!🥰 https://x.com/UnitreeRobotics/status/1948681325277577551