Multimodal: AI News Week Ending 08/15/2025

Multimodal: AI News Week Ending 08/15/2025

August 15, 2025

Image created with Flux Pro v1.1 Ultra. Image prompt: CU Boulder brand style — CU Gold & Black, Helvetica Neue, Flatirons, Tuscan-vernacular sandstone + red-tile roofs; CASE interior lobby, evening ambient light, ground-level wide, sandstone texture band; integrate the category “Multimodality” via Overlay: connected icons for text, image, audio, and video under the label “MULTIMODALITY”; natural light, clean professional inspiring tone, crisp focus, subtle grain, editorial composition

Gemma 3 270m 4-bit DWQ is up. Same speed, same memory, much better quality: https://x.com/awnihannun/status/1956089788240728467

Gemma 3 270m 4-bit generates text at over 650 (!) tok/sec on an M4 Max with mlx-lm and uses < 200MB: Not sped up: https://x.com/awnihannun/status/1956053493216895406

Gemma 3 270M running on my Pixel 7a! Absolutely crazy (not sped up) https://x.com/1littlecoder/status/1956065040563331344

Google just dropped a new tiny LLM with outstanding performance — Gemma3 270M. Now available on KerasHub. Try the new presets `gemma3_270m` and `gemma3_instruct_270m`! https://x.com/fchollet/status/1956059444523286870

Google releases Gemma 3 270M, a new model that runs locally on just 0.5 GB RAM.✨ Trained on 6T tokens, it runs fast on phones & handles chat, coding & math. Run at ~50 t/s with our Dynamic GGUF, or fine-tune via Unsloth & export to your phone. Details: https://x.com/UnslothAI/status/1956027720288366883

Introducing Gemma 3 270M: The compact model for hyper-efficient AI – Google Developers Blog https://developers.googleblog.com/en/introducing-gemma-3-270m/

Introducing Gemma 3 270M! 🚀 It sets a new standard for instruction-following in compact models, while being extremely efficient for specialized tasks. https://x.com/googleaidevs/status/1956023961294131488

The new Gemma 3 270M is here https://x.com/ggerganov/status/1956026718013014240

Introducing Gemma 3 270M, a new compact open model engineered for hyper-efficient AI. Built on the Gemma 3 architecture with 170 million embedding parameters and 100 million for transformer blocks. – Sets a new performance for its size on IFEval. – Built for domain and adoption https://x.com/_philschmid/status/1956024995701723484

Introducing Gemma 3 270M: The compact model for hyper-efficient AI – Google Developers Blog
https://developers.googleblog.com/en/introducing-gemma-3-270m/

ollama run gemma3:270m Gemma 3 270M is here! Small model that is extremely efficient to run on-device, and designed for fine-tuning to serve specific agentic use-cases!”” / X https://x.com/ollama/status/1956034607373222042

LFM2-VL: Efficient Vision-Language Models | Liquid AI https://www.liquid.ai/blog/lfm2-vl-efficient-vision-language-models

Introducing DINOv3: a state-of-the-art computer vision model trained with self-supervised learning (SSL) that produces powerful, high-resolution image features. For the first time, a single frozen vision backbone outperforms specialized solutions on multiple long-standing dense https://x.com/AIatMeta/status/1956027795051831584

GLM4.5V is out! it’s a multimodal reasoning MoE with 106B total and 12B active params 🔥 it comes with transformers support from get-go! 💗 you can also use with @huggingface Inference Providers powered by @novita_labs 👏 https://x.com/mervenoyann/status/1954907611368771728

Introducing DINOv3 🦕🦕🦕 A SotA-enabling vision foundation model, trained with pure self-supervised learning (SSL) at scale. High quality dense features, combining unprecedented semantic and geometric scene understanding. Three reasons why this matters… https://x.com/maxseitzer/status/1956029421602623787

new TRL comes packed for vision language models 🔥 we shipped support for > native supervised fine-tuning for VLMs > multimodal GRPO > MPO 🫡 read all about it in our blog 🤗 next one! https://x.com/mervenoyann/status/1955622287920537636

Say hello to DINOv3 🦖🦖🦖 A major release that raises the bar of self-supervised vision foundation models. With stunning high-resolution dense features, it’s a game-changer for vision tasks! We scaled model size and training data, but here’s what makes it special 👇 https://x.com/BaldassarreFe/status/1956027867860516867

zai-org/GLM-V: GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning https://github.com/zai-org/GLM-V

Introducing LangExtract: A Gemini powered information extraction library – Google Developers Blog
https://developers.googleblog.com/en/introducing-langextract-a-gemini-powered-information-extraction-library/

REFERENCE THE QUOTES FROM SOTHEBY’S PRESENTATION The web’s next user isn’t human. AIs will soon use the internet far more than humans ever have. At Parallel, we are building for the web’s second user. Our API is the first to surpass humans and all leading AI models (including GPT-5) on deep web research tasks. https://x.com/p0/status/1956007609250492924

Introducing Parallel | Web Search Infrastructure for AIs | Parallel Web Systems | Enterprise Deep Research API
https://parallel.ai/blog/introducing-parallel

RT @Saboo_Shubham_: Google just released LangExtract Python library. It can extract structured data from unstructured docs with precise so…”” / X https://x.com/algo_diver/status/1954424008767951106

Natural conversation includes interruptions and talking over people, which is hard for an LLM to model as a single autoregressive sequence. I’m sure you can get pretty far by creating a text sequence with movie-script like breaks mid sentence, but it seems like the real solution”” / X https://x.com/ID_AA_Carmack/status/1954930438322954532

Introducing Higgsfield Draw-to-Video. RIP Prompts. Turn your sketch into an absolute cinema. Works with all our video models: MiniMax, Veo 3 & Seedance Pro. This is possible ONLY in Higgsfield. Retweet to unlock the full capacity of the best video models in your DMs. https://x.com/higgsfield_ai/status/1955742643704750571

Runway Aleph can precisely replace, retexture or entirely reimagine specific parts of a video, making it possible to rapidly ideate and iterate new concepts with existing footage. All you need to do is tell Aleph what you want. https://x.com/runwayml/status/1955615613583519917

A compilation of experiences I made with GPT-5 in one shot. The poem camera app is particularly impressive because the model came up with all the details, like the way the photos stack in the gallery, the photo developing animation, etc https://x.com/skirano/status/1953516768317628818

GPT-5 Pro is an impressive geo-guesser. I gave it a cropped photo with metadata removed and it figured out the city. https://x.com/emollick/status/1954288373797203991

HTC Unveils VIVE Eagle AI Glasses https://www.vive.com/us/newsroom/2025-08-14/

LightSwitch: Multi-view Relighting with Material-guided Diffusion TL;DR: material-relighting diffusion framework; relights an arbitrary number of input images to a target lighting condition while incorporating cues from inferred intrinsic properties; (1/2) https://x.com/Almorgand/status/1955655723985309967

The enhance meme from Bladerunner, except the AI is asking the computer to enhance.”” / X https://x.com/emollick/status/1954534598903275605

腾讯混元 https://vision.hunyuan.tencent.com/zh?tabIndex=0

There have been a lot of crazy many-camera rigs created for the purpose of capturing full spatial video. I recall a conversation at Meta that was basically “we are going to lean in as hard as possible on classic geometric computer vision before looking at machine learning https://x.com/ID_AA_Carmack/status/1955302165653926058

Farewell Microsoft Lens – popular mobile PDF scanner app set to be ditched soon | TechRadar https://www.techradar.com/pro/microsoft-is-killing-off-its-well-loved-lens-pdf-scanner-app-in-favor-of-ai

🚨 Big news! We decided that @huggingface’s post-training library, TRL, will natively supports training Vision Language Models 🖼️ This builds on our recent VLM support in SFTTrainer — and we’re not stopping until TRL is the #1 VLM training library 🥇 More here 👉 https://x.com/QGallouedec/status/1956066332488950020

[2507.22229] TRIBE: TRImodal Brain Encoder for whole-brain fMRI response prediction https://www.arxiv.org/abs/2507.22229

when you get access to gpt-5, try a message like “”use beatbot to make a sick beat to celebrate gpt-5″”. it’s a nice preview of what we think this will be like as AI starts to generate its own UX and interfaces get more dynamic. it’s cool that you can interact with the https://x.com/sama/status/1953529799219319205

RT @ggerganov: whisper.cpp is coming to ffmpeg https://x.com/ggerganov/status/1955161982023131645

🎨Deep Agents UI Deep agents operate with a todo list, file system, and subagents We built a dedicated UI for running deep agents that properly highlights all of these things! Repo: https://x.com/LangChainAI/status/1955674201853247584