Locally Run: AI News Week Ending 09/05/2025

Locally Run: AI News Week Ending 09/05/2025

September 5, 2025

Image created with Flux Pro v1.1 Ultra. Image prompt: Local, smartphone on table with on-device chip tray illustrated by arrays of tiny bananas arranged like circuitry, antenna off, photorealistic, editorial, minimal, high detail, 3:2 landscape

Apple released FastVLM
so I tried vibe coding a video captioning AI app with it
took 5 prompts to get a working app in anycoder and deployed it on Hugging Face
85x faster and 3.4x smaller than comparable sized VLMs
the deployed app works 100% locally in your browser powered by transformers.js and WebGPU https://x.com/_akhaliq/status/1962018549674684890

🚨 Apple just released FastVLM on Hugging Face – 0.5, 1.5 and 7B real-time VLMs with WebGPU support 🤯 > 85x faster and 3.4x smaller than comparable sized VLMs > 7.9x faster TTFT for larger models > designed to output fewer output tokens and reduce encoding time for high https://x.com/reach_vb/status/1961471154197053769

And FastVLM was released by Apple today! 🚀 All about on-device use. Model sizes: 0.5B, 1.5B, 7B. Available in MLX and Core ML. Vision encoder designed to output fewer tokens and reduce encoding time. Which means much faster time-to-first-token.”” / X https://x.com/pcuenq/status/1961464859465269757

Holy crap! That is some fast video captioning — all happening locally in your browser 🤯 This is the aptly named FastVLM by Apple; available on HF: https://x.com/bilawalsidhu/status/1962545148136444380

NEW: Apple releases FastVLM and MobileCLIP2 on Hugging Face! 🤗 The models are up to 85x faster and 3.4x smaller than previous work, enabling real-time VLM applications! 🤯 It can even do live video captioning 100% locally in your browser (zero install). Huge for accessibility! https://x.com/xenovacom/status/1961454543503344036

Google’s on a roll. That’s a lot of performance for that tiny size! I just embedded 1.4 million documents in ~80 mins on my M2 Max for free. Would’ve been ~$200 with the text-embedding-3-large, with worse quality.”” / X https://x.com/rishdotblog/status/1963805087014502497

If you think Apple is not doing much in AI, you’re getting blindsided by the chatbot hype and not paying enough attention! They just released FastVLM and MobileCLIP2 on Huggingface. The models are up to 85x faster and 3.4x smaller than previous work, enabling real-time vision language model (VLM) applications! It can even do live video captioning 100% locally in your browser 🤯🤯🤯 https://x.com/ClementDelangue/status/1962526559115358645

EmbeddingGemma is our new best-in-class open embedding model designed for on-device AI. 📱 At just 308M parameters, it delivers state-of-the-art performance while being small and efficient enough to run anywhere – even without an internet connection. https://x.com/GoogleDeepMind/status/1963635422698856705

Embeddings go on-device ⬇️ EmbeddingGemma – a new open multilingual embedding model with 308M parameters, optimized for speed, privacy, and efficiency. It’s based on Gemma 3 and trained on 100+ languages Why it matters: ▸ A top open multilingual embedding model under 500M on https://x.com/TheTuringPost/status/1963666849364836606

Google just dropped Gemma embeddings! Perfect for on-device semantic search. Here’s what makes Gemma embeddings special: 🌍 𝟭𝟬𝟬+ 𝗹𝗮𝗻𝗴𝘂𝗮𝗴𝗲𝘀 𝘀𝘂𝗽𝗽𝗼𝗿𝘁𝗲𝗱 – truly global AI 📱 <𝟯𝟬𝟬𝗠𝗕 𝗥𝗔𝗠 𝘂𝘀𝗮𝗴𝗲 with QAT – fits on edge devices 📊 𝟴𝗸 𝘁𝗼𝗸𝗲𝗻 https://x.com/weaviate_io/status/1963683200368304613

Google just launched EmbeddingGemma: an efficient, multilingual 308M embedding model that’s ready for semantic search & more on just about any hardware, CPU included. Details in 🧵: https://x.com/tomaarsen/status/1963639557653422304

Introducing EmbeddingGemma, our new open embedding model for on-device AI applications. – Highest ranking open model under 500M on the MTEB benchmark. – Runs on less than 200MB of RAM with quantization. – Dynamic output dimensions from 768 down to 128. – Input context length of https://x.com/_philschmid/status/1963634786636841461

Introducing EmbeddingGemma: The Best-in-Class Open Model for On-Device Embeddings – Google Developers Blog https://developers.googleblog.com/en/introducing-embeddinggemma/

We’re excited to be a day 0 partner for EmbeddingGemma, Google’s new open-source embedding model! You can deploy it directly from our model library – our engineers are continually rolling out additional performance optimizations.”” / X https://x.com/basetenco/status/1963724754315284720

You can now run 100B parameter models on your local CPU without GPUs.
Microsoft finally open-sourced their 1-bit LLM inference framework called bitnet.cpp:
https://x.com/LiorOnAI/status/1963316578612605327

A 14B model just beat a 671B model on math reasoning. Here’s how Microsoft’s rStar2-Agent achieves frontier math performance in 1 week of RL training
https://x.com/FrankYouChill/status/1962180218053144655

Meet Google’s new best small embedding model, EmbeddingGemma It’s a 300M embedding model made for retrieval augmented generation (RAG) use cases. ollama pull embeddinggemma 🧵 https://x.com/ollama/status/1963667967184617703

ollama-style CLI for running MLX models on Apple Silicon https://x.com/tom_doerr/status/1961309536406392877

Introducing ChromaSwift – in beta! Build search and retrieval into your iOS apps – Includes on-device persistence – Packaged with on-device MLX embedding inference https://x.com/trychroma/status/1962917927382122857

Introducing EmbeddingGemma🎉 🔥With only 308M params, this is the top open model under 500M 🌏Trained on 100+ languages 🪆Flexible embeddings (768 to 128 dims) with Matryoshka 🤗Works with your favorite open tools 🤏Runs with as little as 200MB https://x.com/osanseviero/status/1963635281032040914

Our most compact LLM from the Hermes 4 series is locally usable and optimized for consumer hardware, providing at-home access to its powerful hybrid reasoning and tool calling.
https://x.com/NousResearch/status/1963349882837897535

GPT-4o level intelligence running on your phone! MiniCPM-V 4.5 delivers enterprise-grade AI performance in just 8B parameters, outperforming models like GPT-4o, Gemini-2.0 Pro on vision and language tasks. – 30+ language support – Runs smoothly on iPhone/iPad 100% open-source! https://x.com/akshay_pachaar/status/1962132670126981459

❤️ Thanks to @mervenoyann & @huggingface , MiniCPM-V 4.5 is officially live on Hugging Face Spaces. Come check it out！ https://x.com/OpenBMB/status/1963623940028563910

Towards High Data Efficiency in Reinforcement Learning with Verifiable Reward “”we propose DEPO, a Data-Efficient Policy Optimization pipeline that combines optimized strategies for both offline and online data selection. In the offline phase, we curate a high-quality subset of https://x.com/iScienceLuvr/status/1963169113007895020

best small vision LM with reasoning has dropped on @huggingface 🔥 Tencent dropped R-4B, small vision LM that claims sota with Apache 2.0 license 💗 the model enables different thinking options and transformers support through custom code! https://x.com/mervenoyann/status/1962917635932229797