Locally Run: AI News Week Ending 04/03/2026

Image created with gemini-3.1-flash-image-preview with claude-sonnet-4-5. Image prompt: Using the provided reference image, preserve the exact square faceted perfume bottle with amber-gold liquid, crystal stopper, pure white background, soft shadow, and high-key studio lighting. Replace the label text with ‘Local’ in the same black serif font style. Add a delicate sterling silver chain draped naturally around the bottle neck with a small dainty pendant: a miniature cluster of three connected houses forming a tiny neighborhood icon in refined jewelry aesthetic, small and precise like a Tiffany charm.

🚀 Imagine running Claude 4.6 Opus-level reasoning… but entirely on your own GPU with just 16GB VRAM. This 27B Qwen3.5 variant, distilled on Claude 4.6 Opus reasoning traces, delivers frontier coding power locally. It’s beating Claude Sonnet 4.5 on SWE-bench in 4-bit
https://x.com/outsource_/status/2038999111039357302

This model has been #1 trending for 3 weeks now. It’s Qwen3.5-27B fine-tuned on distilled data from Claude-4.6-Opus (reasoning). Trained via Unsloth. Runs locally on 16GB in 4-bit or 32GB in 8-bit. Model:
https://x.com/UnslothAI/status/2038625148354679270

Very bullish on open source and local models Imagine running near-Opus-level model locally on that $600, 16GB Mac Mini you bought last month This 27B Qwen3.5 distill was trained on Claude 4.6 Opus reasoning traces and is putting up real numbers: – beats Claude Sonnet 4.5 on
https://x.com/TheCraigHewitt/status/2039303217620627604

Local Claude Code builds have been achieved internally
https://x.com/theo/status/2039079267905261831

. @googlegemma have open sourced the perfect model for local open source agents. Gemma 4 comes in all the sizes we need for mobile, local, and code. This is how I’ll be switching my @thdxr opencode agent over. Let’s go local agents.
https://x.com/ben_burtenshaw/status/2039740590091362749

🎉 Gemma 4 is officially available on vLLM! Byte-for-byte, these are the most capable open models for advanced reasoning and agentic workflows. Key features include: – Native Multimodal Support: Full vision and audio capabilities with up to a 256K context window. – Broad
https://x.com/vllm_project/status/2039762998563418385

A 12-month time difference between Gemma 3 27b and Gemma 4 31b. The jump is absolutely enormous. Just look at the evaluations between the two models. GPQA doubled, AIME 2026 went from ~20% to ~90%, and so on. Crazy.
https://x.com/kimmonismus/status/2039759264680747219?s=20

A Visual Guide to Gemma 4 With almost 40 (!) custom visuals, explore the new models from Google DeepMind. We explore various techniques, ranging from Mixture of Experts and the Vision Encoder all the way up to Per-Layer Embeddings and the Audio Encoder. Link below 👇
https://x.com/MaartenGr/status/2040099556948390075

Gemma 4 — Google DeepMind
https://deepmind.google/models/gemma/gemma-4/

Gemma 4 31B (Reasoning) is very token efficient, using ~1.2M tokens on the GPQA Diamond evaluation, fewer than peers models such as Qwen3.5 27B (~1.5M) and Qwen3.5 35B A3B (~1.6M)
https://x.com/ArtificialAnlys/status/2039752015811866652

Gemma 4 31B running with TurboQuant KV cache on MLX 🔥 128K context: → KV Memory: 13.3 GB → 4.9 GB (63% reduction) → Peak Memory: 75.2 GB → 65.8 GB (-9.4 GB) → Quality preserved TurboQuant compression scales with sequence length, so the longer the context, the bigger the
https://x.com/Prince_Canuma/status/2039840313074753896

Gemma 4 outperforms models over 10x their size! (note the x-axis is log scale!)
https://x.com/demishassabis/status/2040067244349063326

Gemma 4: Our most capable open models to date
https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/

Gemma-4-31B is now live in Text Arena – ranking #3 among open models (#27 overall), matching much larger models at 10× smaller scale! A significant jump from Gemma-3-27B (+87 pts). Highlights: – #3 open (#27 overall), on par with the best open models Kimi-K2.5, Qwen-3.5-397b –
https://x.com/arena/status/2039739427715735645

Getting Started with Gemma 4 in AI Studio
https://x.com/GoogleAIStudio/status/2040090067709075732

Google just open-sourced Gemma 4. Unprecedented performance for advanced reasoning and agentic workflows, and big leap in efficiency on a parameter basis. Use it now in KerasHub. I recommend the JAX backend – best performance!
https://x.com/fchollet/status/2039845249334510016

Google just re-entered the game 🔥🔥 They want to take the crown 👑 back from Chinese open source AI. And… Gemma 4 is FINALLY Apache 2.0 aka real-open-source-licensed. From what I’ve seen it’s going to be a pretty significant model. But give it a try yourself today: brew
https://x.com/ClementDelangue/status/2039941213244072173

got Gemma 4 up and running at 34 tokens per second this is the 26B-A4B model, running on my mac mini m4 with 16GB ram next time i hit my claude session limits i’ll have this fast free local AI as a backup :]
https://x.com/measure_plan/status/2040069272613834847

Got Gemma-4-26B-A4 MoE running on iPhone w/Flash SSD in Swift MLX. Still pretty slow, I expect 10+ t/s once optimized properly for Swift.
https://x.com/anemll/status/2040126326708031969

Introducing a Visual Guide to Gemma 4 👀 An in-depth, architectural deep dive of the Gemma 4 family of models. From Per-Layer Embeddings to the vision and audio encoders. Take a look!
https://x.com/osanseviero/status/2040105484061954349

Let’s look at how the open model Gemma has progressed across its last three versions. – Gemma 4 ranks 100 places above Gemma 3 – Gemma 3 ranks 87 above Gemma 2 All three models from @GoogleDeepMind are roughly the same size (31B, 27B, 27B), and these gains came only 9 and 13
https://x.com/arena/status/2039848959301361716

Lets go: Running a full AI assistant locally on a MacBook Air M4 with 16GB, completely free, open source, no API keys needed. Atomic Bot makes it really simple: install, pick Gemma 4, and you have an always-on AI agent running on your machine. No cloud. No subscription. No data
https://x.com/kimmonismus/status/2039989730901623049

Meet Gemma 4: our new family of open models you can run on your own hardware. Built for advanced reasoning and agentic workflows, we’re releasing them under an Apache 2.0 license. Here’s what’s new 🧵
https://x.com/GoogleDeepMind/status/2039735446628925907

NEW: Google releases Gemma 4, their most capable open models yet! 🤯 Apache-2.0, multimodal (text, image, and audio input), and multilingual (140 languages)! They can even run 100% locally in your browser on WebGPU. Watch it describe the Artemis II launch! 🚀 Try the demo! 👇
https://x.com/xenovacom/status/2039741226337935430

To explain why I consider Gemma 4 a bigger release than most people realize. This is a big deal because models like Gemma 4 E4B can run directly on devices, bringing powerful AI (even a 2B model ~60% on MMLU Pro) to phones, laptops, and edge systems without relying on the cloud,
https://x.com/kimmonismus/status/2039978863644537048

Today, we’re launching Gemma 4, our most intelligent open models to date. Built with the same breakthrough technology as Gemini 3, Gemma 4 brings advanced reasoning to your personal hardware and devices. Here’s what Gemma 4 unlocks for developers: — Intelligence-per-parameter:
https://x.com/GoogleAI/status/2039735543068504476

We just released Gemma 4 — our most intelligent open models to date. Built from the same world-class research as Gemini 3, Gemma 4 brings breakthrough intelligence directly to your own hardware for advanced reasoning and agentic workflows. Released under a commercially
https://x.com/Google/status/2039736220834480233

You can run Gemma 4 100% locally in your browser thanks to HF transformers.js. That means 100% private and 100% free! @xenovacom created a demo for it here:
https://x.com/ClementDelangue/status/2039782910996148508

run OpenClaw, Hermes Agent and Pi with Gemma 4 with few lines of change 🔥
https://x.com/mervenoyann/status/2039788257815261400

So happy to see Google release Gemma 4 today in apache 2.0 that gives you frontier capabilities locally. You can use it right away in all your favorite open agent platforms like openclaw, opencode, pi, Hermes by asking it to change your model to local gemma 4 with
https://x.com/ClementDelangue/status/2039740419899056152

One way to see the advancement of AI is to see how much further you can get with new models on the same hardware Here is “”an otter using a laptop on an airplane”” generated on my home computer using the open weights Wan 2.1, first try. We have come pretty far in 18 months.
https://x.com/emollick/status/2037616578787713194

.@GoogleDeepMind Gemma 4 is here with state-of-the-art models targeting edge and workstations. Requires Ollama 0.20+ that is rolling out. 4 models: 4B Effective (E4B) ollama run gemma4:e4b 2B Effective (E2B) ollama run gemma4:e2b 26B (4B active MoE) ollama run gemma4:26b
https://x.com/ollama/status/2039738348647108680

.@UnslothAI supports @GoogleGemma 4 models, optimized for RTX GPUs. 🦥 Run & fine-tune locally in Unsloth Studio.
https://x.com/NVIDIA_AI_PC/status/2040096993800761579

Axolotl support for Gemma 4 is in v0.16.1 is released! Finetune @GoogleAIStudio Gemma4 26B-A4B on your own 5090 using our optimized fused MoE+LoRA kernels!
https://x.com/winglian/status/2039823559363629432

Deploy Gemma4 31B and 26B-A4B with one click on Hugging Face Inference Endpoints 🔥👇
https://x.com/ErikKaum/status/2040008281796513939

Excited to launch Gemma 4: the best open models in the world for their respective sizes. Available in 4 sizes that can be fine-tuned for your specific task: 31B dense for great raw performance, 26B MoE for low latency, and effective 2B & 4B for edge device use – happy building!
https://x.com/demishassabis/status/2039736628659269901

Flagship open-weight release days are always exciting. Was just reading through the Gemma 4 reports, configs, and code, and here are my takeaways: Architecture-wise, besides multi-model support, Gemma 4 (31B) looks pretty much unchanged compared to Gemma 3 (27B). Gemma 4
https://x.com/rasbt/status/2039780905619705902

future is local 🔥 Google DeepMind just released Gemma 4: local frontier in many sizes, all modalities with free license 🤯 we ship Gemma 4 in transformers, llama.cpp, transformers.js and more for your convenience 🫡 plug-and-play with your agents 🙌🏻 read our blog ⤵️
https://x.com/mervenoyann/status/2039739097611215344

Gemma
https://x.com/OfficialLoganK/status/2039486016751366431

Gemma 4 26B MoE (4B active) on a single RTX 4090: – 162 t/s decode – 8,400 t/s prefill – Full 262K native context — 19.5 GB VRAM – Only 10 Elo below the 31B dense Q8_0 on dual 4090+3090: 9,024 t/s prefill at 10K. 2,537 t/s at full 262K — that’s a novel in about 100
https://x.com/basecampbernie/status/2039847254534852783

Gemma 4 architecture analysis thread Just as Gemma3n, this thing has a galaxybrained architecture, very much not a standard transformer
https://x.com/norpadon/status/2039740827975500251

Gemma 4 by @GoogleDeepMind debuts at 3rd and 6th on the open source leaderboard, making it the #1 ranked US open source model. By total parameter count, Gemma 4 31B is 24× smaller than GLM-5 and 34× smaller than Kimi-K2.5-Thinking, delivering comparable performance at a
https://x.com/arena/status/2039782449648214247

Gemma 4 is here! The best open-source model you can run on your machine. Day-0 support in a llama.cpp. Check it out!
https://x.com/ggerganov/status/2039744468899811419

Gemma 4 is live on Baseten and available to all customers on day 0 via the Baseten model library. All models in the Gemma 4 family are multimodal, supporting text and image inputs with text output. Key capabilities include: -> Advanced reasoning and thinking -> Coding and
https://x.com/baseten/status/2039751071284015393

Gemma4 is amazing. You’ll read that everywhere. Let’s focus on what is HUGE here: the revenge of dense models…. Throw away your b200, not needed anymore, throw away the millions of lines of code we had to write to make MOEs faster, training stable etc… throw away your
https://x.com/art_zucker/status/2039740402517893361

Google Deep Mind’s impressive fully-open Gemma 4 is live day-zero on Modular Cloud. Modular provides the fastest performance on NVIDIA Blackwell and AMD MI355X, thanks to MAX and Mojo🔥. The team took this impressive new model to production inference in days.🚀
https://x.com/clattner_llvm/status/2039738590213910558

google gemma 4 architecture is very interesting and every model has some subtle differences, here is a recap: > per layer embedding only on the small variant > no attention scale (usually you divide qk^T by sqrt(d), they don’t) > they do QK norm + V norm as well > they share
https://x.com/eliebakouch/status/2039751171556954531

Google has released Gemma 4, a new family of multimodal open-weight models including Gemma 4 E2B, Gemma 4 E4B, Gemma 4 31B and Gemma 4 26B A4B @GoogleDeepMind’s new Gemma 4 family introduces four multimodal models supporting text, image, and video inputs. We evaluated Gemma 4
https://x.com/ArtificialAnlys/status/2039752013249212600

Google releases Gemma 4. ✨ Gemma 4 introduces 4 models: E2B, E4B, 26B-A4B, 31B. The multimodal reasoning models are under Apache 2.0. Run E2B and E4B on ~6GB RAM, and on phones. Run 26B-A4B and 31B on ~18GB. GGUFs:
https://t.co/fpX21yWbge Guide:
https://x.com/UnslothAI/status/2039739190536286313

I have to give credit to Google for Apache 2.0 on Gemma 4! This is huge!
https://x.com/QuixiAI/status/2039862230452252926

Intel is partnering with @GoogleAI to deliver fully functional #Gemma4 models on Intel hardware from day zero–across Intel Xeon CPUs, Intel Xe GPUs, and Intel Core Ultra processors, with support across open frameworks including @vllm_project and @huggingface. This means
https://x.com/intelnews/status/2040106767258906707

Just do this: brew install llama.cpp –HEAD Then; llama-server -hf ggml-org/gemma-4-26B-A4B-it-GGUF:Q4_K_M
https://x.com/julien_c/status/2039746054355067002

Let me demonstrate the true power of llama.cpp: – Running on Mac Studio M2 Ultra (3 years old) – Gemma 4 26B A4B Q8_0 (full quality) – Built-in WebUI (ships with llama.cpp) – MCP support out of the box (web-search, HF, github, etc.) – Prompt speculative decoding The result:
https://x.com/ggerganov/status/2039752638384709661

Say hello to Gemma 4 from @GoogleDeepMind 🚀🔥 💎 Comes in 4 sizes: E2B, E4B, 26B A4B, 31B 💎 Supports vision and reasoning 💎 Apache 2.0 💎 Available now in LM Studio
https://x.com/lmstudio/status/2039738625525502426

Son lead the development on HF/llama.cpp side for adding support for the new Gemma 4 models. As always, he did an outstanding job throughout the collaboration with the Google DeepMind team. Day-0 support is possible thanks to his hard work!
https://x.com/ggerganov/status/2039943099284140286

Thanks for following us! We’re excited to see what you all build with Gemma 4! In case you missed it, you can find all our checkpoints, with an Apache 2.0 License, on Hugging Face:
https://x.com/googlegemma/status/2040107948010242075

thinking about google’s gemma 4 and what it means a few months ago running something this capable locally meant serious hardware and serious tradeoffs on quality now it runs on your laptop, works offline on your phone (!!!), speaks 140 languages natively, 256k context window,
https://x.com/gregisenberg/status/2039853864082424198

Today we’re releasing Gemma 4, our new family of open foundation models, built on the same research and technology as our Gemini 3 series. These models set a new standard for open intelligence, offering SOTA reasoning capabilities from edge-scale (2B and 4B w/ vision/audio) up
https://x.com/JeffDean/status/2039748604232122707

Two years ago, we released Gemma, Google DeepMind family of open models. Today, I’m thrilled to share a new milestone: Gemma 400M downloads and 100,000 variants! Thank you to every developer, partner, and contributor. We can’t wait to see what you build next!👀
https://x.com/osanseviero/status/2039120000095547722

What you need to know about @googlegemma 4: 4️⃣ 4 sizes (E2B, E4B, 26B4A, 31B) 🪟 Up to 256K context window 🛠️ Native function-calling, structured JSON output 👁️ + audio on edge models (E2B/E4B) 🌍 Trained on 140+ languages 🏆 31B ranks #3 open model on Arena AI 🪪 Apache 2.0
https://x.com/_philschmid/status/2039736207676965264

Yowza! @ollama is on it with new Gemma 4 models
https://x.com/MichaelGannotti/status/2039903041642508541

Gemma 4 31B shifts the Pareto frontier, scoring +30 Arena points above similarly priced models like DeepSeek 3.2. Its position on the Pareto frontier is based on early pricing indicators from third parties.
https://x.com/arena/status/2040128319719670101

impressive, very nice. now let’s compare a 31b dense to a 31b active 670b total instead. flop for flop
https://x.com/stochasticchasm/status/2039912148676264334

MoE models differ from the likes of DeepSeek and Qwen: instead of using shared experts in parallel to the routed ones, Gemma adds MoE blocks as separate layers in addition to the normal MLP blocks. So the architecture is Attention -> MLP -> MoE
https://x.com/norpadon/status/2039750841754697767

Nemotron Super / Ultra Arcee Trinity Large (soon) Gemma 4 (eventually) Reflection’s first models (maybe) GPT OSS 2? (maybe) Thinky? Other neolabs? Things looking up for open models built in the US in 2026. We had 0 for a bit there.
https://x.com/natolambert/status/2039499358325129530

DreamLite A Lightweight On-Device Unified Model for Image Generation and Editing paper:
https://x.com/_akhaliq/status/2039011853460819999

1-bit Bonsai 8B running locally on an M4 Pro (MLX) alongside a standard 16-bit 8B model. Same class of model, very different deployment profile: far lower memory use and substantially higher throughput.
https://x.com/PrismML/status/2039049404209148007

A new addition to Claw-style agents AutoClaw – a local-first agent runner from
https://t.co/3QuHijMYPx promising full autonomy – No API keys, no cloud dependency – No data leaving your machine – Runs custom models + GLM-5-Turbo (tool-optimized) – Start tasks directly from a
https://x.com/TheTuringPost/status/2038900836794081287

Cohere transcribe running locally in the browser!
https://x.com/nickfrosst/status/2037680223445975131#m

Demo of 1-bit Bonsai 8B from @PrismML running on-device on iPhone 17 Pro More than 40tk/s for a dense 8B model on iPhone, that’s a first Powered by Apple MLX and available now in Locally AI
https://x.com/adrgrondin/status/2039066539022778613

llama.cpp at 100k stars now that 90% of the code worldwide is being written by AI agents, I predict that within 3-6 months, 90% of all AI agents will be running locally with llama.cpp 😄 Jokes aside, I am going to use this small milestone as an opportunity to reflect a bit on
https://x.com/ggerganov/status/2038632534414680223

My self-sovereign / local / private / secure LLM setup, April 2026
https://vitalik.eth.limo/general/2026/04/02/secure_llms.html