Locally Run: AI News Week Ending 07/04/2025

Locally Run: AI News Week Ending 07/04/2025

July 4, 2025

Image created with OpenAI GPT-Image-1. Image prompt: rich crimson, bright ivory, deep navy Independence-Day palette, vibrant, celebratory, wholesome, authentic, photorealistic hot-air balloon launch in early morning scene featuring a map overlay of the local town with a blue location pin hovering; natural lighting, subtle film grain, high detail

if you block AI from accessing your content, there will be no one left to read it. to elaborate: the future of search is lightweight research agents. a lot of businesses are already seeing significant traffic come from ChatGPT. if you block AI scrapers, i will never read your content because o4-mini-high will send me to your competitor’s website”” / X https://x.com/vikhyatk/status/1940227029389255109

The race for LLM “cognitive core” – a few billion param model that maximally sacrifices encyclopedic knowledge for capability. It lives always-on and by default on every computer as the kernel of LLM personal computing.
Its features are slowly crystalizing:

– Natively multimodal text/vision/audio at both input and output.
– Matryoshka-style architecture allowing a dial of capability up and down at test time.
– Reasoning, also with a dial. (system 2)
– Aggressively tool-using.
– On-device finetuning LoRA slots for test-time training, personalization and customization.
– Delegates and double checks just the right parts with the oracles in the cloud if internet is available.

It doesn’t know that William the Conqueror’s reign ended in September 9 1087, but it vaguely recognizes the name and can look up the date. It can’t recite the SHA-256 of empty string as e3b0c442…, but it can calculate it quickly should you really want it.

What LLM personal computing lacks in broad world knowledge and top tier problem-solving capability it will make up in super low interaction latency (especially as multimodal matures), direct / private access to data and state, offline continuity, sovereignty (“not your weights not your brain”). i.e. many of the same reasons we like, use and buy personal computers instead of having thin clients access a cloud via remote desktop or so.https://x.com/karpathy/status/1938626382248149433

Apple just released a Sage Mixtral 8x7b fine-tune w/ Apache license on the hub 👀 Uses State-Action Chains (SAC) to enhance dialogue generation by incorporating latent variables for emotional states and conversational strategies. Key comparisons: > SAC vs. standard LM”” / X https://x.com/reach_vb/status/1939970610702028899

This is what efficient AI looks like: Gemma 3n just dropped – a natively multimodal model that runs entirely on your device. No cloud. No API calls. 🧠 Text, image, audio, and video – handled locally. ⚡️Only needs 2B in GPU memory to run 🤯 First sub-10B model to hit 1300+ Elo https://x.com/fdaudens/status/1938304519344992493

Tencent released Hunyuan-A13B, a new open-source hybrid reasoning model It nears or matches models like o1 and DeepSeek R1 on major benchmarks, while remaining efficient enough to run on a single GPU Also includes “”fast and slow”” modes to adjust efficiency levels https://x.com/rowancheung/status/1939601169271197973

Gemma 3N quirks! 1. Vision NaNs on float16 2. Conv2D weights are large FP16 overflows to infinity 3. Large activations fixed vs Gemma 3 4. 6-7 training losses: normal for multimodal? 5. Large nums in msfa_ffn_pw_proj 6. NaNs fixed in @UnslothAI Details: https://x.com/danielhanchen/status/1940073369648734571

Local MCP servers can now be installed with one click on Claude Desktop. Desktop Extensions (.dxt files) package your server, handle dependencies, and provide secure configuration. https://x.com/AnthropicAI/status/1938272883618312670

In preparation for OpenAI’s upcoming open-source model, I’m building the world’s best local agent It seamlessly integrates with my OS, auto accesses my clipboard, is Finder-aware, creates/reads files, searches the web, and updates text in any app. Local Jarvis. https://x.com/skirano/status/1940818055703208036

Okay Wispr Flow is actually amazing. Order of magnitude faster to write & review docs vs. using something like superwhisper on mac. Is there anything local that comes close to the user experience of Flow? Because rn I think the tradeoff of cloud processing seems totally worth”” / X https://x.com/bilawalsidhu/status/1940550340144775251

LeoAM shows that long-context LLM inference can run on a single consumer GPU by keeping only the key-value chunks that matter in GPU memory. While representing the rest with lightweight summaries on disk, and streaming them through a three-tier GPU-CPU-disk pipeline. This https://x.com/rohanpaul_ai/status/1940335638714441872

Want to learn about the research behind Gemma 3n? Altup – https://x.com/osanseviero/status/1940127957730959494

Flux1.Kontext runs on your laptop with MFLUX + MLX: https://x.com/awnihannun/status/1938947706350903401

Introducing Mistral Small 3.2, a small update to Mistral Small 3.1 to improve: – Instruction following: Small 3.2 is better at following precise instructions – Repetition errors: Small 3.2 produces less infinite generations or repetitive answers – Function calling: Small https://x.com/MistralAI/status/1936093325116781016