Ethan B. Holland

Over 54,900 manually organized AI links and counting

Audio: AI News Week Ending 07/25/2025

July 25, 2025

Image created with OpenAI GPT-Image-1. Image prompt: over-the-top 1990s pro-wrestling promo poster, ring-side DJ booth featuring “Audio Assault” spinning vinyl before leaping with a mic-cord whip; booming speakers, grainy print texture, vivid neon titles

NVIDIA’s Canary-Qwen-2.5B 1st place on the @HuggingFace leaderboard for automatic speech recognition – lowest word error rate (WER) ever recorded on the Hugging Face OpenASR leaderboard: 5.63%. – its the first speech model built on top of an existing LLM. – At its core, it https://x.com/rohanpaul_ai/status/1946823138932863210

AudioRAG is becoming real! Just built a demo with ColQwen-Omni that does semantic search on raw audio, no transcription needed. Drop in a podcast, ask your question, and it finds the exact chunks where it happens. You can also get a written answer. What’s exciting: it skips https://x.com/fdaudens/status/1946226098905169967

2024: Voice Cloning 2025: What about personality cloning? Hume’s voice AI can now not only mimic your voice but also speaking style and language. It’s now available via our TTS and new speech-to-speech model, EVI 3, which is also launching today. https://x.com/hume_ai/status/1945900611334979712

NEW: Higgs Audio V2 from @boson_ai open, unified TTS model w/ voice cloning, beats GPT 4o mini tts and ElevenLabs v2 🔥 > Trained on 10M hours (speech, music, events) > Built on top of Llama 3.2 3B > Works real-time and on edge > Beats GPT-4o-mini-tts, ElevenLabs v2 in prosody https://x.com/reach_vb/status/1947997596456272203

Dubbing for Everyone: Data-Efficient Visual Dubbing using Neural Rendering Priors https://dubbingforeveryone.github.io/

It feels significant that we crossed a line for AI music where you can continuously create entirely new AI songs in less time than it takes to listen to an AI song. Especially now that AI music has reached a point where some people enjoy it. And I assume that number will grow.”” / X https://x.com/emollick/status/1945696499012088016

It is has not ceased to be weird that I can put Rilke’s First Elegy into Suno and get out a coherent 8 minute performance with music. You might not like the interpretation, but it is genuinely amazing that this audio, with apparent emotion, is all 100% AI from the verses alone. https://x.com/emollick/status/1947179948420088065

Shrek inspired, multi-person generation (with voice cloning) – this is possible now with a *single* TTS model! https://x.com/reach_vb/status/1948012058630303857

It’s been fun collaborating with @superwhisperapp, a blazing fast AI speech-to-text app, with support for local and cloud models. By putting @vercel CDN in front of their model API, they’re seeing up to 350ms gains in some geos 🤯 To be clear, this is 350ms+ faster by *just* https://x.com/rauchg/status/1947019072908230674