Ethan B. Holland

Over 54,900 manually organized AI links and counting

Audio: AI News Week Ending 05/09/2025

May 9, 2025

Image created with GPT Image 1. Image prompt: Wearing a zoot suit made of speaker mesh and velvet headphones coiled like a crown, a suave Black jazz dandy moonwalks down a glowing vinyl runway; each step sends ripples through the neon-lit club — captured on a long exposure echoing the pulse of Black sonic invention and AI audio remix culture.

Nvidia released Parakeet V2, a new open-source automatic speech recognition AI —Transcribes an hour of audio in a second —Top model on the Open ASR, beating ElevenLabs’ Scribe and OpenAI’s Whisper —6.05% Word Error Rate —Available under CC-BY-4.0 license https://x.com/rowancheung/status/1919656472574615857

HeyGen dropped Avatar IV, an AI for expressive animations With one photo and voice script, its audio to expression engine captures tone, rhythm, and emotion to generate facial motion Also supports different subjects, camera shots, and formats https://x.com/rowancheung/status/1920018838462095760

Avatar IV is here and it changes everything. The most advanced avatar model we’ve ever built. Upload one photo and a script. That’s it. Our new audio to expression engine captures your tone, rhythm, and emotion, then generates facial motion so real it feels alive. And it’s https://x.com/HeyGen_Official/status/1919824467821551828

🏆 With our new Parakeet model (parakeet-tdt-0.6b-v2), we have achieved a new standard for automatic speech recognition (ASR) with an 👀 industry-best 6.05% Word Error Rate on the @HuggingFace Open-ASR-Leaderboard. 🦜 Parakeet V2 takes performance to the next level with https://x.com/NVIDIAAIDev/status/1917976429939351944

Suno v4.5 released with new AI music generation features, including: —New genres and enhanced voices —Complex, textured sounds —Better prompting and adherence —Ability to create extended 8-minute songs https://x.com/adcock_brett/status/1919060448264987109

Google expanded NotebookLM’s podcast-generating Audio Overviews to 50+ languages, including Hindi and Spanish It also opened AI Mode in Google Search to all Labs users in the U.S. and added new features for better shopping and planning https://x.com/adcock_brett/status/1919060329226486125

I built a voice-enabled agent using Google’s ADK + ElevenLabs in minutes. With native MCP integration out of the box, I spun it up by using ElevenLabs’ MCP server for speech-to-text capabilities. One thing I really like is the web client – makes things feel interactive and https://x.com/reymerekar7/status/1917555533977993504

7/ @reymerekar7 built a voice-enabled agent using Google’s ADK + ElevenLabs in minutes. With native MCP integration out of the box, spun it up by using ElevenLabs’ MCP server for speech-to-text capabilities. https://x.com/AtomSilverman/status/1919824763205144725

20/ You can control your Spotify using natural language with the help of Claude using a MCP server (Model Context Protocol). 🎧 @aafimalek2032 https://x.com/AtomSilverman/status/1920168610296967474

Now I can control my Spotify using natural language with the help of Claude using a MCP server (Model Context Protocol) #MCP #AIAgent #API #Spotify #Claude Sound on 🔊 https://x.com/aafimalek2032/status/1917201305342202314

Voila Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play https://x.com/_akhaliq/status/1919674858998202407

✨ Observe and Evaluate Multi-Modal Agents with LangSmith 🖼️ LangSmith now supports images, PDFs, and audio files across the playground, annotation queues, and datasets — making it easier than ever to build and evaluate multimodal applications. 📹 Watch how we evaluate a https://x.com/LangChainAI/status/1920207008462201054

Learn to build conversational AI voice agents in “Building AI Voice Agents for Production”, created in collaboration with @livekit and @realavatarai, and taught by @dsa (Co-founder & CEO of LiveKit), @shayneparlo (Developer Advocate, LiveKit), and @nedteneva (Head of AI at https://x.com/AndrewYNg/status/1920161212312268988

New short course ➡️ Building AI Voice Agents for Production LLMs can write and reason, but getting them to talk in real time, with low latency, and in a way that actually feels human, is a different challenge. In this course, created with @LiveKitAgent and @realavatarai, you’ll https://x.com/DeepLearningAI/status/1920153317562323095

3/ @DKundel + team published a new sample app for building multimodal agents with the @OpenAIDevs Agents SDK 🤖 how to wrap your existing agents in a VoicePipeline 🎙️ capture/play audio in a React 🔌 send the audio between Python & your frontend https://x.com/AtomSilverman/status/1919066831190470933

(1) Voice AI Masterclass — Kwindla Hultman Kramer and swyx – YouTube https://www.youtube.com/watch?v=AbToUiWRhn4&t=972s

Just launched on Hugging Face: ACE-Step v1-3.5B — ultra-fast, open-source music generation model! 🎵 Key features: > 4 mins of music in 20s (15× faster than LLMs) > Diffusion + compressed audio + linear transformer > Wide style/genre support with structure control > Tasks: https://x.com/Tu7uruu/status/1919748788903621048

Nvidia just open sourced Parakeet TDT 0.6B – the BEST Speech Recognition model on Open ASR Leaderboard 🔥 Can transcribe 60 minutes of audio in 1 second 🤯 600M parameters, with CC-BY-4.0 license (commercially permissive) Congrats Nvidia on the brilliant release and beating https://x.com/reach_vb/status/1919422953256587376

Forget everything you know about transcription models – NVIDIA’s parakeet-tdt-0.6b-v2 changed the game for me! – 6% word error rate (best in class) – incredible speed ( RTFx: 3386.02) – cc-by-4.0 license https://x.com/fdaudens/status/1918325068020760967

Nvidia’s state-of-the-art Parakeet ASR model has an MLX implementation! The 0.6B model is at the top of the Hugging Face ASR leader board and runs super fast locally with MLX. https://x.com/awnihannun/status/1919984733968040030

nvidia/parakeet-tdt-0.6b-v2 · Hugging Face https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2

I’m having more fun with @runwayml’s Gen-4 References than I’ve had in a while with an AI model. This evening I started with the one image and created a whole video sequence in a couple of hours. It feels like a natural way to create. Audio is a track I generated a while ago https://x.com/TomLikesRobots/status/1917711787857768558

v4.5 (Suno) just dropped for Pro & Premier subscribers 🔥 A wider range of genres, richer vocals, & enhanced prompt understanding for songs that match your vision. What’s New: 🙌 Expanded genres & smarter mashups: More genre options — Blends like midwest emo + neosoul or EDM + folk https://x.com/SunoMusic/status/1917979468699931113

Ace Studio dropped ACE-Step v1-3.5B, an ultra-fast, open-source music generation model It can generate 4 minutes of music in 20s (15× faster than LLMs) with support for several genres and structure control https://x.com/rowancheung/status/1920018927670685914

ace-step/ACE-Step: ACE-Step: A Step Towards Music Generation Foundation Model https://github.com/ace-step/ACE-Step