Ethan B. Holland

Over 54,900 manually organized AI links and counting

a realistic baroque oil painting in an ornate gilded frame, depicting a deconstructed scene with a variety of image and audio tools, a small, elegant nameplate at the bottom of the fame is engraved with the title: "Multimodal" --chaos 20 --ar 4:3 --style raw --personalize t9u6ckr --v 6.1

Multimodality News: Week Ending 07/12/2024

July 12, 2024

a realistic baroque oil painting in an ornate gilded frame, depicting a deconstructed scene with a variety of image and audio tools, a small, elegant nameplate at the bottom of the fame is engraved with the title: “Multimodal” –chaos 20 –ar 4:3 –style raw –personalize t9u6ckr –v 6.1

“French startup Kyutai just introduced Moshi, an open-sourced ‘real-time’ AI voice assistant. It’s capable of responding to a range of emotions and styles in a similar fashion to OpenAI’s Voice Mode

French startup Kyutai just introduced Moshi, an open-sourced ‘real-time’ AI voice assistant.

It's capable of responding to a range of emotions and styles in a similar fashion to OpenAI’s Voice Mode pic.twitter.com/1m5FmPAWvV
— Brett Adcock (@adcock_brett) July 7, 2024

“SenseTime also revealed SenseNova 5o, a real-time multimodal model capable of processing audio, text, image, and video. Here’s a video of a live demonstration of SenseTime 5o in action (it’s incredibly similar to the GPT-4o demo)

SenseTime also revealed SenseNova 5o, a real-time multimodal model capable of processing audio, text, image, and video.

Here's a video of a live demonstration of SenseTime 5o in action (it's incredibly similar to the GPT-4o demo) pic.twitter.com/aVF8UDQyxK
— Rowan Cheung (@rowancheung) July 8, 2024

“🚨 Chinese AI company SenseTime just revealed SenseNova 5.5, an AI model that claims to beat GPT-4o across key metrics Plus, big developments from Apple, YouTube, KLING, Neuralink, and Google DeepMind. Here’s everything going on in AI right now:” / X

🚨 Chinese AI company SenseTime just revealed SenseNova 5.5, an AI model that claims to beat GPT-4o across key metrics

Plus, big developments from Apple, YouTube, KLING, Neuralink, and Google DeepMind.

Here's everything going on in AI right now:
— Rowan Cheung (@rowancheung) July 8, 2024

“At the World Artificial Intelligence Conference (WAIC) in Shanghai this weekend, SenseTime unveiled SenseNova 5.5. The company claims the model outperforms GPT-4o in 5 out of 8 key metrics. While I’d take it with a grain of salt, China’s AI startups are showing major progress

At the World Artificial Intelligence Conference (WAIC) in Shanghai this weekend, SenseTime unveiled SenseNova 5.5.

The company claims the model outperforms GPT-4o in 5 out of 8 key metrics.

While I'd take it with a grain of salt, China's AI startups are showing major progress pic.twitter.com/1ZFbojHs3v
— Rowan Cheung (@rowancheung) July 8, 2024

“Vision language models can see We introduce a new benchmark named Avocado360, and evaluate four arbitrarily selected VLMs on this benchmark. We show, for the first time ever, that VLMs can determine whether or not a given image contains an avocado.” / X

Vision language models can see
We introduce a new benchmark named Avocado360, and evaluate four arbitrarily selected VLMs on this benchmark. We show, for the first time ever, that VLMs can determine whether or not a given image contains an avocado. https://t.co/WVQJ2JXjUB
— vik (@vikhyatk) July 11, 2024

Deploying ML for Voice Safety

https://corp.roblox.com/newsroom/2024/07/deploying-ml-for-voice-safety

OpenAI

“Introducing Whisper Timestamped: Multilingual speech recognition with word-level timestamps, running 100% locally in your browser thanks to 🤗 Transformers.js! This unlocks a world of possibilities for in-browser video editing! 🤯 What will you build? 😍 Demo (+ source code) 👇

Introducing Whisper Timestamped: Multilingual speech recognition with word-level timestamps, running 100% locally in your browser thanks to 🤗 Transformers.js!

This unlocks a world of possibilities for in-browser video editing! 🤯 What will you build? 😍

Demo (+ source code) 👇 pic.twitter.com/PIqtcgk17Q
— Xenova (@xenovacom) July 10, 2024

“Game-changer alert: Navigate your video by clicking transcribed words with Whisper Timestamped! 🚀 Key features: – Multilingual transcription (35+ languages) – Click any word to jump to that moment in the video – Works with audio & video files – 100% browser-based for total

Game-changer alert: Navigate your video by clicking transcribed words with Whisper Timestamped! 🚀

Key features:
– Multilingual transcription (35+ languages)
– Click any word to jump to that moment in the video
– Works with audio & video files
– 100% browser-based for total… pic.twitter.com/A9jkhSgyeb
— Florent Daudens (@fdaudens) July 10, 2024

Segmentation

ConceptExpress: Unsupervised Concept Extraction (UCE): We focus on the unsupervised problem of extracting multiple concepts from a single image. Given an image that contains multiple concepts, we aim to harness a frozen pretrained diffusion model to automatically learn the conceptual tokens. Using the learned conceptual tokens, we can regenerate the extracted concepts with high quality.

https://haoosz.github.io/ConceptExpress

Researchers leverage shadows to model 3D scenes, including objects blocked from view | MIT News | Massachusetts Institute of Technology

https://news.mit.edu/2024/researchers-leverage-shadows-model-3d-scenes-blocked-objects-0618

“Cool demo by @mervenoyann for real-time object tracking with RT-DETR https://twitter.com/fdaudens/status/1811029049638011000

Heads up! You’ve scrolled to the end of this category. There may have been just one or two links (above), so go back up and double check to be sure you didn’t quickly scroll down past it.

Be Sure To Read This Week’s Main Post:

This week’s executive overview and top links are here:

AI News #41: Week Ending 07/12/2024 with Executive Summary and Top 58 Links

The post you just read is an deep dive extension of my weekly newsletter, This Week In AI, an executive summary of the top things to know in AI. Each week, I create an accessible overview for laypeople to feel confident they are conversant with the week’s AI developments. I include a curated list of must-click links of the week, to offer everyone a hands-on opportunity to explore the most intriguing updates in artificial intelligence across various categories, including robotics, imagery, video, AR/VR, science, ethics, and more. Beyond the overview, I post these topic-based deeper dives (below). If you haven’t read this week’s overview, I recommend starting there.

Credits/Sources

Most of these weekly links come from just a few prolific oversharing sources. Please follow them, as they work hard to find the news each week and they make it a lot easier for me to compile.