Ethan B. Holland

Over 56,100 manually organized AI links and counting

A fashion photoshoot of a runway look inspired by theater. A large screen displays the word "Multimodal" --ar 4:3 --style raw

Multimodality News: Week Ending 06/07/2024

June 7, 2024

A fashion photoshoot of a runway look inspired by theater. A large screen displays the word “Multimodal” –ar 4:3 –style raw

Twelve Labs Earns $50 Million Series A Co-led by NEA and NVIDIA’s NVentures to Build the Future of Multimodal AI

https://www.prweb.com/releases/twelve-labs-earns-50-million-series-a-co-led-by-nea-and-nvidias-nventures-to-build-the-future-of-multimodal-ai-302163279.html

“ShareGPT4Video Improving Video Understanding and Generation with Better Captions We present the ShareGPT4Video series, aiming to facilitate the video understanding of large video-language models (LVLMs) and the video generation of text-to-video models (T2VMs)

ShareGPT4Video

Improving Video Understanding and Generation with Better Captions

We present the ShareGPT4Video series, aiming to facilitate the video understanding of large video-language models (LVLMs) and the video generation of text-to-video models (T2VMs) pic.twitter.com/EAe9SKXeW8
— AK (@_akhaliq) June 7, 2024

“The No Language Left Behind paper (NLLB) just appeared in Nature. High-quality translation between 200 languages in any direction, with sparse training data, and many low-resource languages.”

The No Language Left Behind paper (NLLB) just appeared in Nature.
High-quality translation between 200 languages in any direction, with sparse training data, and many low-resource languages. https://t.co/Ur9lcR3WZo
— Yann LeCun (@ylecun) June 5, 2024

“Newly published today in @Nature: No Language Left Behind (NLLB) is an AI model created by researchers at Meta capable of delivering high-quality translations directly between 200 languages – including low-resource languages. Read more in Nature ⬇️

Newly published today in @Nature: No Language Left Behind (NLLB) is an AI model created by researchers at Meta capable of delivering high-quality translations directly between 200 languages – including low-resource languages.

Read more in Nature ⬇️https://t.co/2xIqVojNuM
— AI at Meta (@AIatMeta) June 5, 2024

AI is cracking a hard problem – giving computers a sense of smell

https://theconversation.com/ai-is-cracking-a-hard-problem-giving-computers-a-sense-of-smell-221731

“Amazon unveiled Project P.I., an AI system that scans products in fulfillment centers to detect damaged or incorrect items before they ship. Amazon also utilizes a multimodal LLM to investigate issues further, combining customer feedback with Project P.I. images.

Amazon unveiled Project P.I., an AI system that scans products in fulfillment centers to detect damaged or incorrect items before they ship.

Amazon also utilizes a multimodal LLM to investigate issues further, combining customer feedback with Project P.I. images. pic.twitter.com/3eWmNRZDov
— Rowan Cheung (@rowancheung) June 5, 2024

“Today we’re announcing LiveKit’s $22.5M Series A to build the transport layer for AI. This wasn’t an easy fundraise. Late last year, we pitched investors that realtime voice and video would become THE way we interact with computers. A few didn’t agree; most said it was at least”

Today we’re announcing LiveKit’s $22.5M Series A to build the transport layer for AI.

This wasn’t an easy fundraise. Late last year, we pitched investors that realtime voice and video would become THE way we interact with computers. A few didn’t agree; most said it was at least…
— dsa (@dsa) June 4, 2024

Amazon: AI spots product defects, reduces waste

https://www.aboutamazon.com/news/innovation-at-amazon/amazon-ai-sustainability-carbon-footprint-product-defects

Amazon’s Project PI AI looks for product defects before they ship – The Verge

https://www.theverge.com/2024/6/3/24170567/amazons-project-pi-product-defect-return-ai-computer-vision

“The future of AI glasses is normal looking, light weight and affordable – meet Frame, AI Glasses by @brilliantlabsAR It is shipping to hackers and creators already. Frame is open-source platform with mic, camera, AR display. It leverages your phone (connectivity & audio) and

The future of AI glasses is normal looking, light weight and affordable – meet Frame, AI Glasses by @brilliantlabsAR

It is shipping to hackers and creators already. Frame is open-source platform with mic, camera, AR display. It leverages your phone (connectivity & audio) and… pic.twitter.com/mRcCZBglHX
— Sander Saar (@sandersaar) June 7, 2024

Using AI to decode dog vocalizations | University of Michigan News

Using AI to decode dog vocalizations

“In a new paper by Nature, researchers from the University of Zurich developed a more effective vision model that keeps autonomous cars from crashing. It works by combining a 5,000 FPS event camera with a 20-FPS RGB camera.

In a new paper by Nature, researchers from the University of Zurich developed a more effective vision model that keeps autonomous cars from crashing.

It works by combining a 5,000 FPS event camera with a 20-FPS RGB camera. pic.twitter.com/44CPDrGQZX
— Brett Adcock (@adcock_brett) June 2, 2024

“Announcing Dragonfly, a set of vision-language models leveraging multi-resolution encoding & zoom-in patch selection to unlock fine-grained visual understanding.

Announcing Dragonfly, a set of vision-language models leveraging multi-resolution encoding & zoom-in patch selection to unlock fine-grained visual understanding. https://t.co/wXP7nhr0Og pic.twitter.com/5G02eJoKlS
— Together AI (@togethercompute) June 6, 2024

Dragonfly: A large vision-language model with multi-resolution zoom

https://www.together.ai/blog/dragonfly-v1

“🌟Introducing “🤖SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model”

🌟Introducing "🤖SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model" https://t.co/Yaj2K91Yx8

SpatialRGPT is a powerful region-level VLM that can understand both 2D and 3D spatial arrangements. It can process any region proposal (e.g., boxes or masks) and provide… pic.twitter.com/qWEWMbIxfT
— An-Chieh Cheng (@anjjei) June 4, 2024

“Video-MME The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis In the quest for artificial general intelligence, Multi-modal Large Language Models (MLLMs) have emerged as a focal point in recent advancements. However, the predominant focus

Video-MME

The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

In the quest for artificial general intelligence, Multi-modal Large Language Models (MLLMs) have emerged as a focal point in recent advancements. However, the predominant focus pic.twitter.com/dUpMESsQHw
— AK (@_akhaliq) June 3, 2024

OpenAI

“My AI smart speaker using @OpenAI – now with vision! #GPT4 🔊 📷 Code:

My AI smart speaker using @OpenAI – now with vision! #GPT4 🔊 📷

Code: https://t.co/HMrN4LCEnc
3D Print files: https://t.co/f1hHvXHbnh #AI #ArtificialIntelligence #SmartSpeaker #TechTrends #VoiceAssistant #OpenAI pic.twitter.com/2L6Za3PzyD
— Ben (@Olney1Ben) June 8, 2024

“🚨 ChatGPT adds “Background Conversations” in its latest update. It allows you to keep the conversation going even if you are using other apps or your screen is off. GPT-4o new voice feature might be coming soon!

🚨 ChatGPT adds “Background Conversations” in its latest update.

It allows you to keep the conversation going even if you are using other apps or your screen is off.

GPT-4o new voice feature might be coming soon! pic.twitter.com/tQh9byGGtn
— Alvaro Cintas (@dr_cintas) June 5, 2024

Google

“This is great – Gemini 1.5 Pro and Flash outperforms GPT-4o in Multi-modal LLMs in Video Analysis. Specialy given the pricing of Gemini 1.5 Flash Paper – ‘Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis’

This is great – Gemini 1.5 Pro and Flash outperforms GPT-4o in Multi-modal LLMs in Video Analysis.

Specialy given the pricing of Gemini 1.5 Flash

Paper – 'Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis' https://t.co/0rTwsFqKWJ pic.twitter.com/imOoTYWbNj
— Rohan Paul (@rohanpaul_ai) June 5, 2024

Phi

“Introducing Phi-3 WebGPU, a private and powerful AI chatbot that runs locally in your browser, powered by 🤗 Transformers.js and onnxruntime-web! 🔒 On-device inference: no data sent to a server ⚡️ WebGPU-accelerated (> 20 t/s) 📥 Model downloaded once and cached Try it out! 👇

Introducing Phi-3 WebGPU, a private and powerful AI chatbot that runs locally in your browser, powered by 🤗 Transformers.js and onnxruntime-web!

🔒 On-device inference: no data sent to a server
⚡️ WebGPU-accelerated (> 20 t/s)
📥 Model downloaded once and cached

Try it out! 👇 pic.twitter.com/Y79fTIghv7
— Xenova (@xenovacom) May 8, 2024

“Phi-3 Medium (14B) and Small (7B) models are on the @lmsysorg leaderboard! 😍 Medium ranks near GPT-3.5-Turbo-0613, but behind Llama 3 8B. Phi-3 Small is close to Llama-2-70B, and Mistral fine-tunes. This proves that we cannot purely optimize for academic benchmarks. We need to

Phi-3 Medium (14B) and Small (7B) models are on the @lmsysorg leaderboard! 😍 Medium ranks near GPT-3.5-Turbo-0613, but behind Llama 3 8B. Phi-3 Small is close to Llama-2-70B, and Mistral fine-tunes.

This proves that we cannot purely optimize for academic benchmarks. We need to… pic.twitter.com/f6EDuqW3cI
— Philipp Schmid (@_philschmid) June 3, 2024

This week’s executive overview and top links are here:

AI News #36: Week Ending 06/07/2024 with Executive Summary and Top 40 Links

The post you just read is an deep dive extension of my weekly newsletter, This Week In AI, an executive summary of the top things to know in AI. Each week, I create an accessible overview for laypeople to feel confident they are conversant with the week’s AI developments. I include a curated list of must-click links of the week, to offer everyone a hands-on opportunity to explore the most intriguing updates in artificial intelligence across various categories, including robotics, imagery, video, AR/VR, science, ethics, and more. Beyond the overview, I post these topic-based deeper dives (below). If you haven’t read this week’s overview, I recommend starting there.

Credits/Sources

Most of these weekly links come from just a few prolific oversharing sources. Please follow them, as they work hard to find the news each week and they make it a lot easier for me to compile.