Multimodality News: Week Ending 08/09/2024

Flux[dev]: A surreal machine with letters and text going into it and music notes coming out. The machine is operated by a futuristic robot, sleek smooth humanoid design. Smooth, glossy black faceplate with no visible facial features, high-tech, minimalist appearance. The robot's body is matte black or dark gray, with articulated joints and mechanical parts that resemble those of a human, including fingers. In the background, "Multimodal" is written in fog.

Multimodality News: Week Ending 08/09/2024

August 9, 2024

Flux[dev]: A surreal machine with letters and text going into it and music notes coming out. The machine is operated by a futuristic robot, sleek smooth humanoid design. Smooth, glossy black faceplate with no visible facial features, high-tech, minimalist appearance. The robot’s body is matte black or dark gray, with articulated joints and mechanical parts that resemble those of a human, including fingers. In the background, “Multimodal” is written in fog.

Zuckerberg touts Meta’s latest video vision AI with Nvidia CEO Jensen Huang | TechCrunch

Zuckerberg touts Meta’s latest video vision AI with Nvidia CEO Jensen Huang

AI can see what’s on your screen by reading HDMI electromagnetic radiation | TechSpot – https://www.techspot.com/news/104015-ai-can-see-what-screen-reading-hdmi-electromagnetic.html

OpenAI invests in a webcam company turned AI startup. – The Verge

https://www.theverge.com/2024/8/7/24215370/openai-invests-in-a-webcam-company-turned-ai-startup

“OpenAI is leading a $60 million investment in Opal, which sells $300 professional webcams and plans to develop other types of devices powered by OpenAI’s AI models. w/ @steph_palazzolo

OpenAI is leading a $60 million investment in Opal, which sells $300 professional webcams and plans to develop other types of devices powered by OpenAI’s AI models. w/ @steph_palazzolo https://t.co/iNGAUCW4tf
— Kate Clark (@KateClarkTweets) August 7, 2024

“Want a robot to assist you in the kitchen **without any instructions** simply by watching you?🤖🏠 🚀 Presenting our recent paper on action anticipation from short video context for human-robot collaboration, accepted at Robotics and Automation Letters (RA-L).

Want a robot to assist you in the kitchen **without any instructions** simply by watching you?🤖🏠

🚀 Presenting our recent paper on action anticipation from short video context for human-robot collaboration, accepted at Robotics and Automation Letters (RA-L). pic.twitter.com/88SwEbgrbH
— Sarthak Bhagat (@sarthak__bhagat) August 9, 2024

“Although there is probably too much AI hype these days, I am excited about my Ray-Ban smart glasses for many reasons (e.g., listening to music, live streaming, image capture etc.). The “killer app” is that these glasses are now powered by Meta’s Llama AI model! 1/3

Although there is probably too much AI hype these days, I am excited about my Ray-Ban smart glasses for many reasons (e.g., listening to music, live streaming, image capture etc.). The “killer app” is that these glasses are now powered by Meta's Llama AI model! 1/3 pic.twitter.com/H2sZobPGPn
— Chad Mairn (@cmairn) August 7, 2024

“Idefics3-Llama is out! 💥 It’s a multimodal model based on Llama 3.1 that accepts arbitrary number of interleaved images with text with a huge context window (10k tokens!) 😍 Link to demo and model in the next one 😏

Idefics3-Llama is out! 💥

It's a multimodal model based on Llama 3.1 that accepts arbitrary number of interleaved images with text with a huge context window (10k tokens!) 😍

Link to demo and model in the next one 😏 pic.twitter.com/40tsgV8EBC
— merve (@mervenoyann) August 6, 2024

“This is clever. A diffusion model picks up features in common across datasets and we can use that to find subtle visual patterns. It identifies geographies like a geo-guesser (utility poles, bollards), decades by eye glasses shape & fashion trends, and shows promise for medicine

This is clever. A diffusion model picks up features in common across datasets and we can use that to find subtle visual patterns.

It identifies geographies like a geo-guesser (utility poles, bollards), decades by eye glasses shape & fashion trends, and shows promise for medicine https://t.co/sKnX52xYLR pic.twitter.com/Bg4Sfyhot1
— Ethan Mollick (@emollick) August 7, 2024

“MiniCPM V 2.6 is out! 🤩 A VLM marrying SigLIP 400M 🤝🏻 Qwen2-7B 💪🏻 Outperforms proprietary models on OpenCompass benchmarks and video benchmarks 🎬 Accepts multiple images, videos, can do in-context learning I will unpack it with details once 2.6 technical report is out 😊

MiniCPM V 2.6 is out! 🤩

A VLM marrying SigLIP 400M 🤝🏻 Qwen2-7B
💪🏻 Outperforms proprietary models on OpenCompass benchmarks and video benchmarks
🎬 Accepts multiple images, videos, can do in-context learning
I will unpack it with details once 2.6 technical report is out 😊 pic.twitter.com/Vh1Bw01daw
— merve (@mervenoyann) August 7, 2024

Segmentation

“Our SAM 2 pod with @nikhilaravi is out! Fun SAM1 quote from guest cohost @josephofiowa: “I recently pulled statistics from the usage of SAM in @RoboFlow over the course of the last year. And users have labeled about 49 million images using SAM on the hosted side of the RoboFlow

Our SAM 2 pod with @nikhilaravi is out! Fun SAM1 quote from guest cohost @josephofiowa:

"I recently pulled statistics from the usage of SAM in @RoboFlow over the course of the last year. And users have labeled about 49 million images using SAM on the hosted side of the RoboFlow… https://t.co/kEdwzypcRS pic.twitter.com/F0fzOg4zLp
— swyx (@swyx) August 7, 2024

“SAM 2 from Meta FAIR is the first unified model for real-time, promptable object segmentation in images & videos. Using the model in our web-based demo you can segment, track and apply effects to objects in video in just a few clicks. Try SAM 2 ➡️

SAM 2 from Meta FAIR is the first unified model for real-time, promptable object segmentation in images & videos. Using the model in our web-based demo you can segment, track and apply effects to objects in video in just a few clicks.

Try SAM 2 ➡️ https://t.co/dhf6reyoEu pic.twitter.com/QJoVRDJ1Z8
— AI at Meta (@AIatMeta) August 7, 2024

“Grounded SAM 2: Ground and Track Anything in Videos

Grounded SAM 2: Ground and Track Anything in Videoshttps://t.co/ARwXKNxk4e

Based on the core concept of Ground SAM, we have introduced Grounded SAM 2. We have open-sourced our code, allowing users to input custom videos and create impressive object tracking videos. #AI #SAM2 pic.twitter.com/4At1igKxQe
— Tianhe Ren (@Tianhe_Ren) August 7, 2024

“Computer vision + Journalism + #Olympics = 😍 – The @nytimes used computer vision to detect the positions of the athletes on photos taken every 100ms – Speeds were then computed by combining their positions and the timestamp of each photograph – Manual verification and

Computer vision + Journalism + #Olympics = 😍

– The @nytimes used computer vision to detect the positions of the athletes on photos taken every 100ms
– Speeds were then computed by combining their positions and the timestamp of each photograph
– Manual verification and… pic.twitter.com/Hteu3b1ULl
— Florent Daudens (@fdaudens) August 5, 2024