Ethan B. Holland

Over 56,100 manually organized AI links and counting

a movie theater in a forest with a trail sign that reads "Multimodal" --ar 5:3 --style raw

Multimodality News: Week Ending 05/24/2024

May 24, 2024

a movie theater in a forest with a trail sign that reads “Multimodal” –ar 5:3 –style raw

This week’s category cover theme is a sign in a forest. Each category image prompt is a derivative of the formula “an [category themed object] in a forest with a trail sign that reads “[category name]”. Using a theme each week takes the cover creation time down to about 20 minutes, rather than several hours.

Blog – DeepDataSpace | Unleashing the Power of Cutting-Edge Computer Vision Technology

https://www.deepdataspace.com/blog/Grounding-DINO-1.5-Pro

“What makes up the abstract concept of an apple? We read the word “apple” as a string, see 2D pictures online, 3D shape in real life, and moving apples in videos. We touch the apple, feel its geometry in our palms and texture through the rich tactile sensation on our fingers. Do all these different modalities converge to the same representation space, given sufficient learning capacity? After all, they are all shadows of one “true reality” projected onto our different senses. I really like this study paper from MIT, “The Platonic Representation Hypothesis”. The authors show that highly capable LLMs and vision models learn very similar representations, even though the modalities are never explicitly co-trained. Concretely, the experiments compare the similarity between strings “apple” and “orange” with the similarity between a picture of an apple vs orange. These two turn out to agree with each other in a wide selection of off-the-shelf models.

What makes up the abstract concept of an apple? We read the word "apple" as a string, see 2D pictures online, 3D shape in real life, and moving apples in videos. We touch the apple, feel its geometry in our palms and texture through the rich tactile sensation on our fingers.

Do… pic.twitter.com/2LzxYa4f3N
— Jim Fan (@DrJimFan) May 22, 2024

“The PaliGemma vision-language model is included as part of the latest KerasNLP release! Works with JAX, TF, and torch. There’s a lot you can do with it: describing images, captioning, object detection and image segmentation, OCR, visual question answering… it even has” / X

The PaliGemma vision-language model is included as part of the latest KerasNLP release! Works with JAX, TF, and torch.

There's a lot you can do with it: describing images, captioning, object detection and image segmentation, OCR, visual question answering… it even has…
— François Chollet (@fchollet) May 22, 2024

OpenAI

“This is a big leap forward to doing real data analysis with GPT-4o. It still can’t quite handle the way many people use spreadsheets (Excel formulas, etc.) but it is pretty impressive. You can zoom into spreadsheets and ask about cells, modify graphs, etc. I trimmed 45 seconds.

This is a big leap forward to doing real data analysis with GPT-4o. It still can't quite handle the way many people use spreadsheets (Excel formulas, etc.) but it is pretty impressive.

You can zoom into spreadsheets and ask about cells, modify graphs, etc. I trimmed 45 seconds. pic.twitter.com/VkBBVy6Zup
— Ethan Mollick (@emollick) May 21, 2024

Google

PaliGemma: Open Source Multimodal Model by Google

https://blog.roboflow.com/paligemma-multimodal-vision

Grok

Elon Musk’s xAI is working on making Grok multimodal – The Verge

https://www.theverge.com/2024/5/21/24161764/elon-musk-xai-grok-multimodal-ai

Phi

“Phi-3-vision with 4.2B parameters

Phi-3-vision with 4.2B parameters pic.twitter.com/0iAJWBp9sI
— Rohan Paul (@rohanpaul_ai) May 21, 2024

Segmentation

“In honor of the playoffs, I’d like to showcase what we’ve been working on here at Nexavision — a new way to generate basketball analytics through tracking with computer vision and AI: 🧵 https://twitter.com/AmarSVS/status/1793037268690579787

Heads up! You’ve scrolled to the end of this category. There may have been just one or two links (above), so go back up and double check to be sure you didn’t quickly scroll down past it.

Be Sure To Read This Week’s Main Post:

This week’s executive overview and top links are here:

AI News #34: Week Ending 05/24/2024 with Executive Summary and Top 47 Links

The post you just read is an deep dive extension of my weekly newsletter, This Week In AI, an executive summary of the top things to know in AI. Each week, I create an accessible overview for laypeople to feel confident they are conversant with the week’s AI developments. I include a curated list of must-click links of the week, to offer everyone a hands-on opportunity to explore the most intriguing updates in artificial intelligence across various categories, including robotics, imagery, video, AR/VR, science, ethics, and more. Beyond the overview, I post these topic-based deeper dives (below). If you haven’t read this week’s overview, I recommend starting there.

Credits/Sources

Most of these weekly links come from just a few prolific oversharing sources. Please follow them, as they work hard to find the news each week and they make it a lot easier for me to compile.