Multimodality News: Week Ending 09/06/2024

A donut with the word “document” written in sprinkles sits on a desk covered with papers. A rubber stamp with the word “Data” on it is also on the desk.

“Multimodal AI models are getting even faster with Groq’s newest LLaVA v1.5 7B release. It’s reportedly 4x as fast as GPT-4o and can have conversations with images, audio, and text. LLaVA v1.5 7B is currently free in “Preview Mode” for developers.

https://twitter.com/rowancheung/status/1831554191988261116

“friends, here I’m talking about multimodal RAG or document retrieval if you want short, structured and concise answers from documents of same structure and have labelled data I suggest fine-tuning a model like Donut or LayoutLM series or UDOP” / X

https://twitter.com/mervenoyann/status/1831467222012920164

Google

“🩺 Enhancing Healthcare Diagnostics with Multimodal RAG Systems With Qdrant & Gemini, you can transform the way healthcare professionals approach diagnostics by combining the power of *both* text and image data. 🖼 🔠 In this article, Pragnesh Prajapati shows how to create a

https://twitter.com/qdrant_engine/status/1831240904293728327

Google Gemini will again support AI image generation of people

https://www.cnbc.com/2024/08/28/google-gemini-will-again-support-ai-image-generation-of-people.html

Google Photos: Search improvements and early access to Ask Photos

https://blog.google/products/photos/google-ask-photos-early-access

Google’s AI-powered Ask Photos feature begins US rollout | TechCrunch

Google’s AI-powered Ask Photos feature begins US rollout

Segmentation

“Those are some crispy and consistent depth maps! DepthCrafter looks like the new SOTA for video depth estimation tasks. Take any video, of any length, and get temporally coherent depth maps that you can use for VFX, or as an input to other AI models. Quick thread

https://twitter.com/bilawalsidhu/status/1832086639428132940