The Latent Space LIVE mini-conference at NeurIPS 2024 showcased the latest breakthroughs in computer vision. Speakers from Roboflow and Moondream discussed significant trends, such as the transition from per-image to video processing seen in models like Sora and SAM2. They also noted how DETR models are now outpacing YOLO in real-time object detection. Additionally, they addressed the challenges faced by large language models (LLMs) in capturing intricate visual details, referencing important research like MMVP, Florence 2, PolyGemma 2, and AIM-V2. The session included a lively discussion on the shortcomings of current vision-language models (VLMs) in complex visual tasks and explored how synthetic data and chain-of-thought prompting could enhance their effectiveness.
Sign in to continue reading, translating and more.
Open full episode in Podwise![2024 in Vision [LS Live @ NeurIPS] Episode cover](https://substackcdn.com/feed/podcast/1084089/post/153472517/18645731b5bd7579807b2e396b2e994c.jpg)