2024 in Vision [LS Live @ NeurIPS]

The Latent Space LIVE mini-conference at NeurIPS 2024 showcased the latest breakthroughs in computer vision. Speakers from Roboflow and Moondream discussed significant trends, such as the transition from per-image to video processing seen in models like Sora and SAM2. They also noted how DETR models are now outpacing YOLO in real-time object detection. Additionally, they addressed the challenges faced by large language models (LLMs) in capturing intricate visual details, referencing important research like MMVP, Florence 2, PolyGemma 2, and AIM-V2. The session included a lively discussion on the shortcomings of current vision-language models (VLMs) in complex visual tasks and explored how synthetic data and chain-of-thought prompting could enhance their effectiveness.

Outlines

Sign in to continue reading, translating and more.

Continue

Latent Space: The AI Engineer Podcast

Latent Space LIVE Mini-Conference Introduction

Computer Vision Trends: Video Generation and Object Detection

Deep Dive into SAM2 and DETR Advancements

LLMs and Fine-Grained Visual Details: The MMVP Challenge

Improving Vision-Language Models: Florence 2 and PolyGemma

AIM-V2 and the Future of Vision-Language Models

Moondream: A Vision Language Model for Edge Devices

Concluding Thoughts on Vision-Language Model Advancements

2024 in Vision [LS Live @ NeurIPS]

Latent Space: The AI Engineer Podcast

00:03Latent Space LIVE Mini-Conference Introduction

Latent Space LIVE Mini-Conference Introduction

02:29Computer Vision Trends: Video Generation and Object Detection

Computer Vision Trends: Video Generation and Object Detection

10:17Deep Dive into SAM2 and DETR Advancements

Deep Dive into SAM2 and DETR Advancements

22:44LLMs and Fine-Grained Visual Details: The MMVP Challenge

LLMs and Fine-Grained Visual Details: The MMVP Challenge

32:22Improving Vision-Language Models: Florence 2 and PolyGemma

Improving Vision-Language Models: Florence 2 and PolyGemma

40:58AIM-V2 and the Future of Vision-Language Models

AIM-V2 and the Future of Vision-Language Models

41:29Moondream: A Vision Language Model for Edge Devices

Moondream: A Vision Language Model for Edge Devices

55:23Concluding Thoughts on Vision-Language Model Advancements

Concluding Thoughts on Vision-Language Model Advancements