Combining next-token prediction and video diffusion in computer vision and robotics | MIT CSAIL | Podwise