Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 6: Kernels, Triton

Stanford Online

In this lecture, the speaker details how to write high-performance code for GPUs, particularly for Assignment 2, which involves profiling and writing a Triton kernel for FlashAttention2. The lecture includes a review of GPU components, benchmarking, and profiling techniques using CUDA kernels in C++, Triton, and PyTorch's JIT compiler. The speaker emphasizes the importance of benchmarking and profiling code to identify bottlenecks, demonstrating with a simple MLP example. The lecture further explores profiling with PyTorch's built-in profiler and NVIDIA's Nsight Systems, illustrating how to analyze GPU behavior and performance, including the interaction between the CPU and GPU. The speaker also covers kernel fusion, writing CUDA kernels in C++, and using Triton for GPU programming, comparing performance across different implementations and concluding with a demonstration of writing a fast Triton implementation of Softmax.

Outlines

Continue

Preview

How to Get Rich: Every EpisodeNaval

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 6: Kernels, Triton

Stanford Online

Introduction to High-Performance GPU Code and GPU Architecture Review

Benchmarking and Profiling for High-Performance Code

Fine-Grained Profiling with PyTorch and CUDA Kernel Analysis

Profiling Complex Operations: CDIST, GELU, and Softmax

Advanced Profiling with NVIDIA Nsight Systems

Kernel Fusion and CUDA Implementation of GELU

Performance Comparison and Introduction to Triton

Triton Implementation of GELU and PTX Code Analysis

Performance Comparison, Torch Compile, and Triton Implementation of Softmax

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 6: Kernels, Triton

Stanford Online

00:04Introduction to High-Performance GPU Code and GPU Architecture Review

Introduction to High-Performance GPU Code and GPU Architecture Review

07:05Benchmarking and Profiling for High-Performance Code

Benchmarking and Profiling for High-Performance Code

17:56Fine-Grained Profiling with PyTorch and CUDA Kernel Analysis

Fine-Grained Profiling with PyTorch and CUDA Kernel Analysis

27:25Profiling Complex Operations: CDIST, GELU, and Softmax

Profiling Complex Operations: CDIST, GELU, and Softmax

35:30Advanced Profiling with NVIDIA Nsight Systems

Advanced Profiling with NVIDIA Nsight Systems

44:16Kernel Fusion and CUDA Implementation of GELU

Kernel Fusion and CUDA Implementation of GELU

57:16Performance Comparison and Introduction to Triton

Performance Comparison and Introduction to Triton

1:08:32Triton Implementation of GELU and PTX Code Analysis

Triton Implementation of GELU and PTX Code Analysis

1:11:09Performance Comparison, Torch Compile, and Triton Implementation of Softmax

Performance Comparison, Torch Compile, and Triton Implementation of Softmax