CUDA Programming Course – High-Performance Computing with GPUs | freeCodeCamp.org

CUDA programming enables high-performance computing by leveraging NVIDIA GPUs to accelerate deep learning workflows and parallel processing tasks. Mastering this technology requires understanding GPU architecture, kernel launch configurations, and memory management, specifically addressing bottlenecks like memory bandwidth and on-chip communication. Practical implementation involves writing CUDA kernels in C/C++, optimizing matrix multiplication, and extending PyTorch with custom extensions to achieve production-scale performance. The deep learning ecosystem relies on various tools, including cuDNN for neural network primitives, NCCL for distributed cluster communication, and Triton for high-level kernel development. Developers can utilize cloud-based GPU instances to experiment with these technologies, while profiling tools like NVIDIA Nsight Compute provide critical insights into memory throughput and execution efficiency. Ultimately, effective GPU programming transforms massive neural network training runs by unrolling nested loops into parallel instructions, significantly reducing computation time and maximizing hardware utilization.

Outlines

Part 1: Careers, Ecosystem, Infrastructure

Part 2: Setup, C/C++ Foundations

Part 3: Hardware Architecture

Part 4: CUDA Programming Model

Part 5: Profiling, Concurrency, Atomics

Part 6: NVIDIA Libraries, Distributed Training

Part 7: Advanced Kernel Optimization

Part 8: Triton, PyTorch Integration

Part 9: MNIST Project, Future Trends

Sign in to continue reading, translating and more.

Open full episode in Podwise

CUDA Programming Course – High-Performance Computing with GPUs

freeCodeCamp.org

Part 1: Careers, Ecosystem, Infrastructure

00:00Course Overview and GPU Kernel Engineering Career Paths

Course Overview and GPU Kernel Engineering Career Paths

08:44Prerequisites and Key Takeaways for Performance Engineering

Prerequisites and Key Takeaways for Performance Engineering

16:54Deep Learning Ecosystem: Frameworks and Production Runtimes

Deep Learning Ecosystem: Frameworks and Production Runtimes

24:51Low-Level Infrastructure and Cloud GPU Providers

Low-Level Infrastructure and Cloud GPU Providers

Part 2: Setup, C/C++ Foundations

37:44Environment Setup for Windows WSL and Ubuntu Linux

Environment Setup for Windows WSL and Ubuntu Linux

47:07C and C++ Review: Pointers and Memory Management

C and C++ Review: Pointers and Memory Management

1:00:22Array Indexing and Custom Data Types in C

Array Indexing and Custom Data Types in C

1:14:16Macros, Compilers, and Build Automation with Makefiles

Macros, Compilers, and Build Automation with Makefiles

Part 3: Hardware Architecture

1:35:51Hardware Comparison: CPU vs. GPU vs. TPU

Hardware Comparison: CPU vs. GPU vs. TPU

1:52:07GPU Microarchitecture and Compute Capability

GPU Microarchitecture and Compute Capability

Part 4: CUDA Programming Model

2:00:01CUDA Programming Model: Host, Device, and Hierarchy

CUDA Programming Model: Host, Device, and Hierarchy

2:11:37Thread Indexing Math and Warp Scheduling

Thread Indexing Math and Warp Scheduling

2:31:37Vector Addition: Comparing 1D and 3D Kernel Performance

Vector Addition: Comparing 1D and 3D Kernel Performance

Part 5: Profiling, Concurrency, Atomics

2:45:15Naive Matrix Multiplication and Tiling Intuition

Naive Matrix Multiplication and Tiling Intuition

3:14:06Performance Profiling with NVIDIA Nsight Compute

Performance Profiling with NVIDIA Nsight Compute

3:27:51Preventing Race Conditions with Atomic Operations

Preventing Race Conditions with Atomic Operations

3:37:12CUDA Streams and Asynchronous Concurrency

CUDA Streams and Asynchronous Concurrency

Part 6: NVIDIA Libraries, Distributed Training

3:56:05Accelerating Linear Algebra with the cuBLAS API

Accelerating Linear Algebra with the cuBLAS API

4:25:04Specialized cuBLAS-LT and cuBLAS-XT Libraries

Specialized cuBLAS-LT and cuBLAS-XT Libraries

4:46:07Deep Learning Primitives and Fusion with cuDNN

Deep Learning Primitives and Fusion with cuDNN

5:15:13Convolution Algorithms and Distributed Training Tools

Convolution Algorithms and Distributed Training Tools

Part 7: Advanced Kernel Optimization

5:35:24Optimizing MatMul: Global Memory Coalescing

Optimizing MatMul: Global Memory Coalescing

6:05:01Shared Memory Caching and Tiling Strategies

Shared Memory Caching and Tiling Strategies

6:27:47Advanced 1D and 2D Block Tiling for High Throughput

Advanced 1D and 2D Block Tiling for High Throughput

7:41:56Vectorized Memory Access with float4 and Assembly Analysis

Vectorized Memory Access with float4 and Assembly Analysis

8:18:43Programmatic Access to Tensor Cores via WMMA

Programmatic Access to Tensor Cores via WMMA

Part 8: Triton, PyTorch Integration

8:23:03Triton: Block-Level Abstractions for Python Developers

Triton: Block-Level Abstractions for Python Developers

9:04:57Building Custom PyTorch CUDA Extensions

Building Custom PyTorch CUDA Extensions

Part 9: MNIST Project, Future Trends

9:17:51Final Project: MNIST MLP Architecture and Theory

Final Project: MNIST MLP Architecture and Theory

9:35:04Backpropagation Mechanics and Neuron Intuition

Backpropagation Mechanics and Neuron Intuition

10:16:44Implementing the MLP in NumPy and Pure C

Implementing the MLP in NumPy and Pure C

11:14:00Training the MNIST MLP on the GPU with CUDA

Training the MNIST MLP on the GPU with CUDA

11:41:14Advanced Optimization Trends and Course Conclusion

Advanced Optimization Trends and Course Conclusion