CUDA Programming

CUDA programming basics

Understand the hardware
- Architecture Generations
  - P100: Pascal / sm 60
  - V100: Volta / sm 70
  - A100: Ampere / sm 80
- CUDA Core vs. Tensor Core
Programming model
- Thread
- Block
- Grid
- Stream
Must-know functions
- cudaMalloc() vs. cudaFree()
- cudaMemcpy() vs. cudaMemcpyAsync()
- cudaMemset() vs. cudaMemsetAsync()
- cudaStreamSynchronize() vs. cudaDeviceSynchronize()
- cudaEventRecord() vs. cudaStreamWaitEvent()

Common tricks

Avoid memcpy
Avoid unnecessary Sync
Preprocess data in CPU
when to use #pragma unroll?

CUDA Kernel Examples

Easy: Dropout/DropGrad
Medium: SoftmaxCrossEntropyLoss(Grad)
Hard: LayerNormalization, ReduceSum, GatherGrad

Debugging CUDA kernels

printf() works inside CUDA code
Memcpy data to CPU for inspection?

Understanding IO bound and compute bound

Please use the learning roadmap on the home wiki page for building general understanding of ORT.

CUDA Programming

CUDA programming basics

Common tricks

CUDA Kernel Examples

Debugging CUDA kernels

Understanding IO bound and compute bound

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Navigation by topic

Upcoming Release Roadmap

Glossary

Development

Common Tasks

Dependencies

Core Architecture

Feature Details

Inferencing

Training

Clone this wiki locally