Name		Name	Last commit message	Last commit date
parent directory ..
data		data
kernels		kernels
output_triton_code		output_triton_code
perf_screenshots		perf_screenshots
tb_logs_pregenerated		tb_logs_pregenerated
FusedKernels.ipynb		FusedKernels.ipynb
README.md		README.md
criteo_dataset.py		criteo_dataset.py
lora_on_simple_mlp.py		lora_on_simple_mlp.py
model.py		model.py
model_hyperparameters_main.json		model_hyperparameters_main.json
model_hyperparameters_small.json		model_hyperparameters_small.json
model_train.py		model_train.py
torch_compile_generated_cpu.py		torch_compile_generated_cpu.py
torch_compile_generated_triton.py		torch_compile_generated_triton.py

README.md

Fused Kernels

Abstract

With focus on performance to get the most out of hardware, fusing of kernels has been a popular technique. At times, researchers/practitioners will re-write their code in native cuda or cpu kernels to get optimal performance, but projects such as torch.compile aim to make this simpler. Talk will focus on generating fused kernels and how to leverage torch.compile to be able to do that. We will shift a bit from all LLM talk and look into recommendation algorithms. In the process, we will work on creating fused kernels (triton and cuda) with the help of torch.compile.

Code and other artifacts

Lecture Data: https://github.com/kapilsh/cuda-mode-lecture/tree/main/data
How to open chrome trace: chrome://tracing
DLRM Blog Post: https://ai.meta.com/blog/dlrm-an-advanced-open-source-deep-learning-recommendation-model/
DLRM Paper: https://arxiv.org/pdf/1906.00091
DLRM github repo: https://github.com/facebookresearch/dlrm
Criteo Dataset: https://ailab.criteo.com/download-criteo-1tb-click-logs-dataset/
Pytorch Profiler with Tensorboard
TORCH_LOGS with torch.compile
LoRA Paper: https://arxiv.org/abs/2106.09685
LoRA from scratch: https://lightning.ai/lightning-ai/studios/code-lora-from-scratch
Netron: https://netron.app/
GPUs go brrr https://horace.io/brrr_intro.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lecture_018

lecture_018

README.md

Fused Kernels

Abstract

Code and other artifacts

Files

lecture_018

Directory actions

More options

Directory actions

More options

Latest commit

History

lecture_018

Folders and files

parent directory

README.md

Fused Kernels

Abstract

Code and other artifacts