Skip to content

Commit

Permalink
Merge pull request #1075 from pritesh2000/gram-1/pytorch2.0
Browse files Browse the repository at this point in the history
extras/pytorch_2_intro.ipynb
  • Loading branch information
mrdbourke authored Sep 12, 2024
2 parents b99a203 + c844a13 commit 5fbccf6
Showing 1 changed file with 10 additions and 10 deletions.
20 changes: 10 additions & 10 deletions extras/pytorch_2_intro.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -175,7 +175,7 @@
"\n",
"Why?\n",
"\n",
"Modern GPUs have so much compute power they are often not compute limited, as in, the main bottleneck to training models is how fast can you get data from your CPU to your GPU.\n",
"Modern GPUs have so much compute power they are often not compute limited, as in, the main bottleneck to training models is how fast you can get data from your CPU to your GPU.\n",
"This is known as bandwidth or memory bandwidth.\n",
"\n",
"You want to reduce your bandwidth costs as much as possible.\n",
Expand Down Expand Up @@ -209,7 +209,7 @@
"\n",
"Graph capture I’m less confident explaining.\n",
"\n",
"But the way I think about is that graph capture or graph tracing is:\n",
"But the way I think about it is that graph capture or graph tracing is:\n",
"\n",
"* Going through a series of operations that need to happen, such as the operations in a neural network.\n",
"* And capturing or tracing what needs to happen ahead of time.\n",
Expand Down Expand Up @@ -272,7 +272,7 @@
"source": [
"## What we're going to cover\n",
"\n",
"Since many of the upgrades in PyTorch 2.0 are speed focused and happen behind the scenes (e.g. PyTorch takes care of them for you), in this notebook we're going to run a compartive speed test.\n",
"Since many of the upgrades in PyTorch 2.0 are speed focused and happen behind the scenes (e.g. PyTorch takes care of them for you), in this notebook we're going to run a comparative speed test.\n",
"\n",
"Namely we'll make two of the same models, one using the default PyTorch setup and the other using the new `torch.compile()` setup and we'll train them on the same dataset.\n",
"\n",
Expand Down Expand Up @@ -391,7 +391,7 @@
"\n",
"And GPUs which are datacenter-class (e.g. A100, A10, H100) are likely to see more significant speedups than desktop-class GPUs (e.g. RTX 3090, RTX 3080, RTX 3070, RTX 3060 Ti).\n",
"\n",
"We can check the compute capbility score of our GPU using [`torch.cuda.get_device_capability()`](https://pytorch.org/docs/stable/generated/torch.cuda.get_device_capability.html).\n",
"We can check the compute capability score of our GPU using [`torch.cuda.get_device_capability()`](https://pytorch.org/docs/stable/generated/torch.cuda.get_device_capability.html).\n",
"\n",
"This will output a tuple of `(major, minor)` compute capability scores, for example, `(8, 0)` for the A100.\n",
"\n",
Expand Down Expand Up @@ -737,7 +737,7 @@
"* **Increasing the batch size** - More samples per batch means more samples on the GPU, for example, using a batch size of 256 instead of 32.\n",
"* **Increasing data size** - For example, using larger image size, 224x224 instead of 32x32. A larger data size means that more tensor operations will be happening on the GPU.\n",
"* **Increasing model size** - For example, using a larger model such as ResNet101 instead of ResNet50. A larger model means that more tensor operations will be happening on the GPU.\n",
"* **Decreasing data transfer** - For example, setting up all your tensors to be on GPU memory, this minizes the amount of data transfer between the CPU and GPU.\n",
"* **Decreasing data transfer** - For example, setting up all your tensors to be on GPU memory, this minimizes the amount of data transfer between the CPU and GPU.\n",
"\n",
"All of these result in *more* data being on the GPU.\n",
"\n",
Expand Down Expand Up @@ -806,7 +806,7 @@
"\n",
"Let's write some code to use a larger batch size if more GPU memory is available.\n",
"\n",
"> **Note:** The ideal batch size you use will depend on the specific GPU and dataset and model you're working with. The code below is specifically targeted for the A100 GPU available on Google Colab Pro. However, you may to adjust it for your own GPU. As if you set the batch size too high, you may run into CUDA out of memory errors.\n",
"> **Note:** The ideal batch size you use will depend on the specific GPU and dataset and model you're working with. The code below is specifically targeted for the A100 GPU available on Google Colab Pro. However, you may adjust it for your own GPU. As if you set the batch size too high, you may run into CUDA out of memory errors.\n",
"\n",
"If the total memory on the GPU available is **above 16GB**, let's use a batch size of 128 and an image size of 224 (both of these values can be increased on GPUs with more memory).\n",
"\n",
Expand Down Expand Up @@ -985,7 +985,7 @@
"\n",
"As in, the transfer speed from CPU to GPU.\n",
"\n",
"As we're discussed before you want to get your data to the GPU as fast as possible.\n",
"As we've discussed before you want to get your data to the GPU as fast as possible.\n",
"\n",
"Let's create our `DataLoaders` using `torch.utils.data.DataLoader`.\n",
"\n",
Expand Down Expand Up @@ -1123,7 +1123,7 @@
" # 5. Optimizer step\n",
" optimizer.step()\n",
"\n",
" # Calculate and accumulate accuracy metric across all batches\n",
" # Calculate and accumulate accuracy metrics across all batches\n",
" y_pred_class = torch.argmax(torch.softmax(y_pred, dim=1), dim=1)\n",
" train_acc += (y_pred_class == y).sum().item()/len(y_pred)\n",
"\n",
Expand Down Expand Up @@ -2801,7 +2801,7 @@
"\n",
"### 4.4 Save multi run results to file with GPU details\n",
"\n",
"Let's also save our results dataframes for experiments 3 and 4 to file to in case we'd like to inspect them later or compare them to other kinds of models."
"Let's also save our results dataframes for experiments 3 and 4 to file in case we'd like to inspect them later or compare them to other kinds of models."
]
},
{
Expand Down Expand Up @@ -2856,7 +2856,7 @@
"For even more speedups, I'd recommend researching/trying the following:\n",
"\n",
"* **More powerful CPUs** - I have a sneaking suspicion that Google Colab instances are limited to 2 CPU cores, speedup numbers could be improved with more CPUs. This could be tracked via the [PyTorch Profiler](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html) (a tool to find what processes take what time).\n",
"* **Using mixed precision training** - newer GPUs have the ability to handle difference precision types (e.g. [`torch.float16`](https://pytorch.org/docs/stable/tensors.html#data-types) and [`torch.bfloat16`](https://pytorch.org/docs/stable/generated/torch.Tensor.bfloat16.html)) which enable faster training and inference. I'd suspect you'll see an even larger speedup than we've seen here by using mixed precision training. For more on this, see the [PyTorch documentation for automatic mixed precision](https://pytorch.org/docs/stable/notes/amp_examples.html#amp-examples) (also called AMP) with PyTorch. \n",
"* **Using mixed precision training** - newer GPUs have the ability to handle different precision types (e.g. [`torch.float16`](https://pytorch.org/docs/stable/tensors.html#data-types) and [`torch.bfloat16`](https://pytorch.org/docs/stable/generated/torch.Tensor.bfloat16.html)) which enable faster training and inference. I'd suspect you'll see an even larger speedup than we've seen here by using mixed precision training. For more on this, see the [PyTorch documentation for automatic mixed precision](https://pytorch.org/docs/stable/notes/amp_examples.html#amp-examples) (also called AMP) with PyTorch. \n",
"* **Transformer based models may see more *relative* speedups than convolutional models** - PyTorch 2.0 includes a [stable release for accelerated transformer models](https://pytorch.org/blog/pytorch-2.0-release/#stable-accelerated-pytorch-2-transformers) (models which use the attention mechanism). The main speedups come from an improved implementation of [`scaled_dot_product_attention()`](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention.html?highlight=scaled_dot_product#torch.nn.functional.scaled_dot_product_attention) which automatically selects the best version of attention to use based on the hardware you're computing on. You can see more in the [dedicated PyTorch tutorial](https://pytorch.org/tutorials/intermediate/scaled_dot_product_attention_tutorial.html). \n",
"* **Train for longer** - As previously discussed, the speedups from `torch.compile()` are likely to be more noticeable when training for longer. A great exercise would be to train over a longer number of epochs, potentially on a different dataset with a different model (e.g. a transformer) and see how the speedups compare."
]
Expand Down

0 comments on commit 5fbccf6

Please sign in to comment.