Final October docs edits (#313)

This PR includes final edits for October release of docs, including: - Changing Nemo2 images to `.png` and cropping - Fixing broken links throughout docs - Adds labeling guidelines to contributing.md per #309 --------- Signed-off-by: Tyler Shimko <[email protected]>
NVIDIA · Oct 15, 2024 · 70f3dc1 · 70f3dc1
1 parent 8a29c8d
commit 70f3dc1
Show file tree

Hide file tree

Showing 20 changed files with 17 additions and 13 deletions.
diff --git a/docs/docs/assets/images/megatron_background/data_parallelism.jpg b/docs/docs/assets/images/megatron_background/data_parallelism.jpg
diff --git a/docs/docs/assets/images/megatron_background/data_parallelism.png b/docs/docs/assets/images/megatron_background/data_parallelism.png
diff --git a/docs/docs/assets/images/megatron_background/execution_schedulers.jpg b/docs/docs/assets/images/megatron_background/execution_schedulers.jpg
diff --git a/docs/docs/assets/images/megatron_background/execution_schedulers.png b/docs/docs/assets/images/megatron_background/execution_schedulers.png
diff --git a/docs/docs/assets/images/megatron_background/fsdp_slide1.jpg b/docs/docs/assets/images/megatron_background/fsdp_slide1.jpg
diff --git a/docs/docs/assets/images/megatron_background/fsdp_slide1.png b/docs/docs/assets/images/megatron_background/fsdp_slide1.png
diff --git a/docs/docs/assets/images/megatron_background/fsdp_slide2.jpg b/docs/docs/assets/images/megatron_background/fsdp_slide2.jpg
diff --git a/docs/docs/assets/images/megatron_background/fsdp_slide2.png b/docs/docs/assets/images/megatron_background/fsdp_slide2.png
diff --git a/docs/docs/assets/images/megatron_background/pipeline_parallelism.jpg b/docs/docs/assets/images/megatron_background/pipeline_parallelism.jpg
diff --git a/docs/docs/assets/images/megatron_background/pipeline_parallelism.png b/docs/docs/assets/images/megatron_background/pipeline_parallelism.png
diff --git a/docs/docs/assets/images/megatron_background/tensor_and_pipeline_parallelism.jpg b/docs/docs/assets/images/megatron_background/tensor_and_pipeline_parallelism.jpg
diff --git a/docs/docs/assets/images/megatron_background/tensor_and_pipeline_parallelism.png b/docs/docs/assets/images/megatron_background/tensor_and_pipeline_parallelism.png
diff --git a/docs/docs/assets/images/megatron_background/tensor_parallelism.jpg b/docs/docs/assets/images/megatron_background/tensor_parallelism.jpg
diff --git a/docs/docs/assets/images/megatron_background/tensor_parallelism.png b/docs/docs/assets/images/megatron_background/tensor_parallelism.png
diff --git a/docs/docs/user-guide/background/nemo2.md b/docs/docs/user-guide/background/nemo2.md
@@ -24,7 +24,7 @@ Synchronization of gradients occurs after the backward pass is complete for each
 that ensures all GPUs have synchronized parameters for the next iteration. Here is an example of how this might appear
 on your cluster with a small model:
 
-![Data Parallelism Diagram](site:assets/images/megatron_background/data_parallelism.jpg)
+![Data Parallelism Diagram](site:assets/images/megatron_background/data_parallelism.png)
 
 ### FSDP background
 FSDP extends DDP by sharding (splitting) model weights across GPUs in your cluster to optimize memory usage.
@@ -40,8 +40,8 @@ Note that this process parallelizes the storage in a way that enables too large
 layer is not too large to fit on a GPU). Megatron (next) co-locates both storage and compute.
 
 The following two figures show two steps through the forward pass of a model that has been sharded with FSDP.
-![FSDP Diagram Step 1](site:assets/images/megatron_background/fsdp_slide1.jpg)
-![FSDP Diagram Step 2](site:assets/images/megatron_background/fsdp_slide2.jpg)
+![FSDP Diagram Step 1](site:assets/images/megatron_background/fsdp_slide1.png)
+![FSDP Diagram Step 2](site:assets/images/megatron_background/fsdp_slide2.png)
 
 ### Model Parallelism
 Model parallelism is the catch-all term for the variety of different parallelism strategies
@@ -55,7 +55,7 @@ Pipeline parallelism is similar to FSDP, but the model blocks that are sharded a
 nodes that own the model weight in question. You can think of this as a larger simulated GPU that happens to be spread
 across several child GPUs. Examples of this include `parallel_state.is_pipeline_last_stage()` which is commonly
 used to tell if a particular node is on last pipeline stage, where you compute the final head outputs, loss, etc.
-![Pipeline Parallelism](../assets/images/megatron_background/pipeline_parallelism.jpg). Similarly there are convenience
+![Pipeline Parallelism](site:/assets/images/megatron_background/pipeline_parallelism.png). Similarly there are convenience
 environmental lookups for the first pipeline stage (where you compute the embedding for example)
 `parallel_state.is_pipeline_first_stage()`.
 
@@ -64,7 +64,7 @@ Tensor parallelism represents splitting single layers across GPUs. This can also
 layers could in theory be too large to fit on a single GPU, which would make FSDP not possible. This would still work
 since individual layer weights (and computations) are distributed. Examples of this in megatron include `RowParallelLinear` and
 `ColumnParallelLinear` layers.
-![Tensor Parallelism](site:assets/images/megatron_background/tensor_parallelism.jpg)
+![Tensor Parallelism](site:assets/images/megatron_background/tensor_parallelism.png)
 
 #### Sequence Parallelism
 In megatron, "sequence parallelism" refers to the parallelization of the dropout, and layernorm blocks of a transformer.
@@ -102,12 +102,12 @@ Below is a figure demonstrating how mixing strategies results in larger "virtual
 fewer distinct micro-batches in flight across your cluster. Also note that the number of virtual GPUs is multiplicative
 so if you have `TP=2` and `PP=2` then you are creating a larger virtual GPU out of `2*2=4` GPUs, so your cluster size
 needs to be a multiple of 4 in this case.
-![Mixing Tensor and Pipeline Parallelism](site:assets/images/megatron_background/tensor_and_pipeline_parallelism.jpg)
+![Mixing Tensor and Pipeline Parallelism](site:assets/images/megatron_background/tensor_and_pipeline_parallelism.png)
 
 #### Scheduling model parallelism
 You can improve on naive schedules by splitting up micro-batches into smaller pieces, executing multiple stages of the
 model on single GPUs, and starting computing the backwards pass of one micro-batch while another is going through forward.
 These optimizations allow for better cluster GPU utilization to be achieved. For example the following figure shows
 how more advanced splitting techniques in megatron (eg the interleaved scheduler) provide better utilization when model
 parallelism is used. Again when you can get away without using model parallelism (DDP), that is generally the best approach.
-![Execution Schedulers](site:assets/images/megatron_background/execution_schedulers.jpg)
+![Execution Schedulers](site:assets/images/megatron_background/execution_schedulers.png)
diff --git a/docs/docs/user-guide/contributing/code-review.md b/docs/docs/user-guide/contributing/code-review.md
@@ -155,7 +155,7 @@ Before becoming an approver, study this document so that you are completely fami
 responsibilities of reviewers and approvers. Additionally, make sure that you are intimately
 familiar with our coding style guides and best practices:
 
-- [CONTRIBUTING](CONTRIBUTING.md)
+- [Contributing](contributing.md)
   - In addition, make sure that you understand and can apply all elements of the
     [Google Python style guide](https://google.github.io/styleguide/pyguide.html), which we adhere
     to for all Python code
@@ -206,7 +206,7 @@ a fruitful interaction across the team members.
   to keep working on.
 
 - Follow code styling and rules stated in the project's documents
-  (e.g., [CONTRIBUTING.md](CONTRIBUTING.md), of which the [Google Python
+  (for example, [contributing.md](contributing.md), of which the [Google Python
   Style Guide](https://google.github.io/styleguide/pyguide.html) is a subet) as these define the
   look and feel of the code which defines the most fundamentals of how the code should be
   developed and allows reviewers to focus on the most important aspects of a new piece of code.

diff --git a/docs/docs/user-guide/contributing/contributing.md b/docs/docs/user-guide/contributing/contributing.md
@@ -63,6 +63,11 @@ repository (unless external constraints prevent it).
 
 
 ## Pull Request (PR) Guidelines
+
+### Labeling Your PR
+
+If you are an external contributor (not an NVIDIA employee), please add the `contribution` label to your PR before submitting. Labels can be accessed in the right sidebar of the GitHub user interface when creating or editing a PR.
+
 ### Signing Your Work
 
 * We require that all contributors "sign-off" on their commits (not GPG signing, just adding the `-s | --signoff`

diff --git a/docs/docs/user-guide/examples/bionemo-esm2/pretrain.md b/docs/docs/user-guide/examples/bionemo-esm2/pretrain.md
@@ -272,7 +272,7 @@ llm.train(
 )
 ```
 
-Or simply call [`esm2_pretrain.py`](../../../../../scripts/protein/esm2/esm2_pretrain.py) directly.
+Or simply call `esm2_pretrain.py` directly.
 ```bash
 DATA_DIR=$(download_bionemo_data esm2/testdata_esm2_pretrain:2.0 --source pbss)
 

diff --git a/docs/docs/user-guide/getting-started/access-startup.md b/docs/docs/user-guide/getting-started/access-startup.md
@@ -71,7 +71,7 @@ docker login nvcr.io
 
 This command will prompt you to enter your API key. Fill in the details as shown below. Note that you should enter the
 string `$oauthtoken` as your username. Replace the password (`<YOUR_API_KEY>`) with the API key that you generated in
-the [NGC Account and API Key Configuration](#NGC-Account-and-API-Key-Configuration) section above:
+the NGC Account and API Key Configuration section above:
 
 ```bash
 Username: $oauthtoken

diff --git a/docs/docs/user-guide/getting-started/development.md b/docs/docs/user-guide/getting-started/development.md
@@ -79,8 +79,7 @@ The scripts provide various options that can be customized for pretraining, such
 You can specify these options when running the script using command-line arguments. For each of the available scripts,
 you can use the `--help` option for an explanation of the available options for that model.
 
-For more information on pretraining a model, refer to the [ESM2 Pretraining
-Tutorial](../examples/bionemo-esm2/pretrain.md).
+For more information on pretraining a model, refer to the [ESM2 Pretraining Tutorial](../examples/bionemo-esm2/pretrain.md).
 
 ## Fine-Tuning
-Original file line number
+Diff line change
@@ Expand Up / @@ -272,7 +272,7 @@ llm.train( @@
     )
     ```
-    Or simply call [`esm2_pretrain.py`](../../../../../scripts/protein/esm2/esm2_pretrain.py) directly.
+    Or simply call `esm2_pretrain.py` directly.
     ```bash
     DATA_DIR=$(download_bionemo_data esm2/testdata_esm2_pretrain:2.0 --source pbss)
@@ Expand Down @@