Skip to content

Commit

Permalink
Final October docs edits (#313)
Browse files Browse the repository at this point in the history
This PR includes final edits for October release of docs, including:
- Changing Nemo2 images to `.png` and cropping
- Fixing broken links throughout docs
- Adds labeling guidelines to contributing.md per #309

---------

Signed-off-by: Tyler Shimko <[email protected]>
  • Loading branch information
tshimko-nv authored Oct 15, 2024
1 parent 8a29c8d commit 70f3dc1
Show file tree
Hide file tree
Showing 20 changed files with 17 additions and 13 deletions.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
14 changes: 7 additions & 7 deletions docs/docs/user-guide/background/nemo2.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ Synchronization of gradients occurs after the backward pass is complete for each
that ensures all GPUs have synchronized parameters for the next iteration. Here is an example of how this might appear
on your cluster with a small model:

![Data Parallelism Diagram](site:assets/images/megatron_background/data_parallelism.jpg)
![Data Parallelism Diagram](site:assets/images/megatron_background/data_parallelism.png)

### FSDP background
FSDP extends DDP by sharding (splitting) model weights across GPUs in your cluster to optimize memory usage.
Expand All @@ -40,8 +40,8 @@ Note that this process parallelizes the storage in a way that enables too large
layer is not too large to fit on a GPU). Megatron (next) co-locates both storage and compute.

The following two figures show two steps through the forward pass of a model that has been sharded with FSDP.
![FSDP Diagram Step 1](site:assets/images/megatron_background/fsdp_slide1.jpg)
![FSDP Diagram Step 2](site:assets/images/megatron_background/fsdp_slide2.jpg)
![FSDP Diagram Step 1](site:assets/images/megatron_background/fsdp_slide1.png)
![FSDP Diagram Step 2](site:assets/images/megatron_background/fsdp_slide2.png)

### Model Parallelism
Model parallelism is the catch-all term for the variety of different parallelism strategies
Expand All @@ -55,7 +55,7 @@ Pipeline parallelism is similar to FSDP, but the model blocks that are sharded a
nodes that own the model weight in question. You can think of this as a larger simulated GPU that happens to be spread
across several child GPUs. Examples of this include `parallel_state.is_pipeline_last_stage()` which is commonly
used to tell if a particular node is on last pipeline stage, where you compute the final head outputs, loss, etc.
![Pipeline Parallelism](../assets/images/megatron_background/pipeline_parallelism.jpg). Similarly there are convenience
![Pipeline Parallelism](site:/assets/images/megatron_background/pipeline_parallelism.png). Similarly there are convenience
environmental lookups for the first pipeline stage (where you compute the embedding for example)
`parallel_state.is_pipeline_first_stage()`.

Expand All @@ -64,7 +64,7 @@ Tensor parallelism represents splitting single layers across GPUs. This can also
layers could in theory be too large to fit on a single GPU, which would make FSDP not possible. This would still work
since individual layer weights (and computations) are distributed. Examples of this in megatron include `RowParallelLinear` and
`ColumnParallelLinear` layers.
![Tensor Parallelism](site:assets/images/megatron_background/tensor_parallelism.jpg)
![Tensor Parallelism](site:assets/images/megatron_background/tensor_parallelism.png)

#### Sequence Parallelism
In megatron, "sequence parallelism" refers to the parallelization of the dropout, and layernorm blocks of a transformer.
Expand Down Expand Up @@ -102,12 +102,12 @@ Below is a figure demonstrating how mixing strategies results in larger "virtual
fewer distinct micro-batches in flight across your cluster. Also note that the number of virtual GPUs is multiplicative
so if you have `TP=2` and `PP=2` then you are creating a larger virtual GPU out of `2*2=4` GPUs, so your cluster size
needs to be a multiple of 4 in this case.
![Mixing Tensor and Pipeline Parallelism](site:assets/images/megatron_background/tensor_and_pipeline_parallelism.jpg)
![Mixing Tensor and Pipeline Parallelism](site:assets/images/megatron_background/tensor_and_pipeline_parallelism.png)

#### Scheduling model parallelism
You can improve on naive schedules by splitting up micro-batches into smaller pieces, executing multiple stages of the
model on single GPUs, and starting computing the backwards pass of one micro-batch while another is going through forward.
These optimizations allow for better cluster GPU utilization to be achieved. For example the following figure shows
how more advanced splitting techniques in megatron (eg the interleaved scheduler) provide better utilization when model
parallelism is used. Again when you can get away without using model parallelism (DDP), that is generally the best approach.
![Execution Schedulers](site:assets/images/megatron_background/execution_schedulers.jpg)
![Execution Schedulers](site:assets/images/megatron_background/execution_schedulers.png)
4 changes: 2 additions & 2 deletions docs/docs/user-guide/contributing/code-review.md
Original file line number Diff line number Diff line change
Expand Up @@ -155,7 +155,7 @@ Before becoming an approver, study this document so that you are completely fami
responsibilities of reviewers and approvers. Additionally, make sure that you are intimately
familiar with our coding style guides and best practices:

- [CONTRIBUTING](CONTRIBUTING.md)
- [Contributing](contributing.md)
- In addition, make sure that you understand and can apply all elements of the
[Google Python style guide](https://google.github.io/styleguide/pyguide.html), which we adhere
to for all Python code
Expand Down Expand Up @@ -206,7 +206,7 @@ a fruitful interaction across the team members.
to keep working on.

- Follow code styling and rules stated in the project's documents
(e.g., [CONTRIBUTING.md](CONTRIBUTING.md), of which the [Google Python
(for example, [contributing.md](contributing.md), of which the [Google Python
Style Guide](https://google.github.io/styleguide/pyguide.html) is a subet) as these define the
look and feel of the code which defines the most fundamentals of how the code should be
developed and allows reviewers to focus on the most important aspects of a new piece of code.
Expand Down
5 changes: 5 additions & 0 deletions docs/docs/user-guide/contributing/contributing.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,11 @@ repository (unless external constraints prevent it).


## Pull Request (PR) Guidelines

### Labeling Your PR

If you are an external contributor (not an NVIDIA employee), please add the `contribution` label to your PR before submitting. Labels can be accessed in the right sidebar of the GitHub user interface when creating or editing a PR.

### Signing Your Work

* We require that all contributors "sign-off" on their commits (not GPG signing, just adding the `-s | --signoff`
Expand Down
2 changes: 1 addition & 1 deletion docs/docs/user-guide/examples/bionemo-esm2/pretrain.md
Original file line number Diff line number Diff line change
Expand Up @@ -272,7 +272,7 @@ llm.train(
)
```

Or simply call [`esm2_pretrain.py`](../../../../../scripts/protein/esm2/esm2_pretrain.py) directly.
Or simply call `esm2_pretrain.py` directly.
```bash
DATA_DIR=$(download_bionemo_data esm2/testdata_esm2_pretrain:2.0 --source pbss)

Expand Down
2 changes: 1 addition & 1 deletion docs/docs/user-guide/getting-started/access-startup.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ docker login nvcr.io

This command will prompt you to enter your API key. Fill in the details as shown below. Note that you should enter the
string `$oauthtoken` as your username. Replace the password (`<YOUR_API_KEY>`) with the API key that you generated in
the [NGC Account and API Key Configuration](#NGC-Account-and-API-Key-Configuration) section above:
the NGC Account and API Key Configuration section above:

```bash
Username: $oauthtoken
Expand Down
3 changes: 1 addition & 2 deletions docs/docs/user-guide/getting-started/development.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,8 +79,7 @@ The scripts provide various options that can be customized for pretraining, such
You can specify these options when running the script using command-line arguments. For each of the available scripts,
you can use the `--help` option for an explanation of the available options for that model.

For more information on pretraining a model, refer to the [ESM2 Pretraining
Tutorial](../examples/bionemo-esm2/pretrain.md).
For more information on pretraining a model, refer to the [ESM2 Pretraining Tutorial](../examples/bionemo-esm2/pretrain.md).

## Fine-Tuning

Expand Down

0 comments on commit 70f3dc1

Please sign in to comment.