Add BERT e2e training test #467

mattcjo · 2024-08-07T20:46:21Z

Issue #, if available:

Description of changes:

This test being added will run an E2E BERT training test. The validation for this test was done on a cluster consisting of p3.16xlarge instance type. The cluster has four nodes in total.

The results of running the training test can be seen below. These logs were obtained from the master pod that coordinated the E2E BERT training job.

[1,31]<stdout>:Process 31 - Training time: 10.09 seconds
[1,31]<stdout>:Process 31 - Throughput: 9.91 samples/second
[1,29]<stdout>:Process 29 - Training time: 10.05 seconds
[1,29]<stdout>:Process 29 - Throughput: 9.95 samples/second
[1,28]<stdout>:Process 28 - Training time: 10.09 seconds
[1,28]<stdout>:Process 28 - Throughput: 9.91 samples/second
[1,25]<stdout>:Process 25 - Training time: 10.04 seconds
[1,25]<stdout>:Process 25 - Throughput: 9.96 samples/second
[1,27]<stdout>:Process 27 - Training time: 10.10 seconds
[1,27]<stdout>:Process 27 - Throughput: 9.90 samples/second
[1,20]<stdout>:Process 20 - Training time: 10.09 seconds
[1,20]<stdout>:Process 20 - Throughput: 9.91 samples/second
[1,3]<stdout>:Process 3 - Training time: 10.07 seconds
[1,3]<stdout>:Process 3 - Throughput: 9.93 samples/second
[1,0]<stdout>:Process 0 - Training time: 10.03 seconds
[1,0]<stdout>:Process 0 - Throughput: 9.97 samples/second
[1,23]<stdout>:Process 23 - Training time: 10.04 seconds
[1,23]<stdout>:Process 23 - Throughput: 9.96 samples/second
[1,24]<stdout>:Process 24 - Training time: 10.10 seconds
[1,24]<stdout>:Process 24 - Throughput: 9.90 samples/second
[1,2]<stdout>:Process 2 - Training time: 10.14 seconds
[1,2]<stdout>:Process 2 - Throughput: 9.86 samples/second
[1,5]<stdout>:Process 5 - Training time: 10.08 seconds
[1,5]<stdout>:Process 5 - Throughput: 9.92 samples/second
[1,21]<stdout>:Process 21 - Training time: 10.08 seconds
[1,21]<stdout>:Process 21 - Throughput: 9.92 samples/second
[1,22]<stdout>:Process 22 - Training time: 10.07 seconds
[1,22]<stdout>:Process 22 - Throughput: 9.93 samples/second
[1,30]<stdout>:Process 30 - Training time: 10.09 seconds
[1,30]<stdout>:Process 30 - Throughput: 9.91 samples/second
[1,1]<stdout>:Process 1 - Training time: 10.07 seconds
[1,1]<stdout>:Process 1 - Throughput: 9.93 samples/second
[1,17]<stdout>:Process 17 - Training time: 10.11 seconds
[1,17]<stdout>:Process 17 - Throughput: 9.89 samples/second
[1,12]<stdout>:Process 12 - Training time: 10.01 seconds
[1,12]<stdout>:Process 12 - Throughput: 9.99 samples/second
[1,6]<stdout>:Process 6 - Training time: 10.04 seconds
[1,6]<stdout>:Process 6 - Throughput: 9.96 samples/second
[1,18]<stdout>:Process 18 - Training time: 10.12 seconds
[1,18]<stdout>:Process 18 - Throughput: 9.88 samples/second
[1,7]<stdout>:Process 7 - Training time: 10.11 seconds
[1,7]<stdout>:Process 7 - Throughput: 9.89 samples/second
[1,15]<stdout>:Process 15 - Training time: 10.14 seconds
[1,15]<stdout>:Process 15 - Throughput: 9.86 samples/second
[1,19]<stdout>:Process 19 - Training time: 10.12 seconds
[1,19]<stdout>:Process 19 - Throughput: 9.89 samples/second
[1,14]<stdout>:Process 14 - Training time: 9.96 seconds
[1,14]<stdout>:Process 14 - Throughput: 10.04 samples/second
[1,13]<stdout>:Process 13 - Training time: 10.05 seconds
[1,13]<stdout>:Process 13 - Throughput: 9.95 samples/second
[1,16]<stdout>:Process 16 - Training time: 10.10 seconds
[1,16]<stdout>:Process 16 - Throughput: 9.90 samples/second
[1,26]<stdout>:Process 26 - Training time: 10.11 seconds
[1,26]<stdout>:Process 26 - Throughput: 9.89 samples/second
[1,10]<stdout>:Process 10 - Training time: 10.12 seconds
[1,10]<stdout>:Process 10 - Throughput: 9.88 samples/second
[1,11]<stdout>:Process 11 - Training time: 10.10 seconds
[1,11]<stdout>:Process 11 - Throughput: 9.90 samples/second
[1,8]<stdout>:Process 8 - Training time: 10.09 seconds
[1,8]<stdout>:Process 8 - Throughput: 9.91 samples/second
[1,4]<stdout>:Process 4 - Training time: 10.05 seconds
[1,4]<stdout>:Process 4 - Throughput: 9.95 samples/second
[1,9]<stdout>:Process 9 - Training time: 10.08 seconds
[1,9]<stdout>:Process 9 - Throughput: 9.92 samples/second

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…erfile for the e2e BERT training task

…github action

…ER_HOST

…ty it up.

…erfile for the e2e BERT training task

…github action

…ER_HOST

…ty it up.

…or MPI, NCCL, and EFA.

…ated

…ce to be consistent with the other test images

…duplicate

weicongw · 2024-08-07T20:53:24Z

e2e2/test/cases/training/manifests/bert-training.yaml

+            resources:
+              requests:
+                nvidia.com/gpu: 8
+                vpc.amazonaws.com/efa: 0
+              limits:
+                nvidia.com/gpu: 8
+                vpc.amazonaws.com/efa: 0


I think we shouldn't hard code this, since we might need to run it in different node configurations (e.g. node type, node count).
Check here for a reference on how not to hardcode this.
https://github.com/aws/aws-k8s-tester/blob/main/e2e2/test/cases/nvidia/main_test.go#L98-L144
https://github.com/aws/aws-k8s-tester/blob/main/e2e2/test/cases/nvidia/manifests/mpi-job-nccl-test-multi-node.yaml

Sure, we can parameterize it for future proofing. Right now all tests will be ran on an instance with 8 NVIDIA GPUs, but I have no problem with this. Will make the update.

weicongw · 2024-08-07T20:55:33Z

e2e2/test/cases/training/bert_training_test.go

+			return ctx
+		}).
+		Teardown(func(ctx context.Context, t *testing.T, cfg *envconf.Config) context.Context {
+			// Delete the manifest


If we can print out the logs before deleting them, it will help us troubleshoot the test failures better.
Reference: https://github.com/aws/aws-k8s-tester/blob/main/e2e2/test/cases/nvidia/mpi_test.go#L128-L135

weicongw · 2024-08-07T21:45:49Z

e2e2/test/images/bert-training/train.py

@@ -85,6 +86,8 @@ def train_bert(rank, world_size, local_rank, model, tokenizer):

    start_time = time.time()

+    print(f"starting training for rank: {rank}")
+
    for epoch in range(1):  # Short run for testing
        ddp_model.train()
        for batch in train_dataloader:


The test is broken.

[1,0]<stderr>:Traceback (most recent call last): [1,0]<stderr>: File "/app/train.py", line 138, in <module> [1,0]<stderr>: main() [1,0]<stderr>: File "/app/train.py", line 123, in main [1,0]<stderr>: num_gpus_per_node = int(os.environ["NUM_GPUS_PER_NODE"]) [1,0]<stderr>: File "/usr/local/lib/python3.10/os.py", line 680, in __getitem__ [1,0]<stderr>: raise KeyError(key) from None [1,0]<stderr>:KeyError: 'NUM_GPUS_PER_NODE' [1,1]<stderr>:Traceback (most recent call last): [1,1]<stderr>: File "/app/train.py", line 138, in <module> [1,1]<stderr>: main() [1,1]<stderr>: File "/app/train.py", line 123, in main [1,1]<stderr>: num_gpus_per_node = int(os.environ["NUM_GPUS_PER_NODE"]) [1,1]<stderr>: File "/usr/local/lib/python3.10/os.py", line 680, in __getitem__ [1,1]<stderr>: raise KeyError(key) from None [1,1]<stderr>:KeyError: 'NUM_GPUS_PER_NODE' [1,2]<stderr>:Traceback (most recent call last): [1,2]<stderr>: File "/app/train.py", line 138, in <module> [1,2]<stderr>: main() [1,2]<stderr>: File "/app/train.py", line 123, in main [1,2]<stderr>: num_gpus_per_node = int(os.environ["NUM_GPUS_PER_NODE"]) [1,2]<stderr>: File "/usr/local/lib/python3.10/os.py", line 680, in __getitem__ [1,2]<stderr>: raise KeyError(key) from None [1,2]<stderr>:KeyError: 'NUM_GPUS_PER_NODE'

Also you forget to add the test binary to Dockerfile.kubetest2

mattcjo and others added 30 commits June 26, 2024 21:15

Add python training script, requirements.txt (dependencies), and dock…

94d99b1

…erfile for the e2e BERT training task

Add github action to build bert-testing image on PR

7f50245

Specify directory the BERT training image should be built in for the …

58ccf5d

…github action

Add default values and include in docker env for MASTER_ADDR and MAST…

9c30c45

…ER_HOST

Slightly change env var value retrieval. Also ran a formatter to pret…

4c82b9a

…ty it up.

Create new base directory structure for BERT training test

1b615c4

Merge branch 'aws:main' into bert-training-test

7dfb3d6

Add main_test.go from nvidia test suite

7fe40e2

Create file to run bert training test

9ee61f8

base manifest for mpi training

3cbd31a

Remove unused dependency

eccc502

Add python training script, requirements.txt (dependencies), and dock…

af9fda0

…erfile for the e2e BERT training task

Add github action to build bert-testing image on PR

104fa93

Specify directory the BERT training image should be built in for the …

477f672

…github action

Add default values and include in docker env for MASTER_ADDR and MAST…

fb7d18f

…ER_HOST

Slightly change env var value retrieval. Also ran a formatter to pret…

b5aedc7

…ty it up.

Update bert training dockerfile to include amazon specific packages f…

7f9480b

…or MPI, NCCL, and EFA.

Change Dockerfile.bert-training file name to just Dockerfile

19613e1

Update git workflow to use new Dockerfile path since the name was upd…

974da50

…ated

Update Docker image to use Python version 3.10.12 and build from sour…

5b4ae1a

…ce to be consistent with the other test images

Update image that is used

0df187d

Updated dockerfile using Python 3.10.12 and installing from source

831d228

Merge branch 'main' into bert-training-test

8144b57

Merge remote-tracking branch 'upstream/main'

6bc3ef4

Remove extra line

fa8d244

Merge branch 'main' into bert-training-test

72cba38

Remove comments from training manifest

b05db50

Remove memory limits from manifest

bfc18dc

Add print statements to training script for debugging

4933d64

Had been setting MASTER_ADDR and MASTER_PORT env vars twice. Removed …

f87ba65

…duplicate

mattcjo and others added 18 commits July 18, 2024 19:58

Set each process to a GPU via local rank instead of overall rank

7af6b13

Merge remote-tracking branch 'upstream/main'

1a3ad52

Change comment describing section in dockerfile

1f5b1c9

Merge branch 'aws:main' into main

b67026c

parameterize number of gpus per node in Dockerfile and train.py

4a8e0ec

Merge remote-tracking branch 'origin/main' into bert-training-test

920aff0

Rename directory from bert-training to just training

c0b78b5

rename manifest bert-mpi-training.yaml to bert-training.yaml

4af829b

Change MPIJob name from bert-mpi-training to bert-training

67a1266

Remove efa manifest since it currently isn't being utilized

a61e833

Update finalized bert-training manifest that's parameterized

d08bec1

Add steps to tear down manifests that were applied in both tests

31ba8ee

Delete old, unused dockerfile

6fa38c5

Merge remote-tracking branch 'upstream/main'

60ddc02

formatting in train.py

01d8270

Merge remote-tracking branch 'upstream/main'

21fd336

Merge branch 'main' into bert-training-test

ed40f33

Add training test e2e Go binary to Dockerfile.kubetest2

2df0a42

weicongw reviewed Aug 7, 2024

View reviewed changes

mattcjo added 2 commits August 30, 2024 10:45

Merge branch 'aws:main' into bert-training-test

da86684

Merge branch 'aws:main' into bert-training-test

4f4229f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add BERT e2e training test #467

Add BERT e2e training test #467

mattcjo commented Aug 7, 2024

weicongw Aug 7, 2024

mattcjo Aug 7, 2024

weicongw Aug 7, 2024

weicongw Aug 7, 2024

weicongw Aug 7, 2024

Add BERT e2e training test #467

Are you sure you want to change the base?

Add BERT e2e training test #467

Conversation

mattcjo commented Aug 7, 2024

weicongw Aug 7, 2024

Choose a reason for hiding this comment

mattcjo Aug 7, 2024

Choose a reason for hiding this comment

weicongw Aug 7, 2024

Choose a reason for hiding this comment

weicongw Aug 7, 2024

Choose a reason for hiding this comment

weicongw Aug 7, 2024

Choose a reason for hiding this comment