Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add single node Neuron test to the e2e tester #450

Closed
wants to merge 0 commits into from

Conversation

weicongw
Copy link
Contributor

@weicongw weicongw commented Jun 18, 2024

Issue #, if available:

Description of changes:
This PR adds single-node Neuron tests to the e2e2 tester. These tests serve as unit tests for the Neuron device and include the following:

  • testNeuronSingleAllReduce: Tests all-reduce NCCL communications.
  • testNeuronMlp: Runs simple training tasks.
  • testNeuronParallelState: Tests parallel distribution of data and model.

These test scripts are replicated from https://github.com/aws/deep-learning-containers/blob/master/test/dlc_tests/container_tests/bin/pytorch_tests

Testing

 go test -v . -args -neuronSingleNodeTestImage public.ecr.aws/o5d5x8n6/weicongw:latest
=== RUN   TestMPIJobPytorchTraining
=== RUN   TestMPIJobPytorchTraining/single-node
=== RUN   TestMPIJobPytorchTraining/single-node/Single_node_test_Job_succeeds
--- PASS: TestMPIJobPytorchTraining (110.44s)
    --- PASS: TestMPIJobPytorchTraining/single-node (110.44s)
        --- PASS: TestMPIJobPytorchTraining/single-node/Single_node_test_Job_succeeds (110.08s)
PASS
ok      github.com/aws/aws-k8s-tester/e2e2/test/cases/neuron    117.961s

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@weicongw weicongw marked this pull request as ready for review June 18, 2024 20:25
@cartermckinnon
Copy link
Member

also please add a CI job that tests the image build for the new dockerfile 👍

@@ -0,0 +1,5 @@
# Start with the NVIDIA CUDA base image
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment needs to be updated

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need this in addition to Dockerfile.neuronx-tests, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you've got dupes of these files with and without extensions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we're just lifting these from somewhere else, can you add code comments in the scripts with permalinks to the original source?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants