Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add single node Neuron test to the e2e tester #452

Merged
merged 3 commits into from
Jul 3, 2024

Conversation

weicongw
Copy link
Contributor

Issue #, if available:

Description of changes:
Accidentally close this PR, reopen it.#450

This PR adds single-node Neuron tests to the e2e2 tester. These tests serve as unit tests for the Neuron device and include the following:

  • testNeuronSingleAllReduce: Tests all-reduce NCCL communications.
  • testNeuronMlp: Runs simple training tasks.
  • testNeuronParallelState: Tests parallel distribution of data and model.

These test scripts are replicated from https://github.com/aws/deep-learning-containers/blob/master/test/dlc_tests/container_tests/bin/pytorch_tests

Testing

 go test -v . -args -neuronSingleNodeTestImage public.ecr.aws/o5d5x8n6/weicongw:latest
=== RUN   TestMPIJobPytorchTraining
=== RUN   TestMPIJobPytorchTraining/single-node
=== RUN   TestMPIJobPytorchTraining/single-node/Single_node_test_Job_succeeds
--- PASS: TestMPIJobPytorchTraining (110.44s)
    --- PASS: TestMPIJobPytorchTraining/single-node (110.44s)
        --- PASS: TestMPIJobPytorchTraining/single-node/Single_node_test_Job_succeeds (110.08s)
PASS
ok      github.com/aws/aws-k8s-tester/e2e2/test/cases/neuron    117.961s

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

e2e2/test/cases/neuron/main_test.go Outdated Show resolved Hide resolved
e2e2/test/cases/neuron/main_test.go Outdated Show resolved Hide resolved
e2e2/test/cases/neuron/main_test.go Show resolved Hide resolved
e2e2/test/cases/neuron/neuron_test.go Outdated Show resolved Hide resolved
e2e2/test/cases/neuron/main_test.go Outdated Show resolved Hide resolved
e2e2/internal/framework_extensions/conditions.go Outdated Show resolved Hide resolved
e2e2/test/images/neuron/Dockerfile Outdated Show resolved Hide resolved
e2e2/test/images/neuron/Dockerfile Show resolved Hide resolved
e2e2/test/images/neuron/Dockerfile Outdated Show resolved Hide resolved
@cartermckinnon cartermckinnon merged commit 1da1bf3 into aws:main Jul 3, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants