-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add docker image for BERT e2e training task #454
Conversation
.github/workflows/ci.yaml
Outdated
steps: | ||
- uses: actions/checkout@v3 | ||
- run: docker build --file e2e2/test/images/bert-training/Dockerfile.bert-training e2e2/test/images/bert-training |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an idea for the future (since there will be more to come), maybe we standardizing a images/XXX/{Dockerfile, ...}
structure for all our images and then create a job matrix for test image build
os.environ['MASTER_ADDR'] = os.environ['MASTER_ADDR'] # Kubernetes sets this | ||
os.environ['MASTER_PORT'] = os.environ['MASTER_PORT'] # Kubernetes sets this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this to docker ENV
with defaults
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are your thoughts on just setting the defaults here in the script (I should have done this in the first place) versus in the dockerfile?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default values should only be used during local dev, which is why I could see them belonging in the training script. Kubernetes should set them at runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kubernetes should set them at runtime.
meaning its part of the container spec right? those env's should override what's in the docker ENV
, so i think it just generally makes sense to put it in the Dockerfile/image definition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, it would be in the container spec. Works for me, I'll make the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ndbaker1 Done
print(f"Process {rank} - Training time: {training_time:.2f} seconds") | ||
print(f"Process {rank} - Throughput: {throughput:.2f} samples/second") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to dump this output to disk so we can use to upload to s3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially... I also was considering writing directly to s3 as well, but was curious to hear other's perspective(s). My intuition says writing to S3 is the long-term solution (once a stable schema is solidified), but short term just doing something like writing to disk or stdout might be the way to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i agree it can go to s3, cloudwatch, or etc once we know where this is going. we should have this output also printed for sure though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should be fine dump to disk for short term, and you have enough flexibility to play POC with different long-term destinations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Issacwww Are there any concerns/considerations with writing from the container to the host machine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, good call out, this is on the tod worker, dump to disk has no difference between stdout... so stdout should be fine now.
…erfile for the e2e BERT training task
…or MPI, NCCL, and EFA.
|
||
start_time = time.time() | ||
|
||
for epoch in range(1): # Short run for testing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we let the program read the epoch from an environment variable or argument? This way we can allow larger instances (e.g. p5) to run more epochs without changing the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would the purpose of having more epochs for larger instance sizes? Are you thinking about it purely from the perspective of wanting the tests to last the same amount of time for each instance type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some random thoughts. I was thinking we could run more epochs for larger instances to get more accurate performance data. Additionally, we could reuse this code for our future long-running tests (like soak tests).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha... Yeah I certainly appreciate the idea behind re-usability, but there's a good chance this current test isn't the best option for a SOAP test anyways.
As far as more epochs for larger instance types, it depends on what your end goal is. For the tests we're running, and the metrics we're looking to gather, I don't see any benefit in doing this at this time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a discussion with @cartermckinnon. I think we could reuse the e2e2/test/images/nvidia/Dockerfile
, so we don't need to maintain multiple images and dockerfiles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean that's fine from a base image perspective, since many of the dependencies will be shared among test types (i.e. unit/training/inference), but training and inference will both require unique dependencies on top of what's included in e2e2/test/images/nvidia/Dockerfile
. The dependencies between training
and inference
might be the same at this point, but this could very well change in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add those unique dependencies to the e2e2/test/images/nvidia/Dockerfile
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cartermckinnon I agree though that further thought needs to be put into the test directory structure before we go too much further. I'm not sure how many more tests we're looking to add, but the current approach doesn't scale particularly well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@weicongw I mean sure we can, but then you're adding another ~7GB of deps to that image, which are totally unnecessary for the unit tests. Also, if we ever added the another test (e.g. ResNet), they very well could have their own unique dependencies as well. This will especially be true if we ever want to validate other frameworks than what are currently being utilized.
…ce to be consistent with the other test images
A few major updates were made with the last few commits. The GPU was incorrectly being assigned to the process' world rank and not its local rank. This was leading to a failure when trying to run on a multi-node cluster. I've rectified the problem, and the script will now successfully run on a multi-node cluster. Here's the output from a successful workload being ran on a cluster of 4 nodes, where each node is a Click to expand
|
# TODO: Consider parameterizing for nodes of any GPU count | ||
num_gpus_per_node = 8 # Adjust this based on your setup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to do this now right, in case we want to call a job with a different instance type entirely? i guess the testing instances have been the same capacity so far
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, at the moment all instance types requested have 8 GPUs. I don't really ever see us running the training test on any instance type with less than 8 GPUs, unless EC2 suddenly changes their pattern for instance configurations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would future proof, but also add a touch more complexity and another point of failure if we have to have it configured at runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ndbaker1 I'd actually probably advocate I take out the TODO comment, and just leave it hard coded. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was thinking we could just add an
ENV GPUS_PER_NODE=8
in the Dockerfile and it wouldn't complicate much right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that works. I'm thinking further down stream (upstream..?) for parameterizitation of the manifest, and collection of the number of GPUs on a node by our Go test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So would really need to be something like
ARG GPUS_PER_NODE=8
unless I'm misunderstanding something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ndbaker1 Made the change, verified with local testing
Issue #, if available:
Description of changes:
A distributed training script (
e2e2/test/images/bert-training/train.py
) has been added, along with it's dependencies (e2e2/test/images/bert-training/requirements.txt
), in a new docker file (e2e2/test/images/bert-training/Dockerfile.bert-training
). Building the dockerfile will produce an image that will run a distributed BERT training job.The testing of the docker image took place on a
p3.16xlarge
instance utilizing the AMI: ami-05e885690ca33b527. The goal of the image is to start a process per GPU, creating an isolated training process per GPU, and then for there to be communication between each process, consolidating the weights from each process. The test is ran for a single epoch.The results of the test show that the running the docker image starts up and executes the distributed BERT training job as expected :
Included in this PR is also the inclusion of a new github action to verify the docker image will build successfully.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.