-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Error at extracting the layers from bucket." with the DDP communication hook, when there are other all_reduce calls outside of DDP #5
Comments
jrcavani
changed the title
"Error at extracting the layers from bucket." with the DDP communication hook, but not with envvar setting
"Error at extracting the layers from bucket." with the DDP communication hook, when there are other all_reduce calls outside of DDP
Sep 8, 2023
Here are a few runs with various compression settings for my code out of the box. I resumed a run, and measure speed + loss values. Without excluding those `DistCrossEntropyFunc calls from compression, 4 bits / 8 bits gave very unstable outcomes. If only I could exclude them from being compressed...
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hello,
I am using this library to speed up the card-to-card communication. I was able to run my code modified to support CGX, using the Dockerfile provided (based on
pytorch:22.10-py3
).The code without CGX modification is here
Supplying the envvars worked fine:
However, when I added the DDP communication hook
I got this error:
The code that throw the error is at
DistCrossEntropyFunc
. The line numbers and file names don't correspond exactly to the above exception, because I was running on internal code, but they are very similar for the purpose of illustration.This trainer is essentially a big distributed classifier, similar to the ImageNet classifier, using DDP for distributed data parallelism for the resnet model backbone. However, it additionally uses model parallelism for the last classification layer, and there are a few
all_reduce
calls involved, outside of model backbone DDP.So my understanding of the situation is as follows. It works without the hook, and errors with the hook because in the hook there might be some assumptions on what goes into the buckets, and because there are
all_reduce
calls outside of DDP, the hook is not able to anticipate them.Is this true? I can live without the DDP hook. What I really want to do is the exclude compression for these all_reduce calls outside of the DDP backbone. In my experience even fp16 compression for these
DistCrossEntropyFunc
all_reduce
calls would make loss forward/backward inaccurate. If I can turn them off through something likeexclude_layers(layer_type_name)
(can't find the function anymore), that would be great.The text was updated successfully, but these errors were encountered: