Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug with forked S3PathHandler #16

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

DavidMChan
Copy link

When using the S3PathHandler in a large number of distributed pytorch dataloaders, you can get errors of the form:

SSLError: [SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:1748)

As discussed in dask/dask#1292, this is caused when the PID of the process that owns the client is not the same as the PID of the process that makes the S3 request.

This PR addresses the issue by storing the PID of the client, and refreshing the client when the PIDs no longer match.

The S3PathHandler suffers from the same problem described here: dask/dask#1292. This commit stores the PID that the client was created for, and refreshes the client if the PID does not match (if the PathManager is now part of a different process).
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 29, 2022
@sujitoc
Copy link
Contributor

sujitoc commented Aug 22, 2022

@DavidMChan can you elaborate on the use case that exposes this error?

@DavidMChan
Copy link
Author

Sure - this happens when you're training a model (We're training HuBERT) in fairseq using iopath's S3 manager. It seems like the fork which fairseq executes can cause this SSL error (probably because of the same underlying root cause as the Dask issue above).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants