Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker shared memory issue and solution #369

Open
peteflorence opened this issue Feb 14, 2019 · 15 comments
Open

Docker shared memory issue and solution #369

peteflorence opened this issue Feb 14, 2019 · 15 comments

Comments

@peteflorence
Copy link
Collaborator

peteflorence commented Feb 14, 2019

I am not sure if this is happening in our various other configurations, but it was happening in my spartan Docker container inside which I put PyTorch and was trying to do some training.

Symptom

I was getting an error something like, "Bus error (core dumped) model share memory". It's related to this issue: pytorch/pytorch#2244

Cause

Following the comments by apaszke (a PyTorch author) are helpful here (pytorch/pytorch#1355 (comment)) in which, running inside the Docker container, it appears the only available shared memory is 64 megs:

peteflo@08482dc37efa:~$ df -h | grep shm
shm              64M     0   64M   0% /dev/shm

Temp Solution

As mentioned by apaszke,

sudo mount -o remount,size=8G /dev/shm

(choose more than 8G if you'd like)

This fixes it, as visible here:

peteflo@08482dc37efa:~$ df -h | grep shm
shm             8.0G     0  8.0G   0% /dev/shm

Other notes

Some places on the internet you will find that --ipc=host is supposed to avoid this issue, as can other flags to the docker run process, but those didn't work for me, and involve re-opening the container. I suspect something about my configuration is wrong. The above issue fixes it even while inside the container.

Long term solution

It would first be useful to identify if anybody else's docker containers have this issue, which can be simply evaluated by df -h | grep shm inside the container. Then we could diagnose who it is happening to and why. It might just be me.

@peteflorence
Copy link
Collaborator Author

@manuelli @gizatt @weigao95

has anybody else seen this?

@gizatt
Copy link
Collaborator

gizatt commented Feb 14, 2019 via email

@peteflorence
Copy link
Collaborator Author

Yes that would work but first would like to ascertain if anybody else has this issue.

I've done a lot of work with PyTorch in Docker before but haven't had this, so would like to understand what's different.

Is easy to test your own docker setup, just run:

df -h | grep shm

@gizatt
Copy link
Collaborator

gizatt commented Feb 14, 2019 via email

@peteflorence
Copy link
Collaborator Author

peteflorence commented Feb 14, 2019 via email

@gizatt
Copy link
Collaborator

gizatt commented Feb 14, 2019 via email

@patmarion
Copy link
Member

why not use: docker run --shm-size 8G

@peteflorence
Copy link
Collaborator Author

peteflorence commented Feb 14, 2019 via email

@manuelli
Copy link
Collaborator

Yeah I have it inside my spartan container as well.

manuelli@paladin-44:~/spartan$ df -h | grep shm
shm              64M     0   64M   0% /dev/shm

but inside pdc container I have 31G.

manuelli@paladin-44:~/code$ df -h | grep shm
tmpfs            32G  882M   31G   3% /dev/shm

So we must have something different between pdc and spartan docker containers that is causing this.

@peteflorence
Copy link
Collaborator Author

peteflorence commented Feb 14, 2019 via email

@peteflorence
Copy link
Collaborator Author

Resolved by either passing --ipc=host or --shm-size 8G

I did have the arg in the wrong spot in the docker_run.py string it builds up!

@peteflorence
Copy link
Collaborator Author

Looked at it with @manuelli this morning

We might just want to add --ipc=host by default to spartan

@austinmw
Copy link

@peteflorence If both ipc=host and shm-size work for increasing shared memory, could you help me understand the difference?

@gjstein
Copy link

gjstein commented Aug 19, 2020

Both solutions worked for me (though in a separate container that runs PyTorch). Root cause is still unknown? Otherwise perhaps this issue is resolved.

@depshad
Copy link

depshad commented Oct 1, 2020

Is there a way to override the path used by Pytorch multiprocess (/dev/shm). Unfortunately, increasing shared memory is not possible for me.
Something like %env JOBLIB_TEMP_FOLDER=/tmp, which works for sklearn.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants