Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

too many .bin files for dataloader, crashed #1252

Closed
exnx opened this issue Jul 13, 2024 · 0 comments
Closed

too many .bin files for dataloader, crashed #1252

exnx opened this issue Jul 13, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@exnx
Copy link
Contributor

exnx commented Jul 13, 2024

Hello, I am training with a very large dataset, 7T tokens, across 45 .bin files. When I try to use more than 32 gpus, I get an error that says too many files are open. I am wondering if anyone else has come across this? Here's the error I receive. Thanks so much!

GPUCA6E:     with socket.fromfd(conn.fileno(), socket.AF_UNIX, socket.SOCK_STREAM) as s:
GPUCA6E:          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
GPUCA6E:   File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/socket.py", line 546, in fromfd
GPUCA6E:     fd, addr = self._accept()
GPUCA6E:     return recvfds(s, 1)[0]
GPUCA6E:                ^^^^^^^^^^^^^^
GPUCA6E:    OSError  : [Errno 24] Too many open files
GPUCA6E:        nfd = dup(fd)
GPUCA6E:           ^^^^^^^
GPUCA6E: OSError: [Errno 24] Too many open files
GPUCA6E:    ^^^^    return recvfds(s, 1)[0]
GPUCA6E:           ^ ^^^^^^^^^^^^^^^^
GPUCA6E:   File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/multiprocessing/reduction.py", line 159, in recvfds
GPUCA6E: ^^^^^
GPUCA6E:   File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/multiprocessing/reduction.py", line 164, in recvfds
GPUCA6E:     raise EOFError    
GPUCA6E: EOFErrorTraceback (most recent call last):
GPUCA6E:   File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/multiprocessing/resource_sharer.py", line 145, in _serve
GPUCA6E: 
GPUCA6E: Exception in thread raise RuntimeError('received %d items of ancdata' %
GPUCA6E: Thread-4 (_pin_memory_loop):
GPUCA6E: Traceback (most recent call last):
GPUCA6E: RuntimeError: received 0 items of ancdata
GPUCA6E:   File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
GPUCA6E:     send(conn, destination_pid)
GPUCA6E:   File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/multiprocessing/resource_sharer.py", line 50, in send
GPUCA6E:     reduction.send_handle(conn, new_fd, pid)
GPUCA6E:   File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/multiprocessing/reduction.py", line 183, in send_handle
GPUCA6E:     with socket.fromfd(conn.fileno(), socket.AF_UNIX, socket.SOCK_STREAM) as s:
GPUCA6E:          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^    ^self.run()^
GPUCA6E: ^^^^^^^^^^^
GPUCA6E:   File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/socket.py", line 546, in fromfd
GPUCA6E:   File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/threading.py", line 975, in run
GPUCA6E:     nfd = dup(fd)
GPUCA6E:             self._target(*self._args, **self._kwargs) 
GPUCA6E:  ^^^^^^^
GPUCA6E: OSError: [Errno 24] Too many open files
@exnx exnx added the bug Something isn't working label Jul 13, 2024
@exnx exnx closed this as completed Jul 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant