You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, I am training with a very large dataset, 7T tokens, across 45 .bin files. When I try to use more than 32 gpus, I get an error that says too many files are open. I am wondering if anyone else has come across this? Here's the error I receive. Thanks so much!
GPUCA6E: with socket.fromfd(conn.fileno(), socket.AF_UNIX, socket.SOCK_STREAM) as s:
GPUCA6E: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
GPUCA6E: File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/socket.py", line 546, in fromfd
GPUCA6E: fd, addr = self._accept()
GPUCA6E: return recvfds(s, 1)[0]
GPUCA6E: ^^^^^^^^^^^^^^
GPUCA6E: OSError : [Errno 24] Too many open files
GPUCA6E: nfd = dup(fd)
GPUCA6E: ^^^^^^^
GPUCA6E: OSError: [Errno 24] Too many open files
GPUCA6E: ^^^^ return recvfds(s, 1)[0]
GPUCA6E: ^ ^^^^^^^^^^^^^^^^
GPUCA6E: File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/multiprocessing/reduction.py", line 159, in recvfds
GPUCA6E: ^^^^^
GPUCA6E: File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/multiprocessing/reduction.py", line 164, in recvfds
GPUCA6E: raise EOFError
GPUCA6E: EOFErrorTraceback (most recent call last):
GPUCA6E: File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/multiprocessing/resource_sharer.py", line 145, in _serve
GPUCA6E:
GPUCA6E: Exception in thread raise RuntimeError('received %d items of ancdata' %
GPUCA6E: Thread-4 (_pin_memory_loop):
GPUCA6E: Traceback (most recent call last):
GPUCA6E: RuntimeError: received 0 items of ancdata
GPUCA6E: File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
GPUCA6E: send(conn, destination_pid)
GPUCA6E: File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/multiprocessing/resource_sharer.py", line 50, in send
GPUCA6E: reduction.send_handle(conn, new_fd, pid)
GPUCA6E: File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/multiprocessing/reduction.py", line 183, in send_handle
GPUCA6E: with socket.fromfd(conn.fileno(), socket.AF_UNIX, socket.SOCK_STREAM) as s:
GPUCA6E: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^self.run()^
GPUCA6E: ^^^^^^^^^^^
GPUCA6E: File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/socket.py", line 546, in fromfd
GPUCA6E: File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/threading.py", line 975, in run
GPUCA6E: nfd = dup(fd)
GPUCA6E: self._target(*self._args, **self._kwargs)
GPUCA6E: ^^^^^^^
GPUCA6E: OSError: [Errno 24] Too many open files
The text was updated successfully, but these errors were encountered:
Hello, I am training with a very large dataset, 7T tokens, across 45 .bin files. When I try to use more than 32 gpus, I get an error that says too many files are open. I am wondering if anyone else has come across this? Here's the error I receive. Thanks so much!
The text was updated successfully, but these errors were encountered: