Sudden data error during training #766

faresobeid · 2024-12-16T15:44:38Z

🐛 Describe the bug

I'm trying to run a tiny olmo 2 training and have done successfully for some steps of training, then suddenly I get errors like this:

AssertionError: Caught AssertionError in DataLoader worker process 23.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/worker.py", line 351, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 33, in fetch
    data.append(next(self.dataset_iter))
  File "/root/OLMo/olmo/data/iterable_dataset.py", line 181, in <genexpr>
    return (self._get_dataset_item(int(idx)) for idx in indices)
  File "/root/OLMo/olmo/data/iterable_dataset.py", line 184, in _get_dataset_item
    item = self.dataset[idx]
  File "/root/OLMo/olmo/data/memmap_dataset.py", line 196, in __getitem__
    input_ids = self._read_chunk_from_memmap(self._memmap_paths[memmap_index], memmap_local_index)
  File "/root/OLMo/olmo/data/memmap_dataset.py", line 162, in _read_chunk_from_memmap
    buffer = get_bytes_range(path, bytes_start, num_bytes)
  File "/root/OLMo/olmo/util.py", line 380, in get_bytes_range
    return _http_get_bytes_range(
  File "/root/OLMo/olmo/util.py", line 712, in _http_get_bytes_range
    len(result) == num_bytes
AssertionError: expected 16384 bytes, got 7170

Thanks!

Versions

Python 3.10.9

The text was updated successfully, but these errors were encountered:

aman-17 · 2024-12-16T20:17:47Z

Hey @faresobeid I tried to recreate the issue, but works fine for me.
I guess the issue might be caused by accessing the data over the network when trying to fetch from HTTP based source during training. Might be due to connection issues (network interruption during data retrieval) or downloading truncated file because of some network issues, instead of receiving the actual data tokens (16384 bytes), you are receiving an error response of 7170 bytes, which causes the training to fail after some time.

faresobeid · 2024-12-16T20:36:45Z

Oh thanks for replying quickly, was wondering whats the easiest way to solve this issue

aman-17 · 2024-12-16T20:38:22Z

Are you able to save global_indices.npy? When you run torchrun --nproc_per_node=2 scripts/train.py configs/tiny/OLMo-20M.yaml --save_overwrite, iterable_dataset.py will save global_indices.npy to your workspace.

faresobeid · 2024-12-16T20:50:05Z

Oh interesting didn't realize, but what would I do with that if I could save (also how large is it?). To clarify I'm trying to run the olmo2 7b config with a smaller model and less steps so was also wondering if I should edit the data in the config to support this

aman-17 · 2024-12-16T21:05:32Z

global_indices.npy is the train data. I am not sure at this point why it is throwing the error. Can you provide more details on what exactly are you implementing so that I can help you in best possible way?

faresobeid · 2024-12-16T21:37:19Z

Right now i'm not on it but I just ran the olmo 2 7b stage 1 config but just modified d_model, n_layers, n_heads, and mlp_hidden_size as well as switching from fsdp to ddp. If more details are needed I will send over tmrw. Thanks a lot!

faresobeid · 2024-12-17T21:41:33Z

Ok update it seems to work now but provide a new error after around 80 steps in training, can see my config here https://github.com/faresobeid/OLMo/blob/main/configs/official-1124/OLMo2-7B-stage1.yaml. Also an unrelated question, if i only want to train on a subset of the data is the easiest way just to change max duration rather than touch any of the data sources? Thanks once again!

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/worker.py", line 351, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 33, in fetch
    data.append(next(self.dataset_iter))
  File "/root/OLMo/olmo/data/iterable_dataset.py", line 181, in <genexpr>
    return (self._get_dataset_item(int(idx)) for idx in indices)
  File "/root/OLMo/olmo/data/iterable_dataset.py", line 184, in _get_dataset_item
    item = self.dataset[idx]
  File "/root/OLMo/olmo/data/memmap_dataset.py", line 196, in __getitem__
    input_ids = self._read_chunk_from_memmap(self._memmap_paths[memmap_index], memmap_local_index)
  File "/root/OLMo/olmo/data/memmap_dataset.py", line 162, in _read_chunk_from_memmap
    buffer = get_bytes_range(path, bytes_start, num_bytes)
  File "/root/OLMo/olmo/util.py", line 380, in get_bytes_range
    return _http_get_bytes_range(
  File "/root/OLMo/olmo/util.py", line 707, in _http_get_bytes_range
    response = requests.get(
  File "/usr/local/lib/python3.10/dist-packages/requests/api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/requests/adapters.py", line 700, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='olmo-data.org', port=80): Max retries exceeded with url: /preprocessed/dclm/text_openhermes_reddit_eli5_vs_rw_v2_bigram_200k_train/allenai/dolma2-tokenizer/part-015-00001.npy (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ffa3242d240>: Failed to establish a new connection: [Errno 101] Network is unreachable'))

aman-17 · 2024-12-17T22:30:14Z

Hey @faresobeid, seems like you have an unstable network. If you want to train on a subset of data, the easiest way is to edit .yaml file. Don't forget to inspect the train data using inspect_train_data.py after editing your .yaml

faresobeid · 2024-12-18T00:30:16Z

Oh ok and for editing the training data is there a recommended way as in how many links from each source to get rid of? Was also wondering to get past the unstable network issues if i could pre download and tokenize the dataset for example, what would be the easiest way to do so?

ethanlshen · 2024-12-28T18:49:07Z

global_indices.npy is the train data. I am not sure at this point why it is throwing the error. Can you provide more details on what exactly are you implementing so that I can help you in best possible way?

Hi! I am having the same problem. However, my global_indices.npy is 3.48 G when my entire train dataset (dolma v1.5) should be > 2 TB. I'm guessing this is because the iterable dataset streams the download links. Is there a way to get one global_indices.npy for the entire dataset?

dirkgr · 2025-01-07T23:50:26Z

global_indices.npy does not contain the whole dataset. It just contains offsets into the whole dataset. That size for global_indices seems OK.

Shua1 · 2025-02-07T07:44:51Z

File "/root/OLMo/olmo/util.py", line 712, in _http_get_bytes_range
len(result) == num_bytes
AssertionError: expected 16384 bytes, got 7170

The original error seems to be just that the URL does not return the right amount of data. If it's a connection/network issue, it would more likely manifest as an exception.

One option is to update util.py:712 to include the url, curl the url and check if the returned data length is expected.

ethanlshen · 2025-02-07T09:11:21Z

Don't think its an issue with the URL. I ended up writing a try catch block around the function's contents, forcing it to retry on the same URL if any assertion error or exception occurred, which resolved all issues on my end.

Shua1 · 2025-02-07T22:50:28Z

Did you happen to verify that the retry actually happened?

For the try catch to solve the op's problem, an exception should have happend for the "requests.get" line. In that case, the assert "(len(result) == num_bytes)" won't even be triggered.

ethanlshen · 2025-02-08T01:25:34Z

Both lines errored occasionally for me so I ended up wrapping the entire body of the function in a try catch inside of a while true loop. If everything ran successfully then the loop would break. This is what's worked for me in the end.

dirkgr · 2025-02-09T17:25:18Z

We should do this in the code in general, put the fix into main. The root cause is that the server sometimes returns truncated content instead of an error.

…

On Feb 7, 2025 at 17:25:56, Ethan Shen ***@***.***> wrote: Both lines errored occasionally for me so I ended up wrapping the entire body of the function in a try catch inside of a while true loop. If everything ran successfully then the loop would break. This is what's worked for me in the end. — Reply to this email directly, view it on GitHub <#766 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHAYPRM74ZFJPNKEDAHMET2OVMKJAVCNFSM6AAAAABTWNZ7UGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNBUGQYDQMRRHE> . You are receiving this because you commented.Message ID: ***@***.***>

…raining #766

faresobeid added the type/bug An issue about a bug label Dec 16, 2024

aman-17 added a commit that referenced this issue Feb 9, 2025

updated _http_get_bytes_range to resolve Sudden data error during t…

dd26d23

…raining #766

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sudden data error during training #766

Sudden data error during training #766

faresobeid commented Dec 16, 2024

aman-17 commented Dec 16, 2024 •

edited

Loading

faresobeid commented Dec 16, 2024

aman-17 commented Dec 16, 2024 •

edited

Loading

faresobeid commented Dec 16, 2024

aman-17 commented Dec 16, 2024

faresobeid commented Dec 16, 2024

faresobeid commented Dec 17, 2024

aman-17 commented Dec 17, 2024 •

edited

Loading

faresobeid commented Dec 18, 2024

ethanlshen commented Dec 28, 2024

dirkgr commented Jan 7, 2025

Shua1 commented Feb 7, 2025

ethanlshen commented Feb 7, 2025

Shua1 commented Feb 7, 2025

ethanlshen commented Feb 8, 2025

dirkgr commented Feb 9, 2025 via email

Sudden data error during training #766

Sudden data error during training #766

Comments

faresobeid commented Dec 16, 2024

🐛 Describe the bug

Versions

aman-17 commented Dec 16, 2024 • edited Loading

faresobeid commented Dec 16, 2024

aman-17 commented Dec 16, 2024 • edited Loading

faresobeid commented Dec 16, 2024

aman-17 commented Dec 16, 2024

faresobeid commented Dec 16, 2024

faresobeid commented Dec 17, 2024

aman-17 commented Dec 17, 2024 • edited Loading

faresobeid commented Dec 18, 2024

ethanlshen commented Dec 28, 2024

dirkgr commented Jan 7, 2025

Shua1 commented Feb 7, 2025

ethanlshen commented Feb 7, 2025

Shua1 commented Feb 7, 2025

ethanlshen commented Feb 8, 2025

dirkgr commented Feb 9, 2025 via email

aman-17 commented Dec 16, 2024 •

edited

Loading

aman-17 commented Dec 16, 2024 •

edited

Loading

aman-17 commented Dec 17, 2024 •

edited

Loading