-
Notifications
You must be signed in to change notification settings - Fork 536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sudden data error during training #766
Comments
Hey @faresobeid I tried to recreate the issue, but works fine for me. |
Oh thanks for replying quickly, was wondering whats the easiest way to solve this issue |
Are you able to save global_indices.npy? When you run |
Oh interesting didn't realize, but what would I do with that if I could save (also how large is it?). To clarify I'm trying to run the olmo2 7b config with a smaller model and less steps so was also wondering if I should edit the data in the config to support this |
global_indices.npy is the train data. I am not sure at this point why it is throwing the error. Can you provide more details on what exactly are you implementing so that I can help you in best possible way? |
Right now i'm not on it but I just ran the olmo 2 7b stage 1 config but just modified d_model, n_layers, n_heads, and mlp_hidden_size as well as switching from fsdp to ddp. If more details are needed I will send over tmrw. Thanks a lot! |
Ok update it seems to work now but provide a new error after around 80 steps in training, can see my config here https://github.com/faresobeid/OLMo/blob/main/configs/official-1124/OLMo2-7B-stage1.yaml. Also an unrelated question, if i only want to train on a subset of the data is the easiest way just to change max duration rather than touch any of the data sources? Thanks once again!
|
Hey @faresobeid, seems like you have an unstable network. If you want to train on a subset of data, the easiest way is to edit |
Oh ok and for editing the training data is there a recommended way as in how many links from each source to get rid of? Was also wondering to get past the unstable network issues if i could pre download and tokenize the dataset for example, what would be the easiest way to do so? |
Hi! I am having the same problem. However, my global_indices.npy is 3.48 G when my entire train dataset (dolma v1.5) should be > 2 TB. I'm guessing this is because the iterable dataset streams the download links. Is there a way to get one global_indices.npy for the entire dataset? |
global_indices.npy does not contain the whole dataset. It just contains offsets into the whole dataset. That size for global_indices seems OK. |
File "/root/OLMo/olmo/util.py", line 712, in _http_get_bytes_range The original error seems to be just that the URL does not return the right amount of data. If it's a connection/network issue, it would more likely manifest as an exception. One option is to update util.py:712 to include the url, curl the url and check if the returned data length is expected. |
Don't think its an issue with the URL. I ended up writing a try catch block around the function's contents, forcing it to retry on the same URL if any assertion error or exception occurred, which resolved all issues on my end. |
Did you happen to verify that the retry actually happened? For the try catch to solve the op's problem, an exception should have happend for the "requests.get" line. In that case, the assert "(len(result) == num_bytes)" won't even be triggered. |
Both lines errored occasionally for me so I ended up wrapping the entire body of the function in a try catch inside of a while true loop. If everything ran successfully then the loop would break. This is what's worked for me in the end. |
We should do this in the code in general, put the fix into main. The root
cause is that the server sometimes returns truncated content instead of an
error.
…On Feb 7, 2025 at 17:25:56, Ethan Shen ***@***.***> wrote:
Both lines errored occasionally for me so I ended up wrapping the entire
body of the function in a try catch inside of a while true loop. If
everything ran successfully then the loop would break. This is what's
worked for me in the end.
—
Reply to this email directly, view it on GitHub
<#766 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHAYPRM74ZFJPNKEDAHMET2OVMKJAVCNFSM6AAAAABTWNZ7UGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNBUGQYDQMRRHE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
🐛 Describe the bug
I'm trying to run a tiny olmo 2 training and have done successfully for some steps of training, then suddenly I get errors like this:
Thanks!
Versions
Python 3.10.9
The text was updated successfully, but these errors were encountered: