-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: [enforce fail at inline_container.cc:145] . PytorchStreamReader failed reading file data/94862161006912: file read failed #35
Comments
Hi It's likely that you have a corrupted model file. This can happen if training was terminated while the checkpoint was being saved. Solutions:
|
@prajdabre oh Thanks for your patience, and In fact, I have guessed before that it is the problem that the model is not fully saved. And I saw that If saved models successfully, in the After I confirmed, the problem is that, If I run the And as I understand, the big model(6.9GB) is actually no need to save? Can I save and reload only pure_model every 1000 steps? and as for the optimizer and scheduler, I can also use them from previous step? Maybe this way can resolve my error in multiple GPUs; and finally, I only save one(big model) in last step. Can I make the above changes? Or have other suggestions? Actually I don't understand the reason for saving this large model |
Hi I'm not sure of the exact problem but I typically pass the flag: --save_intermediate_checkpoints This will save a separate checkpoint every 10k iterations and this 10k can be set to any value using another flag --long_save_every . I then use an appropriate checkpoint. I have never used the last checkpoint to be honest. I should look whether there's a bug in the last checkpoint saving or not. That being said using the .pure_model for fine tuning is not a problem. I designed the training so that the big checkpoint with the optimizer and scheduler and counter can be used to resume training a failed run. The pure_model checkpoint is the one that should be used for fine tuning on a downstream task where optimizer params are not needed or for sharing with someone or for uploading to huggingface. Hope this makes sense. |
@prajdabre Thank you for your reply, After many adjustments, I finally decide to Remove some intermediately model_load code parts as follows And another point that I haven't understand is that where is the stopping mechanism of this pre-training stage adjusted? I don't seem to find a parameter that can control the number of steps in training |
Hi Your modification is ok. There's no real reason to load a saved model. I just wanted to make sure that everything is hard synchronized. The argument you are looking for is: --num_batches |
Thanks a lot |
@prajdabre by the way, I just want to ask that when I pre train based on more than one language, such as 8 langs, Is it necessary to adjust this parameter because my script errored out again, when I switched to multilingual pre-training, from one lang to 8 langs;
-- Process 0 terminated with the following error: and my script setting is : |
No you don't need the domain classifier flags. The reason for the failure is that the language ids you use should be corresponding to what is used in mbart. en should be EN_XX Look at the official mbart model repo and find the ids for other languages. |
@prajdabre very thanks for your Quickly reply this is the official format of langs in mBart-large-50 ` from https://huggingface.co/facebook/mbart-large-50 and what I don't understand is When I pre-trained a single language, I did just use "en", not "EN_XX", and it run successful, and in the project's examples I saw in |
That's because coincidentally the token en existed in the Mbart tokenizer. I'm betting that the token zh isn't present in the tokenizer and is split as "z# h". Whenever a language token is split into multiple parts my code crashes (intentionally). |
ok, thanks, I will try it, and by the way, besides the |
No the training files can have any suffix. Only during tokenizer training, the training files should have proper suffixes which act as language indicator tokens which you plan to use for model training and decoding. |
Thank you very much. After some setting process, it has been successful; And by the way, is that mean, if I continue pre train with new language, for example, I continue pre train based on mBart-50, but there is not Traditional Chinese in mBart50, only have Simplified Chinese, and if I add traditional Chinese, I have to pre train from scratch? @prajdabre |
To my understanding, mbart-50 does not officially support traditional Chinese. Firstly you will have to check if the mbart-50 tokenizer can handle all traditional characters or not. If it does then you may directly train.
|
Thanks a lot @prajdabre |
@prajdabre oh, I have another confusion,I'm running continue pre training task do you know the reason? |
by the way, supplement information of data and --num_batches = 2000000, bach_size = 512 |
Hi Your supplementary information about corpora sizes answers it all. Since Korean has the smallest data it will finish far more epochs before others finish one epoch. This is because there is a data sampling hyperparam called --data_sampling_temperature which is set to 5. This means that smaller data will be seen more often to keep training from focusing on the higher resource languages. I think you will see 1 epoch for Thai after 20 or so epochs for Korean. |
oh thank you very much, I see I have neglected the parameter --data_sampling_temperature, and If that's the case, then I'll have to thinking about resetting but now after 380k batches, haven't even finished one epoch for another 7 languages, so probably after 2000000 batches, some high resource language will not be fully trained, I don't know if I'm right in thinking this way @prajdabre |
A correction: The batch size doesn't indicate lines but number of tokens. If your corpus contains paragraphs then each batch per GPU contains only 2 or 3 entries for a batch of 512 tokens. If it's sentences then probably 8 or 10 sentences. Note that you should set --hard_truncate_length to 512 and --max_length to 512 as well. Else your training will skip all the data in case of paragraphs. Anyway, with just 2 GPUs, you are going to need several tens of millions of steps. I'm afraid that with the scale of data you want to work with you need more GPUs to get results quickly. I recommend filtering the dataset to a more manageable size like 14 million examples across all languages. |
@prajdabre thank you very much, it seems that my previous setting doesn't seem reasonable, actually my training set in all language is all about sentence, not paragraphs, I will set --hard_truncate_length to 512 and --max_length to 512, and a higher value of --num_batches, and set reduce the size of the training set to less than half of the current size |
Hi, When I use the
train_mbart_model.sh
to get further pre train based on the mBart-large-50 from this:https://huggingface.co/facebook/mbart-large-50;And when I run in single GPU, there is no problem, but when I set this form and want to run on 2-GPUS
export CUDA_VISIBLE_DEVICES=0,1 # Change to the GPU ID corresponding to a GPU that is free. nohup python pretrain_nmt.py -n 1 -nr 0 -g 2 --model_path gen_model/mbart/mbart-50-v1 --tokenizer_name_or_path pretrain_model/mbart-50 --langs en --mono_src examples/test_data/test_mbart_train.en --encoder_layers 12 --decoder_layers 12 --encoder_attention_heads=12 --decoder_attention_heads=12 --encoder_ffn_dim=128 --decoder_ffn_dim=4096 --d_model=1024 --batch_size 128 --use_official_pretrained --pretrained_model pretrain_model/mbart-50 --no_reload_optimizer_ctr_and_scheduler --shard_files > gen_model/mbart/run_train.log 2>&1 &
there is error as follow:
Number of model parameters: 610879488
Total number of params to be optimized are: 610879488
Percentage of parameters to be optimized: 100.0
Initial LR is: 1.25e-07
Training from official pretrained model
/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:247: UserWarning: To get the last learning rate computed by the scheduler, please use
get_last_lr()
.warnings.warn("To get the last learning rate computed by the scheduler, "
/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of
lr_scheduler.step()
beforeoptimizer.step()
. In PyTorch 1.1.0 and later, you should call them in the opposite order:optimizer.step()
beforelr_scheduler.step()
. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
Traceback (most recent call last):
File "pretrain_nmt.py", line 968, in
run_demo()
File "pretrain_nmt.py", line 965, in run_demo
mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,)) #
File "/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "****/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, args)
File "/mt_mbart/yanmtt/pretrain_nmt.py", line 313, in model_create_load_run_save
checkpoint_dict = torch.load(CHECKPOINT_PATH, map_location=map_location)
File "/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/serialization.py", line 594, in load
return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
File "/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/serialization.py", line 853, in _load
result = unpickler.load()
File "/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/serialization.py", line 845, in persistent_load
load_tensor(data_type, size, key, _maybe_decode_ascii(location))
File "*/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/serialization.py", line 833, in load_tensor
storage = zip_file.get_storage_from_record(name, size, dtype).storage()
RuntimeError: [enforce fail at inline_container.cc:145] . PytorchStreamReader failed reading file data/94862161006912: file read failed
Can you help me for this problem? I spent a long time and didn't solve it
The text was updated successfully, but these errors were encountered: