Fine-tuning pretrained model #51

ashleyyy94 · 2019-03-21T10:23:48Z

I'm trying to fine-tune the pretrained model provided with my custom dataset. The command is nvidia-docker run -it --rm -v pwd:/decaNLP/ -u $(id -u):$(id -g) bmccann/decanlp:cuda9_torch041 bash -c "python /decaNLP/train.py --load /decaNLP/mqan_decanlp_better_sampling_cove_cpu/iteration_560000.pth --resume --train_tasks mwo

While trying to initialise the MQAN model, it throws up this error:
RuntimeError: Error(s) in loading state_dict for MultitaskQuestionAnsweringNetwork: Missing key(s) in state_dict: "encoder_embeddings.projection.linear.weight", "encoder_embeddings.projection.linear.bias". Unexpected key(s) in state_dict: "cove.rnn1.weight_ih_l0", "cove.rnn1.weight_hh_l0", "cove.rnn1.bias_ih_l0", "cove.rnn1.bias_hh_l0", "cove.rnn1.weight_ih_l0_reverse", "cove.rnn1.weight_hh_l0_reverse", "cove.rnn1.bias_ih_l0_reverse", "cove.rnn1.bias_hh_l0_reverse", "cove.rnn1.weight_ih_l1", "cove.rnn1.weight_hh_l1", "cove.rnn1.bias_ih_l1", "cove.rnn1.bias_hh_l1", "cove.rnn1.weight_ih_l1_reverse", "cove.rnn1.weight_hh_l1_reverse", "cove.rnn1.bias_ih_l1_reverse", "cove.rnn1.bias_hh_l1_reverse", "project_cove.linear.weight", "project_cove.linear.bias".

Kindly advise how to go about fine-tuning the model. Thank you.

The text was updated successfully, but these errors were encountered:

hot-cheeto · 2020-04-17T03:27:22Z

Hello,
I am having issues with fine-tuning the pretrained model mqan_decanlp_better_sampling_cove_cpu. I give the following command :

python train.py --name test_run --load /path/to/mqan_decanlp_better_sampling_cove_cpu/iteration_560000.pth --resume --device 0 --cove --train_tasks new_task

put I received the following error message: ValueError: loaded state dict contains a parameter group that doesn't match the size of optimizer's group

I have double check the parameters in config.json in mqan_decanlp_better_sampling_cove_cpu.

What could be the problem ? Am I missing something ?

Thank you in advance !

diarmidmackenzie · 2020-12-12T16:05:20Z

These queries are quite old, but I've been hitting the same problems. Posting some answers in case they might be useful for others.

@ashleyyy94 Looks like you were running without the --cove parameter. The pretrained model you were trying to use had used cove, so you need to also use cove to continue training.

@hot-cheeto I have been hitting this problem too. I don't fully understand it, but it seems to be possible to work around this by dropping the "resume" parameter.

The issue seems to be that the stored state for the optimizer has a mismatching number of parameters.

It has 153 parameters.

>>> import torch
>>> a = torch.load("/decaNLP/results/checkpoints/iteration_560000_rank_0_optim.pth", map_location='cpu')
>>> len(a['param_groups'][0]['params'])
153

Whereas if I start training with the same parameters as you, the optimizer state only has 137 parameters (16 fewer).

>>> b = torch.load("/decaNLP/diarmid_learning/1/iteration_1000_rank_0_optim.pth", map_location='cpu')
>>> len(b['param_groups'][0]['params'])
137

I have not yet understood what accounts for these extra 16 optimizer parameters. So I have no idea how to correct for them. But I believe it is reasonable to discard the optimizer state and continue training from the model state, and seem to have got some reasonable results doing so.

You can do that by dropping the --resume parameter.

Once you have a learning checkpoint that you have generated yourself, you can then continue from this checkpoint using the --resume parameter.

diarmidmackenzie · 2020-12-16T15:44:55Z

Adding a note that you can't set "strict=False" on the call to load_state_dict for the optimizer. The reason why is explained here: pytorch/pytorch#3852.

I am suspicious there has been some change to the model since the pre-trained data referenced in the ReadMe was generated.

The pre-trained data logs say:
process_0 - MultitaskQuestionAnsweringNetwork has 18,199,502 parameters

What I see when training is:
process_0 - MultitaskQuestionAnsweringNetwork has 14,589,902 trainable parameters

parameters vs. trainable parameters in this log seems to imply that the pre-trained data set was generated from code prior to this commit (26 Oct 2018):
2c837ea
(even if the training itself seems to have taken place in December 2018, it seems to have been using older code).

The 3.5M difference in the number of trainable parameters seems concerning too, and this doesn't seem to be down to configuration (unless I have missed something)

There have been quite a few changes to the repo since 26 Oct 2018 (including a bunch on Oct 26 itself). I've not analyzed them all, but it seems plausible that one of these changes might have resulted in the incompatibility of the optimizer's stored state, causing this problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine-tuning pretrained model #51

Fine-tuning pretrained model #51

ashleyyy94 commented Mar 21, 2019

hot-cheeto commented Apr 17, 2020

diarmidmackenzie commented Dec 12, 2020

diarmidmackenzie commented Dec 16, 2020

Fine-tuning pretrained model #51

Fine-tuning pretrained model #51

Comments

ashleyyy94 commented Mar 21, 2019

hot-cheeto commented Apr 17, 2020

diarmidmackenzie commented Dec 12, 2020

diarmidmackenzie commented Dec 16, 2020