finetune.py does not save checkpoints #57

soroushhashemifar · 2021-06-25T16:43:45Z

I run the following command to finetune the model:

python finetune.py --transcript_file ./cv-corpus-6.1-2020-12-11/vi/clips/clips.trans.txt --pretrain_model /content/self-supervised-speech-recognition/outputs/2021-06-25/14-39-00/checkpoints/checkpoint_best.pt --dict_file /content/self-supervised-speech-recognition/save_dir/dict.ltr.txt

and I get the following logs:

2021-06-25 15:31:21 | INFO | fairseq_cli.train | training on 1 devices (GPUs/TPUs)
2021-06-25 15:31:21 | INFO | fairseq_cli.train | max tokens per GPU = 2800000 and batch size per GPU = None
2021-06-25 15:31:21 | INFO | fairseq.trainer | no existing checkpoint found checkpoints/checkpoint_last.pt
2021-06-25 15:31:21 | INFO | fairseq.trainer | loading train data for epoch 1
2021-06-25 15:31:21 | INFO | fairseq.data.audio.raw_audio_dataset | loaded 587, skipped 0 samples
2021-06-25 15:31:21 | INFO | fairseq.trainer | NOTE: your device does NOT support faster training with --fp16, please switch to FP32 which is likely to be faster
2021-06-25 15:31:21 | INFO | fairseq.trainer | begin training epoch 1
2021-06-25 15:31:36 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 64.0
2021-06-25 15:31:36 | INFO | fairseq_cli.train | end of epoch 1 (average epoch stats below)
2021-06-25 15:31:36 | INFO | train | {"epoch": 1, "train_lr": "5e-07", "train_loss_scale": "64", "train_train_wall": "14", "train_wall": "15"}
2021-06-25 15:31:36 | INFO | fairseq.trainer | begin training epoch 2
2021-06-25 15:31:50 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 32.0
2021-06-25 15:31:50 | INFO | fairseq_cli.train | end of epoch 2 (average epoch stats below)
2021-06-25 15:31:50 | INFO | train | {"epoch": 2, "train_lr": "5e-07", "train_loss_scale": "32", "train_train_wall": "13", "train_wall": "29"}
2021-06-25 15:31:50 | INFO | fairseq.trainer | begin training epoch 3
2021-06-25 15:32:04 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 16.0
2021-06-25 15:32:04 | INFO | fairseq_cli.train | end of epoch 3 (average epoch stats below)
2021-06-25 15:32:04 | INFO | train | {"epoch": 3, "train_lr": "5e-07", "train_loss_scale": "16", "train_train_wall": "13", "train_wall": "43"}
2021-06-25 15:32:04 | INFO | fairseq.trainer | begin training epoch 4
2021-06-25 15:32:18 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 8.0
2021-06-25 15:32:18 | INFO | fairseq_cli.train | end of epoch 4 (average epoch stats below)
2021-06-25 15:32:18 | INFO | train | {"epoch": 4, "train_lr": "5e-07", "train_loss_scale": "8", "train_train_wall": "13", "train_wall": "57"}
2021-06-25 15:32:18 | INFO | fairseq.trainer | begin training epoch 5
2021-06-25 15:32:32 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 4.0
2021-06-25 15:32:32 | INFO | fairseq_cli.train | end of epoch 5 (average epoch stats below)
2021-06-25 15:32:32 | INFO | train | {"epoch": 5, "train_lr": "5e-07", "train_loss_scale": "4", "train_train_wall": "14", "train_wall": "71"}
2021-06-25 15:32:33 | INFO | fairseq.trainer | begin training epoch 6
2021-06-25 15:32:47 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 2.0
2021-06-25 15:32:47 | INFO | fairseq_cli.train | end of epoch 6 (average epoch stats below)
2021-06-25 15:32:47 | INFO | train | {"epoch": 6, "train_lr": "5e-07", "train_loss_scale": "2", "train_train_wall": "14", "train_wall": "86"}
2021-06-25 15:32:47 | INFO | fairseq.trainer | begin training epoch 7
2021-06-25 15:33:02 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 1.0
2021-06-25 15:33:02 | INFO | fairseq_cli.train | end of epoch 7 (average epoch stats below)
2021-06-25 15:33:02 | INFO | train | {"epoch": 7, "train_lr": "5e-07", "train_loss_scale": "1", "train_train_wall": "14", "train_wall": "101"}
2021-06-25 15:33:02 | INFO | fairseq.trainer | begin training epoch 8
2021-06-25 15:33:17 | INFO | fairseq_cli.train | end of epoch 8 (average epoch stats below)
2021-06-25 15:33:17 | INFO | train | {"epoch": 8, "train_loss": "1038.75", "train_ntokens": "18316", "train_nsentences": "587", "train_nll_loss": "33.29", "train_wps": "0", "train_ups": "0", "train_wpb": "18316", "train_bsz": "587", "train_num_updates": "1", "train_lr": "5.38077e-07", "train_gnorm": "227.223", "train_loss_scale": "1", "train_train_wall": "14", "train_wall": "115"}
2021-06-25 15:33:17 | INFO | fairseq.trainer | begin training epoch 9
2021-06-25 15:33:30 | INFO | fairseq_cli.train | end of epoch 9 (average epoch stats below)
2021-06-25 15:33:30 | INFO | train | {"epoch": 9, "train_loss": "1037.46", "train_ntokens": "18316", "train_nsentences": "587", "train_nll_loss": "33.249", "train_wps": "1324.7", "train_ups": "0.07", "train_wpb": "18316", "train_bsz": "587", "train_num_updates": "2", "train_lr": "5.76154e-07", "train_gnorm": "241.803", "train_loss_scale": "1", "train_train_wall": "13", "train_wall": "129"}
2021-06-25 15:33:30 | INFO | fairseq.trainer | begin training epoch 10
2021-06-25 15:33:44 | INFO | fairseq_cli.train | end of epoch 10 (average epoch stats below)
2021-06-25 15:33:44 | INFO | train | {"epoch": 10, "train_loss": "1037.83", "train_ntokens": "18316", "train_nsentences": "587", "train_nll_loss": "33.261", "train_wps": "1307.6", "train_ups": "0.07", "train_wpb": "18316", "train_bsz": "587", "train_num_updates": "3", "train_lr": "6.14231e-07", "train_gnorm": "222.575", "train_loss_scale": "1", "train_train_wall": "13", "train_wall": "143"}
2021-06-25 15:33:44 | INFO | fairseq.trainer | begin training epoch 11
2021-06-25 15:33:58 | INFO | fairseq_cli.train | end of epoch 11 (average epoch stats below)
2021-06-25 15:33:58 | INFO | train | {"epoch": 11, "train_loss": "1038.17", "train_ntokens": "18316", "train_nsentences": "587", "train_nll_loss": "33.272", "train_wps": "1326.1", "train_ups": "0.07", "train_wpb": "18316", "train_bsz": "587", "train_num_updates": "4", "train_lr": "6.52308e-07", "train_gnorm": "232.343", "train_loss_scale": "1", "train_train_wall": "13", "train_wall": "157"}
2021-06-25 15:33:58 | INFO | fairseq.trainer | begin training epoch 12
...

When I checkout the output directory, there's no *.pt saved. What's going on?

The text was updated successfully, but these errors were encountered:

EdenJin20171503024 · 2022-04-21T06:42:04Z

I have the same error，do you slove it？

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

finetune.py does not save checkpoints #57

finetune.py does not save checkpoints #57

soroushhashemifar commented Jun 25, 2021

EdenJin20171503024 commented Apr 21, 2022

finetune.py does not save checkpoints #57

finetune.py does not save checkpoints #57

Comments

soroushhashemifar commented Jun 25, 2021

EdenJin20171503024 commented Apr 21, 2022