Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

finetune.py does not save checkpoints #57

Open
soroushhashemifar opened this issue Jun 25, 2021 · 1 comment
Open

finetune.py does not save checkpoints #57

soroushhashemifar opened this issue Jun 25, 2021 · 1 comment

Comments

@soroushhashemifar
Copy link

I run the following command to finetune the model:

python finetune.py --transcript_file ./cv-corpus-6.1-2020-12-11/vi/clips/clips.trans.txt --pretrain_model /content/self-supervised-speech-recognition/outputs/2021-06-25/14-39-00/checkpoints/checkpoint_best.pt --dict_file /content/self-supervised-speech-recognition/save_dir/dict.ltr.txt

and I get the following logs:

2021-06-25 15:31:21 | INFO | fairseq_cli.train | training on 1 devices (GPUs/TPUs)
2021-06-25 15:31:21 | INFO | fairseq_cli.train | max tokens per GPU = 2800000 and batch size per GPU = None
2021-06-25 15:31:21 | INFO | fairseq.trainer | no existing checkpoint found checkpoints/checkpoint_last.pt
2021-06-25 15:31:21 | INFO | fairseq.trainer | loading train data for epoch 1
2021-06-25 15:31:21 | INFO | fairseq.data.audio.raw_audio_dataset | loaded 587, skipped 0 samples
2021-06-25 15:31:21 | INFO | fairseq.trainer | NOTE: your device does NOT support faster training with --fp16, please switch to FP32 which is likely to be faster
2021-06-25 15:31:21 | INFO | fairseq.trainer | begin training epoch 1
2021-06-25 15:31:36 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 64.0
2021-06-25 15:31:36 | INFO | fairseq_cli.train | end of epoch 1 (average epoch stats below)
2021-06-25 15:31:36 | INFO | train | {"epoch": 1, "train_lr": "5e-07", "train_loss_scale": "64", "train_train_wall": "14", "train_wall": "15"}
2021-06-25 15:31:36 | INFO | fairseq.trainer | begin training epoch 2
2021-06-25 15:31:50 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 32.0
2021-06-25 15:31:50 | INFO | fairseq_cli.train | end of epoch 2 (average epoch stats below)
2021-06-25 15:31:50 | INFO | train | {"epoch": 2, "train_lr": "5e-07", "train_loss_scale": "32", "train_train_wall": "13", "train_wall": "29"}
2021-06-25 15:31:50 | INFO | fairseq.trainer | begin training epoch 3
2021-06-25 15:32:04 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 16.0
2021-06-25 15:32:04 | INFO | fairseq_cli.train | end of epoch 3 (average epoch stats below)
2021-06-25 15:32:04 | INFO | train | {"epoch": 3, "train_lr": "5e-07", "train_loss_scale": "16", "train_train_wall": "13", "train_wall": "43"}
2021-06-25 15:32:04 | INFO | fairseq.trainer | begin training epoch 4
2021-06-25 15:32:18 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 8.0
2021-06-25 15:32:18 | INFO | fairseq_cli.train | end of epoch 4 (average epoch stats below)
2021-06-25 15:32:18 | INFO | train | {"epoch": 4, "train_lr": "5e-07", "train_loss_scale": "8", "train_train_wall": "13", "train_wall": "57"}
2021-06-25 15:32:18 | INFO | fairseq.trainer | begin training epoch 5
2021-06-25 15:32:32 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 4.0
2021-06-25 15:32:32 | INFO | fairseq_cli.train | end of epoch 5 (average epoch stats below)
2021-06-25 15:32:32 | INFO | train | {"epoch": 5, "train_lr": "5e-07", "train_loss_scale": "4", "train_train_wall": "14", "train_wall": "71"}
2021-06-25 15:32:33 | INFO | fairseq.trainer | begin training epoch 6
2021-06-25 15:32:47 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 2.0
2021-06-25 15:32:47 | INFO | fairseq_cli.train | end of epoch 6 (average epoch stats below)
2021-06-25 15:32:47 | INFO | train | {"epoch": 6, "train_lr": "5e-07", "train_loss_scale": "2", "train_train_wall": "14", "train_wall": "86"}
2021-06-25 15:32:47 | INFO | fairseq.trainer | begin training epoch 7
2021-06-25 15:33:02 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 1.0
2021-06-25 15:33:02 | INFO | fairseq_cli.train | end of epoch 7 (average epoch stats below)
2021-06-25 15:33:02 | INFO | train | {"epoch": 7, "train_lr": "5e-07", "train_loss_scale": "1", "train_train_wall": "14", "train_wall": "101"}
2021-06-25 15:33:02 | INFO | fairseq.trainer | begin training epoch 8
2021-06-25 15:33:17 | INFO | fairseq_cli.train | end of epoch 8 (average epoch stats below)
2021-06-25 15:33:17 | INFO | train | {"epoch": 8, "train_loss": "1038.75", "train_ntokens": "18316", "train_nsentences": "587", "train_nll_loss": "33.29", "train_wps": "0", "train_ups": "0", "train_wpb": "18316", "train_bsz": "587", "train_num_updates": "1", "train_lr": "5.38077e-07", "train_gnorm": "227.223", "train_loss_scale": "1", "train_train_wall": "14", "train_wall": "115"}
2021-06-25 15:33:17 | INFO | fairseq.trainer | begin training epoch 9
2021-06-25 15:33:30 | INFO | fairseq_cli.train | end of epoch 9 (average epoch stats below)
2021-06-25 15:33:30 | INFO | train | {"epoch": 9, "train_loss": "1037.46", "train_ntokens": "18316", "train_nsentences": "587", "train_nll_loss": "33.249", "train_wps": "1324.7", "train_ups": "0.07", "train_wpb": "18316", "train_bsz": "587", "train_num_updates": "2", "train_lr": "5.76154e-07", "train_gnorm": "241.803", "train_loss_scale": "1", "train_train_wall": "13", "train_wall": "129"}
2021-06-25 15:33:30 | INFO | fairseq.trainer | begin training epoch 10
2021-06-25 15:33:44 | INFO | fairseq_cli.train | end of epoch 10 (average epoch stats below)
2021-06-25 15:33:44 | INFO | train | {"epoch": 10, "train_loss": "1037.83", "train_ntokens": "18316", "train_nsentences": "587", "train_nll_loss": "33.261", "train_wps": "1307.6", "train_ups": "0.07", "train_wpb": "18316", "train_bsz": "587", "train_num_updates": "3", "train_lr": "6.14231e-07", "train_gnorm": "222.575", "train_loss_scale": "1", "train_train_wall": "13", "train_wall": "143"}
2021-06-25 15:33:44 | INFO | fairseq.trainer | begin training epoch 11
2021-06-25 15:33:58 | INFO | fairseq_cli.train | end of epoch 11 (average epoch stats below)
2021-06-25 15:33:58 | INFO | train | {"epoch": 11, "train_loss": "1038.17", "train_ntokens": "18316", "train_nsentences": "587", "train_nll_loss": "33.272", "train_wps": "1326.1", "train_ups": "0.07", "train_wpb": "18316", "train_bsz": "587", "train_num_updates": "4", "train_lr": "6.52308e-07", "train_gnorm": "232.343", "train_loss_scale": "1", "train_train_wall": "13", "train_wall": "157"}
2021-06-25 15:33:58 | INFO | fairseq.trainer | begin training epoch 12
...

When I checkout the output directory, there's no *.pt saved. What's going on?

@EdenJin20171503024
Copy link

I have the same error,do you slove it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants