Can I use an English dataset for this repo? #7

Shaobo-Z · 2023-07-02T06:09:09Z

In the source code, you used Vietnamese for training and validation. If I want to fine-tune a model that is in English and has English dataset, is there anything that I should change?

khanld · 2023-07-02T10:52:20Z

No, you just have to prepare the English dataset

Shaobo-Z · 2023-07-02T11:40:45Z

This is how my dataset looks like ⬇

And this is what I got ⬇. There is changes with train_loss, train_lr,..... However, the train_wer is always 1.0000.

Checked:

Sample Rate: by using librosa.get_samplerate. I got 16000.
Transcript is correct.
Only modify the file_path and iteration in the configure file.
The pre-trained model is facebook/wave2vec2-base.

I tried multiple ways. However, the result remains the same. Any ideas? Plz.

khanld · 2023-07-02T15:38:18Z

I can see that your dataset is relatively small, so the number of update steps per epoch is only 5. Have your try a longer run and check if the behavior remains. Take a look at the vocab.json file whether it contains the correct English characters.

ghosthunterk · 2023-07-21T04:15:16Z

Encountered the same problem even with larger dataset (91 steps and 20 epochs).

khanld · 2023-07-21T05:44:12Z

I have not tried on other language datasets yet. Can you share more information about your dataset, config, tensorboard,…

ghosthunterk · 2023-07-21T06:24:02Z

Python 3.8
Pip install all in requirements.txt, with exception of torch 1.7.1 i had to use (conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=11.0 -c pytorch) because I have CUDA 11.4
I tried both vivos dataset and common voice dataset, store them in .txt with panda seperatated by "|" and 2 column: path (path on server) and transcript (encoded utf-8)
When I tried to print the pred and label, i got these

ghosthunterk · 2023-07-21T06:27:26Z

Audio are already pre-processed to be 16000 sampling rate and .wav format

khanld · 2023-07-21T07:06:46Z

i can see that your model did not converge yet, train loss is still high. Try increase the lr higher for faster training

khanld · 2023-07-21T07:09:33Z

Ping me at mail [email protected] for better debugging since I rarely check the GitHub notifications

ghosthunterk · 2023-07-21T07:33:41Z

Ping me at mail [email protected] for better debugging since I rarely check the GitHub notifications

Already, thanks

Shaobo-Z · 2023-07-21T07:48:49Z

Is it possible to get an update on this question? What is the minimum size of the dataset? I want to train the model with a 20mins dataset. Do you think it is possible?

…

________________________________ From: ghosthunterk ***@***.***> Sent: Friday, July 21, 2023 5:33:53 PM To: khanld/ASR-Wav2vec-Finetune ***@***.***> Cc: Shaobo-Z ***@***.***>; Author ***@***.***> Subject: Re: [khanld/ASR-Wav2vec-Finetune] Can I use an English dataset for this repo? (Issue #7) Ping me at mail ***@***.******@***.***> for better debugging since I rarely check the GitHub notifications Already, thanks — Reply to this email directly, view it on GitHub<#7 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AJHDBZHB2CRHQRPSO6CQOU3XRIWGDANCNFSM6AAAAAAZ3HZPXY>. You are receiving this because you authored the thread.Message ID: ***@***.***>

khanld · 2023-07-21T08:29:16Z

I will take a look at my codes and run some experiments on english datasets and response to you soon @Shaobo-Z

ghosthunterk · 2023-07-26T16:03:20Z

So after having experimented a while, I found that increasing the learning rate (about >1e-5) and set the scheduler max learning rate to >=1e-4 helped the model to actually learn after a while, just be patient.

khanld closed this as completed Jul 2, 2023

khanld reopened this Jul 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can I use an English dataset for this repo? #7

Can I use an English dataset for this repo? #7

Shaobo-Z commented Jul 2, 2023

khanld commented Jul 2, 2023

Shaobo-Z commented Jul 2, 2023

khanld commented Jul 2, 2023

ghosthunterk commented Jul 21, 2023

khanld commented Jul 21, 2023

ghosthunterk commented Jul 21, 2023

ghosthunterk commented Jul 21, 2023

khanld commented Jul 21, 2023

khanld commented Jul 21, 2023

ghosthunterk commented Jul 21, 2023

Shaobo-Z commented Jul 21, 2023 via email

khanld commented Jul 21, 2023

ghosthunterk commented Jul 26, 2023

Can I use an English dataset for this repo? #7

Can I use an English dataset for this repo? #7

Comments

Shaobo-Z commented Jul 2, 2023

khanld commented Jul 2, 2023

Shaobo-Z commented Jul 2, 2023

khanld commented Jul 2, 2023

ghosthunterk commented Jul 21, 2023

khanld commented Jul 21, 2023

ghosthunterk commented Jul 21, 2023

ghosthunterk commented Jul 21, 2023

khanld commented Jul 21, 2023

khanld commented Jul 21, 2023

ghosthunterk commented Jul 21, 2023

Shaobo-Z commented Jul 21, 2023 via email

khanld commented Jul 21, 2023

ghosthunterk commented Jul 26, 2023