Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can I use an English dataset for this repo? #7

Open
Shaobo-Z opened this issue Jul 2, 2023 · 13 comments
Open

Can I use an English dataset for this repo? #7

Shaobo-Z opened this issue Jul 2, 2023 · 13 comments

Comments

@Shaobo-Z
Copy link

Shaobo-Z commented Jul 2, 2023

In the source code, you used Vietnamese for training and validation. If I want to fine-tune a model that is in English and has English dataset, is there anything that I should change?

@khanld
Copy link
Owner

khanld commented Jul 2, 2023

No, you just have to prepare the English dataset

@khanld khanld closed this as completed Jul 2, 2023
@khanld khanld reopened this Jul 2, 2023
@Shaobo-Z
Copy link
Author

Shaobo-Z commented Jul 2, 2023

This is how my dataset looks like ⬇
image

And this is what I got ⬇. There is changes with train_loss, train_lr,..... However, the train_wer is always 1.0000.
image

Checked:

  1. Sample Rate: by using librosa.get_samplerate. I got 16000.
  2. Transcript is correct.
  3. Only modify the file_path and iteration in the configure file.
  4. The pre-trained model is facebook/wave2vec2-base.

I tried multiple ways. However, the result remains the same. Any ideas? Plz.

@khanld
Copy link
Owner

khanld commented Jul 2, 2023

I can see that your dataset is relatively small, so the number of update steps per epoch is only 5. Have your try a longer run and check if the behavior remains. Take a look at the vocab.json file whether it contains the correct English characters.

@ghosthunterk
Copy link

Encountered the same problem even with larger dataset (91 steps and 20 epochs).

@khanld
Copy link
Owner

khanld commented Jul 21, 2023

I have not tried on other language datasets yet. Can you share more information about your dataset, config, tensorboard,…

@ghosthunterk
Copy link

Python 3.8
Pip install all in requirements.txt, with exception of torch 1.7.1 i had to use (conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=11.0 -c pytorch) because I have CUDA 11.4
I tried both vivos dataset and common voice dataset, store them in .txt with panda seperatated by "|" and 2 column: path (path on server) and transcript (encoded utf-8)
When I tried to print the pred and label, i got these
image

@ghosthunterk
Copy link

Audio are already pre-processed to be 16000 sampling rate and .wav format
image

@khanld
Copy link
Owner

khanld commented Jul 21, 2023

i can see that your model did not converge yet, train loss is still high. Try increase the lr higher for faster training

@khanld
Copy link
Owner

khanld commented Jul 21, 2023

Ping me at mail [email protected] for better debugging since I rarely check the GitHub notifications

@ghosthunterk
Copy link

Ping me at mail [email protected] for better debugging since I rarely check the GitHub notifications

Already, thanks

@Shaobo-Z
Copy link
Author

Shaobo-Z commented Jul 21, 2023 via email

@khanld
Copy link
Owner

khanld commented Jul 21, 2023

I will take a look at my codes and run some experiments on english datasets and response to you soon @Shaobo-Z

@ghosthunterk
Copy link

image
So after having experimented a while, I found that increasing the learning rate (about >1e-5) and set the scheduler max learning rate to >=1e-4 helped the model to actually learn after a while, just be patient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants