Add FreeVC implementation #201

Nugine · 2024-05-07T13:25:17Z

✨ Description

FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion

This PR is a part of AIR6063 final project.

@Nugine (223040051)
@SeanYouLaw (223040034)

FYI, we also have another repo which refactors the training pipeline. Both the PR code and the custom code can produce good checkpoints.

Here are our checkpoints trained with PR code on single NVIDIA RTX4090

120000 steps, 183 epochs, 14.53 hours for each ckpt (freevc, freevc-s, freevc-nosr)
290000 steps, 443 epochs, 35.98 hours for each ckpt (freevc, freevc-s, freevc-nosr)
300000 steps, 458 epochs, 36.53 hours for each ckpt (freevc)

🚧 Related Issues

During the project, we have opened some issues and another PR to help improve Amphion.

👨‍💻 Changes Proposed

Add FreeVC in models
Add FreeVC in egs

🧑‍🤝‍🧑 Who Can Review?

[Please use the '@' symbol to mention any community member who is free to review the PR once the tests have passed. Feel free to tag members or contributors who might be interested in your PR.]
@zhizhengwu @RMSnow @Adorable-Qin

✅ Checklist

Code has been reviewed
Code complies with the project's code standards and best practices
Code has passed all tests
Code does not affect the normal use of existing features
Code has been commented properly
Documentation has been updated (if applicable)
Demo/checkpoint has been attached (if applicable)

SeanYouLaw · 2024-05-07T13:27:19Z

Here are some examples of our results:

1_src.mp4

1_dst.mp4

1_output.mp4

2_src.mp4

2_dst.mp4

2_output.mp4

3_src.mp4

3_dst.mp4

3_output.mp4

RMSnow · 2024-05-07T14:38:48Z

The quality of the samples sounds good. @Adorable-Qin Please check the code and document carefully.

SeanYouLaw · 2024-05-08T13:33:42Z

Here are some examples of our results, using the checkpoint of 183 epoch(120k steps) training(while above examples are from the pretrained checkpoint):

1_src.mp4

1_tgt.mp4

1_output.mp4

2_src.mp4

2_tgt.mp4

2_output.mp4

3_src.mp4

3_tgt.mp4

3_output.mp4

Nugine · 2024-05-08T15:34:01Z

Our AutoDL server will expire tomorrow. Here is a demo video recording the training status.

demo-video.mp4

Adorable-Qin · 2024-05-20T07:23:49Z

models/vc/FreeVC/data.py

+
+    @torch.no_grad()
+    def load_sample(self, filename):
+        filepath = os.path.join(self.vctk_16k_dir, filename)


Why is this line hard-coded? Is it possible to select datasets in config?

The original implementation only trains on VCTK dataset.

Data preprocessing uses the file structure of VCTK dataset to retrieve speaker tags.

When splitting train/val/test set, every speaker's samples are split randomly. It ensures that every speaker is in train & val & test set.

It's possible to support other datasets if we can perform the same operations on them.

Thank you for your explanation!
@RMSnow For this implementation, do we expect a universal model that can be trained on any dataset?

Yes, a universal FreeVC for any datasets is welcome. I think only the FreeVC's model part need to be integrated in Amphion.

models/vc/FreeVC/hifigan.py

models/vc/FreeVC/mel_processing.py

Adorable-Qin · 2024-05-21T03:02:31Z

models/vc/FreeVC/speaker_encoder/ckpt/pretrained_bak_5805000.pt.txt

What's the purpose of this file?

The whole directory (models/vc/FreeVC/speaker_encoder) is copied from

https://github.com/OlaWod/FreeVC/tree/81c169cdbfc97ff07ee2f501e9b88d543fc46126/speaker_encoder
(MIT license)

https://github.com/liusongxiang/ppg-vc/tree/b59cb9862cf4b82a3bdb589950e25cab85fc9b03/speaker_encoder
(Apache-2.0 license)

We keep it unchanged to match the original implementation.
However, it may be a problem if we copy so much code and a pretrained ckpt from other repo. I'm not sure what is the best practice.

Thank you for your explanation.
@RMSnow Any advice about this?

@Adorable-Qin I think introducing such a pretrained speaker encode is acceptable. It is just like WeNet. However, please add some acknowledge in our main readme before integrating it.

BTW, I think the .pt.txt file is strange. If it is a pretrained model, we can follow our pretrained model's part to integrate.

models/vc/FreeVC/speaker_encoder/data_objects/speaker_verification_dataset.py

models/vc/FreeVC/speaker_encoder/hparams.py

Adorable-Qin · 2024-05-21T10:05:23Z

models/vc/FreeVC/train.py

Is it possible to support multi-GPU training using external library like the Accelerate used in Amphion?

We have tried multi-GPU training in another repo. We use lightning framework to automatically enable DDP training. But it exits with error soon after starting. Single GPU works well.

PyTorch Lightning DDP crashes with unused parameters Lightning-AI/pytorch-lightning#17212

Add FreeVC implementation

0dd0d42

RMSnow mentioned this pull request May 7, 2024

Quick (Singing) Voice Conversion #200

Closed

7 tasks

Nugine and others added 5 commits May 7, 2024 23:06

fix

9c356e9

fix

97a1e5a

Add WER PCC NER evaluation file in models

2634855

format

8c5e341

fix readme

4b8ef50

lmxue requested review from ArkhamImp, HarryHe11 and Adorable-Qin and removed request for HarryHe11 May 10, 2024 16:03

Adorable-Qin reviewed May 20, 2024

View reviewed changes