Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Lower quality than the examples on the demo page #334

Open
GalenMarek14 opened this issue Nov 6, 2024 · 2 comments
Open

[BUG]: Lower quality than the examples on the demo page #334

GalenMarek14 opened this issue Nov 6, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@GalenMarek14
Copy link

Describe the bug

After following the installation instructions (plus replacing phonemizer with https://github.com/justinjohn0306/phonemizer to make it work on Win 11), and using the same examples from the demo page, I was unable to replicate the quality of the examples. For example, the whispering voice always outputs something between a whisper and a normal voice. I tried both the inference script and the Gradio app, with the same result. Additionally, the duration calculator seems to be broken for Chinese - it makes the output twice as fast when set to auto.

This is the demo page result:
https://vocaroo.com/15JxVNPRScwD

This is mine:
https://vocaroo.com/13b14dZCkNau

How To Reproduce

Steps to reproduce the behavior:
Follow the instructions to install on Win 11 with special phonemizer and generate audio

Expected behavior

Quality should be the same as the examples

Screenshots

image

Environment Information

  • Operating System: Windows 11
  • Python Version: Python 3.10.15
  • Driver & CUDA Version: Driver 546.92 & CUDA 12.4
  • Error Messages and Logs: Posted it above, here quoted version:

./models/tts/maskgct/g2p\sources\g2p_chinese_model\poly_bert_model.onnx
Start loading: facebook/w2v-bert-2.0
D:\AIMaskGCTTTS\Amphion\models\tts\maskgct\gradio_demo.py:103: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
stat_mean_var = torch.load("./models/tts/maskgct/ckpt/wav2vec2bert_stats.pt")
D:\AIMaskGCTTTS\venv\lib\site-packages\torch\nn\utils\weight_norm.py:143: FutureWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
WeightNorm.apply(module, name, dim)
Models built successfully.
Checkpoints downloaded successfully.
Checkpoints loaded successfully.

To create a public link, set share=True in launch().
===== New task submitted =====
Start inference...
Audio loaded.
D:\AIMaskGCTTTS\venv\lib\site-packages\whisper_init_.py:150: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
checkpoint = torch.load(fp, map_location=device)
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\Ozkan\AppData\Local\Temp\jieba.cache
Loading model cost 0.321 seconds.
Prefix dict has been built successfully.
D:\AIMaskGCTTTS\Amphion\models\tts\maskgct\g2p\g2p\chinese_model_g2p.py:100: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\utils\tensor_new.cpp:281.)
batch_label_starts = torch.tensor(batch_label_starts, dtype=torch.long)
Saved: ./output/output_0.wav
===== New task submitted =====
Start inference...
Audio loaded.
Saved: ./output/output_1.wav

Additional context

Thank you very much for this project

@yuantuo666
Copy link
Collaborator

Hi, for me, it sounds like the generated speech is trying to speak in a whisper style. You may run multiple times to get the best results. Besides, this can be improved by fine-tuning with high-quality whisper speeches or adding more whisper speeches in the training stage.
Here, MaskGCT is not designed for whisper speech generating, and we just found this feature when testing the model. This is why it cannot generate the whispered speech every time.

Answer for: #340 (comment)

TKsavy and yuantuo666, I can run the project on my Windows 11, but I couldn't reproduce the demo page examples. What could be the problem? For example, the whisper voice example on the demo page: I downloaded the sample from there and generated the same text, but it always outputs something between a whisper and a low voice, whereas the demo page examples are successful clones. My generations are generally of lower quality, regardless of the steps I take; I've tried up to 100 iterations.

I've also tried every version, including this one, the Windows fork, and Google Colab (to try it on a Linux environment), but all of them produce inferior results compared to your examples. Are the shared models from a previous training point, by any chance? Are you able to reproduce those results with the current shared models?

This was my issue for this matter with detailed logs and outputs: #334

Since I did not participate in the training of MaskGCT or demo generating, I did not know the details. Could @HeCheng0625 help with this?

@steven8274
Copy link

I have the same problem.I followed steps on page 'https://github.com/open-mmlab/Amphion/tree/main/models/tts/maskgct', and started the gradio demo.But I found the audio generated is not as good as the official examples.I used the audio downloaded from 'https://maskgct.github.io/audios/icl_smaples/icl_10.wav' and use target text '顿时,气氛变得沉郁起来。乍看之下,一切的困扰仿佛都围绕在我身边。我皱着眉头,感受着那份压力,但我知道我不能放弃,不能认输。于是,我深吸一口气,心底的声音告诉我:“无论如何,都要冷静下来,重新开始。”', both in the first example of 'Zero-shot In-context Learning'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants