-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preprocessing code for Chinese #14
Comments
These are what I'm gonna try:
Any other suggestions? Thanks! |
When I trained the multilingual PL-BERT (English, Japanese, Chinese), I tried two preprocessing methods for Chinese and didn't notice any difference in terms of quality for the downstream TTS tasks (possibly also because the AiShell dataset is simple and just like VCTK where no clear context or emotions are there). The simplest way is character-level P2G, i.e., you treat each character as a grapheme. You should also take into account the change of pronunciations (heterophones) given different contexts (for example, "了" can be read both as "liao" and "le" depending on the context). Another more complicated way is to represent graphemes at the word level. For example, you treat "了" (particle) as a grapheme, but you treat "了解" as another grapheme (instead of two graphemes, "了" and "解"). This probably helps for Japanese too, as a lot of graphemes can be shared between Chinese and Japanese. For me, I used I don't think there's any need to train a BPE tokenizer. Not sure what it is for. I will leave this issue open in case someone else needs to train a PL-BERT in Chinese. |
I've tried tokenizer = BertTokenizer.from_pretrained("uer/gpt2-chinese-cluecorpussmall") and got some wrong output: tokenizer.tokenize('这是一句中文文本,时间是2023年7月12日。') digits like '2023' '7' '12' should be read out, and '2023' may should not to be treated seperately, so maybe I need to use a Text Normalization module first. Also, you've mentioned to treat "了解" as another grapheme, this requires a chinese tokenizer. May I know which tokenizer did you use? I'm quite surprise that there is no difference in terms of quality for the downstream TTS tasks, cause when we input wrong graphme it hurts naturalness a lot. May I know how do you evaluate the output chinese speech? Does the result outperform the baseline that is not pretrained on large-scale corpus? Thanks! |
Hi @TinaChen95 , How is your try going? Can you please share the desired input and output format for preprocessing Chinese? If you can give a few examples, that'll be very helpful. Thank you. |
@TinaChen95 You can use tokenizers here https://fengshenbang-doc.readthedocs.io/zh/latest/index.html that has word-level tokenization instead of character-level. It is true that you will need to normalize the date and numbers to their reading. As for the performance in the Mandarin dataset, I only tested it on AiShell dataset which is similar to VCTK that does not have any emotion or context, so the difference is probably not that big. I could not find any Chinese audiobook or emotional speech dataset with contexts like LJSpeech or LibriTTS, so if you know any that I can test on please let me know. Also since the PL-BERT is eventually fine-tuned with the TTS model, as long as the phonemes are correct in the TTS dataset, the incorrect phonemization during pre-training indeed has little effect. This is only confirmed for English datasets however, but I believe similar things should hold true for Chinese. |
Do you have any suggestions for Chinese data preprocessing?
For example, text normalization, g2p, etc.
From your experience, will the accuracy of the g2p model have great impact on the model performance ?
The text was updated successfully, but these errors were encountered: