Update MaskGCT env setup and notebook (#316)

yuantuo666 · web-flow · commit 2940c438950e · 2024-10-31T17:34:08.000+08:00
* Update MaskGCT env setup and notebook
diff --git a/models/tts/maskgct/README.md b/models/tts/maskgct/README.md
@@ -5,7 +5,7 @@
 [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-demo-pink)](https://huggingface.co/spaces/amphion/maskgct)
 [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](../../../models/tts/maskgct/README.md)
 
-[正式版公测地址（趣丸千音）](https://voice.funnycp.com/)
+Public beta version address 公测版地址: [趣丸千音](https://voice.funnycp.com/)
 
 ## Overview
 
@@ -21,17 +21,93 @@ MaskGCT (**Mask**ed **G**enerative **C**odec **T**ransformer) is *a fully non-au
 
 - **2024/10/19**: We release **MaskGCT**, a fully non-autoregressive TTS model that eliminates the need for explicit alignment information between text and speech supervision. MaskGCT is trained on [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) dataset and achieves SOTA zero-shot TTS perfermance.
 
+## Issues
+
+If you encounter any issue when using MaskGCT, feel free to open an issue in this repository. But please use **English** to describe, this will make it easier for keyword searching and more people to participate in the discussion.
+
 ## Quickstart
 
-**Clone and install**
+### Clone and Environment
+
+This parts, follow the steps below to clone the repository and install the environment.
+
+1. Clone the repository, you can choose (a) partial clone or (b) full clone.
+2. Install the environment follow guide below.
+
+#### 1. (a) Partial clone
+
+Since the whole Amphion repository is large, you can use sparse-checkout to download only the needed code.
+
+```bash
+# download meta info only
+git clone --no-checkout --filter=blob:none https://github.com/open-mmlab/Amphion.git
+
+# enter the repositry directory
+cd Amphion
+
+# setting sparse-checkout
+git sparse-checkout init --cone
+git sparse-checkout set models/tts/maskgct
+
+# download the needed code
+git checkout main
+git sparse-checkout add models/codec utils
+```
+
+#### 1. (b) Full clone
+
+If you prefer to download the whole repository, you can use the following command.
 
 ```bash
 git clone https://github.com/open-mmlab/Amphion.git
-# create env
-bash ./models/tts/maskgct/env.sh
+
+# enter the repositry directory
+cd Amphion
+```
+
+#### 2. Install the environment
+
+Before start installing, making sure you are under the `Amphion` directory. If not, use `cd` to enter.
+
+Since we use `phonemizer` to convert text to phoneme, you need to install `espeak-ng` first. More details can be found [here](https://bootphon.github.io/phonemizer/install.html). Choose the correct installation command according to your operating system:
+
+```bash
+# For Debian-like distribution (e.g. Ubuntu, Mint, etc.)
+sudo apt-get install espeak-ng
+# For RedHat-like distribution (e.g. CentOS, Fedora, etc.) 
+sudo yum install espeak-ng
+
+# For Windows
+# Please visit https://github.com/espeak-ng/espeak-ng/releases to download .msi installer
+```
+
+It is recommended to use conda to configure the environment. You can use the following command to create and activate a new conda environment.
+
+```bash
+conda create -n maskgct python=3.10
+conda activate maskgct
 ```
 
-**Model download**
+Then, install the python packages.
+
+```bash
+pip install -r models/tts/maskgct/requirements.txt
+```
+
+### Jupyter Notebook
+
+We provide a [Jupyter notebook](../../../models/tts/maskgct/maskgct_demo.ipynb) to show how to use MaskGCT to inference.
+
+After installing the environment, you can open this notebook and start running.
+
+### Start from Scratch
+
+If you do not want to use Juptyer notebook, you can start from scratch. We provide the following steps to help you start from scratch.
+
+1. Download the pretrained model.
+2. Load the model and inference.
+
+#### 1. Model download
 
 We provide the following pretrained checkpoints:
 
@@ -63,10 +139,12 @@ s2a_1layer_ckpt = hf_hub_download("amphion/MaskGCT", filename="s2a_model/s2a_mod
 s2a_full_ckpt = hf_hub_download("amphion/MaskGCT", filename="s2a_model/s2a_model_full/model.safetensors")
 ```
 
-**Basic Usage**
+#### 2. Basic Inference
 
 You can use the following code to generate speech from text and a prompt speech (the code is also provided in [inference.py](../../../models/tts/maskgct/maskgct_inference.py)).
 
+Run it with `python -m models.tts.maskgct.maskgct_inference`.
+
 ```python
 from models.tts.maskgct.maskgct_utils import *
 from huggingface_hub import hf_hub_download
@@ -92,7 +170,7 @@ if __name__ == "__main__":
     s2a_model_full =  build_s2a_model(cfg.model.s2a_model.s2a_full, device)
 
     # download checkpoint
-    ...
+    # ...
 
     # load semantic codec
     safetensors.torch.load_model(semantic_codec, semantic_code_ckpt)
@@ -132,9 +210,6 @@ if __name__ == "__main__":
     sf.write(save_path, recovered_audio, 24000)        
 ```
 
-**Jupyter Notebook**
-
-We also provide a [jupyter notebook](../../../models/tts/maskgct/maskgct_demo.ipynb) to show more details of MaskGCT inference.
 
 ## Training Dataset
 
diff --git a/models/tts/maskgct/env.sh b/models/tts/maskgct/env.sh
diff --git a/models/tts/maskgct/maskgct_demo.ipynb b/models/tts/maskgct/maskgct_demo.ipynb
diff --git a/models/tts/maskgct/maskgct_inference.py b/models/tts/maskgct/maskgct_inference.py
@@ -65,7 +65,7 @@
 
     # inference
     prompt_wav_path = "./models/tts/maskgct/wav/prompt.wav"
-    save_path = "[YOUR SAVE PATH]"
+    save_path = "generated_audio.wav"
     prompt_text = " We do not break. We never give in. We never back down."
     target_text = "In this paper, we introduce MaskGCT, a fully non-autoregressive TTS model that eliminates the need for explicit alignment information between text and speech supervision."
     # Specify the target duration (in seconds). If target_len = None, we use a simple rule to predict the target duration.
diff --git a/models/tts/maskgct/maskgct_utils.py b/models/tts/maskgct/maskgct_utils.py
@@ -8,12 +8,6 @@
 import torch.nn.functional as F
 import numpy as np
 import librosa
-import os
-import pickle
-import math
-import json
-import accelerate
-import safetensors
 from utils.util import load_config
 from tqdm import tqdm
 
diff --git a/models/tts/maskgct/requirements.txt b/models/tts/maskgct/requirements.txt
@@ -0,0 +1,24 @@
+setuptools
+onnxruntime
+torch==2.0.1
+transformers===4.41.1
+tensorboard
+tensorboardX 
+accelerate==0.31.0
+unidecode
+numpy==1.23.5
+
+librosa
+encodecphonemizer
+g2p_en
+jieba
+cn2an
+pypinyin
+LangSegment
+pyopenjtalk
+pykakasi
+
+json5
+black==24.1.1
+ruamel.yaml
+tqdm