Skip to content

Commit 2940c43

Browse files
authored
Update MaskGCT env setup and notebook (#316)
* Update MaskGCT env setup and notebook
1 parent 415a0a6 commit 2940c43

File tree

6 files changed

+353
-58
lines changed

6 files changed

+353
-58
lines changed

models/tts/maskgct/README.md

Lines changed: 85 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
[![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-demo-pink)](https://huggingface.co/spaces/amphion/maskgct)
66
[![readme](https://img.shields.io/badge/README-Key%20Features-blue)](../../../models/tts/maskgct/README.md)
77

8-
[正式版公测地址(趣丸千音](https://voice.funnycp.com/)
8+
Public beta version address 公测版地址: [趣丸千音](https://voice.funnycp.com/)
99

1010
## Overview
1111

@@ -21,17 +21,93 @@ MaskGCT (**Mask**ed **G**enerative **C**odec **T**ransformer) is *a fully non-au
2121

2222
- **2024/10/19**: We release **MaskGCT**, a fully non-autoregressive TTS model that eliminates the need for explicit alignment information between text and speech supervision. MaskGCT is trained on [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) dataset and achieves SOTA zero-shot TTS perfermance.
2323

24+
## Issues
25+
26+
If you encounter any issue when using MaskGCT, feel free to open an issue in this repository. But please use **English** to describe, this will make it easier for keyword searching and more people to participate in the discussion.
27+
2428
## Quickstart
2529

26-
**Clone and install**
30+
### Clone and Environment
31+
32+
This parts, follow the steps below to clone the repository and install the environment.
33+
34+
1. Clone the repository, you can choose (a) partial clone or (b) full clone.
35+
2. Install the environment follow guide below.
36+
37+
#### 1. (a) Partial clone
38+
39+
Since the whole Amphion repository is large, you can use sparse-checkout to download only the needed code.
40+
41+
```bash
42+
# download meta info only
43+
git clone --no-checkout --filter=blob:none https://github.com/open-mmlab/Amphion.git
44+
45+
# enter the repositry directory
46+
cd Amphion
47+
48+
# setting sparse-checkout
49+
git sparse-checkout init --cone
50+
git sparse-checkout set models/tts/maskgct
51+
52+
# download the needed code
53+
git checkout main
54+
git sparse-checkout add models/codec utils
55+
```
56+
57+
#### 1. (b) Full clone
58+
59+
If you prefer to download the whole repository, you can use the following command.
2760

2861
```bash
2962
git clone https://github.com/open-mmlab/Amphion.git
30-
# create env
31-
bash ./models/tts/maskgct/env.sh
63+
64+
# enter the repositry directory
65+
cd Amphion
66+
```
67+
68+
#### 2. Install the environment
69+
70+
Before start installing, making sure you are under the `Amphion` directory. If not, use `cd` to enter.
71+
72+
Since we use `phonemizer` to convert text to phoneme, you need to install `espeak-ng` first. More details can be found [here](https://bootphon.github.io/phonemizer/install.html). Choose the correct installation command according to your operating system:
73+
74+
```bash
75+
# For Debian-like distribution (e.g. Ubuntu, Mint, etc.)
76+
sudo apt-get install espeak-ng
77+
# For RedHat-like distribution (e.g. CentOS, Fedora, etc.)
78+
sudo yum install espeak-ng
79+
80+
# For Windows
81+
# Please visit https://github.com/espeak-ng/espeak-ng/releases to download .msi installer
82+
```
83+
84+
It is recommended to use conda to configure the environment. You can use the following command to create and activate a new conda environment.
85+
86+
```bash
87+
conda create -n maskgct python=3.10
88+
conda activate maskgct
3289
```
3390

34-
**Model download**
91+
Then, install the python packages.
92+
93+
```bash
94+
pip install -r models/tts/maskgct/requirements.txt
95+
```
96+
97+
### Jupyter Notebook
98+
99+
We provide a [Jupyter notebook](../../../models/tts/maskgct/maskgct_demo.ipynb) to show how to use MaskGCT to inference.
100+
101+
After installing the environment, you can open this notebook and start running.
102+
103+
### Start from Scratch
104+
105+
If you do not want to use Juptyer notebook, you can start from scratch. We provide the following steps to help you start from scratch.
106+
107+
1. Download the pretrained model.
108+
2. Load the model and inference.
109+
110+
#### 1. Model download
35111

36112
We provide the following pretrained checkpoints:
37113

@@ -63,10 +139,12 @@ s2a_1layer_ckpt = hf_hub_download("amphion/MaskGCT", filename="s2a_model/s2a_mod
63139
s2a_full_ckpt = hf_hub_download("amphion/MaskGCT", filename="s2a_model/s2a_model_full/model.safetensors")
64140
```
65141

66-
**Basic Usage**
142+
#### 2. Basic Inference
67143

68144
You can use the following code to generate speech from text and a prompt speech (the code is also provided in [inference.py](../../../models/tts/maskgct/maskgct_inference.py)).
69145

146+
Run it with `python -m models.tts.maskgct.maskgct_inference`.
147+
70148
```python
71149
from models.tts.maskgct.maskgct_utils import *
72150
from huggingface_hub import hf_hub_download
@@ -92,7 +170,7 @@ if __name__ == "__main__":
92170
s2a_model_full = build_s2a_model(cfg.model.s2a_model.s2a_full, device)
93171

94172
# download checkpoint
95-
...
173+
# ...
96174

97175
# load semantic codec
98176
safetensors.torch.load_model(semantic_codec, semantic_code_ckpt)
@@ -132,9 +210,6 @@ if __name__ == "__main__":
132210
sf.write(save_path, recovered_audio, 24000)
133211
```
134212

135-
**Jupyter Notebook**
136-
137-
We also provide a [jupyter notebook](../../../models/tts/maskgct/maskgct_demo.ipynb) to show more details of MaskGCT inference.
138213

139214
## Training Dataset
140215

models/tts/maskgct/env.sh

Lines changed: 0 additions & 25 deletions
This file was deleted.

models/tts/maskgct/maskgct_demo.ipynb

Lines changed: 243 additions & 16 deletions
Large diffs are not rendered by default.

models/tts/maskgct/maskgct_inference.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@
6565

6666
# inference
6767
prompt_wav_path = "./models/tts/maskgct/wav/prompt.wav"
68-
save_path = "[YOUR SAVE PATH]"
68+
save_path = "generated_audio.wav"
6969
prompt_text = " We do not break. We never give in. We never back down."
7070
target_text = "In this paper, we introduce MaskGCT, a fully non-autoregressive TTS model that eliminates the need for explicit alignment information between text and speech supervision."
7171
# Specify the target duration (in seconds). If target_len = None, we use a simple rule to predict the target duration.

models/tts/maskgct/maskgct_utils.py

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -8,12 +8,6 @@
88
import torch.nn.functional as F
99
import numpy as np
1010
import librosa
11-
import os
12-
import pickle
13-
import math
14-
import json
15-
import accelerate
16-
import safetensors
1711
from utils.util import load_config
1812
from tqdm import tqdm
1913

models/tts/maskgct/requirements.txt

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
setuptools
2+
onnxruntime
3+
torch==2.0.1
4+
transformers===4.41.1
5+
tensorboard
6+
tensorboardX
7+
accelerate==0.31.0
8+
unidecode
9+
numpy==1.23.5
10+
11+
librosa
12+
encodecphonemizer
13+
g2p_en
14+
jieba
15+
cn2an
16+
pypinyin
17+
LangSegment
18+
pyopenjtalk
19+
pykakasi
20+
21+
json5
22+
black==24.1.1
23+
ruamel.yaml
24+
tqdm

0 commit comments

Comments
 (0)