diff --git a/models/tts/maskgct/README.md b/models/tts/maskgct/README.md
index 811ba209..b75a1ea6 100644
--- a/models/tts/maskgct/README.md
+++ b/models/tts/maskgct/README.md
@@ -5,7 +5,7 @@
[](https://huggingface.co/spaces/amphion/maskgct)
[](../../../models/tts/maskgct/README.md)
-[正式版公测地址(趣丸千音)](https://voice.funnycp.com/)
+Public beta version address 公测版地址: [趣丸千音](https://voice.funnycp.com/)
## Overview
@@ -21,17 +21,93 @@ MaskGCT (**Mask**ed **G**enerative **C**odec **T**ransformer) is *a fully non-au
- **2024/10/19**: We release **MaskGCT**, a fully non-autoregressive TTS model that eliminates the need for explicit alignment information between text and speech supervision. MaskGCT is trained on [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) dataset and achieves SOTA zero-shot TTS perfermance.
+## Issues
+
+If you encounter any issue when using MaskGCT, feel free to open an issue in this repository. But please use **English** to describe, this will make it easier for keyword searching and more people to participate in the discussion.
+
## Quickstart
-**Clone and install**
+### Clone and Environment
+
+This parts, follow the steps below to clone the repository and install the environment.
+
+1. Clone the repository, you can choose (a) partial clone or (b) full clone.
+2. Install the environment follow guide below.
+
+#### 1. (a) Partial clone
+
+Since the whole Amphion repository is large, you can use sparse-checkout to download only the needed code.
+
+```bash
+# download meta info only
+git clone --no-checkout --filter=blob:none https://github.com/open-mmlab/Amphion.git
+
+# enter the repositry directory
+cd Amphion
+
+# setting sparse-checkout
+git sparse-checkout init --cone
+git sparse-checkout set models/tts/maskgct
+
+# download the needed code
+git checkout main
+git sparse-checkout add models/codec utils
+```
+
+#### 1. (b) Full clone
+
+If you prefer to download the whole repository, you can use the following command.
```bash
git clone https://github.com/open-mmlab/Amphion.git
-# create env
-bash ./models/tts/maskgct/env.sh
+
+# enter the repositry directory
+cd Amphion
+```
+
+#### 2. Install the environment
+
+Before start installing, making sure you are under the `Amphion` directory. If not, use `cd` to enter.
+
+Since we use `phonemizer` to convert text to phoneme, you need to install `espeak-ng` first. More details can be found [here](https://bootphon.github.io/phonemizer/install.html). Choose the correct installation command according to your operating system:
+
+```bash
+# For Debian-like distribution (e.g. Ubuntu, Mint, etc.)
+sudo apt-get install espeak-ng
+# For RedHat-like distribution (e.g. CentOS, Fedora, etc.)
+sudo yum install espeak-ng
+
+# For Windows
+# Please visit https://github.com/espeak-ng/espeak-ng/releases to download .msi installer
+```
+
+It is recommended to use conda to configure the environment. You can use the following command to create and activate a new conda environment.
+
+```bash
+conda create -n maskgct python=3.10
+conda activate maskgct
```
-**Model download**
+Then, install the python packages.
+
+```bash
+pip install -r models/tts/maskgct/requirements.txt
+```
+
+### Jupyter Notebook
+
+We provide a [Jupyter notebook](../../../models/tts/maskgct/maskgct_demo.ipynb) to show how to use MaskGCT to inference.
+
+After installing the environment, you can open this notebook and start running.
+
+### Start from Scratch
+
+If you do not want to use Juptyer notebook, you can start from scratch. We provide the following steps to help you start from scratch.
+
+1. Download the pretrained model.
+2. Load the model and inference.
+
+#### 1. Model download
We provide the following pretrained checkpoints:
@@ -63,10 +139,12 @@ s2a_1layer_ckpt = hf_hub_download("amphion/MaskGCT", filename="s2a_model/s2a_mod
s2a_full_ckpt = hf_hub_download("amphion/MaskGCT", filename="s2a_model/s2a_model_full/model.safetensors")
```
-**Basic Usage**
+#### 2. Basic Inference
You can use the following code to generate speech from text and a prompt speech (the code is also provided in [inference.py](../../../models/tts/maskgct/maskgct_inference.py)).
+Run it with `python -m models.tts.maskgct.maskgct_inference`.
+
```python
from models.tts.maskgct.maskgct_utils import *
from huggingface_hub import hf_hub_download
@@ -92,7 +170,7 @@ if __name__ == "__main__":
s2a_model_full = build_s2a_model(cfg.model.s2a_model.s2a_full, device)
# download checkpoint
- ...
+ # ...
# load semantic codec
safetensors.torch.load_model(semantic_codec, semantic_code_ckpt)
@@ -132,9 +210,6 @@ if __name__ == "__main__":
sf.write(save_path, recovered_audio, 24000)
```
-**Jupyter Notebook**
-
-We also provide a [jupyter notebook](../../../models/tts/maskgct/maskgct_demo.ipynb) to show more details of MaskGCT inference.
## Training Dataset
diff --git a/models/tts/maskgct/env.sh b/models/tts/maskgct/env.sh
deleted file mode 100644
index ed595673..00000000
--- a/models/tts/maskgct/env.sh
+++ /dev/null
@@ -1,25 +0,0 @@
-pip install setuptools ruamel.yaml tqdm
-pip install tensorboard tensorboardX torch==2.0.1
-pip install transformers===4.41.1
-pip install -U encodec
-pip install black==24.1.1
-pip install oss2
-sudo apt-get install espeak-ng
-pip install phonemizer
-pip install g2p_en
-pip install accelerate==0.31.0
-pip install funasr zhconv zhon modelscope
-# pip install git+https://github.com/lhotse-speech/lhotse
-pip install timm
-pip install jieba cn2an
-pip install unidecode
-pip install -U cos-python-sdk-v5
-pip install pypinyin
-pip install jiwer
-pip install omegaconf
-pip install pyworld
-pip install py3langid==0.2.2 LangSegment
-pip install onnxruntime
-pip install pyopenjtalk
-pip install pykakasi
-pip install -U openai-whisper
\ No newline at end of file
diff --git a/models/tts/maskgct/maskgct_demo.ipynb b/models/tts/maskgct/maskgct_demo.ipynb
index 84609fbd..a9cacbf0 100644
--- a/models/tts/maskgct/maskgct_demo.ipynb
+++ b/models/tts/maskgct/maskgct_demo.ipynb
@@ -1,15 +1,68 @@
{
"cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## MaskGCT Demo\n",
+ "\n",
+ "This Jypyter notebook will introduce the basic usage of MaskGCT.\n",
+ "\n",
+ "Please follow the guide in README.md to set up environment before starting this notebook."
+ ]
+ },
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
+ "source": [
+ "import os\n",
+ "\n",
+ "# change to root directory of Amphion\n",
+ "cur_dir = os.getcwd()\n",
+ "if os.path.basename(cur_dir) == \"maskgct\":\n",
+ " pkg_rootdir = os.path.dirname(os.path.dirname(os.path.dirname(cur_dir)))\n",
+ " os.chdir(pkg_rootdir)\n",
+ "\n",
+ "os.getcwd()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "/root/miniconda3/envs/maskgct/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
+ " from .autonotebook import tqdm as notebook_tqdm\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "./models/tts/maskgct/g2p/sources/g2p_chinese_model/poly_bert_model.onnx\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "/root/miniconda3/envs/maskgct/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py:69: UserWarning: Specified provider 'CUDAExecutionProvider' is not in available provider names.Available providers: 'AzureExecutionProvider, CPUExecutionProvider'\n",
+ " warnings.warn(\n"
+ ]
+ }
+ ],
"source": [
"import torch\n",
"import numpy as np\n",
"import librosa\n",
"import safetensors\n",
+ "from IPython.display import Audio\n",
"from utils.util import load_config\n",
"\n",
"from models.codec.kmeans.repcodec_model import RepCodec\n",
@@ -23,7 +76,7 @@
},
{
"cell_type": "code",
- "execution_count": null,
+ "execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
@@ -33,7 +86,7 @@
},
{
"cell_type": "code",
- "execution_count": 1,
+ "execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
@@ -84,7 +137,7 @@
},
{
"cell_type": "code",
- "execution_count": null,
+ "execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
@@ -174,7 +227,7 @@
},
{
"cell_type": "code",
- "execution_count": null,
+ "execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
@@ -198,7 +251,7 @@
},
{
"cell_type": "code",
- "execution_count": null,
+ "execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
@@ -228,7 +281,19 @@
},
{
"cell_type": "code",
- "execution_count": null,
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "\n",
+ "# change endpoint if needed\n",
+ "# os.environ[\"HF_ENDPOINT\"] = \"https://hf-mirror.com\""
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
@@ -248,9 +313,20 @@
},
{
"cell_type": "code",
- "execution_count": null,
+ "execution_count": 10,
"metadata": {},
- "outputs": [],
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "(set(), [])"
+ ]
+ },
+ "execution_count": 10,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
"source": [
"# load semantic codec\n",
"safetensors.torch.load_model(semantic_codec, semantic_code_ckpt)\n",
@@ -264,26 +340,177 @@
"safetensors.torch.load_model(s2a_model_full, s2a_full_ckpt)"
]
},
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Fixed length generation"
+ ]
+ },
{
"cell_type": "code",
- "execution_count": null,
+ "execution_count": 11,
"metadata": {},
- "outputs": [],
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "predict semantic shape torch.Size([1, 900])\n"
+ ]
+ },
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ " \n",
+ " "
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "execution_count": 11,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
"source": [
"prompt_wav_path = \"./models/tts/maskgct/wav/prompt.wav\"\n",
"prompt_text = \" We do not break. We never give in. We never back down.\"\n",
"target_text = \"In this paper, we introduce MaskGCT, a fully non-autoregressive TTS model that eliminates the need for explicit alignment information between text and speech supervision.\"\n",
- "target_len = 18 # Specify the target duration (in seconds). If target_len = None, we use a simple rule to predict the target duration.\n",
- "recovered_audio = maskgct_inference(prompt_wav_path, prompt_text, target_text, \"en\", \"en\", target_len=target_len)"
+ "\n",
+ "# Specify the target duration (in seconds). If target_len = None, we use a simple rule to predict the target duration.\n",
+ "target_len = 18\n",
+ "\n",
+ "recovered_audio = maskgct_inference(\n",
+ " prompt_wav_path,\n",
+ " prompt_text,\n",
+ " target_text,\n",
+ " language=\"en\",\n",
+ " target_language=\"en\",\n",
+ " target_len=target_len\n",
+ ")\n",
+ "\n",
+ "Audio(recovered_audio, rate=24000)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Speed change"
]
},
{
"cell_type": "code",
- "execution_count": null,
+ "execution_count": 12,
"metadata": {},
- "outputs": [],
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "predict semantic shape torch.Size([1, 600])\n"
+ ]
+ },
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ " \n",
+ " "
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "execution_count": 12,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
"source": [
- "from IPython.display import Audio\n",
+ "prompt_wav_path = \"./models/tts/maskgct/wav/prompt.wav\"\n",
+ "prompt_text = \" We do not break. We never give in. We never back down.\"\n",
+ "target_text = \"In this paper, we introduce MaskGCT, a fully non-autoregressive TTS model that eliminates the need for explicit alignment information between text and speech supervision.\"\n",
+ "\n",
+ "# Specify the target duration (in seconds). If target_len = None, we use a simple rule to predict the target duration.\n",
+ "target_len = 12 # Make it faster\n",
+ "\n",
+ "recovered_audio = maskgct_inference(\n",
+ " prompt_wav_path,\n",
+ " prompt_text,\n",
+ " target_text,\n",
+ " language=\"en\",\n",
+ " target_language=\"en\",\n",
+ " target_len=target_len\n",
+ ")\n",
+ "\n",
+ "Audio(recovered_audio, rate=24000)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Cross-language generation"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "predict semantic shape torch.Size([1, 644])\n"
+ ]
+ },
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ " \n",
+ " "
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "execution_count": 14,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "prompt_wav_path = \"./models/tts/maskgct/wav/prompt.wav\"\n",
+ "prompt_text = \" We do not break. We never give in. We never back down.\"\n",
+ "target_text = \"在本文中,我们介绍了 MaskGCT,这是一种完全非自回归 TTS 模型,它不需要文本和语音监督之间的明确对齐信息。\"\n",
+ "\n",
+ "# Specify the target duration (in seconds). If target_len = None, we use a simple rule to predict the target duration.\n",
+ "target_len = None\n",
+ "\n",
+ "recovered_audio = maskgct_inference(\n",
+ " prompt_wav_path,\n",
+ " prompt_text,\n",
+ " target_text,\n",
+ " language=\"en\",\n",
+ " target_language=\"zh\", # use ISO 639-1 code, support: en, zh, ja, de, fr, ko\n",
+ " target_len=target_len\n",
+ ")\n",
+ "\n",
"Audio(recovered_audio, rate=24000)"
]
}
@@ -306,7 +533,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
- "version": "3.9.2"
+ "version": "3.10.15"
}
},
"nbformat": 4,
diff --git a/models/tts/maskgct/maskgct_inference.py b/models/tts/maskgct/maskgct_inference.py
index 631ad2ce..d990f147 100644
--- a/models/tts/maskgct/maskgct_inference.py
+++ b/models/tts/maskgct/maskgct_inference.py
@@ -65,7 +65,7 @@
# inference
prompt_wav_path = "./models/tts/maskgct/wav/prompt.wav"
- save_path = "[YOUR SAVE PATH]"
+ save_path = "generated_audio.wav"
prompt_text = " We do not break. We never give in. We never back down."
target_text = "In this paper, we introduce MaskGCT, a fully non-autoregressive TTS model that eliminates the need for explicit alignment information between text and speech supervision."
# Specify the target duration (in seconds). If target_len = None, we use a simple rule to predict the target duration.
diff --git a/models/tts/maskgct/maskgct_utils.py b/models/tts/maskgct/maskgct_utils.py
index 35217c4c..75690a25 100644
--- a/models/tts/maskgct/maskgct_utils.py
+++ b/models/tts/maskgct/maskgct_utils.py
@@ -8,12 +8,6 @@
import torch.nn.functional as F
import numpy as np
import librosa
-import os
-import pickle
-import math
-import json
-import accelerate
-import safetensors
from utils.util import load_config
from tqdm import tqdm
diff --git a/models/tts/maskgct/requirements.txt b/models/tts/maskgct/requirements.txt
new file mode 100644
index 00000000..9db71fb8
--- /dev/null
+++ b/models/tts/maskgct/requirements.txt
@@ -0,0 +1,24 @@
+setuptools
+onnxruntime
+torch==2.0.1
+transformers===4.41.1
+tensorboard
+tensorboardX
+accelerate==0.31.0
+unidecode
+numpy==1.23.5
+
+librosa
+encodecphonemizer
+g2p_en
+jieba
+cn2an
+pypinyin
+LangSegment
+pyopenjtalk
+pykakasi
+
+json5
+black==24.1.1
+ruamel.yaml
+tqdm