diff --git a/models/tts/maskgct/README.md b/models/tts/maskgct/README.md index 811ba209..b75a1ea6 100644 --- a/models/tts/maskgct/README.md +++ b/models/tts/maskgct/README.md @@ -5,7 +5,7 @@ [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-demo-pink)](https://huggingface.co/spaces/amphion/maskgct) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](../../../models/tts/maskgct/README.md) -[正式版公测地址(趣丸千音)](https://voice.funnycp.com/) +Public beta version address 公测版地址: [趣丸千音](https://voice.funnycp.com/) ## Overview @@ -21,17 +21,93 @@ MaskGCT (**Mask**ed **G**enerative **C**odec **T**ransformer) is *a fully non-au - **2024/10/19**: We release **MaskGCT**, a fully non-autoregressive TTS model that eliminates the need for explicit alignment information between text and speech supervision. MaskGCT is trained on [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) dataset and achieves SOTA zero-shot TTS perfermance. +## Issues + +If you encounter any issue when using MaskGCT, feel free to open an issue in this repository. But please use **English** to describe, this will make it easier for keyword searching and more people to participate in the discussion. + ## Quickstart -**Clone and install** +### Clone and Environment + +This parts, follow the steps below to clone the repository and install the environment. + +1. Clone the repository, you can choose (a) partial clone or (b) full clone. +2. Install the environment follow guide below. + +#### 1. (a) Partial clone + +Since the whole Amphion repository is large, you can use sparse-checkout to download only the needed code. + +```bash +# download meta info only +git clone --no-checkout --filter=blob:none https://github.com/open-mmlab/Amphion.git + +# enter the repositry directory +cd Amphion + +# setting sparse-checkout +git sparse-checkout init --cone +git sparse-checkout set models/tts/maskgct + +# download the needed code +git checkout main +git sparse-checkout add models/codec utils +``` + +#### 1. (b) Full clone + +If you prefer to download the whole repository, you can use the following command. ```bash git clone https://github.com/open-mmlab/Amphion.git -# create env -bash ./models/tts/maskgct/env.sh + +# enter the repositry directory +cd Amphion +``` + +#### 2. Install the environment + +Before start installing, making sure you are under the `Amphion` directory. If not, use `cd` to enter. + +Since we use `phonemizer` to convert text to phoneme, you need to install `espeak-ng` first. More details can be found [here](https://bootphon.github.io/phonemizer/install.html). Choose the correct installation command according to your operating system: + +```bash +# For Debian-like distribution (e.g. Ubuntu, Mint, etc.) +sudo apt-get install espeak-ng +# For RedHat-like distribution (e.g. CentOS, Fedora, etc.) +sudo yum install espeak-ng + +# For Windows +# Please visit https://github.com/espeak-ng/espeak-ng/releases to download .msi installer +``` + +It is recommended to use conda to configure the environment. You can use the following command to create and activate a new conda environment. + +```bash +conda create -n maskgct python=3.10 +conda activate maskgct ``` -**Model download** +Then, install the python packages. + +```bash +pip install -r models/tts/maskgct/requirements.txt +``` + +### Jupyter Notebook + +We provide a [Jupyter notebook](../../../models/tts/maskgct/maskgct_demo.ipynb) to show how to use MaskGCT to inference. + +After installing the environment, you can open this notebook and start running. + +### Start from Scratch + +If you do not want to use Juptyer notebook, you can start from scratch. We provide the following steps to help you start from scratch. + +1. Download the pretrained model. +2. Load the model and inference. + +#### 1. Model download We provide the following pretrained checkpoints: @@ -63,10 +139,12 @@ s2a_1layer_ckpt = hf_hub_download("amphion/MaskGCT", filename="s2a_model/s2a_mod s2a_full_ckpt = hf_hub_download("amphion/MaskGCT", filename="s2a_model/s2a_model_full/model.safetensors") ``` -**Basic Usage** +#### 2. Basic Inference You can use the following code to generate speech from text and a prompt speech (the code is also provided in [inference.py](../../../models/tts/maskgct/maskgct_inference.py)). +Run it with `python -m models.tts.maskgct.maskgct_inference`. + ```python from models.tts.maskgct.maskgct_utils import * from huggingface_hub import hf_hub_download @@ -92,7 +170,7 @@ if __name__ == "__main__": s2a_model_full = build_s2a_model(cfg.model.s2a_model.s2a_full, device) # download checkpoint - ... + # ... # load semantic codec safetensors.torch.load_model(semantic_codec, semantic_code_ckpt) @@ -132,9 +210,6 @@ if __name__ == "__main__": sf.write(save_path, recovered_audio, 24000) ``` -**Jupyter Notebook** - -We also provide a [jupyter notebook](../../../models/tts/maskgct/maskgct_demo.ipynb) to show more details of MaskGCT inference. ## Training Dataset diff --git a/models/tts/maskgct/env.sh b/models/tts/maskgct/env.sh deleted file mode 100644 index ed595673..00000000 --- a/models/tts/maskgct/env.sh +++ /dev/null @@ -1,25 +0,0 @@ -pip install setuptools ruamel.yaml tqdm -pip install tensorboard tensorboardX torch==2.0.1 -pip install transformers===4.41.1 -pip install -U encodec -pip install black==24.1.1 -pip install oss2 -sudo apt-get install espeak-ng -pip install phonemizer -pip install g2p_en -pip install accelerate==0.31.0 -pip install funasr zhconv zhon modelscope -# pip install git+https://github.com/lhotse-speech/lhotse -pip install timm -pip install jieba cn2an -pip install unidecode -pip install -U cos-python-sdk-v5 -pip install pypinyin -pip install jiwer -pip install omegaconf -pip install pyworld -pip install py3langid==0.2.2 LangSegment -pip install onnxruntime -pip install pyopenjtalk -pip install pykakasi -pip install -U openai-whisper \ No newline at end of file diff --git a/models/tts/maskgct/maskgct_demo.ipynb b/models/tts/maskgct/maskgct_demo.ipynb index 84609fbd..a9cacbf0 100644 --- a/models/tts/maskgct/maskgct_demo.ipynb +++ b/models/tts/maskgct/maskgct_demo.ipynb @@ -1,15 +1,68 @@ { "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## MaskGCT Demo\n", + "\n", + "This Jypyter notebook will introduce the basic usage of MaskGCT.\n", + "\n", + "Please follow the guide in README.md to set up environment before starting this notebook." + ] + }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], + "source": [ + "import os\n", + "\n", + "# change to root directory of Amphion\n", + "cur_dir = os.getcwd()\n", + "if os.path.basename(cur_dir) == \"maskgct\":\n", + " pkg_rootdir = os.path.dirname(os.path.dirname(os.path.dirname(cur_dir)))\n", + " os.chdir(pkg_rootdir)\n", + "\n", + "os.getcwd()" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/root/miniconda3/envs/maskgct/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", + " from .autonotebook import tqdm as notebook_tqdm\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "./models/tts/maskgct/g2p/sources/g2p_chinese_model/poly_bert_model.onnx\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/root/miniconda3/envs/maskgct/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py:69: UserWarning: Specified provider 'CUDAExecutionProvider' is not in available provider names.Available providers: 'AzureExecutionProvider, CPUExecutionProvider'\n", + " warnings.warn(\n" + ] + } + ], "source": [ "import torch\n", "import numpy as np\n", "import librosa\n", "import safetensors\n", + "from IPython.display import Audio\n", "from utils.util import load_config\n", "\n", "from models.codec.kmeans.repcodec_model import RepCodec\n", @@ -23,7 +76,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 3, "metadata": {}, "outputs": [], "source": [ @@ -33,7 +86,7 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 4, "metadata": {}, "outputs": [], "source": [ @@ -84,7 +137,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 5, "metadata": {}, "outputs": [], "source": [ @@ -174,7 +227,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 6, "metadata": {}, "outputs": [], "source": [ @@ -198,7 +251,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 7, "metadata": {}, "outputs": [], "source": [ @@ -228,7 +281,19 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "\n", + "# change endpoint if needed\n", + "# os.environ[\"HF_ENDPOINT\"] = \"https://hf-mirror.com\"" + ] + }, + { + "cell_type": "code", + "execution_count": 9, "metadata": {}, "outputs": [], "source": [ @@ -248,9 +313,20 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 10, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "(set(), [])" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "# load semantic codec\n", "safetensors.torch.load_model(semantic_codec, semantic_code_ckpt)\n", @@ -264,26 +340,177 @@ "safetensors.torch.load_model(s2a_model_full, s2a_full_ckpt)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Fixed length generation" + ] + }, { "cell_type": "code", - "execution_count": null, + "execution_count": 11, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "predict semantic shape torch.Size([1, 900])\n" + ] + }, + { + "data": { + "text/html": [ + "\n", + " \n", + " " + ], + "text/plain": [ + "" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "prompt_wav_path = \"./models/tts/maskgct/wav/prompt.wav\"\n", "prompt_text = \" We do not break. We never give in. We never back down.\"\n", "target_text = \"In this paper, we introduce MaskGCT, a fully non-autoregressive TTS model that eliminates the need for explicit alignment information between text and speech supervision.\"\n", - "target_len = 18 # Specify the target duration (in seconds). If target_len = None, we use a simple rule to predict the target duration.\n", - "recovered_audio = maskgct_inference(prompt_wav_path, prompt_text, target_text, \"en\", \"en\", target_len=target_len)" + "\n", + "# Specify the target duration (in seconds). If target_len = None, we use a simple rule to predict the target duration.\n", + "target_len = 18\n", + "\n", + "recovered_audio = maskgct_inference(\n", + " prompt_wav_path,\n", + " prompt_text,\n", + " target_text,\n", + " language=\"en\",\n", + " target_language=\"en\",\n", + " target_len=target_len\n", + ")\n", + "\n", + "Audio(recovered_audio, rate=24000)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Speed change" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 12, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "predict semantic shape torch.Size([1, 600])\n" + ] + }, + { + "data": { + "text/html": [ + "\n", + " \n", + " " + ], + "text/plain": [ + "" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ - "from IPython.display import Audio\n", + "prompt_wav_path = \"./models/tts/maskgct/wav/prompt.wav\"\n", + "prompt_text = \" We do not break. We never give in. We never back down.\"\n", + "target_text = \"In this paper, we introduce MaskGCT, a fully non-autoregressive TTS model that eliminates the need for explicit alignment information between text and speech supervision.\"\n", + "\n", + "# Specify the target duration (in seconds). If target_len = None, we use a simple rule to predict the target duration.\n", + "target_len = 12 # Make it faster\n", + "\n", + "recovered_audio = maskgct_inference(\n", + " prompt_wav_path,\n", + " prompt_text,\n", + " target_text,\n", + " language=\"en\",\n", + " target_language=\"en\",\n", + " target_len=target_len\n", + ")\n", + "\n", + "Audio(recovered_audio, rate=24000)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Cross-language generation" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "predict semantic shape torch.Size([1, 644])\n" + ] + }, + { + "data": { + "text/html": [ + "\n", + " \n", + " " + ], + "text/plain": [ + "" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "prompt_wav_path = \"./models/tts/maskgct/wav/prompt.wav\"\n", + "prompt_text = \" We do not break. We never give in. We never back down.\"\n", + "target_text = \"在本文中,我们介绍了 MaskGCT,这是一种完全非自回归 TTS 模型,它不需要文本和语音监督之间的明确对齐信息。\"\n", + "\n", + "# Specify the target duration (in seconds). If target_len = None, we use a simple rule to predict the target duration.\n", + "target_len = None\n", + "\n", + "recovered_audio = maskgct_inference(\n", + " prompt_wav_path,\n", + " prompt_text,\n", + " target_text,\n", + " language=\"en\",\n", + " target_language=\"zh\", # use ISO 639-1 code, support: en, zh, ja, de, fr, ko\n", + " target_len=target_len\n", + ")\n", + "\n", "Audio(recovered_audio, rate=24000)" ] } @@ -306,7 +533,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.9.2" + "version": "3.10.15" } }, "nbformat": 4, diff --git a/models/tts/maskgct/maskgct_inference.py b/models/tts/maskgct/maskgct_inference.py index 631ad2ce..d990f147 100644 --- a/models/tts/maskgct/maskgct_inference.py +++ b/models/tts/maskgct/maskgct_inference.py @@ -65,7 +65,7 @@ # inference prompt_wav_path = "./models/tts/maskgct/wav/prompt.wav" - save_path = "[YOUR SAVE PATH]" + save_path = "generated_audio.wav" prompt_text = " We do not break. We never give in. We never back down." target_text = "In this paper, we introduce MaskGCT, a fully non-autoregressive TTS model that eliminates the need for explicit alignment information between text and speech supervision." # Specify the target duration (in seconds). If target_len = None, we use a simple rule to predict the target duration. diff --git a/models/tts/maskgct/maskgct_utils.py b/models/tts/maskgct/maskgct_utils.py index 35217c4c..75690a25 100644 --- a/models/tts/maskgct/maskgct_utils.py +++ b/models/tts/maskgct/maskgct_utils.py @@ -8,12 +8,6 @@ import torch.nn.functional as F import numpy as np import librosa -import os -import pickle -import math -import json -import accelerate -import safetensors from utils.util import load_config from tqdm import tqdm diff --git a/models/tts/maskgct/requirements.txt b/models/tts/maskgct/requirements.txt new file mode 100644 index 00000000..9db71fb8 --- /dev/null +++ b/models/tts/maskgct/requirements.txt @@ -0,0 +1,24 @@ +setuptools +onnxruntime +torch==2.0.1 +transformers===4.41.1 +tensorboard +tensorboardX +accelerate==0.31.0 +unidecode +numpy==1.23.5 + +librosa +encodecphonemizer +g2p_en +jieba +cn2an +pypinyin +LangSegment +pyopenjtalk +pykakasi + +json5 +black==24.1.1 +ruamel.yaml +tqdm