Merge branch 'vc_noro' of https://github.com/kenxxxxx/Amphion into vc…

…_noro
open-mmlab · Jul 18, 2024 · 099052a · 099052a
2 parents 070df5b + 4a484fb
commit 099052a
Show file tree

Hide file tree

Showing 2 changed files with 109 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -28,6 +28,7 @@
 In addition to the specific generation tasks, Amphion includes several **vocoders** and **evaluation metrics**. A vocoder is an important module for producing high-quality audio signals, while evaluation metrics are critical for ensuring consistent metrics in generation tasks. Moreover, Amphion is dedicated to advancing audio generation in real-world applications, such as building **large-scale datasets** for speech synthesis.
 
 ## 🚀 News
+- **2024/7/17**: Amphion releases the Noro model, a noise-robust one-shot voice conversion system. The Noro model significantly improves performance in noisy environments by introducing a dual-branch reference encoding module and noise-agnostic contrastive speaker loss. It performs well in both clean and noisy conditions, making it suitable for real-world applications. Additionally, we explore the potential of the reference encoder as a self-supervised speaker encoder, showing competitive performance in speaker representation tasks. [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](egs/vc/noro/README.md)
 - **2024/07/01**: Amphion now releases **Emilia**, the first open-source multilingual in-the-wild dataset for speech generation with over 101k hours of speech data, and the **Emilia-Pipe**, the first open-source preprocessing pipeline designed to transform in-the-wild speech data into high-quality training data with annotations for speech generation! [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2407.05361) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Dataset-yellow)](https://huggingface.co/datasets/amphion/Emilia) [![demo](https://img.shields.io/badge/WebPage-Demo-red)](https://emilia-dataset.github.io/Emilia-Demo-Page/) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](preprocessors/Emilia/README.md)
 - **2024/06/17**: Amphion has a new release for its **VALL-E** model! It uses Llama as its underlying architecture and has better model performance, faster training speed, and more readable codes compared to our first version. [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](egs/tts/VALLE_V2/README.md)
 - **2024/03/12**: Amphion now support **NaturalSpeech3 FACodec** and release pretrained checkpoints. [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2403.03100) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-model-yellow)](https://huggingface.co/amphion/naturalspeech3_facodec) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-demo-pink)](https://huggingface.co/spaces/amphion/naturalspeech3_facodec) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](models/codec/ns3_codec/README.md)

diff --git a/egs/vc/README.md b/egs/vc/README.md
@@ -0,0 +1,108 @@
+# Noro: A Noise-Robust One-shot Voice Conversion System
+
+## Project Overview
+Noro is a noise-robust one-shot voice conversion (VC) system designed to convert the timbre of speech from a source speaker to a target speaker using only a single reference speech sample, while preserving the semantic content of the original speech. Noro introduces innovative components tailored for VC using noisy reference speeches, including a dual-branch reference encoding module and a noise-agnostic contrastive speaker loss.
+
+## Features
+- **Noise-Robust Voice Conversion**: Utilizes a dual-branch reference encoding module and noise-agnostic contrastive speaker loss to maintain high-quality voice conversion in noisy environments.
+- **One-shot Voice Conversion**: Achieves timbre conversion using only one reference speech sample.
+- **Speaker Representation Learning**: Explores the potential of the reference encoder as a self-supervised speaker encoder.
+
+## Installation Requirement
+
+Set up your environment as in Amphion README (you'll need a conda environment, and we recommend using Linux).
+
+### Prepare Hubert Model
+
+Humbert checkpoint and kmeans can be downloaded [here](https://github.com/facebookresearch/fairseq/tree/main/examples/hubert).
+Set the downloded model path at `models/vc/hubert_kmeans.py`.
+
+
+## Usage
+
+### Download pretrained weights
+You need to download our pretrained weights from [Google Drive](https://drive.google.com/drive/folders/1NPzSIuSKO8o87g5ImNzpw_BgbhsZaxNg?usp=drive_link). 
+
+### Inference
+1. Configure inference parameters:
+    Modify the pretrained checkpoint path, source voice path and reference voice path at egs/vc/noro_inference.sh` file.
+   Currently it's at line 35.
+```
+    checkpoint_path="path/to/checkpoint/model.safetensors"
+    output_dir="path/to/output/directory"
+    source_path="path/to/source/audio.wav"
+    reference_path="path/to/reference/audio.wav"
+```
+2. Start inference:
+    ```bash
+    bash path/to/Amphion/egs/vc/noro_inference.sh
+    ```
+
+3. You got the reconstructed mel spectrum saved to the output direction.
+   Then use the [BigVGAN](https://github.com/NVIDIA/BigVGAN) to construct the wav file. 
+
+## Training from Scratch
+
+### Data Preparation
+
+We use the LibriLight dataset for training and evaluation. You can download it using the following commands:
+```bash
+    wget https://dl.fbaipublicfiles.com/librilight/data/large.tar
+    wget https://dl.fbaipublicfiles.com/librilight/data/medium.tar
+    wget https://dl.fbaipublicfiles.com/librilight/data/small.tar
+```
+
+### Training the Model with Clean Reference Voice
+
+Configure training parameters:
+Our configuration file for training clean Noro model is at "egs/vc/exp_config_4gpu_clean.json", and Nosiy Noro model at "egs/vc/exp_config_4gpu_noisy.json".
+
+To train your model, you need to modify the `dataset` variable in the json configurations.
+Currently it's at line 40, you should modify the "data_dir" to your dataset's root directory.
+```
+    "directory_list": [
+      "path/to/your/training_data_directory1",
+      "path/to/your/training_data_directory2",
+      "path/to/your/training_data_directory3"
+    ],
+```
+You should also select a reasonable batch size at the "batch_size" entry (currently it's set at 8).
+
+You can change other experiment settings in the `/egs/tts/VALLE_V2/exp_ar_libritts.json` such as the learning rate, optimizer and the dataset.
+
+  **Set smaller batch_size if you are out of memory😢😢**
+
+I used max_tokens = 3200000 to successfully run on a single card, if you'r out of memory, try smaller.
+
+```json
+    "batch_size": 8,
+    "max_tokens": 3200000,
+    "max_sentences": 64,
+```
+### Resume from existing checkpoint
+Our framework supports resuming from existing checkpoint.
+If this is a new experiment, use the following command:
+```
+CUDA_VISIBLE_DEVICES=$gpu accelerate launch --main_process_port 26667 --mixed_precision fp16 \
+"${work_dir}/bins/vc/train.py" \
+    --config $exp_config \
+    --exp_name $exp_name \
+    --log_level debug
+```
+To resume training or fine-tune from a checkpoint, use the following command:
+Ensure the options  `--resume`, `--resume_type resume`, and `--checkpoint_path` are set.
+
+### Run the command to Train model
+Start clean training:
+    ```bash
+    bash path/to/Amphion/egs/vc/noro_train_clean.sh
+    ```
+
+
+Start noisy training:
+    ```bash
+    bash path/to/Amphion/egs/vc/noro_train_noisy.sh
+    ```
+
+
+