Skip to content

AirLabUR/Conan

ย 
ย 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

15 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Conan: A Chunkwise Online Network for Zero-Shot Adaptive Voice Conversion

Python 3.10+ PyTorch License

This is the official implementation of our ASRU 2025 paper "Conan: A Chunkwise Online Network for Zero-Shot Adaptive Voice Conversion".

๐Ÿ“„ Read the Paper (arXiv) ย |ย  ๐ŸŽง Demo Page

Architecture

Zero-shot online voice conversion (VC) holds significant promise for real-time communications and entertainment. However, current VC models struggle to preserve semantic fidelity under real-time constraints, deliver natural-sounding conversions, and adapt effectively to unseen speaker characteristics. To address these challenges, we introduce Conan, a chunkwise online zero-shot voice conversion model that preserves the content of the source while matching the speaker representation of reference speech. Conan comprises three core components:

  1. A Stream Content Extractor that leverages Emformer for low-latency streaming content encoding;
  2. An Adaptive Style Encoder that extracts fine-grained stylistic features from reference speech for enhanced style adaptation;
  3. A Causal Shuffle Vocoder that implements a fully causal HiFiGAN using a pixel-shuffle mechanism. Experimental evaluations demonstrate that Conan outperforms baseline models in subjective and objective metrics.

๐ŸŒŸ Features

  • Streaming Voice Conversion: Real-time voice conversion with low latency (~80ms)
  • Emformer Integration: Efficient transformer-based content encoding
  • High-Quality Vocoding: Pixel-shuffle causal HiFi-GAN vocoder for natural-sounding audio output

Workflow

Our workflow (inference procedure) is shown in the following figure. Workflow we first feed the entire reference speech into the model to provide timbre and stylistic information. During chunkwise online inference, we wait until the input reaches a predefined chunk size before passing it to the model. Because our generation speed for each chunk is faster than the chunkโ€™s duration, online generation becomes possible. To ensure temporal continuity, we employ a sliding context window strategy. At each generation step, we not only input the source speech of the current chunk but also include the preceding context. From the modelโ€™s output, we extract only the segment for this chunk. As the context covers the receptive field, consistent overlapping segments can be generated, ensuring smooth transitions at chunk boundaries.

๐Ÿ“‹ Requirements

System Requirements

  • Python 3.10+

๐Ÿš€ Installation

  1. Clone the repository:
git clone https://github.com/User-tian/Conan.git
cd Conan
  1. Create a virtual environment:
conda create -n conan python=3.10
conda activate conan
  1. Install dependencies:
pip install -r requirements.txt

๐Ÿ“Š Data Preparation

Dataset Structure

You only need to prepare the metadata.json file in the data/processed/ directory.

data/
โ””โ”€โ”€ processed/
    โ”œโ”€โ”€ metadata.json
    โ””โ”€โ”€ spker_set.json

Metadata Format

There is an example "example_metadata.json" file in the data/processed/vc/ directory. The metadata.json file should contain entries like:

[
  {
    "item_name": "speaker1_audio1",
    "wav_fn": "data/raw/speaker1/audio1.wav", // Path to the raw audio file
    "spk_embed": "0.1 0.2 0.3 ...", // Speaker embedding vector
    "duration": 3.5, // Duration in seconds
    "hubert": "12 34 56 ..." // HuBERT features as space-separated string
  }
]

Data Preprocessing Steps

  1. Extract F0 features using RMVPE (needed only for main model training):
export PYTHONPATH=/storage/baotong/workspace/Conan:$PYTHONPATH # (optional) you may need to set the PYTHONPATH for import dependencies
python trials/extract_f0_rmvpe.py \
    --config  egs/conan.yaml \
    --batch-size 80 \
    --save-dir /path/to/audio  

F0 will be saved to the same level folder as the audio folder. File structure: (an example below)

โ””โ”€โ”€ audio/
    โ”œโ”€โ”€ p225_001.wav
    โ”œโ”€โ”€ ...
โ””โ”€โ”€ audio_f0/
    โ”œโ”€โ”€ p225_001.npy
    โ”œโ”€โ”€ ...
  1. Binarize the dataset:
python data_gen/tts/runs/binarize.py --config egs/conan.yaml

(You can use this config for all 3-stage training binarization)

Configuration

Update the configuration files in egs/ directory to match your dataset:

  • egs/conan_emformer.yaml: Main training configuration
  • egs/emformer.yaml: Emformer training configuration
  • egs/hifi_16k320_shuffle.yaml: Vocoder training configuration

Key parameters to adjust:

# Dataset paths
binary_data_dir: 'data/binary/vc'
processed_data_dir: 'data/processed/vc'

๐ŸŽฏ Training

Stage 1: Train Emformer

We first prepare the data and HuBERT tokens from the s3prl package using s3prl.nn.S3PRLUpstream("hubert").

CUDA_VISIBLE_DEVICES=0 python tasks/run.py \
    --config egs/emformer.yaml \
    --exp_name emformer_training \
    --reset

Stage 2: Train Main Conan Model

We fix the Emformer and Vocoder components, and prepare hubert entries of the data by applying the trained Emformer through the datasets (extracted chunk-wise).

CUDA_VISIBLE_DEVICES=0 python tasks/run.py \
    --config egs/conan_emformer.yaml \
    --exp_name conan_training \
    --reset

Stage 3: Train HiFi-GAN Vocoder

CUDA_VISIBLE_DEVICES=0 python tasks/run.py \
    --config egs/hifi_16k320_shuffle.yaml \
    --exp_name hifigan_training \
    --reset

๐Ÿ”ฎ Inference

Streaming Voice Conversion

CUDA_VISIBLE_DEVICES=0 python inference/Conan.py \
    --config egs/conan_emformer.yaml \
    --exp_name conan

Use the exp_name that contains the trained main model checkpoints, and update your config with the trained Emformer checkpoint and HifiGAN checkpoint.

Checkpoints

You can download pre-trained model checkpoints from Google Drive.

Main system checkpoint folders: Emformer, Conan, hifigan_vc

Fast system checkpoint folders: Emformer_fast, Conan_fast, hifigan_vc (you may need to change the "right_context" in the config file to 0 instead of 2)

Note: As we previous developed the Emformer training branch on another codebase, we provided another inference script for it inference/Conan_previous.py.

๐Ÿ“ Project Structure

Conan/
โ”œโ”€โ”€ modules/                    # Core model implementations
โ”‚   โ”œโ”€โ”€ Conan/                 # Main Conan model
โ”‚   โ”œโ”€โ”€ Emformer/              # Emformer feature extractor
โ”‚   โ”œโ”€โ”€ vocoder/               # HiFi-GAN vocoder
โ”‚   โ””โ”€โ”€ ...
โ”œโ”€โ”€ tasks/                     # Training and evaluation tasks
โ”‚   โ”œโ”€โ”€ Conan/                 # Conan training task
โ”‚   โ””โ”€โ”€ ...
โ”œโ”€โ”€ inference/                 # Inference scripts
โ”‚   โ”œโ”€โ”€ Conan.py              # Main inference script
โ”‚   โ”œโ”€โ”€ run_voice_conversion.py
โ”‚   โ””โ”€โ”€ ...
โ”œโ”€โ”€ data_gen/                  # Data preprocessing
โ”‚   โ”œโ”€โ”€ conan_binarizer.py    # Data binarization
โ”‚   โ””โ”€โ”€ ...
โ”œโ”€โ”€ egs/                       # Configuration files
โ”‚   โ”œโ”€โ”€ conan.yaml           # Main training config
โ”‚   โ”œโ”€โ”€ emformer.yaml         # Emformer config
โ”‚   โ””โ”€โ”€ ...
โ”œโ”€โ”€ utils/                     # Utility functions
โ””โ”€โ”€ checkpoints/              # Model checkpoints

๐Ÿ“ˆ Performance

The Conan system achieves state-of-the-art performance on voice conversion tasks:

  • Latency: ~80ms streaming latency (37ms latency for fast system)
  • Quality: High-quality voice conversion with natural prosody
  • Robustness: Robust to different speaking styles and content

๐Ÿ“„ Citation

If you use Conan in your research, please cite our work:

@article{zhang2025conan,
  title={Conan: A Chunkwise Online Network for Zero-Shot Adaptive Voice Conversion},
  author={Zhang, Yu and Tian, Baotong and Duan, Zhiyao},
  journal={arXiv preprint arXiv:2507.14534},
  year={2025}
}

๐Ÿ“œ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgements

  • FastSpeech2 for the codebase and base TTS architectures
  • HiFi-GAN for the neural vocoder
  • Emformer for efficient transformer implementation

About

Official Implementation of "Conan: A Chunkwise Online Network for Zero-Shot Adaptive Voice Conversion"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%