Conan: A Chunkwise Online Network for Zero-Shot Adaptive Voice Conversion

This is the official implementation of our ASRU 2025 paper "Conan: A Chunkwise Online Network for Zero-Shot Adaptive Voice Conversion".

📄 Read the Paper (arXiv) | 🎧 Demo Page

Zero-shot online voice conversion (VC) holds significant promise for real-time communications and entertainment. However, current VC models struggle to preserve semantic fidelity under real-time constraints, deliver natural-sounding conversions, and adapt effectively to unseen speaker characteristics. To address these challenges, we introduce Conan, a chunkwise online zero-shot voice conversion model that preserves the content of the source while matching the speaker representation of reference speech. Conan comprises three core components:

A Stream Content Extractor that leverages Emformer for low-latency streaming content encoding;
An Adaptive Style Encoder that extracts fine-grained stylistic features from reference speech for enhanced style adaptation;
A Causal Shuffle Vocoder that implements a fully causal HiFiGAN using a pixel-shuffle mechanism. Experimental evaluations demonstrate that Conan outperforms baseline models in subjective and objective metrics.

🌟 Features

Streaming Voice Conversion: Real-time voice conversion with low latency (~80ms)
Emformer Integration: Efficient transformer-based content encoding
High-Quality Vocoding: Pixel-shuffle causal HiFi-GAN vocoder for natural-sounding audio output

Workflow

Our workflow (inference procedure) is shown in the following figure. we first feed the entire reference speech into the model to provide timbre and stylistic information. During chunkwise online inference, we wait until the input reaches a predefined chunk size before passing it to the model. Because our generation speed for each chunk is faster than the chunk’s duration, online generation becomes possible. To ensure temporal continuity, we employ a sliding context window strategy. At each generation step, we not only input the source speech of the current chunk but also include the preceding context. From the model’s output, we extract only the segment for this chunk. As the context covers the receptive field, consistent overlapping segments can be generated, ensuring smooth transitions at chunk boundaries.

📋 Requirements

System Requirements

Python 3.10+

🚀 Installation

Clone the repository:

git clone https://github.com/User-tian/Conan.git
cd Conan

Create a virtual environment:

conda create -n conan python=3.10
conda activate conan

Install dependencies:

pip install -r requirements.txt

📊 Data Preparation

Dataset Structure

You only need to prepare the metadata.json file in the data/processed/ directory.

data/
└── processed/
    ├── metadata.json
    └── spker_set.json

Metadata Format

There is an example "example_metadata.json" file in the data/processed/vc/ directory. The metadata.json file should contain entries like:

[
  {
    "item_name": "speaker1_audio1",
    "wav_fn": "data/raw/speaker1/audio1.wav", // Path to the raw audio file
    "spk_embed": "0.1 0.2 0.3 ...", // Speaker embedding vector
    "duration": 3.5, // Duration in seconds
    "hubert": "12 34 56 ..." // HuBERT features as space-separated string
  }
]

Data Preprocessing Steps

Extract F0 features using RMVPE (needed only for main model training):

export PYTHONPATH=/storage/baotong/workspace/Conan:$PYTHONPATH # (optional) you may need to set the PYTHONPATH for import dependencies
python trials/extract_f0_rmvpe.py \
    --config  egs/conan.yaml \
    --batch-size 80 \
    --save-dir /path/to/audio

F0 will be saved to the same level folder as the audio folder. File structure: (an example below)

└── audio/
    ├── p225_001.wav
    ├── ...
└── audio_f0/
    ├── p225_001.npy
    ├── ...

Binarize the dataset:

python data_gen/tts/runs/binarize.py --config egs/conan.yaml

(You can use this config for all 3-stage training binarization)

Configuration

Update the configuration files in egs/ directory to match your dataset:

egs/conan_emformer.yaml: Main training configuration
egs/emformer.yaml: Emformer training configuration
egs/hifi_16k320_shuffle.yaml: Vocoder training configuration

Key parameters to adjust:

# Dataset paths
binary_data_dir: 'data/binary/vc'
processed_data_dir: 'data/processed/vc'

🎯 Training

Stage 1: Train Emformer

We first prepare the data and HuBERT tokens from the s3prl package using s3prl.nn.S3PRLUpstream("hubert").

CUDA_VISIBLE_DEVICES=0 python tasks/run.py \
    --config egs/emformer.yaml \
    --exp_name emformer_training \
    --reset

Stage 2: Train Main Conan Model

We fix the Emformer and Vocoder components, and prepare hubert entries of the data by applying the trained Emformer through the datasets (extracted chunk-wise).

CUDA_VISIBLE_DEVICES=0 python tasks/run.py \
    --config egs/conan_emformer.yaml \
    --exp_name conan_training \
    --reset

Stage 3: Train HiFi-GAN Vocoder

CUDA_VISIBLE_DEVICES=0 python tasks/run.py \
    --config egs/hifi_16k320_shuffle.yaml \
    --exp_name hifigan_training \
    --reset

🔮 Inference

Streaming Voice Conversion

CUDA_VISIBLE_DEVICES=0 python inference/Conan.py \
    --config egs/conan_emformer.yaml \
    --exp_name conan

Use the exp_name that contains the trained main model checkpoints, and update your config with the trained Emformer checkpoint and HifiGAN checkpoint.

Checkpoints

You can download pre-trained model checkpoints from Google Drive.

Main system checkpoint folders: Emformer, Conan, hifigan_vc

Fast system checkpoint folders: Emformer_fast, Conan_fast, hifigan_vc (you may need to change the "right_context" in the config file to 0 instead of 2)

Note: As we previous developed the Emformer training branch on another codebase, we provided another inference script for it inference/Conan_previous.py.

📁 Project Structure

Conan/
├── modules/                    # Core model implementations
│   ├── Conan/                 # Main Conan model
│   ├── Emformer/              # Emformer feature extractor
│   ├── vocoder/               # HiFi-GAN vocoder
│   └── ...
├── tasks/                     # Training and evaluation tasks
│   ├── Conan/                 # Conan training task
│   └── ...
├── inference/                 # Inference scripts
│   ├── Conan.py              # Main inference script
│   ├── run_voice_conversion.py
│   └── ...
├── data_gen/                  # Data preprocessing
│   ├── conan_binarizer.py    # Data binarization
│   └── ...
├── egs/                       # Configuration files
│   ├── conan.yaml           # Main training config
│   ├── emformer.yaml         # Emformer config
│   └── ...
├── utils/                     # Utility functions
└── checkpoints/              # Model checkpoints

📈 Performance

The Conan system achieves state-of-the-art performance on voice conversion tasks:

Latency: ~80ms streaming latency (37ms latency for fast system)
Quality: High-quality voice conversion with natural prosody
Robustness: Robust to different speaking styles and content

📄 Citation

If you use Conan in your research, please cite our work:

@article{zhang2025conan,
  title={Conan: A Chunkwise Online Network for Zero-Shot Adaptive Voice Conversion},
  author={Zhang, Yu and Tian, Baotong and Duan, Zhiyao},
  journal={arXiv preprint arXiv:2507.14534},
  year={2025}
}

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgements

FastSpeech2 for the codebase and base TTS architectures
HiFi-GAN for the neural vocoder
Emformer for efficient transformer implementation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Conan: A Chunkwise Online Network for Zero-Shot Adaptive Voice Conversion

🌟 Features

Workflow

📋 Requirements

System Requirements

🚀 Installation

📊 Data Preparation

Dataset Structure

Metadata Format

Data Preprocessing Steps

Configuration

🎯 Training

Stage 1: Train Emformer

Stage 2: Train Main Conan Model

Stage 3: Train HiFi-GAN Vocoder

🔮 Inference

Streaming Voice Conversion

Checkpoints

📁 Project Structure

📈 Performance

📄 Citation

📜 License

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
data/processed/vc		data/processed/vc
data_gen		data_gen
egs		egs
figs		figs
inference		inference
modules		modules
tasks		tasks
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Conan: A Chunkwise Online Network for Zero-Shot Adaptive Voice Conversion

🌟 Features

Workflow

📋 Requirements

System Requirements

🚀 Installation

📊 Data Preparation

Dataset Structure

Metadata Format

Data Preprocessing Steps

Configuration

🎯 Training

Stage 1: Train Emformer

Stage 2: Train Main Conan Model

Stage 3: Train HiFi-GAN Vocoder

🔮 Inference

Streaming Voice Conversion

Checkpoints

📁 Project Structure

📈 Performance

📄 Citation

📜 License

🙏 Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages