This is the official implementation of our ASRU 2025 paper "Conan: A Chunkwise Online Network for Zero-Shot Adaptive Voice Conversion".
๐ Read the Paper (arXiv) ย |ย ๐ง Demo Page
Zero-shot online voice conversion (VC) holds significant promise for real-time communications and entertainment. However, current VC models struggle to preserve semantic fidelity under real-time constraints, deliver natural-sounding conversions, and adapt effectively to unseen speaker characteristics. To address these challenges, we introduce Conan, a chunkwise online zero-shot voice conversion model that preserves the content of the source while matching the speaker representation of reference speech. Conan comprises three core components:
- A Stream Content Extractor that leverages Emformer for low-latency streaming content encoding;
- An Adaptive Style Encoder that extracts fine-grained stylistic features from reference speech for enhanced style adaptation;
- A Causal Shuffle Vocoder that implements a fully causal HiFiGAN using a pixel-shuffle mechanism. Experimental evaluations demonstrate that Conan outperforms baseline models in subjective and objective metrics.
- Streaming Voice Conversion: Real-time voice conversion with low latency (~80ms)
- Emformer Integration: Efficient transformer-based content encoding
- High-Quality Vocoding: Pixel-shuffle causal HiFi-GAN vocoder for natural-sounding audio output
Our workflow (inference procedure) is shown in the following figure.
we first feed the entire reference speech into the model to provide timbre
and stylistic information. During chunkwise online inference,
we wait until the input reaches a predefined chunk size before
passing it to the model. Because our generation speed for each
chunk is faster than the chunkโs duration, online generation
becomes possible. To ensure temporal continuity, we employ
a sliding context window strategy. At each generation step,
we not only input the source speech of the current chunk but
also include the preceding context. From the modelโs output,
we extract only the segment for this chunk. As the context
covers the receptive field, consistent overlapping segments can
be generated, ensuring smooth transitions at chunk boundaries.
- Python 3.10+
- Clone the repository:
git clone https://github.com/User-tian/Conan.git
cd Conan- Create a virtual environment:
conda create -n conan python=3.10
conda activate conan- Install dependencies:
pip install -r requirements.txtYou only need to prepare the metadata.json file in the data/processed/ directory.
data/
โโโ processed/
โโโ metadata.json
โโโ spker_set.json
There is an example "example_metadata.json" file in the data/processed/vc/ directory.
The metadata.json file should contain entries like:
[
{
"item_name": "speaker1_audio1",
"wav_fn": "data/raw/speaker1/audio1.wav", // Path to the raw audio file
"spk_embed": "0.1 0.2 0.3 ...", // Speaker embedding vector
"duration": 3.5, // Duration in seconds
"hubert": "12 34 56 ..." // HuBERT features as space-separated string
}
]- Extract F0 features using RMVPE (needed only for main model training):
export PYTHONPATH=/storage/baotong/workspace/Conan:$PYTHONPATH # (optional) you may need to set the PYTHONPATH for import dependencies
python trials/extract_f0_rmvpe.py \
--config egs/conan.yaml \
--batch-size 80 \
--save-dir /path/to/audio F0 will be saved to the same level folder as the audio folder. File structure: (an example below)
โโโ audio/
โโโ p225_001.wav
โโโ ...
โโโ audio_f0/
โโโ p225_001.npy
โโโ ...
- Binarize the dataset:
python data_gen/tts/runs/binarize.py --config egs/conan.yaml(You can use this config for all 3-stage training binarization)
Update the configuration files in egs/ directory to match your dataset:
egs/conan_emformer.yaml: Main training configurationegs/emformer.yaml: Emformer training configurationegs/hifi_16k320_shuffle.yaml: Vocoder training configuration
Key parameters to adjust:
# Dataset paths
binary_data_dir: 'data/binary/vc'
processed_data_dir: 'data/processed/vc'We first prepare the data and HuBERT tokens from the s3prl package using s3prl.nn.S3PRLUpstream("hubert").
CUDA_VISIBLE_DEVICES=0 python tasks/run.py \
--config egs/emformer.yaml \
--exp_name emformer_training \
--resetWe fix the Emformer and Vocoder components, and prepare hubert entries of the data by applying the trained Emformer through the datasets (extracted chunk-wise).
CUDA_VISIBLE_DEVICES=0 python tasks/run.py \
--config egs/conan_emformer.yaml \
--exp_name conan_training \
--resetCUDA_VISIBLE_DEVICES=0 python tasks/run.py \
--config egs/hifi_16k320_shuffle.yaml \
--exp_name hifigan_training \
--resetCUDA_VISIBLE_DEVICES=0 python inference/Conan.py \
--config egs/conan_emformer.yaml \
--exp_name conanUse the exp_name that contains the trained main model checkpoints, and update your config with the trained Emformer checkpoint and HifiGAN checkpoint.
You can download pre-trained model checkpoints from Google Drive.
Main system checkpoint folders: Emformer, Conan, hifigan_vc
Fast system checkpoint folders: Emformer_fast, Conan_fast, hifigan_vc (you may need to change the "right_context" in the config file to 0 instead of 2)
Note: As we previous developed the Emformer training branch on another codebase, we provided another inference script for it inference/Conan_previous.py.
Conan/
โโโ modules/ # Core model implementations
โ โโโ Conan/ # Main Conan model
โ โโโ Emformer/ # Emformer feature extractor
โ โโโ vocoder/ # HiFi-GAN vocoder
โ โโโ ...
โโโ tasks/ # Training and evaluation tasks
โ โโโ Conan/ # Conan training task
โ โโโ ...
โโโ inference/ # Inference scripts
โ โโโ Conan.py # Main inference script
โ โโโ run_voice_conversion.py
โ โโโ ...
โโโ data_gen/ # Data preprocessing
โ โโโ conan_binarizer.py # Data binarization
โ โโโ ...
โโโ egs/ # Configuration files
โ โโโ conan.yaml # Main training config
โ โโโ emformer.yaml # Emformer config
โ โโโ ...
โโโ utils/ # Utility functions
โโโ checkpoints/ # Model checkpoints
The Conan system achieves state-of-the-art performance on voice conversion tasks:
- Latency: ~80ms streaming latency (37ms latency for fast system)
- Quality: High-quality voice conversion with natural prosody
- Robustness: Robust to different speaking styles and content
If you use Conan in your research, please cite our work:
@article{zhang2025conan,
title={Conan: A Chunkwise Online Network for Zero-Shot Adaptive Voice Conversion},
author={Zhang, Yu and Tian, Baotong and Duan, Zhiyao},
journal={arXiv preprint arXiv:2507.14534},
year={2025}
}This project is licensed under the MIT License - see the LICENSE file for details.
- FastSpeech2 for the codebase and base TTS architectures
- HiFi-GAN for the neural vocoder
- Emformer for efficient transformer implementation
