Update Info of SingVisio (#274)

Update Info of SingVisio citation, resources links, and Emilia TODOs
open-mmlab · Sep 23, 2024 · d9243b8 · d9243b8
1 parent 251c669
commit d9243b8
Show file tree

Hide file tree

Showing 11 changed files with 173 additions and 17 deletions.
diff --git a/README.md b/README.md
@@ -4,7 +4,7 @@
     <a href="https://arxiv.org/abs/2312.09911"><img src="https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg"></a>
     <a href="https://huggingface.co/amphion"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Amphion-pink"></a>
     <a href="https://openxlab.org.cn/usercenter/Amphion"><img src="https://cdn-static.openxlab.org.cn/app-center/openxlab_app.svg"></a>
-    <a href="https://discord.com/invite/ZxxREr3Y"><img src="https://img.shields.io/badge/Discord-Join%20chat-blue.svg">
+    <a href="https://discord.com/invite/ZxxREr3Y"><img src="https://img.shields.io/badge/Discord-Join%20chat-blue.svg"></a>
     <a href="egs/tts/README.md"><img src="https://img.shields.io/badge/README-TTS-blue"></a>
     <a href="egs/svc/README.md"><img src="https://img.shields.io/badge/README-SVC-blue"></a>
     <a href="egs/tta/README.md"><img src="https://img.shields.io/badge/README-TTA-blue"></a>
@@ -31,11 +31,12 @@ In addition to the specific generation tasks, Amphion includes several **vocoder
 ## 🚀 News
 - **2024/09/01**: [Amphion](https://arxiv.org/abs/2312.09911) and [Emilia](https://arxiv.org/abs/2407.05361) got accepted by IEEE SLT 2024! 🤗
 - **2024/08/28**: Welcome to join Amphion's [Discord channel](https://discord.com/invite/ZxxREr3Y) to stay connected and engage with our community!
+- **2024/08/20**: [SingVisio](https://arxiv.org/abs/2402.12660) got accepted by Computers & Graphics, [available here](https://www.sciencedirect.com/science/article/pii/S0097849324001936)! 🎉
 - **2024/08/27**: *The Emilia dataset is now publicly available!* Discover the most extensive and diverse speech generation dataset with 101k hours of in-the-wild speech data now at [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Dataset-yellow)](https://huggingface.co/datasets/amphion/Emilia-Dataset) or [![OpenDataLab](https://img.shields.io/badge/OpenDataLab-Dataset-blue)](https://opendatalab.com/Amphion/Emilia)! 👑👑👑
 - **2024/07/01**: Amphion now releases **Emilia**, the first open-source multilingual in-the-wild dataset for speech generation with over 101k hours of speech data, and the **Emilia-Pipe**, the first open-source preprocessing pipeline designed to transform in-the-wild speech data into high-quality training data with annotations for speech generation! [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2407.05361) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Dataset-yellow)](https://huggingface.co/datasets/amphion/Emilia) [![demo](https://img.shields.io/badge/WebPage-Demo-red)](https://emilia-dataset.github.io/Emilia-Demo-Page/) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](preprocessors/Emilia/README.md)
 - **2024/06/17**: Amphion has a new release for its **VALL-E** model! It uses Llama as its underlying architecture and has better model performance, faster training speed, and more readable codes compared to our first version. [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](egs/tts/VALLE_V2/README.md)
 - **2024/03/12**: Amphion now support **NaturalSpeech3 FACodec** and release pretrained checkpoints. [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2403.03100) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-model-yellow)](https://huggingface.co/amphion/naturalspeech3_facodec) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-demo-pink)](https://huggingface.co/spaces/amphion/naturalspeech3_facodec) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](models/codec/ns3_codec/README.md)
-- **2024/02/22**: The first Amphion visualization tool, **SingVisio**, release. [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2402.12660) [![openxlab](https://cdn-static.openxlab.org.cn/app-center/openxlab_app.svg)](https://openxlab.org.cn/apps/detail/Amphion/SingVisio) [![Video](https://img.shields.io/badge/Video-Demo-orange)](https://github.com/open-mmlab/Amphion/assets/33707885/0a6e39e8-d5f1-4288-b0f8-32da5a2d6e96) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](egs/visualization/SingVisio/README.md)
+- **2024/02/22**: The first Amphion visualization tool, **SingVisio**, release. [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2402.12660) [![openxlab](https://cdn-static.openxlab.org.cn/app-center/openxlab_app.svg)](https://openxlab.org.cn/apps/detail/Amphion/SingVisio) [![Video](https://img.shields.io/badge/Video-Demo-orange)](https://drive.google.com/file/d/15097SGhQh-SwUNbdWDYNyWEP--YGLba5/view) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](egs/visualization/SingVisio/README.md)
 - **2023/12/18**: Amphion v0.1 release. [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2312.09911) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Amphion-pink)](https://huggingface.co/amphion) [![youtube](https://img.shields.io/badge/YouTube-Demo-red)](https://www.youtube.com/watch?v=1aw0HhcggvQ) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](https://github.com/open-mmlab/Amphion/pull/39)
 - **2023/11/28**: Amphion alpha release. [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](https://github.com/open-mmlab/Amphion/pull/2)
 
@@ -87,7 +88,7 @@ Amphion provides a comprehensive objective evaluation of the generated audio. Th
 
 Amphion provides visualization tools to interactively illustrate the internal processing mechanism of classic models. This provides an invaluable resource for educational purposes and for facilitating understandable research.
 
-Currently, Amphion supports [SingVisio](egs/visualization/SingVisio/README.md), a visualization tool of the diffusion model for singing voice conversion. [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2402.12660) [![openxlab](https://cdn-static.openxlab.org.cn/app-center/openxlab_app.svg)](https://openxlab.org.cn/apps/detail/Amphion/SingVisio) [![Video](https://img.shields.io/badge/Video-Demo-orange)](https://github.com/open-mmlab/Amphion/assets/33707885/0a6e39e8-d5f1-4288-b0f8-32da5a2d6e96)
+Currently, Amphion supports [SingVisio](egs/visualization/SingVisio/README.md), a visualization tool of the diffusion model for singing voice conversion. [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2402.12660) [![openxlab](https://cdn-static.openxlab.org.cn/app-center/openxlab_app.svg)](https://openxlab.org.cn/apps/detail/Amphion/SingVisio) [![Video](https://img.shields.io/badge/Video-Demo-orange)](https://drive.google.com/file/d/15097SGhQh-SwUNbdWDYNyWEP--YGLba5/view)
 
 
 ## 📀 Installation
@@ -158,9 +159,9 @@ Amphion is under the [MIT License](LICENSE). It is free for both research and co
 
 ```bibtex
 @inproceedings{amphion,
-author={Zhang, Xueyao and Xue, Liumeng and Gu, Yicheng and Wang, Yuancheng and Li, Jiaqi and He, Haorui and Wang, Chaoren and Song, Ting and Chen, Xi and Fang, Zihao and Chen, Haopeng and Zhang, Junan and Tang, Tze Ying and Zou, Lexiao and Wang, Mingxuan and Han, Jun and Chen, Kai and Li, Haizhou and Wu, Zhizheng},
-title={Amphion: An Open-Source Audio, Music and Speech Generation Toolkit},
-booktitle={Proc.~of SLT},
-year={2024}
+    author={Zhang, Xueyao and Xue, Liumeng and Gu, Yicheng and Wang, Yuancheng and Li, Jiaqi and He, Haorui and Wang, Chaoren and Song, Ting and Chen, Xi and Fang, Zihao and Chen, Haopeng and Zhang, Junan and Tang, Tze Ying and Zou, Lexiao and Wang, Mingxuan and Han, Jun and Chen, Kai and Li, Haizhou and Wu, Zhizheng},
+    title={Amphion: An Open-Source Audio, Music and Speech Generation Toolkit},
+    booktitle={{IEEE} Spoken Language Technology Workshop, {SLT} 2024},
+    year={2024}
 }
 ```
diff --git a/egs/visualization/README.md b/egs/visualization/README.md
@@ -2,7 +2,7 @@
 
 ## Quick Start
 
-We provides a **[beginner recipe](SingVisio/)** to demonstrate how to implement interactive visualization for classic audio, music and speech generative models. Specifically, it is also an official implementation of the paper "[SingVisio: SingVisio: Visual Analytics of Diffusion Model for Singing Voice Conversion](https://arxiv.org/pdf/2402.12660.pdf)". The **SingVisio** can be experienced [here](https://openxlab.org.cn/apps/detail/Amphion/SingVisio).
+We provides a **[beginner recipe](SingVisio/)** to demonstrate how to implement interactive visualization for classic audio, music and speech generative models. Specifically, it is also an official implementation of the paper "SingVisio: Visual Analytics of the Diffusion Model for Singing Voice Conversion", which can be accessed via [arXiv](https://arxiv.org/abs/2402.12660) or [Computers & Graphics](https://www.sciencedirect.com/science/article/pii/S0097849324001936). The **SingVisio** can be experienced [here](https://openxlab.org.cn/apps/detail/Amphion/SingVisio).
 
 ## Supported Models
 

diff --git a/egs/visualization/SingVisio/README.md b/egs/visualization/SingVisio/README.md
@@ -2,17 +2,19 @@
 
 [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2402.12660)
 [![openxlab](https://cdn-static.openxlab.org.cn/app-center/openxlab_app.svg)](https://openxlab.org.cn/apps/detail/Amphion/SingVisio)
-[![Video](https://img.shields.io/badge/Video-Demo-orange)](https://github.com/open-mmlab/Amphion/assets/33707885/0a6e39e8-d5f1-4288-b0f8-32da5a2d6e96)
+[![Video](https://img.shields.io/badge/Video-Demo-orange)](https://drive.google.com/file/d/15097SGhQh-SwUNbdWDYNyWEP--YGLba5/view)
 
 <div align="center">
-<img src="../../../imgs/visualization/SingVisio_system.png" width="85%">
+<img src="../../../imgs/visualization/SingVisio_system.jpg" width="85%">
 </div>
 
-This is the official implementation of the paper "[SingVisio: Visual Analytics of the Diffusion Model for Singing Voice Conversion](https://arxiv.org/abs/2402.12660)."  **SingVisio** system can be experienced [here](https://openxlab.org.cn/apps/detail/Amphion/SingVisio). 
+This is the official implementation of the paper "SingVisio: Visual Analytics of the Diffusion Model for Singing Voice Conversion", which can be accessed via [arXiv](https://arxiv.org/abs/2402.12660) or [Computers & Graphics](https://www.sciencedirect.com/science/article/pii/S0097849324001936).
+
+The online **SingVisio** system can be experienced [here](https://openxlab.org.cn/apps/detail/Amphion/SingVisio).
 
 **SingVisio** system comprises two main components: a web-based front-end user interface and a back-end generation model.
 
-- The web-based user interface was developed using [D3.js](https://d3-graph-gallery.com/index.html), a JavaScript library designed for creating dynamic and interactive data visualizations. The code can be accessed [here](../../../visualization/SingVisio/webpage/).
+- The web-based user interface was developed using [D3.js](https://d3js.org/), a JavaScript library designed for creating dynamic and interactive data visualizations. The code can be accessed [here](../../../visualization/SingVisio/webpage/).
 - The core generative model, [MultipleContentsSVC](https://arxiv.org/abs/2310.11160), is a diffusion-based model tailored for singing voice conversion (SVC). The code for this model is available in Amphion, with the recipe accessible [here](../../svc/MultipleContentsSVC/).
 
 ## Development Workflow for Visualization Systems
@@ -57,12 +59,32 @@ The user inference of **SingVisio** is comprised of five views:
 
 ## Detailed System Introduction of SingVisio
 
-For a detailed introduction to **SingVisio** and user instructions, please refer to [this online document](https://x8gvg3n7v3.feishu.cn/docx/IMhUdqIFVo0ZjaxlBf6cpjTEnvf?from=from_copylink) (with animation) or [offline document](../../../visualization/SingVisio/System_Introduction_of_SingVisio.pdf) (without animation).
+For a detailed introduction to **SingVisio** and user instructions, please refer to [this document](../../../visualization/SingVisio/System_Introduction_of_SingVisio_V2.pdf).
 
 Additionally, explore the SingVisio demo to see the system's functionalities and usage in action.
 
-[SingVisio_Demo](https://github.com/open-mmlab/Amphion/assets/33707885/0a6e39e8-d5f1-4288-b0f8-32da5a2d6e96)
-
 ## User Study of SingVisio
 
 Participate in the [user study](https://www.wjx.cn/vm/wkIH372.aspx#) of **SingVisio** if you're interested. We encourage you to conduct the study after experiencing the **SingVisio** system. Your valuable feedback is greatly appreciated.
+
+## Citations 📖
+
+Please cite the following papers if you use **SingVisio** in your research:
+
+```bibtex
+@article{singvisio,
+    author={Xue, Liumeng and Wang, Chaoren and Wang, Mingxuan and Zhang, Xueyao and Han, Jun and Wu, Zhizheng},
+    title={SingVisio: Visual Analytics of the Diffusion Model for Singing Voice Conversion},
+    journal={Computers & Graphics},
+    year={2024}
+}
+```
+
+```bibtex
+@inproceedings{amphion,
+    author={Zhang, Xueyao and Xue, Liumeng and Gu, Yicheng and Wang, Yuancheng and Li, Jiaqi and He, Haorui and Wang, Chaoren and Song, Ting and Chen, Xi and Fang, Zihao and Chen, Haopeng and Zhang, Junan and Tang, Tze Ying and Zou, Lexiao and Wang, Mingxuan and Han, Jun and Chen, Kai and Li, Haizhou and Wu, Zhizheng},
+    title={Amphion: An Open-Source Audio, Music and Speech Generation Toolkit},
+    booktitle={{IEEE} Spoken Language Technology Workshop, {SLT} 2024},
+    year={2024}
+}
+```
diff --git a/imgs/visualization/SingVisio_demo.png b/imgs/visualization/SingVisio_demo.png
diff --git a/imgs/visualization/SingVisio_system.jpg b/imgs/visualization/SingVisio_system.jpg
diff --git a/imgs/visualization/SingVisio_system.png b/imgs/visualization/SingVisio_system.png
diff --git a/preprocessors/Emilia/README.md b/preprocessors/Emilia/README.md
@@ -186,6 +186,21 @@ The processed audio (default 24k sample rate) files will be saved into `input_fo
 ]
 ```
 
+## TODOs 📝
+
+Here are some potential improvements for the Emilia-Pipe pipeline:
+
+- [x] Optimize the pipeline for better processing speed.
+- [ ] Support input audio files larger than 4GB (calculated in WAVE format).
+- [ ] Update source separation model to better handle noisy audio (e.g., reverberation).
+- [ ] Ensure single speaker in each segment in the speaker diarization step.
+- [ ] Move VAD to the first step to filter out non-speech segments. (for better speed)
+- [ ] Extend ASR supported max length over 30s while keeping the speed.
+- [ ] Fine-tune the ASR model to improve transcription accuracy on puctuation.
+- [ ] Adding multimodal features to the pipeline for better transcription accuracy.
+- [ ] Filter segments with unclean background noise, speaker overlap, hallucination transcriptions, etc.
+- [ ] Labeling the data: speaker info (e.g., gender, age, native language, health), emotion, speaking style (pitch, rate, accent), acoustic features (e.g., fundamental frequency, formants), and environmental factors (background noise, microphone setup). Besides, non-verbal cues (e.g., laughter, coughing, silence, filters) and paralinguistic features could be labeled as well.
+
 ## Acknowledgement 🔔
 We acknowledge the wonderful work by these excellent developers!
 - Source Separation: [UVR-MDX-NET-Inst_HQ_3](https://github.com/TRvlvr/model_repo/releases/tag/all_public_uvr_models)
@@ -209,7 +224,7 @@ If you use the Emilia dataset or the Emilia-Pipe pipeline, please cite the follo
 @inproceedings{amphion,
     author={Zhang, Xueyao and Xue, Liumeng and Gu, Yicheng and Wang, Yuancheng and Li, Jiaqi and He, Haorui and Wang, Chaoren and Song, Ting and Chen, Xi and Fang, Zihao and Chen, Haopeng and Zhang, Junan and Tang, Tze Ying and Zou, Lexiao and Wang, Mingxuan and Han, Jun and Chen, Kai and Li, Haizhou and Wu, Zhizheng},
     title={Amphion: An Open-Source Audio, Music and Speech Generation Toolkit},
-    booktitle={Proc.~of SLT},
+    booktitle={{IEEE} Spoken Language Technology Workshop, {SLT} 2024},
     year={2024}
 }
 ```
diff --git a/preprocessors/Emilia/main_multi.py b/preprocessors/Emilia/main_multi.py
@@ -0,0 +1,118 @@
+# Copyright (c) 2024 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+import argparse
+import multiprocessing
+import os
+import subprocess
+import time
+
+from utils.logger import Logger
+from utils.tool import get_gpu_nums
+
+
+def run_script(args, gpu_id, self_id):
+    """
+    Run the script by passing the GPU ID and self ID to environment variables and execute the main.py script.
+
+    Args:
+        gpu_id (int): ID of the GPU.
+        self_id (int): ID of the process.
+
+    Returns:
+        None
+    """
+    env = os.environ.copy()
+    env["CUDA_VISIBLE_DEVICES"] = str(gpu_id)
+    env["SELF_ID"] = str(self_id)
+
+    command = (
+        f"source {args.conda_path} &&"
+        'eval "$(conda shell.bash hook)" && '
+        f"conda activate {args.conda_env_name} && "
+        "python main.py"
+    )
+
+    try:
+        process = subprocess.Popen(command, shell=True, env=env, executable="/bin/bash")
+        process.wait()
+        logger.info(f"Process for GPU {gpu_id} completed successfully.")
+    except KeyboardInterrupt:
+        logger.warning(f"Multi - GPU {gpu_id}: Interrupted by keyboard, exiting...")
+    except Exception as e:
+        logger.error(f"Error occurred for GPU {gpu_id}: {e}")
+
+
+def main(args, self_id):
+    """
+    Start multiple script tasks using multiple processes, each process using one GPU.
+
+    Args:
+        self_id (str): Identifier for the current process.
+
+    Returns:
+        None
+    """
+    disabled_ids = []
+    if args.disabled_gpu_ids:
+        disabled_ids = [int(i) for i in args.disabled_gpu_ids.split(",")]
+        logger.info(f"CUDA_DISABLE_ID is set, not using: {disabled_ids}")
+
+    gpus_count = get_gpu_nums()
+
+    available_gpus = [i for i in range(gpus_count) if i not in disabled_ids]
+    processes = []
+
+    for gpu_id in available_gpus:
+        process = multiprocessing.Process(
+            target=run_script, args=(args, gpu_id, self_id)
+        )
+        process.start()
+        logger.info(f"GPU {gpu_id}: started...")
+        time.sleep(1)
+        processes.append(process)
+
+    for process in processes:
+        process.join()
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--self_id", type=str, default="main_multi", help="Log ID")
+    parser.add_argument(
+        "--disabled_gpu_ids",
+        type=str,
+        default="",
+        help="Comma-separated list of disabled GPU IDs, default uses all available GPUs",
+    )
+    parser.add_argument(
+        "--conda_path",
+        type=str,
+        default="/opt/conda/etc/profile.d/conda.sh",
+        help="Conda path",
+    )
+    parser.add_argument(
+        "--conda_env_name",
+        type=str,
+        default="AudioPipeline",
+        help="Conda environment name",
+    )
+    parser.add_argument(
+        "--main_command_args",
+        type=str,
+        default="",
+        help="Main command args, check available options by `python main.py --help`",
+    )
+    args = parser.parse_args()
+
+    self_id = args.self_id
+    if "SELF_ID" in os.environ:
+        self_id = f"{self_id}_#{os.environ['SELF_ID']}"
+
+    logger = Logger.get_logger(self_id)
+
+    logger.info(f"Starting main_multi.py with self_id: {self_id}, args: {vars(args)}.")
+    main(args, self_id)
+    logger.info("Exiting main_multi.py...")
diff --git a/visualization/SingVisio/System_Introduction_of_SingVisio.pdf b/visualization/SingVisio/System_Introduction_of_SingVisio.pdf
diff --git a/visualization/SingVisio/System_Introduction_of_SingVisio_V2.pdf b/visualization/SingVisio/System_Introduction_of_SingVisio_V2.pdf
diff --git a/visualization/SingVisio/webpage/README.md b/visualization/SingVisio/webpage/README.md
@@ -1,6 +1,6 @@
 ## SingVisio Webpage
 
-This is the source code for the SingVisio Webpage. This README file will introduce the project and provide an installation guide.
+This is the source code for the SingVisio Webpage. This README file will introduce the project and provide an installation guide. For introduction to SingVisio, please check this [README.md](../../../egs/visualization/SingVisio/README.md) file.
 
 ### Tech Stack