Skip to content

Latest commit

 

History

History
89 lines (54 loc) · 3.52 KB

readme.md

File metadata and controls

89 lines (54 loc) · 3.52 KB

Audio Separation Project

An assignment for artificial intelligence department. A simple implementation of a music separation project, which refer the paper Demucs.

1 Direct Implementation

1.0 OS

  • Windows10
  • Ubuntu20.04
  • macOS (CPU only)

1.1 modify the configs/config.yaml

1.2 type in terminal:

python run.py --c configs/config.yaml

And then you will see results from records folder.

2 Train model

2.1 Dataset

The first thing to do should always be data. We use the following dataset from SigSep, an open source website that holds all kinds of data. We select the following one:

For Southeast University student, we upload the dataset to pan.seu.edu.cn to fasten your downloading. Here is a link:

After downloading and unzip, please change its format from mp3 into .wav, since the current(Nov 2023) torchaudio only support wav format. You can directly run the following bash(remember to change the location!), here I recommend you to put the musicdb18 into a parallel position with the project:

audioSep Project
|--changeAudioFormat.bash
|....
musicdb18
|-- piece1.mp3
|-- piece1.mp3
|...
# please ensure you are at the current project work space
chmod +x changeAudioFormat.bash
./changeAudioFormat.bash

After running, you will see folder musicdb18_wav in your project folder. For more detailed information about this dataset, please refer to the introduction site or click the readme under downloaded original dataset folder.

3 Others

3.1 About metrics

STOI(Short-Time Objective Intelligibility)

  • mono channel audio only.

The stoi function is designed to evaluate the intelligibility of speech signals, which are typically mono. Intelligibility is a measure of how comprehensible speech is in given conditions, and for this measurement, stereo or multi-channel audio does not provide additional information compared to mono audio.

If the source or predicted audio is stereo (i.e., has 2 channels), it's common practice to either:

  • (The method we adopt) Average the channels to get a mono signal.
  • Evaluate the metric on each channel separately and then average the results.

PESQ (Perceptual Evaluation of Speech Quality)

  • mono channel audio only

Like STOI, PESQ is designed for mono signals and particularly for evaluating the quality of speech signals. For stereo or multi-channel audio, the same approach as STOI can be taken.

Caveat: PESQ is based on perceptual models, so the results can be affected if applied to non-speech signals.

  • (The method we adopt) Average the channels to get a mono signal.

SDR (Source-to-Distortion Ratio)

Able to Multi-Channel: SDR can be computed for multi-channel audio. When computing SDR for multi-channel audio, it's typically done channel-wise, and then the results can be averaged.

SNR (Signal-to-Noise Ratio)

Able to Multi-Channel: SNR can be computed for multi-channel audio. Like SDR, we typically compute SNR for each channel separately and then average.

SIR(Signal to interferences ratio)

Able to Multi-Channel: measures the amount of interference from other sources in the separated source. A higher SIR indicates that the separated source has less interference from other sources, which means the model's performance is better.