This repository explores speech enhancement using Convolutional Autoencoders.
In this repository we explore how UNet model or Convolutional Autoencoders with skip connections can be used for speech enhancement. The u-net is convolutional network architecture for fast and precise segmentation of images it was first used in biomedical engineering. But we can take advantage of this network for denoising noisy speech data. Intutively we can consider log-spectrograms as 2D images fed to this network from which relevant features are encoded and then converted back to another spectrogram with reduced noise.
Steps involved in denoising Noisy speech data:
- Generating noisy speech files by adding differnt kinds of noise to clean speech files.
- Convert noisy speech files to time series data (Waveform model).
- Convert time series data to log-spectrograms.
- Training U-Net model to learn noise spectrograms.
- Generating clean speech spectrogram and converting those back to .wav files.
- Calculate WER (Word error rate), MER (Match error rate) and WIL (Word information lost) to evaluate performance of the model.
We can use script provided by MS-SNSD to generate required noisy speech files. I have already generated few for my project you can access those through this drive link.
To load .wav files for computation we can use librosa.load with a sampling rate of 8000. For computational purposes we will limit the size of each extracted audio files to 2*8064 (~2 sec) and stack them in 2-D array of size (number_of_audio_files X 8064).
Time series data has no information regarding frequency so we will use fourier transformation for this. In particular we will calculate short time fourier transformation of time series data to output complex matrix (spectrogram) which gives the clear picture of how different frequency components are evolving with time. Since humans preception of sound is logarithmic so we will convert our spectrograms into log-spectrograms i.e. convert power into decibels. For model input we only need magnitude part and for converting back spectrograms to audio files complex part is required. We will generate spectrograms for both noisy speech files and clean speech files.
U-Net model at its core is convolutional autoencoder with skip connections. In this model max-pooling is used in encoding part and up-sampling is used in decoding part. Skip connections are added from encoder to decoder to tackle vanishing gradient issue. U-Net model takes noisy speech spectrogram as input and noise speech spectrogram as output which is equal to noisy speech spectrogram - clean speech spectrogram.
To Generate clean speech spectrograms we will use noise spectrograms generated from our model and subtract it from noisy speech spectrograms given as input. To convert log-spectrogram back to audio file we will first generate complex matrix by multiplying magnitude and its respective phase spectrograms and then use librosa.core.istft to generate time series data which can be converted back to .wav file using soundfile.write
Input | Output |
---|---|
Input audio file - 1 | Output audio file - 1 |
Input audio file - 2 | Output audio file - 2 |
Finally to evaluate the performace of our model we will use WER (Word error rate), MER (Match error rate) and WIL (Word information lost) metrics. Thanks to my friend Simon who has prepared one notebook for the same (Link).
There are 2 options to denoise data :
- Retrain the whole network on some new data.
- Use pre-trained model from Models folder and get the denoised output.
- Research papers for speech-enhancement Link
- Speech Enhancement In Multiple-Noise Conditions using Deep Neural Networks by Anurag Kumar, Dinei Florencio. Link
- A FULLY CONVOLUTIONAL NEURAL NETWORK FOR SPEECH ENHANCEMENT by Se Rim Park and Jin Won Lee. Link
- Speech Denoising DNN Link
- Sound Of AI youtube channel Link
- Digital signal processing Link
- Speech Enhancement Link
- Unet Model Link
- Why Skip Connections are needed in Unet Link
- Understanding Semantic Segmentation with Unet Link
- Understanding Up Sampling Link
- Understanding Convolutional Neural Networks Link
- Padding in CNN Link
- Denoising-images Link