This toolkit provides the voice activity detection (VAD) code and our recorded dataset.
- Good news! we have uploaded speech enhancement toolkit based on deep neural network. This toolkit provides several useful things such as data generation script. You can find this toolkit in here
- The test sciprt fully written by python has been uploaded in 'py' branch.
VAD toolkit in this project was used in the paper:
J. Kim and M. Hahn, "Voice Activity Detection Using an Adaptive Context Attention Model," in IEEE Signal Processing Letters, vol. PP, no. 99, pp. 1-1.
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8309294&isnumber=4358004
If our VAD toolkit supports your research, we are very appreciate if you cite this paper.
ACAM is based on the recurrent attention model (RAM) [1] and the implementation of RAM can be found in jlindsey15 and jtkim-kaist's repository.
VAD in this toolkit follows the procedure as below:
In this toolkit, we use the multi-resolution cochleagram (MRCG) [2] for the acoustic feature implemented by matlab. Note that MRCG extraction time is relatively long compared to the classifier.
This toolkit supports 4 types of MRCG based classifer implemented by python with tensorflow as follows:
- Adaptive context attention model (ACAM)
- Boosted deep neural network (bDNN) [2]
- Deep neural network (DNN) [2]
- Long short term memory recurrent neural network (LSTM-RNN) [3]
-
Python 3
-
Tensorflow 1.1-3
-
Matlab 2017b (will be depreciated)
The default model provided in this toolkit is the trained model using our dataset. The used dataset is described in our submitted paper.
The example matlab script is main.m
. Just run it on the matlab.
The result will be like following figure.
Note: To apply this toolkit to other speech data, the speech data should be sampled with 16kHz sampling frequency.
- We attached the sample database to 'path/to/project/data/raw'. Please refer to the database for understanding the data format.
- The model specifications are described in 'path/to/project/configure'.
- The training procedure has 2 steps: (i) MRCG extraction; (ii) Model training.
Note: Do not forget adding the path to this project in the matlab.
# train.sh
# train script options
# m 0 : ACAM
# m 1 : bDNN
# m 2 : DNN
# m 3 : LSTM
# e : extract MRCG feature (1) or not (0)
python3 $train -m 0 -e 1 --prj_dir=$curdir
Our recored dataset is freely available: Download
- Environments
Bus stop, construction site, park, and room.
- Recording device
A smart phone (Samsung Galaxy S8)
At each environment, conversational speech by two Korean male speakers was recorded. The ground truth labels are manually annotated. Because the recording was carried out in the real world, unexpected noises are included to the dataset such as the crying of baby, the chirping of insects, mouse click sound, and etc. The details of dataset is described in the following table:
Bus stop | Cons. site | Park | Room | Overall | |
---|---|---|---|---|---|
Dur. (min) | 30.02 | 30.03 | 30.07 | 30.05 | 120.17 |
Avg. SNR (dB) | 5.61 | 2.05 | 5.71 | 18.26 | 7.91 |
% of speech | 40.12 | 26.71 | 26.85 | 30.44 | 31.03 |
- Although MRCG show good performance but extraction time is somewhat long, therefore we will substitute it to other feature such as spectrogram.
If you find any errors in the code, please contact to us.
E-mail: [email protected]
Copyright (c) 2017 Speech and Audio Information Laboratory, KAIST, South Korea
License
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
[1] J. Ba, V. Minh, and K. Kavukcuoglu, “Multiple object recognition with visual attention,” arXiv preprint arXiv, 1412.7755, 2014.
[2] Zhang, Xiao-Lei, and DeLiang Wang. “Boosting contextual information for deep neural network based voice activity detection,” IEEE Trans. Audio, Speech, Lang. Process., vol. 24, no. 2, pp. 252-264, 2016.
[3] Zazo Candil, Ruben, et al. “Feature learning with raw-waveform CLDNNs for Voice Activity Detection.”, 2016.
Jaeseok, Kim (KAIST) contributed to this project for changing matlab script to python.