data and model question #13

meixitu · 2018-07-12T22:10:53Z

Thanks for your great project, it helps me much.

I have several questions about this project, Could you help me? Thanks you in advance.

your prepared data, the label is not that accurate. the short silence in middle of whole speech is labeld as speech. i measured the length, sometime it exceed 100ms, does it degrade the performance? see the below two figure.
for TIMIT, it has *.phn, *.txt and *.wrd file for each audio. My question is how to label the data? do you label the whole audio as speech or use *.wrd to label each word?
the normalization. I find you get a mean and variance for each feature in the whole dataset
in Truelabel2Trueframe.m, line 13, I don't know why you *10. the input is 0 or 1, Not 0.1
In the ACAM model, I found the final full connected layer output is 7 dimension. and it is same as the input frame number, the activation function is sigmoid to get logit, and use tf.square(logit-labels) to calculate the cost function. My question is, if we set the connected layer output as 2 dimension, then softmax it, then use entropy cost function, the label can be 1 if sum(labels(n:n+6))>3, is it OK? it is very popular for classification task.

1.1 this imag, it is TIMIT test data, i use your saved model to run it, from 1.4s to 1.5s, the output probability is still very high. is it right? I think the probability should be reduce in this period.

1.2 this is your clean_speech.wav, the green line is label. from 2.73s to 2.83s, it is >= 1 frame length, but all of them are labeled as speech, is it right?

Thanks
Jinhong

meixitu · 2018-07-19T18:24:27Z

@jtkim-kaist , Sorry for disturb you.
Could you help me?

Thanks
Jinhong

jtkim-kaist · 2018-07-20T14:20:38Z

Thank you for the detailed question because there are several questions, let's solve your questions step by step.

Your claim may be correct. The TIMIT labels are not perfectly correct in your aspect. However, in general, the VAD labels are made in utterance unit. I think this is because most of VAD related applications such as speech recognition, should be conducted in the utterance level. Ideally, perfect frame-level labels will be good, however, it is almost impossible to make it (too many times to make).
In summary, the VAD is frame-wise classifier so that your claim is correct in ideal aspect, however, conventionally, the short silence between vowel sound (the region where you pointed in your figure 1.1) considered as speech so that the high probability is the correct result. If you want to remove the short silence, the unvoiced/voiced sound classifier is more appropriate. Note that in general VAD is used to detect the speech in utterance level after decision smoothing (VAD + post processing called end-point detection), also in general, the neural network is trained based on averaged gradients computed from mini-batches, small size of noisy sample don't affect the model performance.
I used the .phn files. I labeled to 1 for the phoneme region

If your question 1, 2 are solved, I will answer the remained questions. Thx

meixitu · 2018-07-20T23:16:44Z

Hi @jtkim-kaist ,
Thank you very much for you reply.

Yes, I understand. I found the performance is very good with my own test dataset with you pre-trained model, I will use *.phn to label the speech. It seems there is only start and end point of speech, the phoneme is continuously in the speech. I just want to repeat your training.

Thanks
Jinhong

meixitu · 2018-07-30T18:51:38Z

Hi @jtkim-kaist,
I read the code in detail.
For these three models, ACAM, DNN and BDNN sample 7 frames from 39 frames as the NN input, I think it is reasonable to label these 7 frames as speech if the silence in speech is less than 390ms(this condition should be right for most of the speech).
And for LSTM model, it stack 25 continuous frames as the NN input, if the silence in speech is less than 250ms, it is still reasonable to label it as speech.
So I think the label is correct, it will not degrade the performance.

Could you please explain other question?

Actually, I found the performance of ACAM is still not perfect, For example:
I hear this voice, 0.5~0.7 seconds is speech 'seven', but the probability is less than 0.4, which is the threshold in your matlab code.

Thanks
Jinhong

jtkim-kaist · 2018-07-31T04:30:27Z

I'm really sorry for late answer, these days I'm so busy : ( .

For the above questions:

Maybe, your tested speech is from Aurora which contain short utterance and sampled with 8,000 Hz.

According to my experience, The VADs in this project perform rather in poor when using 8kHz dataset, while they are upsampled to 16kHz. (In contrast, VADs work well in the dataset which has higher sampling rate than 16kHz) The reason might be VADs are trained by using 16kHz dataset only.

The lower probability means that the VAD predicts the decision with low confidence.

To solve this problem,

Use lower threshold for VAD and conduct the post-processing (in this case, the post processing must be conducted because of prediction was carried out with low confidence)

post-processing : https://github.com/jtkim-kaist/end-point-detection

If the result is still not satisfied, re-training is necessary using your dataset.

For the remained questions,

Right
That m-file may be from my legacy project. (In that project, the values of label are 0.1 not 1 ) Skip that file.
Both the bDNN and ACAM use the boosting concept which means that they outputs multiple frames not just one frame. The cross entropy you mentioned is used to train ordinary classification network with softmax output layer. For the detail, refer the https://ieeexplore.ieee.org/document/7347379/

and compare dnn with bdnn based VAD in this project.

meixitu · 2018-08-01T01:19:46Z

Hi @jtkim-kaist ,

Sorry disturb you too many times. You really help me much, thank you!
I get the data from internet, and it is 16KHz. I will do more survey.

I guess maybe the problem is.

how to make sure the normalization is perfect to totally different voice recorder(such as PDM)? I found this speech don't have the same mean and variance with you pre-saved.
Is it OK to normalize the feature for each NN input(the stack of several frame)? Seems you only stack 7 frames as input, calculate mean and variance in only 7 data should not that good.

Thanks
Jinhong

jtkim-kaist · 2018-08-01T07:26:41Z

I cannot sure my normalization factor is perfect for every situation. However, if we use large amount of dataset which can represent the population mean and variance, the normalization factor will be perfect. But it is almost impossible so If you use some dataset which has the mean and variance, far from those of my own dataset, the performance will be degraded.

To solve this kind of problem, we have to use situation-robust feature. For example, the MRCG feature in this project, normalize the power of speech when calculate the feature value so that it is robust to energy variation, which means, the VAD perform well regardless of the distance, according to my investigation.

Note that the pre-saved normalization factor is the global mean and variance of my dataset.

meixitu · 2018-08-01T18:56:21Z

Hi @jtkim-kaist,

Yes, MRCG feature keep same if we change the volume.

Let me do more study about it.

Thanks for your help.

Thanks
Jinhong

meixitu · 2018-08-02T16:55:24Z

Hi @jtkim-kaist ,
Sorry I have another question.
I try to test the performance for different models.
I found in the test.m, vad_func.m, graph_test.py, for DNN and LSTM model, you use softmax layer input to compare with threshold(0.4), and make decision in matlab.
Is it right?
I think softmax layer output is a better choice.
And I can't get the softmax layer output, because softmax output is not a node in your models.

Thanks
Jinhong

jtkim-kaist · 2018-08-02T23:40:28Z

NO, the threshold is used only in bdnn and ACAM, please refer the model definition of DNN and LSTM, their prediction is conducted by argmax function across the softmax dimension.

meixitu · 2018-08-09T21:46:29Z

Hi @jtkim-kaist ,
Now I can train with my dataset. Thanks for your help.
And I have some optimal issue, hope you have time to help me, and I hope it will not waste you a lot of time.

I found in your paper, the optimal training parameter of each model is random searched. 
Could you share to me the optimal parameters?
 it is really need much time to search the optimal parameter.
1. learning rate of each models.
2. learning rate decay rate and learning rate decay freq of each model
3. Could you help me how to design the early stop in training?  
    In ACAM model, the current code(it is comment) will stop if mean_accuracy>=0.991(it is 0.968 for LSTM), is this same as what you proposed?
    and for the other two models don't have this function.   
4. How many N*4096 is used when you stop training? 4096 is batch size.

Thanks
Jinhong

meixitu changed the title ~~data for train~~ data and model question Jul 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data and model question #13

data and model question #13

meixitu commented Jul 12, 2018 •

edited

Loading

meixitu commented Jul 19, 2018

jtkim-kaist commented Jul 20, 2018

meixitu commented Jul 20, 2018 •

edited

Loading

meixitu commented Jul 30, 2018 •

edited

Loading

jtkim-kaist commented Jul 31, 2018 •

edited

Loading

meixitu commented Aug 1, 2018 •

edited

Loading

jtkim-kaist commented Aug 1, 2018

meixitu commented Aug 1, 2018

meixitu commented Aug 2, 2018 •

edited

Loading

jtkim-kaist commented Aug 2, 2018

meixitu commented Aug 9, 2018 •

edited

Loading

data and model question #13

data and model question #13

Comments

meixitu commented Jul 12, 2018 • edited Loading

meixitu commented Jul 19, 2018

jtkim-kaist commented Jul 20, 2018

meixitu commented Jul 20, 2018 • edited Loading

meixitu commented Jul 30, 2018 • edited Loading

jtkim-kaist commented Jul 31, 2018 • edited Loading

meixitu commented Aug 1, 2018 • edited Loading

jtkim-kaist commented Aug 1, 2018

meixitu commented Aug 1, 2018

meixitu commented Aug 2, 2018 • edited Loading

jtkim-kaist commented Aug 2, 2018

meixitu commented Aug 9, 2018 • edited Loading

meixitu commented Jul 12, 2018 •

edited

Loading

meixitu commented Jul 20, 2018 •

edited

Loading

meixitu commented Jul 30, 2018 •

edited

Loading

jtkim-kaist commented Jul 31, 2018 •

edited

Loading

meixitu commented Aug 1, 2018 •

edited

Loading

meixitu commented Aug 2, 2018 •

edited

Loading

meixitu commented Aug 9, 2018 •

edited

Loading