-
Notifications
You must be signed in to change notification settings - Fork 235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
data and model question #13
Comments
@jtkim-kaist , Sorry for disturb you. Thanks |
Thank you for the detailed question because there are several questions, let's solve your questions step by step.
If your question 1, 2 are solved, I will answer the remained questions. Thx |
Hi @jtkim-kaist ,
Thanks |
Hi @jtkim-kaist,
Actually, I found the performance of ACAM is still not perfect, For example: Thanks |
I'm really sorry for late answer, these days I'm so busy : ( . For the above questions: Maybe, your tested speech is from Aurora which contain short utterance and sampled with 8,000 Hz. According to my experience, The VADs in this project perform rather in poor when using 8kHz dataset, while they are upsampled to 16kHz. (In contrast, VADs work well in the dataset which has higher sampling rate than 16kHz) The reason might be VADs are trained by using 16kHz dataset only. The lower probability means that the VAD predicts the decision with low confidence. To solve this problem,
post-processing : https://github.com/jtkim-kaist/end-point-detection
For the remained questions,
and compare dnn with bdnn based VAD in this project. |
Hi @jtkim-kaist ,
I guess maybe the problem is.
Thanks |
To solve this kind of problem, we have to use situation-robust feature. For example, the MRCG feature in this project, normalize the power of speech when calculate the feature value so that it is robust to energy variation, which means, the VAD perform well regardless of the distance, according to my investigation.
|
Hi @jtkim-kaist, Yes, MRCG feature keep same if we change the volume. Let me do more study about it. Thanks for your help. Thanks |
Hi @jtkim-kaist , Thanks |
NO, the threshold is used only in bdnn and ACAM, please refer the model definition of DNN and LSTM, their prediction is conducted by argmax function across the softmax dimension. |
Hi @jtkim-kaist ,
Thanks |
Hi @jtkim-kaist ,
Thanks for your great project, it helps me much.
I have several questions about this project, Could you help me? Thanks you in advance.
1.1 this imag, it is TIMIT test data, i use your saved model to run it, from 1.4s to 1.5s, the output probability is still very high. is it right? I think the probability should be reduce in this period.
1.2 this is your clean_speech.wav, the green line is label. from 2.73s to 2.83s, it is >= 1 frame length, but all of them are labeled as speech, is it right?
Thanks
Jinhong
The text was updated successfully, but these errors were encountered: