Replies: 1 comment 3 replies
-
I read in the audio in 1 second chunks and append to a buffer. Convert the audio to an envelope and run a smoothing filter. Detect silence gaps of approximately 500 mS by thresholding the envelop. Detect the gap looking from the most recent data back in time, and when a gap is found, send the data up till the center of the gap off to a thread that calls whisper_full. Erase the data from the buffer just sent to Whisper. This works very well. You can easily cause Whisper to execute on complete sentences, but to also process short phrases in one chunk, but remembering when you speak that a pause of more than one second will create a new chunk. Downside to this method is the thresholding makes it dependent on mic volume. Upside is that the chunk processing never occurs in the middle of words. |
Beta Was this translation helpful? Give feedback.
-
Awesome work @ggerganov !
I am looking for suggestions on how to go about realtime transcription + diarization using the Stream example ?
I have found a few examples which combine Whisper + Pyannote audio to transcribe and figure out who is saying what, but am looking to create a solution that works with this high performance version of Whisper to do both in real time.
Here is a real-time implementation of pyannote that might be useful but not sure how to try combining this with whisper.cpp.
Any ideas on how to approach this would be greatly appreciated!
Beta Was this translation helpful? Give feedback.
All reactions