Realtime transcript + diarization #211

h-kronick · 2022-12-01T20:41:45Z

h-kronick
Dec 1, 2022

Awesome work @ggerganov !

I am looking for suggestions on how to go about realtime transcription + diarization using the Stream example ?

I have found a few examples which combine Whisper + Pyannote audio to transcribe and figure out who is saying what, but am looking to create a solution that works with this high performance version of Whisper to do both in real time.

Here is a real-time implementation of pyannote that might be useful but not sure how to try combining this with whisper.cpp.

Any ideas on how to approach this would be greatly appreciated!

RndyP · 2022-12-19T14:33:37Z

RndyP
Dec 19, 2022

I read in the audio in 1 second chunks and append to a buffer. Convert the audio to an envelope and run a smoothing filter. Detect silence gaps of approximately 500 mS by thresholding the envelop. Detect the gap looking from the most recent data back in time, and when a gap is found, send the data up till the center of the gap off to a thread that calls whisper_full. Erase the data from the buffer just sent to Whisper.

This works very well. You can easily cause Whisper to execute on complete sentences, but to also process short phrases in one chunk, but remembering when you speak that a pause of more than one second will create a new chunk.

Downside to this method is the thresholding makes it dependent on mic volume. Upside is that the chunk processing never occurs in the middle of words.

3 replies

ggerganov Dec 19, 2022
Maintainer

@RndyP
This is very similar to the new sliding window mode that I added to the "stream" example. The difference is that I keep a ring-buffer of audio and always look at the last 2 seconds of audio. I apply a basic VAD to detect if the voice activity has just stopped - i.e. the audio energy in the first second is bigger than the audio energy in the last second multiplied by some threshold:

whisper.cpp/examples/stream/stream.cpp

Lines 584 to 592 in 1d716d6

    
           audio.get(2000, pcmf32_new); 
        
           if (vad_simple(pcmf32_new, WHISPER_SAMPLE_RATE, 1000, params.vad_thold, params.freq_thold, false)) { 
        
               audio.get(params.length_ms, pcmf32); 
        
           } else { 
        
               std::this_thread::sleep_for(std::chrono::milliseconds(100)); 
        
               continue; 
        
           }

Here is a quick demo of this:

stream-sliding-window-0-fast-lq.mp4

mrmachine Apr 8, 2023

@RndyP Your implementation sounds interesting. Is it available anywhere to compare with the whisper.cpp stream example? I did try the stream example both with and without --step 0 and I also tried https://github.com/davabase/whisper_real_time but all seem to perform much less accurately than simple passing a full recording to whisper.

aehlke Oct 8, 2023

@mrmachine ever resolve this? is whisper less accurate in streaming?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Realtime transcript + diarization #211

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Realtime transcript + diarization #211

h-kronick Dec 1, 2022

Replies: 1 comment · 3 replies

RndyP Dec 19, 2022

ggerganov Dec 19, 2022 Maintainer

mrmachine Apr 8, 2023

aehlke Oct 8, 2023

h-kronick
Dec 1, 2022

Replies: 1 comment 3 replies

RndyP
Dec 19, 2022

ggerganov Dec 19, 2022
Maintainer