[Feature Request]: Speaker labels (Diarization) #74

thewh1teagle · 2024-05-21T01:21:20Z

Goal

Provide speaker labels along with the transcriptions (eg. Speaker1: ..., Speaker2: ...)
Do it in the same time when transcribing efficient and lightweight.

Research

https://github.com/wq2012/awesome-diarization

Possible ways:
Use c/c++ diarization libs in Rust using bindgen
Replicate pyannote-audio to Rust with tch-rs

Use ONNX runtime with ort

pykeio/ort#208

pyannote/pyannote-audio#1322

Best combination:
pyannote-segmentation-30
WespeakerVoxcelebResnet34LM

The text was updated successfully, but these errors were encountered:

florianchevallier · 2024-05-27T09:17:35Z

I don't know if it helps but here is the most successful Notebook I know to perform this, maybe it's adaptable in rust?

https://github.com/MahmoudAshraf97/whisper-diarization/tree/main

thewh1teagle · 2024-06-02T03:32:19Z

Thanks! Most of the implementation in python uses pyannotate
Since we use Rust in vibe it's more challenging as I can't find any oss project that does it.

We'll probably use onnx ai runtime with https://github.com/pykeio/ort for segmentation and https://github.com/nkeenan38/voice_activity_detector for vad

oleole39 · 2024-06-05T01:13:12Z

Thanks! Most of the implementation in python uses pyannotate Since we use Rust in vibe it's more challenging as I can't find any oss project that does it.

We'll probably use onnx ai runtime with https://github.com/pykeio/ort for segmentation and https://github.com/nkeenan38/voice_activity_detector for vad

Or cheat a bit ? https://rustpython.github.io (never used it myself though)

thewh1teagle · 2024-06-05T05:00:29Z

Or cheat a bit ? https://rustpython.github.io (never used it myself though)

Looks like a useful crate! but I hope we can continue to avoid using Python for as long as possible to maintain top-notch performance and quality.

altunenes · 2024-07-05T14:49:04Z

I'm not an expert in this area maybe this could be very stupid but I tried to create a minimal Python script using an onnx model (heavily dependent on https://github.com/pengzhendong/pyannote-onnx) to gain more insight into the process of converting this to rust (using ort for rust).

Converting this script into the Rust doesn't seem like a big deal to me, but of course, I might be missing something critical here (especially on the segmentation part) haha

warning: I tested this only with this file: https://github.com/pengzhendong/pyannote-onnx/blob/master/data/test_16k.wav so may not be appropriate with a general solutions...

import numpy as np
import onnxruntime as ort
import soundfile as sf
from itertools import permutations

class MinimalSpeakerDiarization:
    def __init__(self, model_path):
        self.num_classes = 4
        self.vad_sr = 16000
        self.duration = 10 * self.vad_sr
        self.session = ort.InferenceSession(model_path)

    def sample2frame(self, x):
        return (x - 721) // 270

    def frame2sample(self, x):
        return (x * 270) + 721

    def sliding_window(self, waveform, window_size, step_size):
        start = 0
        num_samples = len(waveform)
        while start <= num_samples - window_size:
            yield waveform[start : start + window_size]
            start += step_size
        if start < num_samples:
            last_window = np.pad(waveform[start:], (0, window_size - (num_samples - start)))
            yield last_window

    def reorder(self, x, y):
        perms = [np.array(perm).T for perm in permutations(y.T)]
        diffs = np.sum(np.abs(np.sum(np.array(perms)[:, : x.shape[0], :] - x, axis=1)), axis=1)
        return perms[np.argmin(diffs)]

    def process_audio(self, audio_path):
        wav, sr = sf.read(audio_path)
        if sr != self.vad_sr:
            raise ValueError(f"Audio sample rate {sr} does not match required {self.vad_sr}")

        wav = wav.astype(np.float32)

        step = 5 * self.vad_sr
        step = max(min(step, int(0.9 * self.duration)), self.duration // 2)
        overlap = self.sample2frame(self.duration - step)
        overlap_chunk = np.zeros((overlap, self.num_classes), dtype=np.float32)

        results = []
        for window in self.sliding_window(wav, self.duration, step):
            window = window.astype(np.float32)
            ort_outs = np.exp(self.session.run(None, {"input": window[None, None, :]})[0][0])
            ort_outs = np.concatenate(
                (
                    1 - ort_outs[:, :1],  # speech probabilities
                    self.reorder(
                        overlap_chunk[:, 1 : self.num_classes],
                        ort_outs[:, 1 : self.num_classes],
                    ),  # speaker probabilities
                ),
                axis=1,
            )
            if len(results) > 0:
                ort_outs[:overlap, :] = (ort_outs[:overlap, :] + overlap_chunk) / 2
            overlap_chunk = ort_outs[-overlap:, :]
            results.extend(ort_outs[:-overlap])

        return np.array(results)
    def get_speech_segments_with_speakers(self, results, threshold=0.5, min_speech_duration_ms=100):
        speech_prob = results[:, 0]
        speaker_probs = results[:, 1:]
        segments = []
        in_speech = False
        start = 0
        
        # First, determine active speakers
        speech_duration = np.sum(speaker_probs > threshold, axis=0)
        speech_duration_ms = self.frame2sample(speech_duration) * 1000 / self.vad_sr
        active_speakers = np.where(speech_duration_ms > min_speech_duration_ms)[0]
        
        for i, (speech, speakers) in enumerate(zip(speech_prob, speaker_probs)):
            if not in_speech and speech >= threshold:
                start = i
                in_speech = True
            elif in_speech and speech < threshold:
                speaker_index = np.argmax(np.mean(speaker_probs[start:i], axis=0))
                if speaker_index in active_speakers:
                    speaker = f'speaker{np.where(active_speakers == speaker_index)[0][0] + 1}'
                    segments.append({
                        'start': self.frame2sample(start) / self.vad_sr,
                        'end': self.frame2sample(i) / self.vad_sr,
                        'speaker': speaker
                    })
                in_speech = False
        if in_speech:
            speaker_index = np.argmax(np.mean(speaker_probs[start:], axis=0))
            if speaker_index in active_speakers:
                speaker = f'speaker{np.where(active_speakers == speaker_index)[0][0] + 1}'
                segments.append({
                    'start': self.frame2sample(start) / self.vad_sr,
                    'end': self.frame2sample(len(speech_prob)) / self.vad_sr,
                    'speaker': speaker
                })
        return segments, len(active_speakers)

    def get_num_speakers(self, results, threshold=0.5, min_speech_duration_ms=100):
        speaker_probs = results[:, 1:]
        speech_duration = np.sum(speaker_probs > threshold, axis=0)
        speech_duration_ms = self.frame2sample(speech_duration) * 1000 / self.vad_sr
        return np.sum(speech_duration_ms > min_speech_duration_ms)

if __name__ == "__main__":
    model_path = "segmentation-3.0.onnx"
    audio_path = "test_16k.wav"

    diarizer = MinimalSpeakerDiarization(model_path)
    results = diarizer.process_audio(audio_path)

    speech_segments, num_speakers = diarizer.get_speech_segments_with_speakers(results)
    print("Speech segments with speakers:")
    for segment in speech_segments:
        print(f"Start: {segment['start']:.2f}s, End: {segment['end']:.2f}s, Speaker: {segment['speaker']}")

    print(f"Number of speakers detected: {num_speakers}")`
    
    
    note model: https://github.com/pengzhendong/pyannote-onnx/blob/master/pyannote_onnx/segmentation-3.0.onnx

thewh1teagle · 2024-07-05T20:36:12Z

@altunenes

Thanks for helping :)
I got another awesome idea that may be easy to start with.
We can get word timestamps from whisper for each word (using max_len=1 and split_on_word=true.
After getting the segments, we can go through them, each one has start and stop timestamp, then we can run speech embedding model such as spkrec-ecapa-voxceleb with onnx runtime using pykeio/ort and this way we'll have speaker label for each word segment.
then we can easily construct the sentences back from it.
We don't even need VAD (voice activity detector) or segmenting the audio. whisper does the heavy lifting already
The downside is that we'll run the model on each word instead of entire segment. less efficient.
What do you think?

altunenes · 2024-07-05T22:37:47Z

@altunenes

Thanks for helping :) I got another awesome idea that may be easy to start with. We can get word timestamps from whisper for each word (using max_len=1 and split_on_word=true. After getting the segments, we can go through them, each one has start and stop timestamp, then we can run speech embedding model such as spkrec-ecapa-voxceleb with onnx runtime using pykeio/ort and this way we'll have speaker label for each word segment. then we can easily construct the sentences back from it. We don't even need VAD (voice activity detector) or segmenting the audio. whisper does the heavy lifting already The downside is that we'll run the model on each word instead of entire segment. less efficient. What do you think?

Very creative!! this probably provides more accurate diarization across various languages and word lengths.

thewh1teagle · 2024-07-05T22:41:28Z

Great :)
I started working on https://github.com/thewh1teagle/sherpa-rs to replicate https://github.com/k2-fsa/sherpa-onnx/blob/master/python-api-examples/speaker-identification.py

thewh1teagle · 2024-07-08T17:22:37Z

I implemented diarization in sherpa-rs/examples/diarize.rs and added it to Vibe source code.
It works pretty good and fast.
The only missing thing is that sherpa has small issue with the vad model k2-fsa/sherpa-onnx#1084 that speech is not detected sometimes.
Once it will be fixed, I'll change whisper logic, to transcribe only the diarized parts and it will work in vibe.

altunenes · 2024-07-08T18:37:24Z

nice! and thank you for your contributions maybe we should continue to talk in sherpa-rs...

csukuangfj · 2024-07-09T08:32:12Z

I implemented diarization in sherpa-rs/examples/diarize.rs and added it to Vibe source code. It works pretty good and fast. The only missing thing is that sherpa has small issue with the vad model k2-fsa/sherpa-onnx#1084 that speech is not detected sometimes. Once it will be fixed, I'll change whisper logic, to transcribe only the diarized parts and it will work in vibe.

It is fixed in
k2-fsa/sherpa-onnx#1099

atsalyuk · 2024-08-01T15:06:40Z

@thewh1teagle any chance this will be added soon? hoping to use this app for a project im working on instead of my current workflow. thanks!

thewh1teagle · 2024-08-01T15:33:44Z

@thewh1teagle any chance this will be added soon? hoping to use this app for a project im working on instead of my current workflow. thanks!

Thanks for interest :)
It's very hard feature to add.
But in short it works like this:

We enable diarization from advanced options in vibe (already added) - once it clicked it asks to download required models
When diarization enabled we tell whisper to enable word_timestamps meaning every word has timestamp
After transcription completed, we starting diarize the audio. meaning we want to know exactly when there are speeches, and after we know when there are speeches we want to identify who speech there
After we have that information we can know in each whisper transcripted word who spoke
To know when there's speech I used silero vad in sherpa-rs but it doesn't work well, and to konw who speech I used nemo_en_speakerverification_speakernet in sherpa-rs which works ok.
Also I integrated custom whisper.cpp so the whisper timestamps will be accurate [Feature Request]: Mark pauses #152

I think the best chance is to add pyannote to k2-fsa/sherpa-onnx#1197

atsalyuk · 2024-08-01T16:27:28Z

ah i see. thank you! appreciate the quick response

thewh1teagle · 2024-08-06T17:23:04Z

Some updates:
I created simple diarization solution in pyannote-rs
And even added it to Vibe in another branch. It's accurate and also make the transcription much more accurate.
The only issue is that it makes the transcription slower since whisper is optimized for chunks of 30s but often speech is shorter.

on macOS with medium model for 40s audio it takes 7s normally and 15s with diarization.
We can also just feed whisper normally with big chunks and get the timestamps from it. but it's timestamps aren't accurate.

The diarization is fast. like 30s for 1 hour.

Todo: download models instead of embedding into the exe to keep the exe lightweight.

thewh1teagle · 2024-08-08T07:27:14Z

Speaker diarization released! (Beta)
You can try it here https://github.com/thewh1teagle/vibe/releases/tag/v2.4.0-beta.0

Few things about it

Enable through transcription options in main window
It's recommend to run the tiny model instead of the medium. the diarization makes the transcription slower
It's recommend to choose max speakers through transcription settings

altunenes · 2024-08-08T08:40:21Z

exciting news!

oleole39 · 2024-08-08T17:01:17Z

Speaker diarization released! (Beta) You can try it here https://github.com/thewh1teagle/vibe/releases/tag/v2.4.0-beta.0

Interesting, but FYI can't run it on Ubuntu 22.04 based OS #207

thewh1teagle · 2024-08-08T19:36:21Z

Interesting, but FYI can't run it on Ubuntu 22.04 based OS #207

I just released stable release including for 22.04
https://github.com/thewh1teagle/vibe/releases/tag/v2.4.0

By the way on Linux I strongly recommend to use the tiny model for speed
See pre built section in latest release

altunenes · 2024-08-08T19:43:00Z

Interesting, but FYI can't run it on Ubuntu 22.04 based OS #207

I just released stable release including for 22.04 https://github.com/thewh1teagle/vibe/releases/tag/v2.4.0

By the way on Linux I strongly recommend to use the tiny model for speed See pre built section in latest release

tiny model is nice, especially in terms of speed, but in terms of the accuracy of transcription, I found the Sherpa version models nicer on my tests at least. :)

Note: maybe I should play with params more...

oleole39 · 2024-08-09T07:23:17Z

I just released stable release including for 22.04

Thanks now it works.

tiny model is nice, especially in terms of speed, but in terms of the accuracy of transcription

Same here, tiny model gives too inaccurate result to be actually practical. Even with higher temperature than default.

By the way on Linux I strongly recommend to use the tiny model for speed

Fortunately, medium model works too on Linux with diarization, even if slow.

Diarization works well! However in the context of a podcast where the host talks for a while, and then interviews other people (I tested only with 2 speakers in total), it behaves likes this (using text format for the output):

Speaker 1:
blablablabla

Speaker 1:
again blablablabla

Speaker 1: 
blablablabla ?

Speaker 2: 
Yes blablalbalba

Speaker 2: 
And blablablalba

To my mind it should Ideally not split the successive content of a same speaker into several labels, but rather several paragraphs under the same unique label, i.e.

Speaker 1:
blablablabla
again blablablabla
blablablabla ?

Speaker 2: 
Yes blablalbalba 
And blablablalba

thewh1teagle · 2024-08-11T21:31:06Z

However in the context of a podcast where the host talks for a while, and then interviews other people (I tested only with 2 speakers in total), it behaves likes this (using text format for the output):

Could you please open a separate issue for this?

Puhhee, that was a huge feature! Now vibe supporting diarization, we can close this issue :)

altunenes · 2024-08-11T21:41:58Z

However in the context of a podcast where the host talks for a while, and then interviews other people (I tested only with 2 speakers in total), it behaves likes this (using text format for the output):

Could you please open a separate issue for this?

Puhhee, that was a huge feature! Now vibe supporting diarization, we can close this issue :)

and also in rust!
congrats!!!!

thewh1teagle added the needs research label May 21, 2024

thewh1teagle changed the title ~~Speaker labels (Diarization)~~ [Feature Request]: Speaker labels (Diarization) Jun 8, 2024

thewh1teagle closed this as completed Aug 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request]: Speaker labels (Diarization) #74

[Feature Request]: Speaker labels (Diarization) #74

thewh1teagle commented May 21, 2024 •

edited

Loading

florianchevallier commented May 27, 2024 •

edited

Loading

thewh1teagle commented Jun 2, 2024 •

edited

Loading

oleole39 commented Jun 5, 2024

thewh1teagle commented Jun 5, 2024

altunenes commented Jul 5, 2024 •

edited

Loading

thewh1teagle commented Jul 5, 2024 •

edited

Loading

altunenes commented Jul 5, 2024

thewh1teagle commented Jul 5, 2024

thewh1teagle commented Jul 8, 2024

altunenes commented Jul 8, 2024

csukuangfj commented Jul 9, 2024

atsalyuk commented Aug 1, 2024

thewh1teagle commented Aug 1, 2024

atsalyuk commented Aug 1, 2024

thewh1teagle commented Aug 6, 2024 •

edited

Loading

thewh1teagle commented Aug 8, 2024 •

edited

Loading

altunenes commented Aug 8, 2024

oleole39 commented Aug 8, 2024

thewh1teagle commented Aug 8, 2024 •

edited

Loading

altunenes commented Aug 8, 2024 •

edited

Loading

oleole39 commented Aug 9, 2024

thewh1teagle commented Aug 11, 2024

altunenes commented Aug 11, 2024

[Feature Request]: Speaker labels (Diarization) #74

[Feature Request]: Speaker labels (Diarization) #74

Comments

thewh1teagle commented May 21, 2024 • edited Loading

Goal

Research

florianchevallier commented May 27, 2024 • edited Loading

thewh1teagle commented Jun 2, 2024 • edited Loading

oleole39 commented Jun 5, 2024

thewh1teagle commented Jun 5, 2024

altunenes commented Jul 5, 2024 • edited Loading

thewh1teagle commented Jul 5, 2024 • edited Loading

altunenes commented Jul 5, 2024

thewh1teagle commented Jul 5, 2024

thewh1teagle commented Jul 8, 2024

altunenes commented Jul 8, 2024

csukuangfj commented Jul 9, 2024

atsalyuk commented Aug 1, 2024

thewh1teagle commented Aug 1, 2024

atsalyuk commented Aug 1, 2024

thewh1teagle commented Aug 6, 2024 • edited Loading

thewh1teagle commented Aug 8, 2024 • edited Loading

altunenes commented Aug 8, 2024

oleole39 commented Aug 8, 2024

thewh1teagle commented Aug 8, 2024 • edited Loading

altunenes commented Aug 8, 2024 • edited Loading

oleole39 commented Aug 9, 2024

thewh1teagle commented Aug 11, 2024

altunenes commented Aug 11, 2024

thewh1teagle commented May 21, 2024 •

edited

Loading

florianchevallier commented May 27, 2024 •

edited

Loading

thewh1teagle commented Jun 2, 2024 •

edited

Loading

altunenes commented Jul 5, 2024 •

edited

Loading

thewh1teagle commented Jul 5, 2024 •

edited

Loading

thewh1teagle commented Aug 6, 2024 •

edited

Loading

thewh1teagle commented Aug 8, 2024 •

edited

Loading

thewh1teagle commented Aug 8, 2024 •

edited

Loading

altunenes commented Aug 8, 2024 •

edited

Loading