Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Speaker labels (Diarization) #74

Closed
thewh1teagle opened this issue May 21, 2024 · 23 comments
Closed

[Feature Request]: Speaker labels (Diarization) #74

thewh1teagle opened this issue May 21, 2024 · 23 comments

Comments

@thewh1teagle
Copy link
Owner

thewh1teagle commented May 21, 2024

Goal

Provide speaker labels along with the transcriptions (eg. Speaker1: ..., Speaker2: ...)
Do it in the same time when transcribing efficient and lightweight.

Research

https://github.com/wq2012/awesome-diarization

Possible ways:
Use c/c++ diarization libs in Rust using bindgen
Replicate pyannote-audio to Rust with tch-rs

Use ONNX runtime with ort

pykeio/ort#208

pyannote/pyannote-audio#1322

Best combination:
pyannote-segmentation-30
WespeakerVoxcelebResnet34LM

@florianchevallier
Copy link

florianchevallier commented May 27, 2024

I don't know if it helps but here is the most successful Notebook I know to perform this, maybe it's adaptable in rust?

https://github.com/MahmoudAshraf97/whisper-diarization/tree/main

@thewh1teagle
Copy link
Owner Author

thewh1teagle commented Jun 2, 2024

Thanks! Most of the implementation in python uses pyannotate
Since we use Rust in vibe it's more challenging as I can't find any oss project that does it.

We'll probably use onnx ai runtime with https://github.com/pykeio/ort for segmentation and https://github.com/nkeenan38/voice_activity_detector for vad

@oleole39
Copy link
Contributor

oleole39 commented Jun 5, 2024

Thanks! Most of the implementation in python uses pyannotate Since we use Rust in vibe it's more challenging as I can't find any oss project that does it.

We'll probably use onnx ai runtime with https://github.com/pykeio/ort for segmentation and https://github.com/nkeenan38/voice_activity_detector for vad

Or cheat a bit ? https://rustpython.github.io (never used it myself though)

@thewh1teagle
Copy link
Owner Author

Or cheat a bit ? https://rustpython.github.io (never used it myself though)

Looks like a useful crate! but I hope we can continue to avoid using Python for as long as possible to maintain top-notch performance and quality.

@thewh1teagle thewh1teagle changed the title Speaker labels (Diarization) [Feature Request]: Speaker labels (Diarization) Jun 8, 2024
@altunenes
Copy link

altunenes commented Jul 5, 2024

I'm not an expert in this area maybe this could be very stupid but I tried to create a minimal Python script using an onnx model (heavily dependent on https://github.com/pengzhendong/pyannote-onnx) to gain more insight into the process of converting this to rust (using ort for rust).

Converting this script into the Rust doesn't seem like a big deal to me, but of course, I might be missing something critical here (especially on the segmentation part) haha

warning: I tested this only with this file: https://github.com/pengzhendong/pyannote-onnx/blob/master/data/test_16k.wav so may not be appropriate with a general solutions...

import numpy as np
import onnxruntime as ort
import soundfile as sf
from itertools import permutations

class MinimalSpeakerDiarization:
    def __init__(self, model_path):
        self.num_classes = 4
        self.vad_sr = 16000
        self.duration = 10 * self.vad_sr
        self.session = ort.InferenceSession(model_path)

    def sample2frame(self, x):
        return (x - 721) // 270

    def frame2sample(self, x):
        return (x * 270) + 721

    def sliding_window(self, waveform, window_size, step_size):
        start = 0
        num_samples = len(waveform)
        while start <= num_samples - window_size:
            yield waveform[start : start + window_size]
            start += step_size
        if start < num_samples:
            last_window = np.pad(waveform[start:], (0, window_size - (num_samples - start)))
            yield last_window

    def reorder(self, x, y):
        perms = [np.array(perm).T for perm in permutations(y.T)]
        diffs = np.sum(np.abs(np.sum(np.array(perms)[:, : x.shape[0], :] - x, axis=1)), axis=1)
        return perms[np.argmin(diffs)]

    def process_audio(self, audio_path):
        wav, sr = sf.read(audio_path)
        if sr != self.vad_sr:
            raise ValueError(f"Audio sample rate {sr} does not match required {self.vad_sr}")

        wav = wav.astype(np.float32)

        step = 5 * self.vad_sr
        step = max(min(step, int(0.9 * self.duration)), self.duration // 2)
        overlap = self.sample2frame(self.duration - step)
        overlap_chunk = np.zeros((overlap, self.num_classes), dtype=np.float32)

        results = []
        for window in self.sliding_window(wav, self.duration, step):
            window = window.astype(np.float32)
            ort_outs = np.exp(self.session.run(None, {"input": window[None, None, :]})[0][0])
            ort_outs = np.concatenate(
                (
                    1 - ort_outs[:, :1],  # speech probabilities
                    self.reorder(
                        overlap_chunk[:, 1 : self.num_classes],
                        ort_outs[:, 1 : self.num_classes],
                    ),  # speaker probabilities
                ),
                axis=1,
            )
            if len(results) > 0:
                ort_outs[:overlap, :] = (ort_outs[:overlap, :] + overlap_chunk) / 2
            overlap_chunk = ort_outs[-overlap:, :]
            results.extend(ort_outs[:-overlap])

        return np.array(results)
    def get_speech_segments_with_speakers(self, results, threshold=0.5, min_speech_duration_ms=100):
        speech_prob = results[:, 0]
        speaker_probs = results[:, 1:]
        segments = []
        in_speech = False
        start = 0
        
        # First, determine active speakers
        speech_duration = np.sum(speaker_probs > threshold, axis=0)
        speech_duration_ms = self.frame2sample(speech_duration) * 1000 / self.vad_sr
        active_speakers = np.where(speech_duration_ms > min_speech_duration_ms)[0]
        
        for i, (speech, speakers) in enumerate(zip(speech_prob, speaker_probs)):
            if not in_speech and speech >= threshold:
                start = i
                in_speech = True
            elif in_speech and speech < threshold:
                speaker_index = np.argmax(np.mean(speaker_probs[start:i], axis=0))
                if speaker_index in active_speakers:
                    speaker = f'speaker{np.where(active_speakers == speaker_index)[0][0] + 1}'
                    segments.append({
                        'start': self.frame2sample(start) / self.vad_sr,
                        'end': self.frame2sample(i) / self.vad_sr,
                        'speaker': speaker
                    })
                in_speech = False
        if in_speech:
            speaker_index = np.argmax(np.mean(speaker_probs[start:], axis=0))
            if speaker_index in active_speakers:
                speaker = f'speaker{np.where(active_speakers == speaker_index)[0][0] + 1}'
                segments.append({
                    'start': self.frame2sample(start) / self.vad_sr,
                    'end': self.frame2sample(len(speech_prob)) / self.vad_sr,
                    'speaker': speaker
                })
        return segments, len(active_speakers)

    def get_num_speakers(self, results, threshold=0.5, min_speech_duration_ms=100):
        speaker_probs = results[:, 1:]
        speech_duration = np.sum(speaker_probs > threshold, axis=0)
        speech_duration_ms = self.frame2sample(speech_duration) * 1000 / self.vad_sr
        return np.sum(speech_duration_ms > min_speech_duration_ms)

if __name__ == "__main__":
    model_path = "segmentation-3.0.onnx"
    audio_path = "test_16k.wav"

    diarizer = MinimalSpeakerDiarization(model_path)
    results = diarizer.process_audio(audio_path)

    speech_segments, num_speakers = diarizer.get_speech_segments_with_speakers(results)
    print("Speech segments with speakers:")
    for segment in speech_segments:
        print(f"Start: {segment['start']:.2f}s, End: {segment['end']:.2f}s, Speaker: {segment['speaker']}")

    print(f"Number of speakers detected: {num_speakers}")`
    
    
    note model: https://github.com/pengzhendong/pyannote-onnx/blob/master/pyannote_onnx/segmentation-3.0.onnx

@thewh1teagle
Copy link
Owner Author

thewh1teagle commented Jul 5, 2024

@altunenes

Thanks for helping :)
I got another awesome idea that may be easy to start with.
We can get word timestamps from whisper for each word (using max_len=1 and split_on_word=true.
After getting the segments, we can go through them, each one has start and stop timestamp, then we can run speech embedding model such as spkrec-ecapa-voxceleb with onnx runtime using pykeio/ort and this way we'll have speaker label for each word segment.
then we can easily construct the sentences back from it.
We don't even need VAD (voice activity detector) or segmenting the audio. whisper does the heavy lifting already
The downside is that we'll run the model on each word instead of entire segment. less efficient.
What do you think?

@altunenes
Copy link

@altunenes

Thanks for helping :) I got another awesome idea that may be easy to start with. We can get word timestamps from whisper for each word (using max_len=1 and split_on_word=true. After getting the segments, we can go through them, each one has start and stop timestamp, then we can run speech embedding model such as spkrec-ecapa-voxceleb with onnx runtime using pykeio/ort and this way we'll have speaker label for each word segment. then we can easily construct the sentences back from it. We don't even need VAD (voice activity detector) or segmenting the audio. whisper does the heavy lifting already The downside is that we'll run the model on each word instead of entire segment. less efficient. What do you think?

Very creative!! this probably provides more accurate diarization across various languages and word lengths.

@thewh1teagle
Copy link
Owner Author

Great :)
I started working on https://github.com/thewh1teagle/sherpa-rs to replicate https://github.com/k2-fsa/sherpa-onnx/blob/master/python-api-examples/speaker-identification.py

@thewh1teagle
Copy link
Owner Author

I implemented diarization in sherpa-rs/examples/diarize.rs and added it to Vibe source code.
It works pretty good and fast.
The only missing thing is that sherpa has small issue with the vad model k2-fsa/sherpa-onnx#1084 that speech is not detected sometimes.
Once it will be fixed, I'll change whisper logic, to transcribe only the diarized parts and it will work in vibe.

@altunenes
Copy link

nice! and thank you for your contributions maybe we should continue to talk in sherpa-rs...

@csukuangfj
Copy link

I implemented diarization in sherpa-rs/examples/diarize.rs and added it to Vibe source code. It works pretty good and fast. The only missing thing is that sherpa has small issue with the vad model k2-fsa/sherpa-onnx#1084 that speech is not detected sometimes. Once it will be fixed, I'll change whisper logic, to transcribe only the diarized parts and it will work in vibe.

It is fixed in
k2-fsa/sherpa-onnx#1099

@atsalyuk
Copy link

atsalyuk commented Aug 1, 2024

@thewh1teagle any chance this will be added soon? hoping to use this app for a project im working on instead of my current workflow. thanks!

@thewh1teagle
Copy link
Owner Author

@thewh1teagle any chance this will be added soon? hoping to use this app for a project im working on instead of my current workflow. thanks!

Thanks for interest :)
It's very hard feature to add.
But in short it works like this:

  1. We enable diarization from advanced options in vibe (already added) - once it clicked it asks to download required models
  2. When diarization enabled we tell whisper to enable word_timestamps meaning every word has timestamp
  3. After transcription completed, we starting diarize the audio. meaning we want to know exactly when there are speeches, and after we know when there are speeches we want to identify who speech there
  4. After we have that information we can know in each whisper transcripted word who spoke
  5. To know when there's speech I used silero vad in sherpa-rs but it doesn't work well, and to konw who speech I used nemo_en_speakerverification_speakernet in sherpa-rs which works ok.
  6. Also I integrated custom whisper.cpp so the whisper timestamps will be accurate [Feature Request]: Mark pauses #152

I think the best chance is to add pyannote to k2-fsa/sherpa-onnx#1197

@atsalyuk
Copy link

atsalyuk commented Aug 1, 2024

ah i see. thank you! appreciate the quick response

@thewh1teagle
Copy link
Owner Author

thewh1teagle commented Aug 6, 2024

Some updates:
I created simple diarization solution in pyannote-rs
And even added it to Vibe in another branch. It's accurate and also make the transcription much more accurate.
The only issue is that it makes the transcription slower since whisper is optimized for chunks of 30s but often speech is shorter.

on macOS with medium model for 40s audio it takes 7s normally and 15s with diarization.
We can also just feed whisper normally with big chunks and get the timestamps from it. but it's timestamps aren't accurate.

The diarization is fast. like 30s for 1 hour.

Todo: download models instead of embedding into the exe to keep the exe lightweight.

@thewh1teagle
Copy link
Owner Author

thewh1teagle commented Aug 8, 2024

Speaker diarization released! (Beta)
You can try it here https://github.com/thewh1teagle/vibe/releases/tag/v2.4.0-beta.0

Few things about it

  1. Enable through transcription options in main window
  2. It's recommend to run the tiny model instead of the medium. the diarization makes the transcription slower
  3. It's recommend to choose max speakers through transcription settings

@altunenes
Copy link

exciting news!

@oleole39
Copy link
Contributor

oleole39 commented Aug 8, 2024

Speaker diarization released! (Beta) You can try it here https://github.com/thewh1teagle/vibe/releases/tag/v2.4.0-beta.0

Interesting, but FYI can't run it on Ubuntu 22.04 based OS #207

@thewh1teagle
Copy link
Owner Author

thewh1teagle commented Aug 8, 2024

Interesting, but FYI can't run it on Ubuntu 22.04 based OS #207

I just released stable release including for 22.04
https://github.com/thewh1teagle/vibe/releases/tag/v2.4.0

By the way on Linux I strongly recommend to use the tiny model for speed
See pre built section in latest release

@altunenes
Copy link

altunenes commented Aug 8, 2024

Interesting, but FYI can't run it on Ubuntu 22.04 based OS #207

I just released stable release including for 22.04 https://github.com/thewh1teagle/vibe/releases/tag/v2.4.0

By the way on Linux I strongly recommend to use the tiny model for speed See pre built section in latest release

tiny model is nice, especially in terms of speed, but in terms of the accuracy of transcription, I found the Sherpa version models nicer on my tests at least. :)

Note: maybe I should play with params more...

@oleole39
Copy link
Contributor

oleole39 commented Aug 9, 2024

I just released stable release including for 22.04

Thanks now it works.

tiny model is nice, especially in terms of speed, but in terms of the accuracy of transcription

Same here, tiny model gives too inaccurate result to be actually practical. Even with higher temperature than default.

By the way on Linux I strongly recommend to use the tiny model for speed

Fortunately, medium model works too on Linux with diarization, even if slow.

Diarization works well! However in the context of a podcast where the host talks for a while, and then interviews other people (I tested only with 2 speakers in total), it behaves likes this (using text format for the output):

Speaker 1:
blablablabla

Speaker 1:
again blablablabla

Speaker 1: 
blablablabla ?

Speaker 2: 
Yes blablalbalba

Speaker 2: 
And blablablalba

To my mind it should Ideally not split the successive content of a same speaker into several labels, but rather several paragraphs under the same unique label, i.e.

Speaker 1:
blablablabla
again blablablabla
blablablabla ?

Speaker 2: 
Yes blablalbalba 
And blablablalba

@thewh1teagle
Copy link
Owner Author

However in the context of a podcast where the host talks for a while, and then interviews other people (I tested only with 2 speakers in total), it behaves likes this (using text format for the output):

Could you please open a separate issue for this?

Puhhee, that was a huge feature! Now vibe supporting diarization, we can close this issue :)

@altunenes
Copy link

However in the context of a podcast where the host talks for a while, and then interviews other people (I tested only with 2 speakers in total), it behaves likes this (using text format for the output):

Could you please open a separate issue for this?

Puhhee, that was a huge feature! Now vibe supporting diarization, we can close this issue :)

and also in rust!
congrats!!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Released
Development

No branches or pull requests

6 participants