Does Pywhispercpp support batching and what gives if not? #56

BBC-Esq · 2024-09-06T21:38:25Z

See here...start thinking about true batching.. 😉

abdeladim-s · 2024-09-06T22:43:52Z

@BBC-Esq, are you talking about batch decoding?
If whisper.cpp supports it, then I believe it will be supported here as well.

UsernamesLame · 2024-09-07T15:53:18Z

@BBC-Esq, are you talking about batch decoding? If whisper.cpp supports it, then I believe it will be supported here as well.

I think he means batch prepping?

Edit:

Nope, batch transcribing!

UsernamesLame · 2024-09-07T22:18:25Z

cough

import os
from pywhispercpp.model import Model
import multiprocessing
from glob import glob

files = [f for f in glob("*") if os.path.isfile(f) and not f.endswith((".py"))]

def transcribeFile(file, queue):
    model = Model("base")
    segments = model.transcribe(file)
    queue.put([file, segments])
    return True

if __name__ == "__main__":
    queue = multiprocessing.Queue()
    processes = []
    for file in files:
        process = multiprocessing.Process(target=transcribeFile, args=(file, queue))
        processes.append(process)
    
    for process in processes:
        process.start()

    for transcriptions in iter(queue.get, None):
        print(transcriptions)

@BBC-Esq @abdeladim-s here's some simple code to batch process with multiple independent whisper instances to ensure context is not maintained under any circumstances between whisper instances.

~~It doesn't do them in parallel as of now due to calling join, but that's fine. It's still batching it. I'll clean this up later.~~

Cleaned it up. Fixed it running in parallel. And oh boy is it a CPU killer.

UsernamesLame · 2024-09-07T22:42:54Z

cough

import os
from pywhispercpp.model import Model
import multiprocessing
from glob import glob

files = [f for f in glob("*") if os.path.isfile(f) and not f.endswith((".py"))]

def transcribeFile(file, queue):
    model = Model("base")
    segments = model.transcribe(file)
    queue.put([file, segments])
    return True

if __name__ == "__main__":
    queue = multiprocessing.Queue()
    processes = []
    for file in files:
        process = multiprocessing.Process(target=transcribeFile, args=(file, queue))
        processes.append(process)
    
    for process in processes:
        process.start()

    for transcriptions in iter(queue.get, None):
        print(transcriptions)

@BBC-Esq @abdeladim-s here's some simple code to batch process with multiple independent whisper instances to ensure context is not maintained under any circumstances between whisper instances.

~~It doesn't do them in parallel as of now due to calling join, but that's fine. It's still batching it. I'll clean this up later.~~

Cleaned it up. Fixed it running in parallel. And oh boy is it a CPU killer.

So a quick heads up, it is painfully slow to do this in parallel. Like dog slow and the more files you throw at it, the slower it gets. But this is just POC code. There's room for improvement such as batching based on file length, file size, core counts, etc.

I'll see if I can beat a few optimizations out of this.

Edit:

I completely forgot that with multiprocessing, queues must be emptied before the main process can finish as they hold open pipes.

    while not queue.empty():
        print(queue.get())

Quick fix over iter. I forgot iterating over a queue is non destructive while calling get is destructive

UsernamesLame · 2024-09-07T23:32:15Z

import os
from pywhispercpp.model import Model
import multiprocessing
from glob import glob
import asyncio

files = [f for f in glob("*") if os.path.isfile(f) and not f.endswith((".py"))]

def transcribeFile(file, queue):
    model = Model("base")
    segments = model.transcribe(file)
    queue.put([file, segments])

if __name__ == "__main__":
    queue = multiprocessing.Queue()
    processes = []

    for file in files:
        process = multiprocessing.Process(target=transcribeFile, args=(file, queue))
        processes.append(process)
        process.start()

    for process in processes:
        process.join()

    while not queue.empty():
        print(queue.get())

This is where I am at. It can queue up lots of files to do in parallel. But there's no limits on how many, that needs improvement. I also need to make it accept adding new things to its queues.

UsernamesLame · 2024-09-08T00:18:04Z

If you're after serial batch transcriptions:

from pywhispercpp.model import Model
import os
from glob import glob


if __name__ == "__main__":
    files = [file for file in glob("*") if os.path.isfile(file) and not file.endswith((".py")) and not file.endswith((".cfg")) and not file.endswith(".txt")]

    for file in files:
        model = Model("base")
        segments = model.transcribe(file)

        with open(f"{file}-transcription.txt", "w") as f:
            for segment in segments:
                f.write(segment.text)

abdeladim-s · 2024-09-08T05:10:36Z

@UsernamesLame, That's multi-processing.
The scripts look work great 👍

BBC-Esq · 2024-09-08T12:38:38Z

Unfortunately, as @abdeladim-s knows, I can't get pywhispercpp to even install correctly...

UsernamesLame · 2024-09-08T13:42:33Z

@UsernamesLame, That's multi-processing.

The scripts look work great 👍

It's "batch" processing 😅

UsernamesLame · 2024-09-08T13:42:55Z

Unfortunately, as @abdeladim-s knows, I can't get pywhispercpp to even install correctly...

Dump logs. Let's get this working.

BBC-Esq · 2024-09-08T14:20:37Z

Logs dumped and now I'm flushing the toilet. 😉 jk Won't have time today as I'm working on the benchmarking repo for a bit...need to get an appropriate dataset and then learn/use the jiwer library? lol

BBC-Esq changed the title ~~Does Whisper.cpp support batching and what gives if not?~~ Does Pywhispercpp support batching and what gives if not? Sep 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does Pywhispercpp support batching and what gives if not? #56

Does Pywhispercpp support batching and what gives if not? #56

BBC-Esq commented Sep 6, 2024 •

edited

Loading

abdeladim-s commented Sep 6, 2024

UsernamesLame commented Sep 7, 2024 •

edited

Loading

UsernamesLame commented Sep 7, 2024 •

edited

Loading

UsernamesLame commented Sep 7, 2024 •

edited

Loading

UsernamesLame commented Sep 7, 2024

UsernamesLame commented Sep 8, 2024

abdeladim-s commented Sep 8, 2024

BBC-Esq commented Sep 8, 2024

UsernamesLame commented Sep 8, 2024

UsernamesLame commented Sep 8, 2024

BBC-Esq commented Sep 8, 2024

Does Pywhispercpp support batching and what gives if not? #56

Does Pywhispercpp support batching and what gives if not? #56

Comments

BBC-Esq commented Sep 6, 2024 • edited Loading

abdeladim-s commented Sep 6, 2024

UsernamesLame commented Sep 7, 2024 • edited Loading

UsernamesLame commented Sep 7, 2024 • edited Loading

UsernamesLame commented Sep 7, 2024 • edited Loading

UsernamesLame commented Sep 7, 2024

UsernamesLame commented Sep 8, 2024

abdeladim-s commented Sep 8, 2024

BBC-Esq commented Sep 8, 2024

UsernamesLame commented Sep 8, 2024

UsernamesLame commented Sep 8, 2024

BBC-Esq commented Sep 8, 2024

BBC-Esq commented Sep 6, 2024 •

edited

Loading

UsernamesLame commented Sep 7, 2024 •

edited

Loading

UsernamesLame commented Sep 7, 2024 •

edited

Loading

UsernamesLame commented Sep 7, 2024 •

edited

Loading