Work in progress for batching with multiple audio files [WIP] #1359
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Concerns issues #915, #1177, #1264, #1192, #840, #1330, #59, #66
This work continues where PR #1302 left off. The goal is to transcribe multiple audio files truly in parallel and increase GPU throughput.
This PR is not done yet, but I wanted to give a preview and discuss some of the choices I have made so far.
Here's an overview of what I did so far
Batch a list of files (
batch_audio_files()) and pad each decoded audio to max length within that batch. Users can also just pass in an already batched numpy array, where it's assumed that the first dimension corresponds to different audio file content.I modified the
transcribefunction inBatchedInferencePipelineto support batching across multiple files, while still enabling batching across an individual large file. This was a little tricky, but essentially we keep track of how many chunks correspond to each audio file within a batch (added data membernum_chunkstoTranscriptionInfo). Then we flatten everything by stacking the features on top of each other. Finally, we pass it into_batched_segments_generator, which doesn't have to change since it already supports batching.In addition to performing parallel transcription, I also set up batched VAD. The post-inference algorithm is ran for each individual audio file within the batch and the results are appended together.
Some of the decisions I took:
There are some workarounds in the issues I linked to support multiple audio files. For example, chunking the decoded audio, padding it, and then manually passing in some clip or VAD parameters. I think the main downside to this is that you'd have to do a lot more work to be able to run VAD on the individual audio clips in batch. Plus there's much more work for the user.
Right now, the user is responsible for regrouping the flattened segments array, such that the transcriptions for each audio file are independent. I return a list of generators (one for each batch) in order to do this. I then use
num_chunksto know when to stop processing segments for the particular audio file. Seetest_transcribe::test_batched_transcribe_manyfor more info on this. I wasn't sure the best way to approach this, since the original code yields a single segment at a time. Happy to discuss if there are better methods. Ideally I would think the user should not have to worry about regrouping a flat segment array...It is assumed that all audio files are in the same language. In theory you could support batching different languages by instantiating a new
Tokenizerwithin the batch loop with a language indexed from an array (provided by the user or detected automatically). I'm just not sure how common that use-case is.Some pending todos:
I only handled batch transcription when the language is specified, VAD is enabled, and
clip_timestampsis not provided. I have to make some modifications such that other conditional branches intranscribework (multiple tests are failing because of this).I have not exhaustively tested the batching to make sure the results are as expected. I think an additional test to run is to perform batch transcribe with 2 batched audio files, and then compare the results with calling normal transcribe twice with each of those audio files separately.
Run performance tests on GPU to make sure batching results in a speed-up (also make sure utilization goes up as expected)
Documentation needs to be rewritten up on how to use this new batch mode
Please share any concerns or suggestions you have!