Skip to content

Conversation

@j-silv
Copy link

@j-silv j-silv commented Sep 8, 2025

Concerns issues #915, #1177, #1264, #1192, #840, #1330, #59, #66

This work continues where PR #1302 left off. The goal is to transcribe multiple audio files truly in parallel and increase GPU throughput.

This PR is not done yet, but I wanted to give a preview and discuss some of the choices I have made so far.

Here's an overview of what I did so far

  1. Batch a list of files (batch_audio_files()) and pad each decoded audio to max length within that batch. Users can also just pass in an already batched numpy array, where it's assumed that the first dimension corresponds to different audio file content.

  2. I modified the transcribe function in BatchedInferencePipeline to support batching across multiple files, while still enabling batching across an individual large file. This was a little tricky, but essentially we keep track of how many chunks correspond to each audio file within a batch (added data member num_chunks to TranscriptionInfo). Then we flatten everything by stacking the features on top of each other. Finally, we pass it into _batched_segments_generator, which doesn't have to change since it already supports batching.

  3. In addition to performing parallel transcription, I also set up batched VAD. The post-inference algorithm is ran for each individual audio file within the batch and the results are appended together.

Some of the decisions I took:

  1. There are some workarounds in the issues I linked to support multiple audio files. For example, chunking the decoded audio, padding it, and then manually passing in some clip or VAD parameters. I think the main downside to this is that you'd have to do a lot more work to be able to run VAD on the individual audio clips in batch. Plus there's much more work for the user.

  2. Right now, the user is responsible for regrouping the flattened segments array, such that the transcriptions for each audio file are independent. I return a list of generators (one for each batch) in order to do this. I then use num_chunks to know when to stop processing segments for the particular audio file. See test_transcribe::test_batched_transcribe_many for more info on this. I wasn't sure the best way to approach this, since the original code yields a single segment at a time. Happy to discuss if there are better methods. Ideally I would think the user should not have to worry about regrouping a flat segment array...

  3. It is assumed that all audio files are in the same language. In theory you could support batching different languages by instantiating a new Tokenizer within the batch loop with a language indexed from an array (provided by the user or detected automatically). I'm just not sure how common that use-case is.

Some pending todos:

  1. I only handled batch transcription when the language is specified, VAD is enabled, and clip_timestamps is not provided. I have to make some modifications such that other conditional branches in transcribe work (multiple tests are failing because of this).

  2. I have not exhaustively tested the batching to make sure the results are as expected. I think an additional test to run is to perform batch transcribe with 2 batched audio files, and then compare the results with calling normal transcribe twice with each of those audio files separately.

  3. Run performance tests on GPU to make sure batching results in a speed-up (also make sure utilization goes up as expected)

  4. Documentation needs to be rewritten up on how to use this new batch mode

Please share any concerns or suggestions you have!

This work continues where SYSTRAN#1302 left off. The goal is to
transcribe multiple audio files truly in parallel and increase
GPU throughput.

For more information please refer to the pull request
@MahmoudAshraf97
Copy link
Collaborator

Hi and thanks for the effort, I guess adding multiple audio batching to BatchedInferencePipeline will have a minimal benefit since it already has a very good utilization, I suggest to only focus on the regular transcription since it under utilizes the gpu, there is an implementation for that in the transformers library that you can transfer here with minimal modifications to save efforts

@j-silv
Copy link
Author

j-silv commented Sep 10, 2025

Hi and thanks for the effort, I guess adding multiple audio batching to BatchedInferencePipeline will have a minimal benefit since it already has a very good utilization, I suggest to only focus on the regular transcription since it under utilizes the gpu, there is an implementation for that in the transformers library that you can transfer here with minimal modifications to save efforts

Would you mind linking to the implementation you had in mind? I get a little loss navigating the transformers library.

Also if I make changes to the regular transcription, then I would only be supporting batching for cases 3 and 4 below correct?

  1. x1 where x1 is an audio clip < 30 seconds
  2. x1 where x1 is an audio clip > 30 seconds
  3. [x1, x2, x3] where x1, x2, and x3 are audio clips < 30 seconds
  4. [x1, x2, x3] where x1 is an audio clip > 30 seconds, and x2 and x3 are audio clips < 30 seconds

In case 2, for example, I believe that the normal transcription calls whisper in 30 second chunks sequentially. So I guess in case 4 we would have a sliding window of 30 seconds across a batch of size N where N corresponds to the number of audio files?

EDIT: Also I think the main benefit this PR addresses in the BatchedInferencePipeline is for case 3. If the user wants to transcribe a bunch of audio files < 30 seconds then they could get processed in parallel using the modifications I proposed. The batched algorithm that exists now only batches audio files > 30 seconds, as I understand it.

@MahmoudAshraf97
Copy link
Collaborator

I'll explain the current approaches for batching in both short-form <30s and long-form >30s:

  1. Multiple short-form files can be batched without VAD using batched pipeline, no need to work on that case in this PR and leave it for later
  2. Multiple long files
    a. Using batched inference, there is no gain here so I wouldn't be concerned with this case wither
    b. Using sequential inference, this is where the most speedups are, you should use this as a reference https://github.com/huggingface/transformers/blob/7d57b31e16d5a08fda29578c31fc1924fefc50b2/src/transformers/models/whisper/generation_whisper.py#L796 and should be the sole focus of this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants