Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batched inference on multiple audios #1177

Open
Deep-unlearning opened this issue Nov 27, 2024 · 4 comments
Open

Batched inference on multiple audios #1177

Deep-unlearning opened this issue Nov 27, 2024 · 4 comments

Comments

@Deep-unlearning
Copy link

Hello,

We are currently looking to add faster-whisper into the Open ASR Leaderboard.

Here is the script we are using to run the evals: https://github.com/huggingface/open_asr_leaderboard/tree/main/ctranslate2

We noticed that I does not natively support batched inference on multiple audios which makes the RTFx significantly lower that the orignal whisper which are evaluated with a batch_size 64.

Is there something that we are doing wrong ?

Thanks

@Vaibhavs10
Copy link

cc: @MahmoudAshraf97 maybe - would be great to get another set of eyes on the eval scripts.

@MahmoudAshraf97
Copy link
Collaborator

unfortunately it doesn't yet, the batching we support is on a single file that is segmented using VAD and AFAIK, OpenASR leaderboard discourages the usage of VAD.
If you guys are willing to monkey-patch, CT2 inference can be inserted into any library that already supports multiple files such as transformers or the original one, so that the heavy lifting is done by CT2 and the pre/post processing logic is done by any other library

@Vaibhavs10
Copy link

Hi @MahmoudAshraf97 - thanks for the response. do you have a sample code for how this would look like with transformers? or can you please point us to the right documentation? 🙏

@MahmoudAshraf97
Copy link
Collaborator

You'll have to replace This with

results = self.model.model.generate(
encoder_output,
prompts,
beam_size=options.beam_size,
patience=options.patience,
length_penalty=options.length_penalty,
max_length=max_length,
suppress_blank=options.suppress_blank,
suppress_tokens=options.suppress_tokens,
return_scores=True,
return_no_speech_prob=True,
sampling_temperature=options.temperatures[0],
repetition_penalty=options.repetition_penalty,
no_repeat_ngram_size=options.no_repeat_ngram_size,
)

Just some quick tips:

  1. CT2 has a limitation where <|startoftranscript|> token must be in the same position for the whole batch, and padding is not officially supported yet (there is a PR for that if you are willing to use a custom build), so you will have to turn off condition_on_previous_text
  2. CT2 is extremely slow when batched and <|notimestamps|> token is not present, which I think a requirement for accurate longform generation, I don't know what is the solution for that but keep it in mind
  3. If you want to test batching without VAD, I suggest you use the HF longform chunking implementation which can be engaged when chunk_length_s is not None in pipeline class

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants