Batched inference on multiple audios #1177

Deep-unlearning · 2024-11-27T13:38:18Z

Hello,

We are currently looking to add faster-whisper into the Open ASR Leaderboard.

Here is the script we are using to run the evals: https://github.com/huggingface/open_asr_leaderboard/tree/main/ctranslate2

We noticed that I does not natively support batched inference on multiple audios which makes the RTFx significantly lower that the orignal whisper which are evaluated with a batch_size 64.

Is there something that we are doing wrong ?

Thanks

Vaibhavs10 · 2024-11-27T13:39:29Z

cc: @MahmoudAshraf97 maybe - would be great to get another set of eyes on the eval scripts.

MahmoudAshraf97 · 2024-11-27T14:21:02Z

unfortunately it doesn't yet, the batching we support is on a single file that is segmented using VAD and AFAIK, OpenASR leaderboard discourages the usage of VAD.
If you guys are willing to monkey-patch, CT2 inference can be inserted into any library that already supports multiple files such as transformers or the original one, so that the heavy lifting is done by CT2 and the pre/post processing logic is done by any other library

Vaibhavs10 · 2024-11-29T15:42:39Z

Hi @MahmoudAshraf97 - thanks for the response. do you have a sample code for how this would look like with transformers? or can you please point us to the right documentation? 🙏

MahmoudAshraf97 · 2024-12-04T10:40:35Z

You'll have to replace This with

faster-whisper/faster_whisper/transcribe.py

Lines 223 to 237 in 8327d8c

    
           results = self.model.model.generate( 
        
               encoder_output, 
        
               prompts, 
        
               beam_size=options.beam_size, 
        
               patience=options.patience, 
        
               length_penalty=options.length_penalty, 
        
               max_length=max_length, 
        
               suppress_blank=options.suppress_blank, 
        
               suppress_tokens=options.suppress_tokens, 
        
               return_scores=True, 
        
               return_no_speech_prob=True, 
        
               sampling_temperature=options.temperatures[0], 
        
               repetition_penalty=options.repetition_penalty, 
        
               no_repeat_ngram_size=options.no_repeat_ngram_size, 
        
           )

Just some quick tips:

CT2 has a limitation where <|startoftranscript|> token must be in the same position for the whole batch, and padding is not officially supported yet (there is a PR for that if you are willing to use a custom build), so you will have to turn off condition_on_previous_text
CT2 is extremely slow when batched and <|notimestamps|> token is not present, which I think a requirement for accurate longform generation, I don't know what is the solution for that but keep it in mind
If you want to test batching without VAD, I suggest you use the HF longform chunking implementation which can be engaged when chunk_length_s is not None in pipeline class

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batched inference on multiple audios #1177

Batched inference on multiple audios #1177

Deep-unlearning commented Nov 27, 2024

Vaibhavs10 commented Nov 27, 2024

MahmoudAshraf97 commented Nov 27, 2024

Vaibhavs10 commented Nov 29, 2024

MahmoudAshraf97 commented Dec 4, 2024

Batched inference on multiple audios #1177

Batched inference on multiple audios #1177

Comments

Deep-unlearning commented Nov 27, 2024

Vaibhavs10 commented Nov 27, 2024

MahmoudAshraf97 commented Nov 27, 2024

Vaibhavs10 commented Nov 29, 2024

MahmoudAshraf97 commented Dec 4, 2024