Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty sequence when using faster-whisper's transcribe on fine-tuned model #1212

Open
mlouala-dev opened this issue Dec 22, 2024 · 4 comments

Comments

@mlouala-dev
Copy link

Hi,
I'm trying to use faster-whisper with a fine-tuned model of the new whisper's turbo model : openai/whisper-large-v3-turbo

Faster-whisper library

When I'm trying to run inference of my fine-tuned model with faster-whisper, after converting the model using this command line :

ct2-transformers-converter --model "mlouala/whisper-diin-v3" --output_dir "whisper-din-v3" --force  --copy_files tokenizer_config.json preprocessor_config.json --quantization int8

Then running this script :

from faster_whisper import WhisperModel
model_size = "/home/dev/whisper-din-v3"
model = WhisperModel(model_size, device="cuda")

segments, info = model.transcribe('foo.wav', beam_size=5)
for segment in segments:
    print(dict(start=segment.start, end=segment.end, text=segment.text))

I tested multiple quantization (int8, int8_float32, int16) and no quantization at all but it always returns empty list of segments.
Nonetheless, it detects correctly the langage and audio's duration as you can see in the TranscriptionInfo :

TranscriptionInfo(language='fr', language_probability=0.8290529251098633, duration=8.2, duration_after_vad=8.2, all_language_probs=[....], transcription_options=TranscriptionOptions(beam_size=5, best_of=5, patience=1, length_penalty=1, repetition_penalty=1, no_repeat_ngram_size=0, log_prob_threshold=-1.0, no_speech_threshold=0.6, compression_ratio_threshold=2.4, condition_on_previous_text=True, prompt_reset_on_temperature=0.5, temperatures=[0.0, 0.2, 0.4, 0.6, 0.8, 1.0], initial_prompt=None, prefix=None, suppress_blank=True, suppress_tokens=(1, 2, 7, 8, 9, 10, 14, 25, 26, 27, 28, 29, 31, 58, 59, 60, 61, 62, 63, 90, 91, 92, 93, 359, 503, 522, 542, 873, 893, 902, 918, 922, 931, 1350, 1853, 1982, 2460, 2627, 3246, 3253, 3268, 3536, 3846, 3961, 4183, 4667, 6585, 6647, 7273, 9061, 9383, 10428, 10929, 11938, 12033, 12331, 12562, 13793, 14157, 14635, 15265, 15618, 16553, 16604, 18362, 18956, 20075, 21675, 22520, 26130, 26161, 26435, 28279, 29464, 31650, 32302, 32470, 36865, 42863, 47425, 49870, 50254, 50258, 50358, 50359, 50360, 50361), without_timestamps=False, max_initial_timestamp=1.0, word_timestamps=False, prepend_punctuations='"\'“¿([{-', append_punctuations='"\'.。,,!!??::”)]}、', multilingual=False, max_new_tokens=None, clip_timestamps=[0.0], hallucination_silence_threshold=None, hotwords=None), vad_options=None)

Also, when I'm running the 'base' turbo model converted using ct2-transformers-converter it works fine.

Geniune Transformer library

But my model is working fine when using this simple code with geniune transformers library :

from transformers import pipeline
pipe = pipeline(model="mlouala/whisper-diin-v3")

def transcribe(audio):
    text = pipe(audio)["text"]
    return text
print(transcribe('foo.wav'))

Any clues ?

@mlouala-dev mlouala-dev changed the title Empty sequence when using Empty sequence when using faster-whisper inference on fine-tuned model Dec 22, 2024
@mlouala-dev mlouala-dev changed the title Empty sequence when using faster-whisper inference on fine-tuned model Empty sequence when using faster-whisper's transcribe on fine-tuned model Dec 22, 2024
@Purfview
Copy link
Contributor

segments, info = model.transcribe('foo.wav', condition_on_previous_text=False)

@mlouala-dev
Copy link
Author

mlouala-dev commented Dec 22, 2024

Thank you for your answer @Purfview, tried your code

segments, info = model.transcribe('foo.wav', condition_on_previous_text=False)

Still returning empty sequence, tried with different audios which returns correctly using the transformer's library, also tried using the original ctranslate2 library with the following code :

import ctranslate2
import librosa
import transformers

audio, _ = librosa.load('foo.wav', sr=16000, mono=True)
processor = transformers.WhisperProcessor.from_pretrained("mlouala/whisper-diin-v3")
inputs = processor(audio, return__tensors='np', sample_rate=16000)

features = ctranslate2.StorageView.from_array(inputs.input_features)
model = ctranslate2.models.Whisper("./whisper-din-v3", compute_type='int8')

prompt = processor.tokenizer.convert_tokens_to_ids(
    [
        "<|startoftranscript|>",
        '<|fr|>',
        '<|transcribe|>',
        '<|notimestamps|>',
    ]
)

results = model.generate(features, [prompt])
transcription = processor.decode(results[0].sequences_ids[0])
print(transcription)

And it's returning correctly but really long inference

@Purfview
Copy link
Contributor

Try this:

segments, info = model.transcribe('foo.wav', condition_on_previous_text=False, without_timestamps=True)

@mlouala-dev
Copy link
Author

Hi @Purfview, thank you, but still returning empty sequence.
By the way, I was originally having an issue loading the model like this #582 and first tried to solve the issue with this line :

model.feature_extractor.mel_filters = model.feature_extractor.get_mel_filters(model.feature_extractor.sampling_rate, model.feature_extractor.n_fft, n_mels=128)

But I was having the following error when trying to run inference :

Traceback (most recent call last):
  File "/home/dev/sandbox_stt.py", line 44, in <module>
    segments, info = model.transcribe(AUDIO, condition_on_previous_text=False, without_timestamps=True)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/miniconda3/lib/python3.12/site-packages/faster_whisper/transcribe.py", line 887, in transcribe
    ) = self.detect_language(
        ^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/miniconda3/lib/python3.12/site-packages/faster_whisper/transcribe.py", line 1764, in detect_language
    encoder_output = self.encode(
                     ^^^^^^^^^^^^
  File "/home/dev/miniconda3/lib/python3.12/site-packages/faster_whisper/transcribe.py", line 1343, in encode
    features = get_ctranslate2_storage(features)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/miniconda3/lib/python3.12/site-packages/faster_whisper/transcribe.py", line 1820, in get_ctranslate2_storage
    segment = ctranslate2.StorageView.from_array(segment)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: Unsupported type: <f8

And I finally solved this error by adding this parameter when running the ct2-transformers-converter command line : --copy_files tokenizer_config.json preprocessor_config.json.

I don't know if these additional informations may help you 🤷‍♂...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants