tokenizer decode decode with timestamp fails for extended vocabulary #35330

bnestor · 2024-12-18T21:47:58Z

System Info

python=3.10.13
transformers==4.44.1
torch==2.1.2

Who can help?

@sanchit-gandhi @ylacombe @eustlb @ArthurZ

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Decoding with timestamps produces unexpected results when the vocabulary is extended

>>> from transformers import WhisperTokenizer, AddedToken
>>> tokenizer = WhisperTokenizer.from_pretrained('openai/whisper-base', language="English", task="transcribe", predict_timestamps=True)
>>> extended_vocab = ['newword1']
>>> extended_vocab = [AddedToken(t, single_word=True, lstrip=True) for t in extended_vocab]
>>> tokenizer.add_tokens(extended_vocab)
1
>>> print(len(tokenizer))
51866
>>> print(tokenizer.convert_ids_to_tokens(51865))
newword1
>>> tokens = tokenizer('<|0.00|> newword1 <|0.22|>').input_ids
>>> tokens
[50258, 50259, 50359, 50364, 51865, 220, 50375, 50257]
>>> tokenizer.decode(tokens, skip_special_tokens=True)
'newword1 '
>>> tokenizer.decode(tokens, skip_special_tokens=False)
'<|startoftranscript|><|en|><|transcribe|>newword1 <|endoftext|>'
>>> tokenizer.decode(tokens, skip_special_tokens=False, decode_with_timestamps=True)
'<|startoftranscript|><|en|><|transcribe|><|0.00|><|30.02|> <|30.24|><|endoftext|>'
>>> tokens = tokenizer('<|0.00|> word <|0.22|>').input_ids # something in the vocabulary
>>> tokenizer.decode(tokens, skip_special_tokens=True)
' word '
>>> tokenizer.decode(tokens, skip_special_tokens=False)
'<|startoftranscript|><|en|><|transcribe|> word <|endoftext|>'
>>> tokenizer.decode(tokens, skip_special_tokens=False, decode_with_timestamps=True)
'<|startoftranscript|><|en|><|transcribe|><|0.00|> word <|0.22|><|endoftext|>'

The problem arises in

transformers/src/transformers/models/whisper/tokenization_whisper.py

Line 546 in 9613933

if token >= timestamp_begin:

see issue 20225

Expected behavior

I would expect the timestamps to remain consistent from tokenizing and decoding.

>>> tokens = tokenizer('<|0.00|> newword1 <|0.22|>').input_ids
>>> tokenizer.decode(tokens, skip_special_tokens=False, decode_with_timestamps=True)
'<|startoftranscript|><|en|><|transcribe|><|0.00|> newword1<|0.22|><|endoftext|>'

The text was updated successfully, but these errors were encountered:

eustlb · 2024-12-19T10:05:05Z

Hey @bnestor, thanks a lot for raising this issue.
Indeed the problem arises where you've spot it, it's linked to #33082 and the PR I already did to fix it #33512. It had to write tests for it before merging but we then underwent another bug-fixing effort (#34535, #34537, #34111) making the PR to stall, sorry for the delay! Nevertheless, it's next point on the whisper bug-fixing roadmap so should be solved quickly 🤗

bnestor added the bug label Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenizer decode decode with timestamp fails for extended vocabulary #35330

tokenizer decode decode with timestamp fails for extended vocabulary #35330

bnestor commented Dec 18, 2024

eustlb commented Dec 19, 2024

tokenizer decode decode with timestamp fails for extended vocabulary #35330

tokenizer decode decode with timestamp fails for extended vocabulary #35330

Comments

bnestor commented Dec 18, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

eustlb commented Dec 19, 2024