Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenizer decode decode with timestamp fails for extended vocabulary #35330

Open
2 of 4 tasks
bnestor opened this issue Dec 18, 2024 · 1 comment
Open
2 of 4 tasks

tokenizer decode decode with timestamp fails for extended vocabulary #35330

bnestor opened this issue Dec 18, 2024 · 1 comment
Labels

Comments

@bnestor
Copy link

bnestor commented Dec 18, 2024

System Info

python=3.10.13
transformers==4.44.1
torch==2.1.2

Who can help?

@sanchit-gandhi @ylacombe @eustlb @ArthurZ

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Decoding with timestamps produces unexpected results when the vocabulary is extended

>>> from transformers import WhisperTokenizer, AddedToken
>>> tokenizer = WhisperTokenizer.from_pretrained('openai/whisper-base', language="English", task="transcribe", predict_timestamps=True)
>>> extended_vocab = ['newword1']
>>> extended_vocab = [AddedToken(t, single_word=True, lstrip=True) for t in extended_vocab]
>>> tokenizer.add_tokens(extended_vocab)
1
>>> print(len(tokenizer))
51866
>>> print(tokenizer.convert_ids_to_tokens(51865))
newword1
>>> tokens = tokenizer('<|0.00|> newword1 <|0.22|>').input_ids
>>> tokens
[50258, 50259, 50359, 50364, 51865, 220, 50375, 50257]
>>> tokenizer.decode(tokens, skip_special_tokens=True)
'newword1 '
>>> tokenizer.decode(tokens, skip_special_tokens=False)
'<|startoftranscript|><|en|><|transcribe|>newword1 <|endoftext|>'
>>> tokenizer.decode(tokens, skip_special_tokens=False, decode_with_timestamps=True)
'<|startoftranscript|><|en|><|transcribe|><|0.00|><|30.02|> <|30.24|><|endoftext|>'
>>> tokens = tokenizer('<|0.00|> word <|0.22|>').input_ids # something in the vocabulary
>>> tokenizer.decode(tokens, skip_special_tokens=True)
' word '
>>> tokenizer.decode(tokens, skip_special_tokens=False)
'<|startoftranscript|><|en|><|transcribe|> word <|endoftext|>'
>>> tokenizer.decode(tokens, skip_special_tokens=False, decode_with_timestamps=True)
'<|startoftranscript|><|en|><|transcribe|><|0.00|> word <|0.22|><|endoftext|>'

The problem arises in

see issue 20225

Expected behavior

I would expect the timestamps to remain consistent from tokenizing and decoding.

>>> tokens = tokenizer('<|0.00|> newword1 <|0.22|>').input_ids
>>> tokenizer.decode(tokens, skip_special_tokens=False, decode_with_timestamps=True)
'<|startoftranscript|><|en|><|transcribe|><|0.00|> newword1<|0.22|><|endoftext|>'
@bnestor bnestor added the bug label Dec 18, 2024
@eustlb
Copy link
Contributor

eustlb commented Dec 19, 2024

Hey @bnestor, thanks a lot for raising this issue.
Indeed the problem arises where you've spot it, it's linked to #33082 and the PR I already did to fix it #33512. It had to write tests for it before merging but we then underwent another bug-fixing effort (#34535, #34537, #34111) making the PR to stall, sorry for the delay! Nevertheless, it's next point on the whisper bug-fixing roadmap so should be solved quickly 🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants