utterance ref cannot be empty ? #98

KarelVesely84 · 2024-12-04T13:48:36Z

Hello,
is there a good reason why an utterance ref is required to be non-empty ?
https://github.com/jitsi/jiwer/blob/9db6e4649dfff1e91de5640e224ea51de01b0a50/jiwer/process.py#L158C1-L159C69

IMHO, i'd expect that it can be empty (sclite behavior).
It is a valid situation, if the utterance in test set contains just silence, it's reference is empty,
and the ASR system should produce an empty string and not hallucinate any symbol.

I hacked it accordingly here:
https://github.com/KarelVesely84/jiwer/tree/allow_empty_ref

Best regards
Karel Vesely

nikvaessen · 2024-12-04T14:03:02Z

My reasoning at the time was that evaluation datasets like test-clean of Librispeech do not have silent utterances, so it is better to fail fast and let the user know they made a mistake (like substituting the reference and hypothesis list).

KarelVesely84 · 2024-12-04T14:04:38Z

Ok, would you be open to changing the behavior ?

nikvaessen · 2024-12-04T14:10:21Z

Yes, do you think a UserWarning is more appropriate? I think with systems like Whisper, it is valid to test empty reference strings...

KarelVesely84 · 2024-12-04T14:40:20Z

Yes, the UserWarning would be good.
It would warn the user in the log, and it would not stop the WER calculation.

nikvaessen · 2024-12-12T09:27:13Z

Do you know how sclite handles the edge-case where we only consider one utterance, with an empty reference? This leads to a division by 0.

KarelVesely84 · 2024-12-12T13:44:56Z

Not sure how sclite treats that case.

Anyway, this is unlikely to happen, as sclite is typically used with test-sets with touhsands of utterances.
To have a credible WER for an ASR system, certain amount of words/utterances is necessary in the ref.

With 1 utterance, 0 ref-word edge case, you are right, that this leads to division by zero.

So the WER sholud be be Inf or NaN, i guess, in that case ?
(so that it is mathematically ok, according to the definition of WER = (S + D + I) / #REF)

peregilk mentioned this issue Jan 7, 2025

Update run_pseudo_labelling.py huggingface/distil-whisper#158

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

utterance ref cannot be empty ? #98

utterance ref cannot be empty ? #98

KarelVesely84 commented Dec 4, 2024

nikvaessen commented Dec 4, 2024

KarelVesely84 commented Dec 4, 2024

nikvaessen commented Dec 4, 2024

KarelVesely84 commented Dec 4, 2024

nikvaessen commented Dec 12, 2024 •

edited

Loading

KarelVesely84 commented Dec 12, 2024 •

edited

Loading

utterance ref cannot be empty ? #98

utterance ref cannot be empty ? #98

Comments

KarelVesely84 commented Dec 4, 2024

nikvaessen commented Dec 4, 2024

KarelVesely84 commented Dec 4, 2024

nikvaessen commented Dec 4, 2024

KarelVesely84 commented Dec 4, 2024

nikvaessen commented Dec 12, 2024 • edited Loading

KarelVesely84 commented Dec 12, 2024 • edited Loading

nikvaessen commented Dec 12, 2024 •

edited

Loading

KarelVesely84 commented Dec 12, 2024 •

edited

Loading