Update run_pseudo_labelling.py #158

peregilk · 2024-10-28T22:04:33Z

Prevents an error in jiwer caused by empty predictions. For consistency both predictions and labels are replaced with <|nocaptions|> if empty, so that they are calculated as part of the wer.

sanchit-gandhi · 2025-01-07T12:18:36Z

training/run_pseudo_labelling.py

-        # filtering step to only evaluate the samples that correspond to non-zero normalized references:
-        norm_pred_str = [norm_pred_str[i] for i in range(len(norm_pred_str)) if len(norm_label_str[i]) > 0]
-        norm_label_str = [norm_label_str[i] for i in range(len(norm_label_str)) if len(norm_label_str[i]) > 0]


These lines only keep the norm_pred_str (hypothesis) and norm_label_str (reference) where the norm_label_str is not empty.

The other edge-case is where we have an empty hypothesis. In this case, for a reference set of N words we have:

N deletions (as many deletions as we do number of words in our reference set)

0 substitutions

0 insertions

So the WER is: (N + 0 + 0) / N = 1, and computed in an entirely valid way.

You can see this with a toy example:

from jiwer import wer reference = "hello world" hypothesis = "" error = wer(reference, hypothesis) print(error)

Print Output:

1.0

=> so there shouldn't be a need to have an additional check for empty normalised hypothesis! These should be valid in the WER calculation. Let me know if you have a minimal repro to rebuttal this!

peregilk · 2025-01-07T17:26:14Z

It might be fixed now. Read details here: jitsi/jiwer#98

Currently AFK so I have not tested. Not sure if my patch is valid for newest jiwer.

It is most effective for reducing hallucinations if it is replaced by <|nocaptions|>(Pre v3) or <|nospeech|>. But slightly different issue.

Update run_pseudo_labelling.py

2e35530

Prevents an error in jiwer caused by empty predictions. For consistency both predictions and labels are replaced with <|nocaptions|> if empty, so that they are calculated as part of the wer.

sanchit-gandhi reviewed Jan 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update run_pseudo_labelling.py #158

Update run_pseudo_labelling.py #158

peregilk commented Oct 28, 2024

sanchit-gandhi Jan 7, 2025 •

edited

Loading

peregilk commented Jan 7, 2025

Update run_pseudo_labelling.py #158

Are you sure you want to change the base?

Update run_pseudo_labelling.py #158

Conversation

peregilk commented Oct 28, 2024

sanchit-gandhi Jan 7, 2025 • edited Loading

Choose a reason for hiding this comment

peregilk commented Jan 7, 2025

sanchit-gandhi Jan 7, 2025 •

edited

Loading