Skip to content

Conversation

@rjames-0
Copy link

hallucination fix from @ArneNx to prevent faster whisper from regenerating text on leftover audio when set to return word level timestamps. Mechanism introduces retry_on_leftover_audio option which when set to false skips the processing of leftover audio segment.

@MahmoudAshraf97
Copy link
Collaborator

Hi, how is this solving hallucination and can you give examples?

@ArneNx
Copy link

ArneNx commented Nov 28, 2025

We would generally see lots of hallucinations when we enabled word-level timestamps. The reason is that Whisper would get presented the leftover audio (that it already decided not to transcribe) without context again. This would frequently cause hallucinations since whisper tends to always output something when presented with audio, even if that audio doesn't contain any speech. I guess there may be use-cases where this retry mechanism makes sense, but at least from our experience it seems to hurt more than it helps.

@Purfview
Copy link
Contributor

Purfview commented Nov 29, 2025

We would generally see lots of hallucinations when we enabled word-level timestamps.

Do you use VAD?

The reason is that Whisper would get presented the leftover audio (that it already decided not to transcribe) without context again.

This is original Whisper behaviour. I don't remember now why it doesn't seek to the full segment boundary there, I think there must be reason for that.

@MahmoudAshraf97 Do you know why?

@Purfview
Copy link
Contributor

Purfview commented Nov 29, 2025

Just tested it. The PR has opposite effect. It creates "hallucinations" at the boundaries - transcribes the last sentences from previous segments and the timestimes at the start of segments leaks into the previous segments too.

The PR looks invalid to me.

@rjames-0 @ArneNx
If you have problems with hallucinations then use VAD and/or hallucination_silence_threshold.

@MahmoudAshraf97
Copy link
Collaborator

We would generally see lots of hallucinations when we enabled word-level timestamps.

Do you use VAD?

The reason is that Whisper would get presented the leftover audio (that it already decided not to transcribe) without context again.

This is original Whisper behaviour. I don't remember now why it doesn't seek to the full segment boundary there, I think there must be reason for that.

@MahmoudAshraf97 Do you know why?

sequential whisper starts the next segment from the end of the previous one, so if there is some overlap between segments, the thing is, if word timestamps are used, the segment boundaries are updated using the start of the first and the end of the last word, I have not verified this behavior with the reference implementation

This problem mentioned in the PR is valid but the solution is wrong imho, VAD is a way cleaner solution

@Purfview
Copy link
Contributor

I have not verified this behavior with the reference implementation

There it's: https://github.com/openai/whisper/blob/c0d2f624c09dc18e709e37c2ad90c039a4eb72a2/whisper/transcribe.py#L413-L416

Disabling that block creates quirks like this (a segment end is after the first line in the screenshot) [no VAD]:

___Capture

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants