-
Notifications
You must be signed in to change notification settings - Fork 1.6k
retry on leftover audio false mechanism #1395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Hi, how is this solving hallucination and can you give examples? |
|
We would generally see lots of hallucinations when we enabled word-level timestamps. The reason is that Whisper would get presented the leftover audio (that it already decided not to transcribe) without context again. This would frequently cause hallucinations since whisper tends to always output something when presented with audio, even if that audio doesn't contain any speech. I guess there may be use-cases where this retry mechanism makes sense, but at least from our experience it seems to hurt more than it helps. |
Do you use VAD?
This is original Whisper behaviour. I don't remember now why it doesn't seek to the full segment boundary there, I think there must be reason for that. @MahmoudAshraf97 Do you know why? |
|
Just tested it. The PR has opposite effect. It creates "hallucinations" at the boundaries - transcribes the last sentences from previous segments and the timestimes at the start of segments leaks into the previous segments too. The PR looks invalid to me. @rjames-0 @ArneNx |
sequential whisper starts the next segment from the end of the previous one, so if there is some overlap between segments, the thing is, if word timestamps are used, the segment boundaries are updated using the start of the first and the end of the last word, I have not verified this behavior with the reference implementation This problem mentioned in the PR is valid but the solution is wrong imho, VAD is a way cleaner solution |
hallucination fix from @ArneNx to prevent faster whisper from regenerating text on leftover audio when set to return word level timestamps. Mechanism introduces retry_on_leftover_audio option which when set to false skips the processing of leftover audio segment.