Replies: 1 comment
-
If you want more accuracy for word and sentence timestamps, you can use Batched Faster-Whisper, I have tested it. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm currently employing a fine-tuned model that was converted to CT2 format. However, the word-level timestamps and segments produced by this model preform very badly.
As an alternative, I'm considering using a forced-alignment model, which performs optimally with audio chunks of approximately 8 seconds in length.
I'm curious to know if, instead of obtaining the original Whisper segments, it might be possible to acquire segments based on VAD.
Since the "max_speech_duration_s" parameter is available, I'm wondering if there's a way to achieve this segmentation based on VAD.
Beta Was this translation helpful? Give feedback.
All reactions