Confidence color-coding foundation -> confidence driven segment re-analysis? #1059
ghchris2021
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm very new with this software (thanks for making it!) so perhaps this may not be practical or interesting but maybe it's worth mentioning / experimenting with.
I see there's a "Confidence color-coding" option already in "main whisper.cpp" so the model must output some kind of confidence metric per token or something.
As a developer I know that often a problem with low accuracy / fidelity encoding / decoding "quantizations" is that they might be "very sure" about something being "X" whereas a better accuracy / fidelity analysis would see it as "Q" instead so to be useful the model would have to "know that it doesn't know" something accurately or at least produce some kind of noise / uncertainty metric in most cases where an accurate result might not be output.
Anyway the idea is maybe it would be possible to use a small / fast model e.g. tiny, base, whatever to encode "the easy stuff" that is able to be mapped from a particular input span to a particular output production (mostly) accurately and rapidly.
But track the areas where the low-fidelity model knows it's possibly (e.g. < NN% confidence) making incorrect estimations along with the time-stamps and data span indices corresponding to that neighborhood of input,
then as may be possible feed the decoded context from the low-fidelity decoder plus the uncertainly interpreted input spans to a high-fidelity model (e.g. large, medium, whatever) to get a localized "second opinion" / reinterpretation for the content the low-fidelity model may have most likely misinterpreted and synthesize the final output from the merged low-fidelity transcription with the "corrections" made by the high-fidelity model.
Other "dynamic" processing could conceivably also be done based on the "first pass" accuracy estimation by a fast / small model e.g. whether to apply more gain to an input span or some kind of noise-reduction frequency domain filtering or whatever to clean-up / enhance some problematic areas of input before re-trying the transcription and possibly yielding an improved transcription / translation for particular input segments.
Since fortunately the small models are small and even the quantized large whisper model isn't THAT large it seems like one could probably optionally fit both e.g. tiny/small/base plus e.g. medium or large into CPU-RAM at once maybe also VRAM so the latency cost (model loading, RAM occupancy, ...) of doing a "second pass" over some input span with a different model might not be that large so maybe it could be cheap enough and interesting / useful enough to experiment with.
Beta Was this translation helpful? Give feedback.
All reactions