You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am working on an interface to evaluate recordings of pair programming sessions. The UI utilizes WebVTT for subtitles & it'd be really nice if the file used whisper.cpp's diarization to determine when speakers alternate & then use WebVTTs (voice) tags to differentiate changes in speaker.
What's the solution you'd like?
Ideally, the program would identify the different voices in the video and mark them appropriately in the VTT. Then, I'd be able to specify the CSS (VTT supports Cascading StyleSheets) used for each voice.
The UI could be as simple as just letting me pick a color for each voice, but raw CSS would be just as simple to collect & be infinitely more flexible. (One can specify gradient fills or different fonts or ligatures or backgrounds or whatever.)
What're alternatives you've considered?
Currently, I'm looking at manually having to specify voices & styles & that's just alot of overhead I'd as soon avoid.
Is there additional context?
I'm uncertain if whisper.cpp's diarization distinguishes between speakers or just recognizes when the speaker changes. I'll do some more research and post a comment on this request later.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem?
I am working on an interface to evaluate recordings of pair programming sessions. The UI utilizes WebVTT for subtitles & it'd be really nice if the file used
whisper.cpp
's diarization to determine when speakers alternate & then use WebVTTs (voice) tags to differentiate changes in speaker.What's the solution you'd like?
Ideally, the program would identify the different voices in the video and mark them appropriately in the VTT. Then, I'd be able to specify the CSS (VTT supports Cascading StyleSheets) used for each voice.
The UI could be as simple as just letting me pick a color for each voice, but raw CSS would be just as simple to collect & be infinitely more flexible. (One can specify gradient fills or different fonts or ligatures or backgrounds or whatever.)
What're alternatives you've considered?
Currently, I'm looking at manually having to specify voices & styles & that's just alot of overhead I'd as soon avoid.
Is there additional context?
I'm uncertain if
whisper.cpp
's diarization distinguishes between speakers or just recognizes when the speaker changes. I'll do some more research and post a comment on this request later.The text was updated successfully, but these errors were encountered: