Whisper's transcription Tested out how Whisper's model works out with long-form video which contains speech only from a single person. Pretty good results