Language auto-detected correctly as Te (Telugu) - But the text was transcibed in a wrong language Ta (Tamil) #86

aravindbm · 2024-10-09T07:59:06Z

Hello
I have tried to transcribe an audio file which is mixed with Telugu and English for an interview with a health professional.
When I set language for auto detection, it displayed as - Detected language 'te' with probability 0.933493
But the output transcription was in Tamil (another south Indian language)
When I set language as Te (Telugu) , still the output transcription was in Tamil (another south Indian language).

Kindly help how to resolve this issue.
Thanks and Regards,
Dr Manoj Aravind,
Assistant Professor,
Community Medicine, Andhra Medical College, Visakhapatnam,
Andhra Pradesh, India.

kaixxx · 2024-10-09T09:23:45Z

Hello
We have two potential issues here:

First, the support for Indian languages is not the best in whisper, the underlying AI model from OpenAI that I use, see: https://qxf2.com/blog/testing-openai-whisper-support-for-indian-languages/ (Note, however, that this test uses the "medium" model. I use the "large" one, with slightly better quality).
Second, mixed language content is not supported very well by whisper. Even if you get issue one sorted, whisper will probably struggle with your mixed languages and start to translate the English passages of your interview into Telugu. The only solution to this would be to split up your interview and transcribe the different languages separately.

To solve the first issue, you can try out this version of the large whisper AI model which has been especially trained to support Telugu: https://huggingface.co/vasista22/whisper-telugu-large-v2
In order to use it with noScribe, you have to first convert this model into the format for "faster-whisper", the particular implementation I use. Follow the instruction here: https://github.com/SYSTRAN/faster-whisper#model-conversion (section "Model conversion").
Now go to the folder of your noScribe-installation (on Windows: "C:\Program Files (x86)\noScribe") and replace the contents of the subfolder "models\faster-whisper-large-v2" with the corresponding files from your converted model. From now on, if you select the "precise" quality in noScribe, the new model will be used which will hopefully have a better support for Telugu.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Language auto-detected correctly as Te (Telugu) - But the text was transcibed in a wrong language Ta (Tamil) #86

Language auto-detected correctly as Te (Telugu) - But the text was transcibed in a wrong language Ta (Tamil) #86

aravindbm commented Oct 9, 2024

kaixxx commented Oct 9, 2024

Language auto-detected correctly as Te (Telugu) - But the text was transcibed in a wrong language Ta (Tamil) #86

Language auto-detected correctly as Te (Telugu) - But the text was transcibed in a wrong language Ta (Tamil) #86

Comments

aravindbm commented Oct 9, 2024

kaixxx commented Oct 9, 2024