Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Language auto-detected correctly as Te (Telugu) - But the text was transcibed in a wrong language Ta (Tamil) #86

Open
aravindbm opened this issue Oct 9, 2024 · 1 comment

Comments

@aravindbm
Copy link

Hello
I have tried to transcribe an audio file which is mixed with Telugu and English for an interview with a health professional.
When I set language for auto detection, it displayed as - Detected language 'te' with probability 0.933493
But the output transcription was in Tamil (another south Indian language)
When I set language as Te (Telugu) , still the output transcription was in Tamil (another south Indian language).

Kindly help how to resolve this issue.
Thanks and Regards,
Dr Manoj Aravind,
Assistant Professor,
Community Medicine, Andhra Medical College, Visakhapatnam,
Andhra Pradesh, India.

@kaixxx
Copy link
Owner

kaixxx commented Oct 9, 2024

Hello
We have two potential issues here:

  • First, the support for Indian languages is not the best in whisper, the underlying AI model from OpenAI that I use, see: https://qxf2.com/blog/testing-openai-whisper-support-for-indian-languages/ (Note, however, that this test uses the "medium" model. I use the "large" one, with slightly better quality).
  • Second, mixed language content is not supported very well by whisper. Even if you get issue one sorted, whisper will probably struggle with your mixed languages and start to translate the English passages of your interview into Telugu. The only solution to this would be to split up your interview and transcribe the different languages separately.

To solve the first issue, you can try out this version of the large whisper AI model which has been especially trained to support Telugu: https://huggingface.co/vasista22/whisper-telugu-large-v2
In order to use it with noScribe, you have to first convert this model into the format for "faster-whisper", the particular implementation I use. Follow the instruction here: https://github.com/SYSTRAN/faster-whisper#model-conversion (section "Model conversion").
Now go to the folder of your noScribe-installation (on Windows: "C:\Program Files (x86)\noScribe") and replace the contents of the subfolder "models\faster-whisper-large-v2" with the corresponding files from your converted model. From now on, if you select the "precise" quality in noScribe, the new model will be used which will hopefully have a better support for Telugu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants