Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

just thought #30

Closed
yenerismail opened this issue Jun 26, 2024 · 10 comments
Closed

just thought #30

yenerismail opened this issue Jun 26, 2024 · 10 comments

Comments

@yenerismail
Copy link

Hello,
As an end user, (My suggestions may be ridiculous because I have no software knowledge.)
"The model used is Whisper-Small-244M with KV cache."
Can Whisper-Large-V3 be used?
Can the user make a choice? (such as tiny, base, small, medium, large)
CPU and GPU are advancing rapidly in GSM phones.
For example, my phone is Qualcomm Snapdragon 8 Gen 2 and Adreno(TM) 740
Can corrections be made during the conversation to prevent people from understanding and translating the wrong word? (Walkie Talkie Mode).
(Walkie Talkie Mode) Can it be adapted for a single language?

Is it possible to input voice for Conversation Mode? (Without keyboard feature)

@niedev
Copy link
Owner

niedev commented Jun 26, 2024

Hello,

Don't worry, no suggestion is ridiculous.

Can Whisper-Large-V3 be used? Can the user make a choice? (such as tiny, base, small, medium, large) CPU and GPU are advancing rapidly in GSM phones. For example, my phone is Qualcomm Snapdragon 8 Gen 2 and Adreno(TM) 740

The most limiting factor for integrating larger models is the amount of RAM on the phones. Let's start from the assumption that usually the maximum amount of RAM usable for an application is half of the phone's RAM (the rest is consumed by the operating system and other apps).

To explain it in simple terms, an AI model must be loaded entirely into RAM to be executed, to calculate its minimum consumption (because they usually consume more) in bytes just multiply the number of its parameters by 1 (normally by 4 but my models are quantized so each parameter weighs 1 byte instead of 4), so Whisper Large for example, which has 1.5B parameters would consume 1.5GB (so it would also be usable, but let's continue).

In the case of my app I should keep both Whisper and the translation model (in this case NLLB) in RAM, and in the case of Whisper, the increase in quality from the small model onwards is gradually smaller, in fact based on the data and my tests, the quality of Whisper small is already very good, the side that needs most improvement is the translation, in this case in fact, unlike Whisper, other translator models with more parameters have a significantly higher quality than NLLB.

Precisely for this reason before the release of the app I tried Madlad, a 3B parameter translator (4GB of RAM used, because to maintain the quality I had to leave some parameters at 4bytes), and together with Whisper small the total RAM consumption was about 5GB (because even Whisper small consumes more than expected), and even with my phone with 12GB of RAM, being so close to the limit (6GB for a 12GB phone), sometimes the app crashed randomly.

So I would say that, at least for now, only those who have a phone with 16GB of RAM can enjoy a better experience than the current one (even if slower), and they are too few to justify the time needed to add other models. Although when OnnxRuntime will support 0.5 byte (4bit) quantization I will probably be able to include Madlad among the options (and before that I could also add Whisper base).

I have already gone too long 🙃, so for the execution speed I'll just tell you that I can only use the CPU, because to use the GPU I have to use Android APIs (NNAPI) that are only supported by a few CPU models 😡 (my Snapdragon 8+ Gen 1 is not supported for example).

Can corrections be made during the conversation to prevent people from understanding and translating the wrong word? (Walkie Talkie Mode).
(Walkie Talkie Mode) Can it be adapted for a single language?
Is it possible to input voice for Conversation Mode? (Without keyboard feature)

I didn't understand these questions, what do you mean?

@yenerismail
Copy link
Author

Hello,
Firstly, thank you for your reply.
I live in Türkiye and use Turkish language. For translation, I use google translate.
"Can corrections be made during the conversation to prevent people from understanding and translating the wrong word? (Walkie Talkie Mode)."
Example:
Spoken: "Mr. Ismail, shall I pour some tea?"
Translation, "Brother Esma, shall I pour some tea?"

I think the accuracy rate is due to whisper.
I wanted to ask if there could be corrections for reasons like these.
I don't think there can be a permanent fix. This might be an option. For each language, accuracy rates vary.

(Walkie Talkie Mode) Can it be adapted for a single language?
I thought that there would be no translation differences for the two people speaking, considering that it was the same language.
The people speaking may be deaf or hard of hearing, so I suggested this with this in mind.

Enjoy your work,

@Kishlay-notabot
Copy link

thanks for explaining in such a nice detail @niedev i'm not into AI but its fun to know!

@niedev
Copy link
Owner

niedev commented Jun 27, 2024

Can it be adapted for a single language?

You can already do this by setting the same language for the two languages in WalkieTalkie mode, I adapted the WalkieTalkie mode to become practically a transcriptor in that case.

Can corrections be made during the conversation to prevent people from understanding and translating the wrong word?

Turkish seems to have problem also for translation, in this particular case probably the language identification have failed and the app translated English text into English, thinking it was Turkish, solving this is complicated because the method I have found to improve the language recognition hurts the performance quite a bit, so I can implement this tecniche only when I will optimize the Whisper speed even more (or maybe I will add an option to manually specify the language spoken in WalkieTalkie mode).

@data-man
Copy link

Great project, thank you!

I hope https://github.com/ggerganov/whisper.cpp can be useful for you.

@niedev
Copy link
Owner

niedev commented Jun 30, 2024

Thank you @data-man!
I already tried whisper.cpp during the development of RTranslator 2.0 but the inference speed is slower than OnnxRuntime, so in the end I opted for the latter.

@data-man
Copy link

Oh, I forgot about https://github.com/rhasspy/piper. :)

@niedev
Copy link
Owner

niedev commented Jun 30, 2024

Oh, I forgot about https://github.com/rhasspy/piper. :)

Oh I didn't know these models, I'll take a look at them, thanks!

@yenerismail
Copy link
Author

@niedev
Copy link
Owner

niedev commented Jul 4, 2024

@yenerismail Thank you! I'll take a look at these projects.

@niedev niedev closed this as completed Jul 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants