Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TTS is not able to handle time in replies #160

Open
nikito opened this issue Jun 9, 2023 · 11 comments
Open

TTS is not able to handle time in replies #160

nikito opened this issue Jun 9, 2023 · 11 comments

Comments

@nikito
Copy link
Contributor

nikito commented Jun 9, 2023

For example, "The time is 8:01 AM.", it will speak "The time is am", where am is the literal word "am" as in "I am". The time component itself is entirely skipped, I think that is related to the number handling issue we saw before mentioned in #148 .

@kristiankielhofner
Copy link
Contributor

kristiankielhofner commented Jun 9, 2023

Yes, they are absolutely related. There's certainly some regex, etc we could develop that would pre-split that but to get the output that's common in American English speech would need to go from "8:01 AM" to "eight o one a m" is tricky. Additionally, we'd need to account for languages/locales that don't use AM and PM, or represent them differently. Frankly I don't know a good way to do that off the top of my head...

@nikito
Copy link
Contributor Author

nikito commented Jun 9, 2023

Yeah agree, it is a tricky problem. It may also just be a limitation of the TTS in use. I know when I try this with Coqui or Piper they both handle numbers and these time values better (they still struggle with AM/PM, but they appear to say eight o' one and such with the syntax shown). But from a linguistic approach the regex or NLP for that would definitely get complex, probably require it's own library/module just to handle that 😮

@kristiankielhofner
Copy link
Contributor

Yeah this certainly brings back the issue regarding alternate TTS engines. The issue there is in the (ample) time I've spent with them I'm increasingly convinced they're just not suitable for Willow as-is. As noted before, the dependencies are VERY messy and the performance is lackluster (we'd still have caching, so there's that).

At this point TTS is our biggest fundamental issue and these issues just keep flowing in compared to the rest of Willow/WIS functionality. We certainly appreciate them, so keep them coming! My concern with coqui, etc is we'd likely just swap a number of issues with an equivalent set of different set issues. The open source models and frameworks for TTS generally seem to be very far behind their counterparts in the rest of the "model world" and none of them seem to meet our overall goal of providing a truly commercial quality voice user interface.

So, as frustrating as it is currently, SpeechT5 is likely (in the end) still our best option and it's clear that it will need significant pre-processing of text before it's provided to the processor itself. If you look at sentencepiece you can start to understand the fundamental challenges that all of the text models and architectures have...

@nikito
Copy link
Contributor Author

nikito commented Jun 9, 2023

If it's of any intrigue to you, for laughs I was able to get Willow to work with Coqui running as an independent server and I get pretty good response times. I did have to modify the Willow code to add DEFAULT_ESP_WAV_DECODER_CONFIG() as Coqui outputs wav, but otherwise works perfectly. Of course this isn't utilizing nginx cache, but still very performant. 😄

@kristiankielhofner
Copy link
Contributor

That's great feedback!

Generally my goal (considering the level of effort required) is not only clean dependency and code, but an actual performance improvement. Of course comparable performance is the absolute minimum.

VITS-based models are clearly (to me at least) the future and I've been working through Coqui and others to get VITS models working with ONNX, CUDA, and TRT (the hard one) which not only leads to easier packaging, dependency management, and cleaner distribution but higher performance as well.

In terms of the WAV decoder on device, we're really trying to make FLAC the standard in terms of audio decoding (ESP ADF doesn't currently support FLAC encoding) as opposed to building additional libraries, bloating the firmware image, and trying to do weird parsing of Content-Type and other conditionals to determine the decoder.

@nikito
Copy link
Contributor Author

nikito commented Jun 9, 2023

Yeah only reason I added WAV was because the coqui docker image API doesn't output in FLAC unfortunately. The model I am playing with is the one we discussed before (the VITS model trained with the Jenny dataset). Would agree if it can be exported to ONNX/CUDA/TRT it would probably be much faster and easier to implement. 😃

@skorokithakis
Copy link

Sorry for hijacking this, but how do you get Willow to speak the response? The only thing I've managed to get it to speak is "ok".

@skorokithakis
Copy link

Ah, I see this only works with Home Assistant, is that correct? There's no way to play audio with the REST server?

@hamishcunningham
Copy link
Contributor

hamishcunningham commented Aug 20, 2023 via email

@ccsmart
Copy link

ccsmart commented May 31, 2024

@nikito would be interested on how coqi would be configured for. maybe you have notes or would be willing to create them ?

On TTS latency i am not very concerned as its the audio feedback only. STT and execution latency i'd guess would be unaffected.

On the current TTS it seems there is some workaround intended but may not actually be applied correctly as i get these in logs:

Got request for speaker CLB with text: The current time is 10:03 AM on May 31, 2024.
TTS: Text contains numbers, converting to words
TTS: Text after number substitution: ['The current time is 10:03 AM on May 31, 2024.']

So it seems there was the intention to convert this like "10" -> "ten" or "one zero", but didnt happen... ?

@nikito
Copy link
Contributor Author

nikito commented May 31, 2024

Yeah speech t5 isn't great with numbers and such, so I'm split_arch branch we make coqui the new default, with xtts as another option for those who want to tinker. There's no notes on it yet as it's still a developmental branch, but if you take a look in the utils.sh there's a method added called build-xtts, and once you run that it tells you how to turn on xtts. Otherwise building and deploying from that branch will automatically use coqui.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants