The Echogarden speech toolset supports all of Piper's voices, with many additional features and enhancements #674

rotemdan · 2024-12-08T11:34:08Z

rotemdan
Dec 8, 2024

Hi, I'm the developer of Echogarden, which is a cross-platform speech toolset that runs on the Node.js runtime (GPL-3 Licensed).

It works as a command-line application, or as a Node.js library (It's very easy to install, thanks to npm, and requires no compilation).

Echogarden has supported and updated with all of Piper's ONNX models (currently a total of 123) since it was first published on late April 2023. It doesn't actually rely on Piper itself - it's an independent implementation that uses a custom WebAssembly port of eSpeak-ng and the Node.js binding for the ONNX runtime (onnxruntime-node) to load the raw .onnx models directly.

It adds many features and enhancements, some of which are not directly available in Piper:

Auto-downloads models when needed
Includes timestamps for paragraphs, sentences, words and phonemes for the synthesized speech, by aligning with eSpeak-ng outputs and its events
Can output subtitles for the synthesized audio (includes a highly customizable subtitle generator)
Configurable, custom pauses between sentences and paragraphs (pretty basic, but I'm not sure if currently supported by Piper directly)
Has preprocessing to improve date, time and currency pronunciation (for English). The default eSpeak-ng phonemization is not always very good at those
Has a custom heteronym disambiguation model I wrote, that improves the pronunciation for ambiguous words like read, present and live (US English only at the moment, about 30 - 50 words included). The model is rule-based, and uses surrounding words to decide on the most likely pronunciation of the target word
Supports custom pronunciation lexicons (you can use the same JSON format to define contextual heteronym disambiguation for languages other than English)
Can output audio using several audio codecs (WAV, MP3, FLAC, OPUS, AAC, OGG) via FFMpeg (can actually output multiple outputs with several different codecs)
Integrates alternative pitch shifting and time stretching via libraries like sonic and rubberband (WASM ports), in addition to the native pitch and time shift parameters of the models themselves
Can split synthesized output to multiple files, based on paragraph boundaries (using sentence boundaries is also possible, but is not implemented yet)
Integrates language detection - can choose a voice based on detected language
Several other enhancements, including preprocessing punctuation to avoid eSpeak bugs, and other workarounds that were added over time

(Interesting fact: almost all these features were available since about August 2023, so they can't really be called "new")

It also supports 14 other synthesis engines (offline and online) including, for example Google Translate, Microsoft Edge, and Elevenlabs, and many methods for speech recognition, forced alignment, voice isolation, speech translation, etc. It also includes a custom, ONNX-based implementation of Whisper that took a very large amount of effort to develop.

With all that, the project currently has very low usage, with only 207 stars so far, after a year and 8 months, given it has a 25,000 line codebase with 23 auxiliary repositories (it had about 70-80 stars after a year of the initial release).

It's likely because I have never personally publicized or announced it to anyone, not on forums, or other repositories, and apparently so never did the users (for all I know).

Based on its issue tracker, there's a small group of dedicated users that seem to use it for alignment or speech recognition, but there's almost no mention for it being used for speech synthesis - which was actually one of the first areas that were originally implemented.

The low usage and visibility has It came to the point where I have started to become concerned that the software would become outdated (effectively "vintage") before a meaningful amount of people would use it at all. The models would be superseded by newer models and some of the development efforts wouldn't make the contribution to the extent they were intended for.

I noticed it has never been mentioned on Piper's discussions or issues, which I find very odd. I refrained from doing it myself due to my personal general aversion with all kinds of "marketing" or "promotion"-like activities. I thought the quality of the software could speak for itself, and users would spread the word organically, but apparently, that didn't seem to happen.

So, if your usage of Piper is mostly about running command lines, and you don't really need it as a Python library or direct C++ dependency, you may want to consider trying the vits engine in Echogarden, which includes all of its voices, and is significantly easier to install and run. If you're looking to use custom models, they can be supported, but there hasn't been any visible demand for that so far.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Echogarden speech toolset supports all of Piper's voices, with many additional features and enhancements #674

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

The Echogarden speech toolset supports all of Piper's voices, with many additional features and enhancements #674

rotemdan Dec 8, 2024

Replies: 0 comments

rotemdan
Dec 8, 2024