Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs text to speech #671

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
927 changes: 304 additions & 623 deletions ai/api-reference/gateway.openapi.yaml

Large diffs are not rendered by default.

21 changes: 21 additions & 0 deletions ai/api-reference/text-to-speech.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
---
openapi: post /text-to-speech
---

<Info>
The default Gateway used in this guide is the public
[Livepeer.cloud](https://www.livepeer.cloud/) Gateway. It is free to use but
not intended for production-ready applications. For production-ready
applications, consider using the [Livepeer Studio](https://livepeer.studio/)
Gateway, which requires an API token. Alternatively, you can set up your own
Gateway node or partner with one via the `ai-video` channel on
[Discord](https://discord.gg/livepeer).
</Info>

<Note>
Please note that the exact parameters, default values, and responses may vary
between models. For more information on model-specific parameters, please
refer to the respective model documentation available in the [text-to-speech
pipeline](/ai/pipelines/text-to-speech). Not all parameters might be available
for a given model.
</Note>
7 changes: 7 additions & 0 deletions ai/orchestrators/models-config.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,13 @@ currently **recommended** models and their respective prices.
"SFAST": true,
"DEEPCACHE": false
}
},
{
"pipeline": "text-to-speech",
"model_id": "parler-tts/parler-tts-large-v1",
"price_per_unit": 11,
"pixels_per_unit": 1e2,
"currency": "USD",
}
]
```
Expand Down
7 changes: 7 additions & 0 deletions ai/pipelines/overview.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -82,4 +82,11 @@ pipelines:
The segment-anything-2 pipeline offers promptable visual segmentation for
images and videos.
</Card>
<Card
title="Text-to-Speech"
icon="message-dots"
href="/ai/pipelines/text-to-speech"
>
The text-to-speech pipeline generates high-quality, natural sounding speech in the style of a given speaker (gender, pitch, speaking style, etc).
</Card>
</CardGroup>
76 changes: 76 additions & 0 deletions ai/pipelines/text-to-speech.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
---
title: Text-to-Speech
---

## Overview

The text-to-speech endpoint in Livepeer utilizes [Parler-TTS](https://github.com/huggingface/parler-tts), specifically `parler-tts/parler-tts-large-v1`. This model can generate speech with customizable characteristics such as voice type, speaking style, and audio quality.

## Basic Usage Instructions

<Tip>
For a detailed understanding of the `text-to-speech` endpoint and to experiment
with the API, see the [Livepeer AI API
Reference](/ai/api-reference/text-to-speech).
</Tip>

To use the text-to-speech feature, submit a POST request to the `/text-to-speech` endpoint. Here's an example of how to structure your request:

```bash
curl -X POST "http://<GATEWAY_IP>/text-to-speech" \
-H "Content-Type: application/json" \
-d '{
"model_id": "parler-tts/parler-tts-large-v1",
"text_input": "A cool cat on the beach",
"description": "Jon's voice is monotone yet slightly fast in delivery, with a very close recording that almost has no background noise."
}'
```

### Request Parameters

- `model_id`: The ID of the text-to-speech model to use. Currently, this should be set to `"parler-tts/parler-tts-large-v1"`.
- `text_input`: The text you want to convert to speech.
- `description`: A description of the desired voice characteristics. This can include details about the speaker's voice, speaking style, and audio quality.

### Voice Customization

You can customize the generated voice by adjusting the `description` parameter. Some aspects you can control include:

- Speaker identity (e.g., "Jon's voice")
- Speaking style (e.g., "monotone", "expressive")
- Speaking speed (e.g., "slightly fast")
- Audio quality (e.g., "very close recording", "no background noise")

The checkpoint was trained on 34 speakers. The full list of available speakers includes: Laura, Gary, Jon, Lea, Karen, Rick, Brenda, David, Eileen, Jordan, Mike, Yann, Joy, James, Eric, Lauren, Rose, Will, Jason, Aaron, Naomie, Alisa, Patrick, Jerry, Tina, Jenna, Bill, Tom, Carol, Barbara, Rebecca, Anna, Bruce, and Emily.

However, the models performed better with certain speakers. A list of the top 20 speakers for each model variant, ranked by their average speaker similarity scores can be found [here](https://github.com/huggingface/parler-tts/blob/main/INFERENCE.md#speaker-consistency)

## Limitations and Considerations

- The maximum length of the input text may be limited. For long-form content, you will need to split your text into smaller chunks. The training default configuration in parler-tts is max 30sec, max text length 600 characters.
https://github.com/huggingface/parler-tts/blob/main/training/README.md#3-training
- While the model supports various voice characteristics, the exact replication of a specific speaker's voice is not guaranteed.
- The quality of the generated speech can vary based on the complexity of the input text and the specificity of the voice description.

## Orchestrator Configuration

To configure your Orchestrator to serve the `text-to-speech` pipeline, refer to
the [Orchestrator Configuration](/ai/orchestrators/get-started) guide.

### System Requirements

The following system requirements are recommended for optimal performance:

- [NVIDIA GPU](https://developer.nvidia.com/cuda-gpus) with **at least 12GB** of
VRAM.

## API Reference

<Card
title="API Reference"
icon="rectangle-terminal"
href="/ai/api-reference/text-to-speech"
>
Explore the `text-to-speech` endpoint and experiment with the API in the
Livepeer AI API Reference.
</Card>
4 changes: 4 additions & 0 deletions api-reference/generate/text-to-speech.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
---
title: "Text to Speech"
openapi: "POST /api/beta/generate/text-to-speech"
---
4 changes: 3 additions & 1 deletion mint.json
Original file line number Diff line number Diff line change
Expand Up @@ -538,6 +538,7 @@
"ai/pipelines/image-to-video",
"ai/pipelines/segment-anything-2",
"ai/pipelines/text-to-image",
"ai/pipelines/text-to-speech",
"ai/pipelines/upscale"
]
},
Expand Down Expand Up @@ -602,7 +603,8 @@
"ai/api-reference/image-to-image",
"ai/api-reference/image-to-video",
"ai/api-reference/segment-anything-2",
"ai/api-reference/upscale"
"ai/api-reference/upscale",
"ai/api-reference/text-to-speech"
]
}
]
Expand Down
Loading