Skip to content

Commit c99e885

Browse files
authored
Merge pull request coqui-ai#3373 from coqui-ai/add-doc-xtts
Add inference parameters
2 parents 4b35a1e + 7d1a6de commit c99e885

File tree

1 file changed

+20
-38
lines changed

1 file changed

+20
-38
lines changed

docs/source/models/xtts.md

Lines changed: 20 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -81,42 +81,6 @@ tts.tts_to_file(text="It took me quite a long time to develop a voice, and now t
8181
language="en")
8282
```
8383

84-
##### Streaming inference
85-
86-
XTTS supports streaming inference. This is useful for real-time applications.
87-
88-
```python
89-
import os
90-
import time
91-
import torch
92-
import torchaudio
93-
94-
print("Loading model...")
95-
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)
96-
model = tts.synthesizer.tts_model
97-
98-
print("Computing speaker latents...")
99-
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=["reference.wav"])
100-
101-
print("Inference...")
102-
t0 = time.time()
103-
stream_generator = model.inference_stream(
104-
"It took me quite a long time to develop a voice and now that I have it I am not going to be silent.",
105-
"en",
106-
gpt_cond_latent,
107-
speaker_embedding
108-
)
109-
110-
wav_chuncks = []
111-
for i, chunk in enumerate(stream_generator):
112-
if i == 0:
113-
print(f"Time to first chunck: {time.time() - t0}")
114-
print(f"Received chunk {i} of audio length {chunk.shape[-1]}")
115-
wav_chuncks.append(chunk)
116-
wav = torch.cat(wav_chuncks, dim=0)
117-
torchaudio.save("xtts_streaming.wav", wav.squeeze().unsqueeze(0).cpu(), 24000)
118-
```
119-
12084
#### 🐸TTS Command line
12185

12286
##### Single reference
@@ -150,14 +114,32 @@ or for all wav files in a directory you can use:
150114

151115
To use the model API, you need to download the model files and pass config and model file paths manually.
152116

153-
##### Calling manually
117+
#### Manual Inference
154118

155-
If you want to be able to run with `use_deepspeed=True` and **enjoy the speedup**, you need to install deepspeed first.
119+
If you want to be able to `load_checkpoint` with `use_deepspeed=True` and **enjoy the speedup**, you need to install deepspeed first.
156120

157121
```console
158122
pip install deepspeed==0.10.3
159123
```
160124

125+
##### inference parameters
126+
127+
- `text`: The text to be synthesized.
128+
- `language`: The language of the text to be synthesized.
129+
- `gpt_cond_latent`: The latent vector you get with get_conditioning_latents. (You can cache for faster inference with same speaker)
130+
- `speaker_embedding`: The speaker embedding you get with get_conditioning_latents. (You can cache for faster inference with same speaker)
131+
- `temperature`: The softmax temperature of the autoregressive model. Defaults to 0.65.
132+
- `length_penalty`: A length penalty applied to the autoregressive decoder. Higher settings causes the model to produce more terse outputs. Defaults to 1.0.
133+
- `repetition_penalty`: A penalty that prevents the autoregressive decoder from repeating itself during decoding. Can be used to reduce the incidence of long silences or "uhhhhhhs", etc. Defaults to 2.0.
134+
- `top_k`: Lower values mean the decoder produces more "likely" (aka boring) outputs. Defaults to 50.
135+
- `top_p`: Lower values mean the decoder produces more "likely" (aka boring) outputs. Defaults to 0.8.
136+
- `speed`: The speed rate of the generated audio. Defaults to 1.0. (can produce artifacts if far from 1.0)
137+
- `enable_text_splitting`: Whether to split the text into sentences and generate audio for each sentence. It allows you to have infinite input length but might loose important context between sentences. Defaults to True.
138+
139+
140+
##### Inference
141+
142+
161143
```python
162144
import os
163145
import torch

0 commit comments

Comments
 (0)