A voice cloning and text-to-speech application that can generate speech in any voice.
- Frontend: React
- Backend: FastAPI
- Text-to-speech: Tortoise TTS
- Clone the repository:
git clone https://github.com/taeefnajib/vocazee.git
cd vocazee
- Build and start the containers:
docker compose up --build
Note: The first build will take some time as it downloads necessary AI models (>1GB). This is a one-time setup.
- Access the application:
- Frontend: http://localhost:3000
- Backend API: http://localhost:8000
- API Documentation: http://localhost:8000/docs
-
From the Web Interface:
- Go to http://localhost:3000
- Switch to the "Train Custom Voice" tab
- Enter a name for your voice
- Record a clear audio file of your voice reading the provided text
- Click "Train Voice"
- Wait for the training to complete (usually takes 1-2 minutes)
-
From Command Line (Advance): Go to
server
directory. Make sure you create and activate a virtual environment and install dependencies by runningpip install -r requirements.txt
. Now follow the steps:# 1. First, process your audio file python generate_voice.py --input_file path/to/your/audio.wav --output_dir voices/your_voice_name # 2. Generate voice embeddings python save_embeddings.py --voice_dir voices/your_voice_name # 3. Cache voice latents for faster generation python cache_voice_latents.py --voice_dir voices/your_voice_name
Tips for best results:
- Use high-quality audio with minimal background noise
- Record in a quiet environment
- Speak clearly and at a natural pace
- Aim for at least 120 seconds of audio
-
From the Web Interface:
- Go to http://localhost:3000
- Select a trained voice from the dropdown
- Enter or paste the text you want to convert to speech
- Toggle "High Quality" if desired (slower but better quality)
- Click "Generate Speech"
- Once complete, use the audio player to listen or download the generated audio
-
Using the API directly:
curl -X POST http://localhost:8000/generate-speech \ -H "Content-Type: application/json" \ -d '{ "text": "Your text here", "voice_name": "your_voice_name", "high_quality": false }'
Each voice in the voices
directory should have the following structure:
voices/
└── your_voice_name/
├── original.wav # Original audio file
├── chunks # Processed audio chunks
├── voice_latents.pth # Cached voice latents
└── embeddings.pt # Voice embeddings
POST /create-voice
: Train a new voiceGET /voices
: List all available voicesPOST /generate-speech
: Generate speech from textGET /audio/{generation_id}/{part}
: Get generated audio file
-
If the server is slow on first request:
- This is normal as models are being loaded
- Subsequent requests will be faster
-
If voice training fails:
- Ensure audio is clear and has minimal background noise
- Try recording a longer sample
- Check if the audio format is supported (WAV recommended)
-
If speech generation is stuck:
- Check server logs using
docker logs vocazee-server-1
- Ensure the voice model exists and is properly trained
- Try with a shorter text first
- Check server logs using
MIT License