Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add google tts for all voice families #73

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mobarski
Copy link

I've added the support for google's tts voices (tested: studio, journey, wavenet, neural, standard).
The new method also shows how to render each speaker as a separate audio track and how to combine both tracks into a single output. The tracks are also saved separately to facilitate workflows with animating AI avatars.

I haven't touched pyproject.toml (or requirements.txt) as poetry had issues with pyprojects.
Please feel free to adjust the code as you wish to merge it as in next few days I might have less time for FOSS projects.

Copy link
Owner

@souzatharsis souzatharsis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add section to config doc on how to setup google cloud in order to run google tts model. I would benefit from that too in order to test the PR :)

https://github.com/souzatharsis/podcastfy/blob/main/usage/config.md

@mobarski
Copy link
Author

Configuring google's TTS is pain in the a**. Here, I've tried to capture all the required steps but there is >0 chance that I've missed something. When you will be replicating these steps - please take notes, they might be required to create proper documentation.

Enable text to speech API

  1. go to "https://console.cloud.google.com/apis/dashboard"
  2. select your project (or create one by clicking on project list and then on "new project"
  3. click "+ ENABLE APIS AND SERVICES" at the top of the screen
  4. enter "text-to-speech" into the search box
  5. click on "Cloud Text-to-Speech API" and then on "ENABLE"
  6. you should be here: "https://console.cloud.google.com/apis/library/texttospeech.googleapis.com?project=..."

Configure Billing account

  1. click "..." on the left of the profile picture (top-right corner)
  2. select "Payment method" and add payment method
  3. again click on the "..." near the profile picture
  4. select "Billing account management"
  5. enable billing in your project

Configure API client

  1. open the terminal
  2. install google cloud CLI tools, on ubuntu its sudo snap install google-cloud-cli
  3. create application credentials file by running gcloud auth application-default login
  4. the browser will open - log into your account
  5. in the terminal you should see information about the file with your credentials Credentials saved to file: /home/mobarski/.config/gcloud/application_default_credentials.json
  6. set environment variable GOOGLE_APPLICATION_CREDENTIALS to path of that file
  7. install python package pip3 install google-cloud-texttospeech

@souzatharsis
Copy link
Owner

thank you so much for the detailed instructions;

@evandempsey
Copy link

Configuring google's TTS is pain in the a**. Here, I've tried to capture all the required steps but there is >0 chance that I've missed something. When you will be replicating these steps - please take notes, they might be required to create proper documentation.

Enable text to speech API

  1. go to "https://console.cloud.google.com/apis/dashboard"
  2. select your project (or create one by clicking on project list and then on "new project"
  3. click "+ ENABLE APIS AND SERVICES" at the top of the screen
  4. enter "text-to-speech" into the search box
  5. click on "Cloud Text-to-Speech API" and then on "ENABLE"
  6. you should be here: "https://console.cloud.google.com/apis/library/texttospeech.googleapis.com?project=..."

Configure Billing account

  1. click "..." on the left of the profile picture (top-right corner)
  2. select "Payment method" and add payment method
  3. again click on the "..." near the profile picture
  4. select "Billing account management"
  5. enable billing in your project

Configure API client

  1. open the terminal
  2. install google cloud CLI tools, on ubuntu its sudo snap install google-cloud-cli
  3. create application credentials file by running gcloud auth application-default login
  4. the browser will open - log into your account
  5. in the terminal you should see information about the file with your credentials Credentials saved to file: /home/mobarski/.config/gcloud/application_default_credentials.json
  6. set environment variable GOOGLE_APPLICATION_CREDENTIALS to path of that file
  7. install python package pip3 install google-cloud-texttospeech

You can get around all this by just enabling the Cloud Text-to-Speech API on the API key you are already using for Gemini and passing it in when you instantiate the client.

client = texttospeech.TextToSpeechClient(client_options={'api_key': os.environ['GOOGLE_API_KEY']})

@souzatharsis
Copy link
Owner

souzatharsis commented Oct 24, 2024 via email

@souzatharsis
Copy link
Owner

@evandempsey It didn't work...

"Requests to this API texttospeech.googleapis.com method google.cloud.texttospeech.v1.TextToSpeech.SynthesizeSpeech are blocked."

Where did you import it from?

from google.cloud import texttospeech

@evandempsey
Copy link

evandempsey commented Nov 5, 2024

@souzatharsis Yes, that's the one.

You probably need to add the API permission to the key you're using on the Google Cloud console.

Go to https://console.cloud.google.com/apis/credentials, click on whatever key you're using for Gemini, then go down to API Restrictions and add the Cloud Text-to-Speech API.

@souzatharsis
Copy link
Owner

souzatharsis commented Nov 5, 2024 via email

@souzatharsis
Copy link
Owner

I've managed to integrate podcastfy with Google's multispeaker model and I think we've found what NotebookLM is using...

I am curious about your feedback before we merge into main. @brumar @mobarski @evandempsey @lfnovo

Should we make this the default TTS model?
Only a Gemini API Key would be required for running it end-to-end.

  1. Transformer paper pdf

https://www.veed.io/view/eb65150f-ef2a-447c-8cb9-43674453ca8f?panel=share

  1. Website: www.open-notebook.ai

https://www.veed.io/view/4c514532-9311-41a6-8af6-9053e14f7a5b?panel=share

@evandempsey
Copy link

Ah, you think they're using this? https://cloud.google.com/text-to-speech/docs/create-dialogue-with-multispeakers

It sounds great. It sounds a bit more natural than what I was able to achieve in my experiments by burning through Elevenlabs credits.

My concern about setting it as the default is the rather irritating GCloud setup you are forcing on people then. But it should definitely be an option.

Have you found out the maximum length of audio you can synthesize with this? It doesn't seem to be documented.

@souzatharsis
Copy link
Owner

souzatharsis commented Nov 6, 2024 via email

@souzatharsis
Copy link
Owner

OK, done.
I've added set up instructions: https://github.com/souzatharsis/podcastfy/blob/main/usage/config.md
I'd welcome feedback.
Thanks!

@souzatharsis
Copy link
Owner

Thank you so much for the feedback!

Google's Multispeaker and Journey models have been released: v0.4.0.
(so much trouble to get them working since they have several limitations: 5000 bytes max in input and 1500 bytes max per turn).

All sample audio in README have been updated to use the new TTS Model. Added some longform podcasts too.

I've updated python notebook describing longform podcast + new Google TTS model work:

https://github.com/souzatharsis/podcastfy/blob/main/podcastfy.ipynb

Would love your feedback!
Do you think it's closer to NotebookLM's?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants