Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MetaVoice 1B TTS: New and Improved Artificial Intelligence Capabilities as well as Improved User Interface. #194

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

RahulVadisetty91
Copy link

1. Summary:

This pull request brings a number of important advancements on the field of AI to the MetaVoice-1B TTS model as well as significant changes in the user interface. The new features include dynamic speech parameters including the top_p slider for speech stability and the guidance slider for speaker similarity allowing users to control the speech synthesis to their preference. New features are also added in the voice cloning including better validation of the voice samples uploaded and better handling of the edge cases. Further, the user interface is improved with better voice selection presentation and improved error messages where a better error handling mechanism has been applied in order to direct users to the right path in case of any problem.

2. Related Issues:

These updates concern the problems that require further developement in speech synthesis, including the enhancement of voice cloning accuracy, and the improvement of the interface. The improvements in dynamic speech parameters and error handling are an answer to the users’ complaints and earlier tests.

3. Discussions:

Concerns were raised on the need to give users some control on the type of speech that is generated in terms of stability and speaker similarity. The necessity for voice samples authentication in order to achieve high-quality cloning and the need for the interface improvements were discussed as well. Also, the importance of coming up with accurate and detailed error messages was underlined in order to improve the user’s experience.

4. QA Instructions:

  • Adjust the top_p and guidance sliders and make sure that they work as intended and allow for the desired control over speech smoothness and speaker similarity.
  • Check the voice cloning process with several audio files to ascertain that the new rules set are correctly implemented and that the voice cloning process provides uniform results.
  • Assess the changes in the user interface, and make sure it is adaptable to the user input whereby the voice selection layout changes depending on the input that the user has made and that the messages displayed in case of any errors are comprehensible.
  • Check the error handling system by generating different errors to see whether notifications are sent for each of them, including notifications for character limits and file format problems.

5. Merge Plan:

After the QA testing is done and is successful then the branch will be merged into the main branch. The merge will be done in a way that will not interfere much with the ongoing development activities and special emphasis will be made on the new dynamic speech parameters and voice cloning improvements.

6. Motivation and Context:

The reason for these updates is to enhance the efficiency, scalability, and the practicality of the MetaVoice-1B TTS model. Through implementing dynamic speech parameters, we want the users to have a high level of control on the speech output for it to be versatile. New and improved voice cloning and interface design take user complaints into consideration and provide a better service. This helps to minimize any inconveniences which may be experienced by users and therefore enhance the overall effectiveness of the text-to-speech conversion.

7. Types of Changes:

  • New Feature: New dynamic speech parameters were added: top_p – for stability, guidance – for speaker similarity.
  • Enhancement: Enhanced Voice Cloning validation, especially in the light of edge conditions.
  • UI Improvement: Changes in the layout of the voice selection and the improvements in the error messages.
  • Error Handling: Enhanced error and warning messages for the user to have a better understanding of the actions to take.

This commit introduces several key enhancements to the MetaVoice-1B text-to-speech (TTS) model, focusing on improving AI capabilities and user interaction:

Advanced Speech Parameters:

Added functionality for dynamic adjustment of speech stability and speaker similarity. Users can now fine-tune the top_p (speech stability) and guidance (speaker similarity) parameters through sliders, allowing for more personalized and controlled speech output.

Enhanced Voice Cloning:

Improved handling of uploaded voice samples for cloning. The script now includes validation for file size and duration, ensuring that uploaded samples are suitable for high-quality voice synthesis. Samples must be between 30-90 seconds and less than 50MB to ensure optimal performance.

User Interface Improvements:

Updated the user interface to provide a more intuitive experience. Users can choose between preset voices and uploaded target voices, with automatic layout adjustments based on the selected option. The interface now features clear labels and better organization for ease of use.

Robust Error Handling:

Enhanced error handling to manage edge cases and provide informative feedback. The script includes comprehensive checks and error messages for input validation, such as handling text length limits and ensuring uploaded files meet the required criteria.

These updates aim to enhance the functionality, usability, and robustness of the MetaVoice-1B TTS model, delivering a more versatile and user-friendly text-to-speech solution.

Signed-off-by: Rahul Vadisetty <[email protected]>
Enhance TTS Model with Advanced Speech Parameters and Improved Voice Cloning
@FurkanGozukara
Copy link

awesome

with cloning you do a training or zero shot?

if zero show many seconds of source speaking video you suggest?

@RahulVadisetty91
Copy link
Author

awesome

with cloning you do a training or zero shot?

if zero show many seconds of source speaking video you suggest?

Thank you for the feedback! Regarding cloning, we are currently using zero-shot learning to clone voices. From our tests, a source video with around 5-10 seconds of clear high-quality audio provides optimal results for speaker similarity. However, if needed we can experiment with different durations to see how it affects the quality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants