This was created as a part of the course - 11785 Introduction to Deep Learning at Carnegie Mellon University
Previous studies have demonstrated a significant statistical correlation between a person's facial structure and their voice. This correlation is attributed to both direct and indirect factors. Directly, the skeletal structure of the face determines the acoustic properties of the vocal tract responsible for voice production. Indirectly, environmental factors that affect facial development may also affect the voice. Furthermore, demographic factors such as age, gender, and ethnicity are also shown to have an impact on both facial structure and voice.
Standing on the back of above research, our project proposes algorithms to generate the voice of a person using their facial imagery. However, the relationship between voices and faces must be learned. To solve this, we aim to extract facial features that would have an impact on the voice, map them to corresponding effects it would have on the voice quality (pitch, timbre). We then generate a potential voice. Further, we propose to apply style transfer to the generated voice to imitate style of speaking (pauses, accent) as the person in question but in the voice of the person whose voice we want to recreate.