For commercial requests, please contact us at [email protected] or [email protected]. We have an HD model ready that can be used commercially.
This code is part of the paper: A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild published at ACM Multimedia 2020.
📑 Original Paper | 📰 Project Page | 🌀 Demo | ⚡ Live Testing | 📔 Colab Notebook |
---|---|---|---|---|
Paper | Project Page | Demo Video | Interactive Demo | Colab Notebook /Updated Collab Notebook |
- download the face detection model checkpoint:
!wget "https://www.adrianbulat.com/downloads/python-fan/s3fd-619a316812.pth" -O "face_detection/detection/sfd/s3fd.pth"
- download the pretrained model checkpoint:
!wget "https://iiitaphyd-my.sharepoint.com/:u:/g/personal/radrabha_m_research_iiit_ac_in/Eb3LEzbfuKlJiR600lQWRxgBIY27JZg80f7V9jtMfbNDaQ?e=TBFBVW" -O "checkpoints/wave2lip.pth" !wget "https://iiitaphyd-my.sharepoint.com/:u:/g/personal/radrabha_m_research_iiit_ac_in/EdjI7bZlgApMqsVoEUUXpLsBxqXbn5z8VTmoxp55YNDcIA?e=n9ljGW" -O "checkpoints/wave2lip_gan.pth"
- execute through python:
python inference.py --checkpoint_path checkpoints/wav2lip_gan.pth --face face_data.mp4 --audio audio_data.wav --outfile final_result.mp4 --nosmooth
- 效果举例:
- 狗唇音匹配效果:
- 人唇音匹配效果:
- download model checkpoint:
!wget "https://www.adrianbulat.com/downloads/python-fan/s3fd-619a316812.pth" -O "face_detection/detection/sfd/s3fd.pth" !wget "https://drive.google.com/file/d/154JgKpzCPW82qINcVieuPH3fZ2e0P812/view" -O "checkpoints/face_segmentation.pth"
- download the pretrained model checkpoint:
!wget "https://iiitaphyd-my.sharepoint.com/:u:/g/personal/radrabha_m_research_iiit_ac_in/Eb3LEzbfuKlJiR600lQWRxgBIY27JZg80f7V9jtMfbNDaQ?e=TBFBVW" -O "checkpoints/wave2lip.pth" !wget "https://iiitaphyd-my.sharepoint.com/:u:/g/personal/radrabha_m_research_iiit_ac_in/EdjI7bZlgApMqsVoEUUXpLsBxqXbn5z8VTmoxp55YNDcIA?e=n9ljGW" -O "checkpoints/wave2lip_gan.pth"
- execute through python:
python inference3.py --checkpoint_path checkpoints/wav2lip_gan.pth --segmentation_path "checkpoints/face_segmentation.pth" --face face_data.mp4 --audio audio_data.wav --outfile final_result.mp4
- 效果举例:
- 狗唇音匹配效果:
- 人唇音匹配效果:
预训练模型应用 inference3_makeup.py 像素清晰度调整,脸部局部美化,参考:https://github.com/TencentARC/GFPGAN/
- download model checkpoint:
!wget "https://www.adrianbulat.com/downloads/python-fan/s3fd-619a316812.pth" -O "face_detection/detection/sfd/s3fd.pth" !wget "https://github.com/TencentARC/GFPGAN/releases/download/v1.3.0/GFPGANv1.3.pth" -O "checkpoints/GFPGANv1.3.pth !wget "https://drive.google.com/file/d/154JgKpzCPW82qINcVieuPH3fZ2e0P812/view" -O "checkpoints/face_segmentation.pth" !wget "https://github.com/xinntao/facexlib/releases/download/v0.1.0/detection_Resnet50_Final.pth" -O "gfpgan/weights/detection_Resnet50_Final.pth" !wget "https://github.com/xinntao/facexlib/releases/download/v0.2.2/parsing_parsenet.pth" -O "gfpgan/weights/parsing_parsenet.pth" !wget "https://github.com/xinntao/Real-ESRGAN/releases/download/v0.2.1/RealESRGAN_x2plus.pt" -O "site-packages/weights/RealESRGAN_x2plus.pt"
- download the pretrained model checkpoint:
!wget "https://iiitaphyd-my.sharepoint.com/:u:/g/personal/radrabha_m_research_iiit_ac_in/Eb3LEzbfuKlJiR600lQWRxgBIY27JZg80f7V9jtMfbNDaQ?e=TBFBVW" -O "checkpoints/wave2lip.pth" !wget "https://iiitaphyd-my.sharepoint.com/:u:/g/personal/radrabha_m_research_iiit_ac_in/EdjI7bZlgApMqsVoEUUXpLsBxqXbn5z8VTmoxp55YNDcIA?e=n9ljGW" -O "checkpoints/wave2lip_gan.pth"
- execute through python:
python inference3_makeup.py --checkpoint_path checkpoints/wav2lip_gan.pth --segmentation_path "checkpoints/face_segmentation.pth" --gfpgan_path "checkpoints/GFPGANv1.3.pth" --face face_data.mp4 --audio audio_data.wav --outfile final_result.mp4 --bg_upsampler None
- 效果举例:
- 狗唇音匹配效果:
- 人唇音匹配效果:
模型训练常见命令
- 训练样本生成命令:python preprocess.py --ngpu 1 --batch_size 16 --data_root /home/guo/wave2lip/wave2lip_torch/Wav2Lip/data/original_data --preprocessed_root /home/guo/wave2lip/wave2lip_torch/Wav2Lip/data/preprocessed_root
- 训练命令:
- python hq_wav2lip_train.py --data_root data/preprocessed_root/original_data --checkpoint_dir savedmodel --syncnet_checkpoint_path checkpoints/lipsync_expert.pth --checkpoint_path checkpoints/wav2lip_gan.pth --disc_checkpoint_path checkpoints/visual_quality_disc.pth
- python wav2lip_train.py --data_root data/preprocessed_root/original_data --checkpoint_dir savedmodel --syncnet_checkpoint_path checkpoints/lipsync_expert.pth --checkpoint_path checkpoints/wav2lip.pth
Model | 模型对应的模型类 | 对应的预训练模型文件名 | Description |
---|---|---|---|
1、人脸识别模型 | SFDDetector | s3fd | 不需要训练,默认预加载 |
2、wav2lip模型 | Wav2Lip | heckpoint_path | 主模型 |
3、专家判别器模型 | SyncNet | syncnet_checkpoint_path | Expert Discriminator |
4、判别式模型 | Wav2Lip_disc_qual | disc_checkpoint_path | Visual Quality Discriminator 抽象质量判别器 |
- Weights of the visual quality disc has been updated in readme!
- Lip-sync videos to any target speech with high accuracy 💯. Try our interactive demo.
- ✨ Works for any identity, voice, and language. Also works for CGI faces and synthetic voices.
- Complete training code, inference code, and pretrained models are available 💥
- Or, quick-start with the Google Colab Notebook: Link. Checkpoints and samples are available in a Google Drive folder as well. There is also a tutorial video on this, courtesy of What Make Art. Also, thanks to Eyal Gruss, there is a more accessible Google Colab notebook with more useful features. A tutorial collab notebook is present at this link.
- 🔥 🔥 Several new, reliable evaluation benchmarks and metrics [
evaluation/
folder of this repo] released. Instructions to calculate the metrics reported in the paper are also present.
python preprocess.py --ngpu 1 --data_root /home/guo/wave2lip/wave2lip_torch/Wav2Lip/data/original_data --preprocessed_root /home/guo/wave2lip/wave2lip_torch/Wav2Lip/data/preprocessed_root python wav2lip_train.py --data_root ./data/preprocessed_root/original_data --checkpoint_dir ./savedmodel --syncnet_checkpoint_path ./checkpoints/lipsync_expert.pth
All results from this open-source code or our demo website should only be used for research/academic/personal purposes only. As the models are trained on the LRS2 dataset, any form of commercial use is strictly prohibhited. For commercial requests please contact us directly!
Python 3.6
- ffmpeg:
sudo apt-get install ffmpeg
- Install necessary packages using
pip install -r requirements.txt
. Alternatively, instructions for using a docker image is provided here. Have a look at this comment and comment on the gist if you encounter any issues. - Face detection pre-trained model should be downloaded to
face_detection/detection/sfd/s3fd.pth
. Alternative link if the above does not work.
Model | Description | Link to the model |
---|---|---|
Wav2Lip | Highly accurate lip-sync | Link |
Wav2Lip + GAN | Slightly inferior lip-sync, but better visual quality | Link |
Expert Discriminator | Weights of the expert discriminator | Link |
Visual Quality Discriminator | Weights of the visual disc trained in a GAN setup | Link |
You can lip-sync any video to any audio:
python inference3.py --checkpoint_path <ckpt> --face <video.mp4> --audio <an-audio-source>
The result is saved (by default) in results/result_voice.mp4
. You can specify it as an argument, similar to several other available options. The audio source can be any file supported by FFMPEG
containing audio data: *.wav
, *.mp3
or even a video file, from which the code will automatically extract the audio.
- Experiment with the
--pads
argument to adjust the detected face bounding box. Often leads to improved results. You might need to increase the bottom padding to include the chin region. E.g.--pads 0 20 0 0
. - If you see the mouth position dislocated or some weird artifacts such as two mouths, then it can be because of over-smoothing the face detections. Use the
--nosmooth
argument and give another try. - Experiment with the
--resize_factor
argument, to get a lower resolution video. Why? The models are trained on faces which were at a lower resolution. You might get better, visually pleasing results for 720p videos than for 1080p videos (in many cases, the latter works well too). - The Wav2Lip model without GAN usually needs more experimenting with the above two to get the most ideal results, and sometimes, can give you a better result as well.
Our models are trained on LRS2. See here for a few suggestions regarding training on other datasets.
data_root (mvlrs_v1)
├── main, pretrain (we use only main folder in this work)
| ├── list of folders
| │ ├── five-digit numbered video IDs ending with (.mp4)
Place the LRS2 filelists (train, val, test) .txt
files in the filelists/
folder.
python preprocess.py --data_root data_root/main --preprocessed_root lrs2_preprocessed/
Additional options like batch_size
and number of GPUs to use in parallel to use can also be set.
preprocessed_root (lrs2_preprocessed)
├── list of folders
| ├── Folders with five-digit numbered video IDs
| │ ├── *.jpg
| │ ├── audio.wav
There are two major steps: (i) Train the expert lip-sync discriminator, (ii) Train the Wav2Lip model(s).
You can download the pre-trained weights if you want to skip this step. To train it:
python color_syncnet_train.py --data_root lrs2_preprocessed/ --checkpoint_dir <folder_to_save_checkpoints>
You can either train the model without the additional visual quality disriminator (< 1 day of training) or use the discriminator (~2 days). For the former, run:
python wav2lip_train.py --data_root lrs2_preprocessed/ --checkpoint_dir <folder_to_save_checkpoints> --syncnet_checkpoint_path <path_to_expert_disc_checkpoint>
To train with the visual quality discriminator, you should run hq_wav2lip_train.py
instead. The arguments for both the files are similar. In both the cases, you can resume training as well. Look at python wav2lip_train.py --help
for more details. You can also set additional less commonly-used hyper-parameters at the bottom of the hparams.py
file.
Training on other datasets might require modifications to the code. Please read the following before you raise an issue:
- You might not get good results by training/fine-tuning on a few minutes of a single speaker. This is a separate research problem, to which we do not have a solution yet. Thus, we would most likely not be able to resolve your issue.
- You must train the expert discriminator for your own dataset before training Wav2Lip.
- If it is your own dataset downloaded from the web, in most cases, needs to be sync-corrected.
- Be mindful of the FPS of the videos of your dataset. Changes to FPS would need significant code changes.
- The expert discriminator's eval loss should go down to ~0.25 and the Wav2Lip eval sync loss should go down to ~0.2 to get good results.
When raising an issue on this topic, please let us know that you are aware of all these points.
We have an HD model trained on a dataset allowing commercial usage. The size of the generated face will be 192 x 288 in our new model.
Please check the evaluation/
folder for the instructions.
Theis repository can only be used for personal/research/non-commercial purposes. However, for commercial requests, please contact us directly at [email protected] or [email protected]. We have an HD model trained on a dataset allowing commercial usage. The size of the generated face will be 192 x 288 in our new model. Please cite the following paper if you use this repository:
@inproceedings{10.1145/3394171.3413532,
author = {Prajwal, K R and Mukhopadhyay, Rudrabha and Namboodiri, Vinay P. and Jawahar, C.V.},
title = {A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild},
year = {2020},
isbn = {9781450379885},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3394171.3413532},
doi = {10.1145/3394171.3413532},
booktitle = {Proceedings of the 28th ACM International Conference on Multimedia},
pages = {484–492},
numpages = {9},
keywords = {lip sync, talking face generation, video generation},
location = {Seattle, WA, USA},
series = {MM '20}
}
Parts of the code structure is inspired by this TTS repository. We thank the author for this wonderful code. The code for Face Detection has been taken from the face_alignment repository. We thank the authors for releasing their code and models. We thank zabique for the tutorial collab notebook.