This repository contains the code to manipulate a source input video of a speaking person to mimic the audio from a given target video.
This code is tested with Ubuntu operating system and Python 3.7.
- First, a virtual environment should be created, e.g. using venv by:
python -m venv venv
The virtual environment is activated by:
source venv/bin/activate
- After activating the virtual environment, run the following to install the requirements.
pip install --upgrade pip
pip install -r requirements.txt
- Create a directory named "Documents" inside the folder, which will contain some necessary files to run the pipeline. Create a subdirectory in this directory named "pipeline_files". Download FLAME 2020 and move "generic_model.pkl" to Documents/pipeline_files/flame_files. Download "landmark_embedding.npy" from DECA and move it to the "flame_files" as well. Download deepspeech trained model, and move "output_graph.pb" to Documents/pipeline_files/trained_models. Finally, download trained tracker model and audio2exp model and move "tracker.pt" and "a2e.pt" to "trained_models" as well.
- Training the source
To use a video as source (i.e. the actor), the model should be trained for that video first. Training the model for source video is initated by running the following inside the directory.
source venv/bin/activate
./train_source.sh SOURCE_NAME
Where SOURCE_NAME is the name (without extension) of the source video. The source video should be located in the "Documents/video" folder.
- Creating the fake video (inference)
Generation of the fake video is initated by running the following inside the directory.
source venv/bin/activate
./create_fake_video.sh SOURCE_NAME TARGET_NAME
Where SOURCE_NAME and TARGET_NAME are the names (without extension) of the source and target videos respectively. The videos shouls be located in the "Documents/video" folder. For example, if the source video is "aa.mp4" and target video is "bb.mp4", you should run:
./create_fake_video.sh aa bb
The target can also be an audio file. This will get the speech from the target video/audio and manipulate the source video based on this speech.
When the run is complete, the resulting fake video will be inside the "Documents/video" folder.
In preprocess/ds_to_flame_params.py, there are two parameters "jaw_gain" and "jaw_closure". jaw_gain determines how much the mouth opens given the speech. Higher gain leads to more movement at the mouth. jaw_closure determines how much the mouth should be closed during silence. By default, there is an offset in mouth in silence, so it aims to solve this. Current parameters seem to work well, but it can be experimented with other parameters to test the effects.