Skip to content

Commit

Permalink
further cleanup and doc updates.
Browse files Browse the repository at this point in the history
  • Loading branch information
guanlongzhao committed Sep 6, 2019
1 parent a975936 commit f4092a8
Show file tree
Hide file tree
Showing 7 changed files with 155 additions and 119 deletions.
61 changes: 50 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,41 +1,80 @@
# Foreign Accent Conversion by Synthesizing Speech from Phonetic Posteriorgrams (accepted to Interspeech'19)
# Foreign Accent Conversion by Synthesizing Speech from Phonetic Posteriorgrams (accepted to Interspeech'19)

**The current version is runnable but you probably won't figure out how, more docs on the way.**

PPG->Speech conversion code. This branch hosts the original code we used to
prepare our interspeech'19 paper titled "Foreign Accent Conversion by Synthesizing Speech from Phonetic Posteriorgrams"
This branch hosts the code we used to
prepare our interspeech'19 paper titled "[Foreign Accent Conversion by Synthesizing Speech from Phonetic Posteriorgrams](https://psi.engr.tamu.edu/wp-content/uploads/2019/07/zhao2019interspeech.pdf)"

### Install

This project uses `conda` to manage all the dependencies, you should install [anaconda](https://anaconda.org/) if you have not done so.

```bash
# Dependencies
# Clone the repo
git clone https://github.com/guanlongzhao/fac-via-ppg.git
cd $PROJECT_ROOT_DIR

# Install dependencies
conda env create -f environment.yml

# Activate the installed environment
conda activate ppg-speech

# Compile protocol buffer
protoc -I=src/common --python_out=src/common src/common/data_utterance.proto

# Include src in your PYTHONPATH
export PYTHONPATH=$PROJECT_ROOT_DIR/src:$PYTHONPATH
```

If `conda` complains that some packages are missing, it is very likely that you can find a similar version of that package on anaconda's archive.

### Run unit tests

```bash
cd test

# Remember to make this script executable
./run_coverage.sh
```

### Train
Change default parameters in `hparams.py`
This only does a few sanity checks, don't worry if the test coverage looks low :)

### Train PPG-to-Mel model
Change default parameters in `src/common/hparams.py:create_hparams()`
The training and validation data should be specified in text files, see `data/filelists` for examples.

```bash
cd src/script
python train.py
python train_ppg2mel.py
```
The `FP16` mode will not work, unfortunately.

### Train WaveGlow model
Change the default parameters in `src/waveglow/config.json`. The training data should be specified in the same manner as the PPG-to-Mel model.

```bash
cd src/script
python train_waveglow.py
```

### View training progress
You should find a dir `log` in all of your output dirs, that is the `LOG_DIR` you should use below.

### View progress
```bash
tensorboard --logdir=${LOG_DIR}
```

### Generate speech synthesis
Use `src/script/generate_synthesis.py`, you can find pre-trained models in the [Links](#Links) section.

```bash
generate_synthesis.py [-h] --ppg2mel_model PPG2MEL_MODEL
--waveglow_model WAVEGLOW_MODEL
--teacher_utterance_path TEACHER_UTTERANCE_PATH
--output_dir OUTPUT_DIR
```

### Links

- Syntheses and pretraind models: [link](https://drive.google.com/file/d/1nye-CAGyz3diM5Q80s0iuBYgcIL_cqrs/view?usp=sharing)
- Syntheses and pre-trained models: [link](https://drive.google.com/file/d/1nye-CAGyz3diM5Q80s0iuBYgcIL_cqrs/view?usp=sharing)
- Training data (L2-ARCTIC recordings after noise removal): [link](https://drive.google.com/file/d/1WnBHAfjEKdFTBDv5D6DxRnlcvfiODBgy/view?usp=sharing)
- Demo: [link](https://guanlongzhao.github.io/demo/fac-via-ppg)
4 changes: 2 additions & 2 deletions environment.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: ppg-speech-lite
name: ppg-speech
channels:
- pytorch
- pykaldi
Expand Down Expand Up @@ -100,5 +100,5 @@ dependencies:
- pip:
- textgrid==1.4
- torch==1.0.0
prefix: /home/guanlong/anaconda2/envs/ppg-speech-lite
prefix: /home/guanlong/anaconda2/envs/ppg-speech

22 changes: 14 additions & 8 deletions src/common/hparams.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,8 @@ def create_hparams(**kwargs):
"cudnn_enabled": True,
"cudnn_benchmark": False,
"output_directory": None, # Directory to save checkpoints.
"log_directory": 'log', # Directory to save tensorboard logs.
# Directory to save tensorboard logs. Just keep it like this.
"log_directory": 'log',
"checkpoint_path": '', # Path to a checkpoint file.
"warm_start": False, # Load the model only (warm start)
"n_gpus": 1, # Number of GPUs
Expand All @@ -65,8 +66,12 @@ def create_hparams(**kwargs):
################################
# Data Parameters #
################################
"training_files": '/home/guanlong/PycharmProjects/fac-via-ppg/data/filelists/ykwk_train_filelist_noise_reduced_lite.txt',
"validation_files": '/home/guanlong/PycharmProjects/fac-via-ppg/data/filelists/ykwk_val_filelist_noise_reduced.txt',
# Passed as a txt file, see data/filelists/training-set.txt for an
# example.
"training_files": '',
# Passed as a txt file, see data/filelists/validation-set.txt for an
# example.
"validation_files": '',
"is_full_ppg": True, # Whether to use the full PPG or not.
"is_append_f0": False, # Currently only effective at sentence level
"ppg_subsampling_factor": 1, # Sub-sample the ppg & acoustic sequence.
Expand All @@ -76,12 +81,11 @@ def create_hparams(**kwargs):
# |True |False |Please set cache path
# |False |True |Overwrite the cache path
# |False |False |Ignores the cache path
"load_feats_from_disk": True, # Remember to set the path.
"load_feats_from_disk": False, # Remember to set the path.
# Mutually exclusive with 'load_feats_from_disk', will overwrite
# 'feats_cache_path' if set.
"is_cache_feats": False,
"feats_cache_path":
'/data_repo/arctic/cache/ykwk_feat_cache_noise_reduced_lite.pkl',
"feats_cache_path": '',

################################
# Audio Parameters #
Expand All @@ -98,7 +102,6 @@ def create_hparams(**kwargs):
################################
# Model Parameters #
################################
# For chain 8629, for fc 5816, for mono 40, for mono+f0 43
"n_symbols": 5816,
"symbols_embedding_dim": 600,

Expand Down Expand Up @@ -156,7 +159,10 @@ def create_hparams(**kwargs):


def create_hparams_stage(**kwargs):
"""Create model hyperparameters. Parse nondefault from given string."""
"""Create model hyperparameters. Parse nondefault from given string.
These are the parameters used for our interspeech 2019 submission.
"""

hparams = {
'attention_dim': 150,
Expand Down
51 changes: 51 additions & 0 deletions src/common/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -128,3 +128,54 @@ def notch_filtering(wav, fs, w0, Q):
wav = signal.lfilter(b, a, wav)
return wav


def get_mel(wav, stft):
audio = torch.FloatTensor(wav.astype(np.float32))
audio_norm = audio / 32768
audio_norm = audio_norm.unsqueeze(0)
audio_norm = torch.autograd.Variable(audio_norm, requires_grad=False)
# (1, n_mel_channels, T)
acoustic_feats = stft.mel_spectrogram(audio_norm)
return acoustic_feats


def waveglow_audio(mel, waveglow, sigma, is_cuda_output=False):
mel = torch.autograd.Variable(mel.cuda())
if not is_cuda_output:
with torch.no_grad():
audio = 32768 * waveglow.infer(mel, sigma=sigma)[0]
audio = audio.cpu().numpy()
audio = audio.astype('int16')
else:
with torch.no_grad():
audio = waveglow.infer(mel, sigma=sigma).cuda()
return audio


def get_inference(seq, model, is_clip=False):
"""Tacotron inference.
Args:
seq: T*D numpy array.
model: Tacotron model.
is_clip: Set to True to avoid the artifacts at the end.
Returns:
synthesized mels.
"""
# (T, D) numpy -> (1, D, T) cpu tensor
seq = torch.from_numpy(seq).float().transpose(0, 1).unsqueeze(0)
# cpu tensor -> gpu tensor
seq = to_gpu(seq)
mel_outputs, mel_outputs_postnet, _, alignments = model.inference(seq)
if is_clip:
return mel_outputs_postnet[:, :, 10:(seq.size(2)-10)]
else:
return mel_outputs_postnet


def load_waveglow_model(path):
model = torch.load(path)['model']
model = model.remove_weightnorm(model)
model.cuda().eval()
return model
100 changes: 25 additions & 75 deletions src/script/generate_synthesis.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,96 +12,46 @@
# See the License for the specific language governing permissions and
# limitations under the License.

from common.data_utils import get_ppg
from common.hparams import create_hparams_stage
from script.train_ppg2mel import load_model
from common.utils import to_gpu
from common.layers import TacotronSTFT
from common import feat
from common.utils import waveglow_audio, get_inference, load_waveglow_model
from scipy.io import wavfile
import numpy as np
import sys
import torch
import ppg
import os
import logging
import datetime
import time
# sys.path.append(os.path.join(os.path.dirname(os.path.abspath(__file__)), '..',
# 'src', 'waveglow'))
from script.train_ppg2mel import load_model
from waveglow.denoiser import Denoiser
from common.data_utils import get_ppg


def get_mel(wav, stft):
audio = torch.FloatTensor(wav.astype(np.float32))
audio_norm = audio / 32768
audio_norm = audio_norm.unsqueeze(0)
audio_norm = torch.autograd.Variable(audio_norm, requires_grad=False)
# (1, n_mel_channels, T)
acoustic_feats = stft.mel_spectrogram(audio_norm)
return acoustic_feats


def waveglow_audio(mel, waveglow, sigma, is_cuda_output=False):
mel = torch.autograd.Variable(mel.cuda())
if not is_cuda_output:
with torch.no_grad():
audio = 32768 * waveglow.infer(mel, sigma=sigma)[0]
audio = audio.cpu().numpy()
audio = audio.astype('int16')
else:
with torch.no_grad():
audio = waveglow.infer(mel, sigma=sigma).cuda()
return audio


def get_inference(seq, model, is_clip=False):
"""Tacotron inference.
Args:
seq: T*D numpy array.
model: Tacotron model.
is_clip: Set to True to avoid the artifacts at the end.
Returns:
synthesized mels.
"""
# (T, D) numpy -> (1, D, T) cpu tensor
seq = torch.from_numpy(seq).float().transpose(0, 1).unsqueeze(0)
# cpu tensor -> gpu tensor
seq = to_gpu(seq)
mel_outputs, mel_outputs_postnet, _, alignments = model.inference(seq)
if is_clip:
return mel_outputs_postnet[:, :, 10:(seq.size(2)-10)]
else:
return mel_outputs_postnet


def load_waveglow_model(path):
model = torch.load(path)['model']
model = model.remove_weightnorm(model)
model.cuda().eval()
return model
import argparse
import logging
import os
import ppg
import torch


if __name__ == '__main__':
parser = argparse.ArgumentParser(
description='Generate accent conversion speech using pre-trained'
'models.')
parser.add_argument('--ppg2mel_model', type=str, required=True,
help='Path to the PPG-to-Mel model.')
parser.add_argument('--waveglow_model', type=str, required=True,
help='Path to the WaveGlow model.')
parser.add_argument('--teacher_utterance_path', type=str, required=True,
help='Path to a native speaker recording.')
parser.add_argument('--output_dir', type=str, required=True,
help='Output dir, will save the audio and log info.')
args = parser.parse_args()

# Prepare dirs
timestamp = datetime.datetime.fromtimestamp(time.time())
output_dir = \
'/media/guanlong/DATA1/exp/ppg-speech/samples/trial_%04d%02d%02d' \
'-%02d%02d%02d' \
% (timestamp.year, timestamp.month, timestamp.day, timestamp.hour,
timestamp.minute, timestamp.second)
output_dir = args.output_dir
if not os.path.isdir(output_dir):
os.mkdir(output_dir)
logging.basicConfig(filename=os.path.join(output_dir, 'debug.log'),
level=logging.DEBUG)
logging.info('Output dir: %s', output_dir)

# Parameters
checkpoint_path = ''
teacher_utt_path = ''
waveglow_path = ''
teacher_utt_path = args.teacher_utterance_path
checkpoint_path = args.ppg2mel_model
waveglow_path = args.waveglow_model
is_clip = False # Set to True to control the output length of AC.
fs = 16000
waveglow_sigma = 0.6
Expand Down
19 changes: 7 additions & 12 deletions src/script/train_ppg2mel.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,6 @@

"""Modified from https://github.com/NVIDIA/tacotron2"""

import datetime
import os
import time
import math
Expand Down Expand Up @@ -280,19 +279,15 @@ def train(output_directory, log_directory, checkpoint_path, warm_start, n_gpus,
if __name__ == '__main__':
hparams = create_hparams()

# Prepare paths for this experiment, this helps avoid collisions.
timestamp = datetime.datetime.fromtimestamp(time.time())
exp_output_root_dir = \
'/media/guanlong/DATA1/exp/ppg-speech/tacotron/trial_%04d%02d%02d' \
'-%02d%02d%02d' \
% (timestamp.year, timestamp.month, timestamp.day, timestamp.hour,
timestamp.minute, timestamp.second)
os.mkdir(exp_output_root_dir)
if hparams.output_directory is None:
hparams.output_directory = os.path.join(exp_output_root_dir, 'output')
if not hparams.output_directory:
raise FileExistsError('Please specify the output dir.')
else:
if not os.path.exists(hparams.output_directory):
os.mkdir(hparams.output_directory)

# Record the hyper-parameters.
hparams_snapshot_file = os.path.join(exp_output_root_dir, 'hparams.txt')
hparams_snapshot_file = os.path.join(hparams.output_directory,
'hparams.txt')
with open(hparams_snapshot_file, 'w') as writer:
pprint(hparams.__dict__, writer)

Expand Down
Loading

0 comments on commit f4092a8

Please sign in to comment.