Audio Deepfake Detection with the aid of Authentic Reference Material

This repository was developed as part of the research for my Master's thesis titled "Audio Deepfake Detection with the Aid of Authentic Reference Material," conducted at the University of Hagen. The thesis was supervised by Prof. Jörg Keller (University of Hagen) and Dr. Dominique Dresen (Federal Office for Information Security), with additional support from Matthias Neu (Federal Office for Information Security).

The primary objective of this repository is to facilitate the development of a deepfake detection model using the ECAPA-TDNN [1] architecture (see Fig. 1). The audio features are generated using either an MFCC extractor or the SSL model WavLM [2]. A key innovation in this thesis is the introduction of the triplet loss function for training the model, as opposed to the current state-of-the-art which uses ECAPA-TDNN with WavLM-Large but employs the AAM-Softmax loss function [3]. To have standarized tested code, this repository uses the ECAPA-TDNN implementation of SpeechBrain[4]. The code can be found in their GitHub repository at speechbrain/lobes/models/ECAPA_TDNN.py. Additionally, S3PRL[5] is used as speech toolkit, to leverage WavLm as feature generator. To leverage additional upstream ssl models included in S3PRL, the abstract s3prl_ECAPA_TDNN class can be implemented. Here we implemented WavLM Base as well as WavLM Large.

Fig. 1 ECAPA-TDNN architecture including the SE-Res2Block [1].

To achieve better results in deepfake detection, the triplets are generated as anchor A (audio of speaker A), positive P (different audio of speaker A), and deepfake D (deepfake of speaker A). The loss function aims to minimize the Euclidean distance of the embeddings generated from the ECAPA-TDNN model in the same manner as done in FaceNet [6]. The loss is calculated using the following function: $$\mathcal{L}(A, P, D) = \max(|f(A) - f(P)|_2^2 - |f(A) - f(D)|_2^2 + \alpha, 0)$$ where $f$ represents the embedding function generated by the ECAPA-TDNN model, and $\alpha$ is a margin that ensures the deepfake is farther from the anchor than the positive. The triplet loss is calculated using the TripletMarginWithDistanceLoss[7] from PyTorch using the compute_distance function, which calculates the squared Euclidean distance given two L2-normalized vectors.

To create these triplets, a dataset is required that consists of authentic audios as well as deepfake audios of the same speaker. The Federal Office for Information Security provided a dataset including authentic audio files from LibriTTS [8] and their corresponding deepfakes. These deepfakes are produced using both Text-to-Speech (TTS) and Voice Conversion (VC) methods. An extensive list of deepfake methods used to generate the dataset is listed in Deepfake Methods

Setup

This repository leverages the deepfake dataset provided by the Federal Office for Information Security. The dataset is accompanied by extraction code located in the extraction_utils directory. To properly utilize this code, it is necessary to create a symbolic link in the ./source folder.

This project requires CUDA-enabled graphics cards for execution. Ensure that CUDA version 12.1 or higher is installed on your system. You can verify your CUDA installation by running:

nvcc --version

If you do not have the required CUDA version, follow the installation instructions provided by NVIDIA to upgrade or install CUDA version 12.1 or higher.

Steps to initialize the project

Download the project

git clone [email protected]:DigitalPhilosopher/SpeakerRecognition.git

Create virtual python environment

python -m venv .venv
source .venv/bin/activate

Install pip requirements

pip install -r requirements.txt

Add a symbolic link or copy data to data dir

ln -s /path/to/BSI_DATASET data/BSI
ln -s /path/to/LibriSpeech data/LibriSpeech

Train Model

This script is designed to train a specific model using the provided deepfake dataset. Users can modify various parameters to adjust hyperparameters and training functionalities according to their requirements. To train the model, execute the script with the desired parameters. The script allows for flexibility in setting different hyperparameters, such as learning rate, batch size, and number of epochs, among others.

usage: TrainModel.py [-h] [--downsample_train DOWNSAMPLE_TRAIN] [--downsample_valid DOWNSAMPLE_VALID] [--downsample_test DOWNSAMPLE_TEST] --dataset DATASET
                     [--mfccs MFCCS] [--sample_rate SAMPLE_RATE] --frontend FRONTEND [--frozen FROZEN] [--embedding_size EMBEDDING_SIZE] [--device DEVICE]
                     [--batch_size BATCH_SIZE] [--epochs EPOCHS] [--validation_rate VALIDATION_RATE] [--margin MARGIN] [--norm NORM] [--learning_rate LEARNING_RATE]
                     [--weight_decay WEIGHT_DECAY] [--amsgrad AMSGRAD]

Training ECAPA-TDNN Model for Deepfake Speaker Verification

options:
  -h, --help            show this help message and exit
  --downsample_train DOWNSAMPLE_TRAIN
                        Downsample training data by a factor (default: 0 - no downsampling)
  --downsample_valid DOWNSAMPLE_VALID
                        Downsample validation data by a factor (default: 0 - no downsampling)
  --downsample_test DOWNSAMPLE_TEST
                        Downsample test data by a factor (default: 0 - no downsampling)
  --dataset DATASET     Which dataset to use (LibriSpeech.genuine | VoxCeleb.genuine | BSI.genuine | BSI.deepfake)
  --mfccs MFCCS         Number of MFCC features to extract (default: 13)
  --sample_rate SAMPLE_RATE
                        Sample rate for the audio data (default: 16000)
  --frontend FRONTEND   Which frontend model to use for feature extraction (mfcc, wavlm_base, wavlm_large)
  --frozen FROZEN       Whether the frontend model is jointly trained or frozen during training (1=frozen, 0=joint)
  --embedding_size EMBEDDING_SIZE
                        Size of the embedding vector (default: 192)
  --device DEVICE       Which device to use (per default looks if cuda is available)
  --batch_size BATCH_SIZE
                        Batch size for training (default: 8)
  --epochs EPOCHS       Number of training epochs (default: 25)
  --validation_rate VALIDATION_RATE
                        Validation rate, i.e., validate every N epochs (default: 5)
  --margin MARGIN       Margin for loss function (default: 1)
  --norm NORM           Normalization type (default: 2)
  --learning_rate LEARNING_RATE
                        Learning rate for the optimizer (default: 0.001)
  --weight_decay WEIGHT_DECAY
                        Weight decay to use for optimizing (default: 0.00001)
  --amsgrad AMSGRAD     Whether to use the AMSGrad variant of Adam optimizer (default: False)

Examples

Below are some examples to help you get started with training the model using different configurations:

# Frontend: MFCC
python source/TrainModel.py --frontend mfcc --dataset BSI.genuine --batch_size 16 --epochs 20 --validation_rate 5 --margin 0.2 --mfccs 80 --downsample_valid 25 --downsample_test 50
python source/TrainModel.py --frontend mfcc --dataset BSI.deepfake --batch_size 16 --epochs 20 --validation_rate 5 --margin 0.2 --mfccs 80 --downsample_valid 25 --downsample_test 50

# Frontend: WavLM Base with frozen parameters
python source/TrainModel.py --frontend wavlm_base --dataset BSI.genuine --batch_size 8 --epochs 20 --validation_rate 5 --margin 0.2 --downsample_valid 25 --downsample_test 50
python source/TrainModel.py --frontend wavlm_base --dataset BSI.deepfake --batch_size 8 --epochs 20 --validation_rate 5 --margin 0.2 --downsample_valid 25 --downsample_test 50

# Frontend: WavLM Base with jointly trained parameters
python source/TrainModel.py --frontend wavlm_base --frozen 0 --dataset BSI.genuine --batch_size 8 --epochs 20 --validation_rate 5 --margin 0.2 --downsample_valid 25 --downsample_test 50
python source/TrainModel.py --frontend wavlm_base --frozen 0 --dataset BSI.deepfake --batch_size 8 --epochs 20 --validation_rate 5 --margin 0.2 --downsample_valid 25 --downsample_test 50

# Frontend: WavLM Large with frozen parameters
python source/TrainModel.py --frontend wavlm_large --dataset BSI.genuine --batch_size 8 --epochs 20 --validation_rate 5 --margin 0.2 --downsample_valid 25 --downsample_test 50
python source/TrainModel.py --frontend wavlm_large --dataset BSI.deepfake --batch_size 8 --epochs 20 --validation_rate 5 --margin 0.2 --downsample_valid 25 --downsample_test 50

# Frontend: WavLM Base with jointly trained parameters
python source/TrainModel.py --frontend wavlm_large --frozen 0 --dataset BSI.genuine --batch_size 4 --epochs 20 --validation_rate 5 --margin 0.2 --downsample_valid 25 --downsample_test 50
python source/TrainModel.py --frontend wavlm_large --frozen 0 --dataset BSI.deepfake --batch_size 4 --epochs 20 --validation_rate 5 --margin 0.2 --downsample_valid 25 --downsample_test 50

Display training results

To visualize and monitor training results, this project utilizes MLflow. MLflow provides a robust interface for tracking experiments, visualizing metrics, and managing model artifacts. To start the MLflow UI and navigate to the local interface, execute the following command in your terminal:

mlflow ui

After running the command, open your web browser and go to MLflow Page on localhost.

Inference

To perform inference using the ECAPA-TDNN Model for Deepfake Speaker Verification or Deepfake Detection, utilize the script for inference. This script allows you to compare a reference audio file with an audio file in question to determine if the latter is genuine or a deepfake.

usage: Inference.py [-h] --reference_audio REFERENCE_AUDIO --audio_in_question AUDIO_IN_QUESTION [--threshold THRESHOLD] --dataset DATASET [--mfccs MFCCS]
                    [--sample_rate SAMPLE_RATE] --frontend FRONTEND [--frozen FROZEN] [--embedding_size EMBEDDING_SIZE] [--device DEVICE]

Inference of the ECAPA-TDNN Model for Deepfake Speaker Verification or Deepfake Detection

options:
  -h, --help            show this help message and exit
  --reference_audio REFERENCE_AUDIO
                        Genuine reference audio of speaker
  --audio_in_question AUDIO_IN_QUESTION
                        Audio in question to be speaker
  --threshold THRESHOLD
                        Threshold that can not be passed for it to beeing a genuine audio
  --dataset DATASET     Which dataset to use (LibriSpeech.genuine | VoxCeleb.genuine | BSI.genuine | BSI.deepfake)
  --mfccs MFCCS         Number of MFCC features to extract (default: 13)
  --sample_rate SAMPLE_RATE
                        Sample rate for the audio data (default: 16000)
  --frontend FRONTEND   Which frontend model to use for feature extraction (mfcc, wavlm_base, wavlm_large)
  --frozen FROZEN       Whether the frontend model is jointly trained or frozen during training (1=frozen, 0=joint)
  --embedding_size EMBEDDING_SIZE
                        Size of the embedding vector (default: 192)
  --device DEVICE       Which device to use (per default looks if cuda is available)

Examples

Below are some examples to help you get started with basic inference on the trained models using different configurations:

# Frontend: MFCC
python source/Inference.py --frontend mfcc --dataset BSI.genuine --mfccs 80 --reference_audio ../data/reference.wav --audio_in_question ../data/question.wav
python source/Inference.py --frontend mfcc --dataset BSI.deepfake --mfccs 80 --reference_audio ..data/reference.wav --audio_in_question ../data/question.wav

# Frontend: WavLM Base with frozen parameters
python source/Inference.py --frontend wavlm_base --dataset BSI.genuine --reference_audio ../data/reference.wav --audio_in_question ../data/question.wav
python source/Inference.py --frontend wavlm_base --dataset BSI.deepfake --reference_audio ../data/reference.wav --audio_in_question ../data/question.wav

# Frontend: WavLM Base with jointly trained parameters
python source/Inference.py --frontend wavlm_base --frozen 0 --dataset BSI.genuine  --reference_audio ../data/reference.wav --audio_in_question ../data/question.wav
python source/Inference.py --frontend wavlm_base --frozen 0 --dataset BSI.deepfake --reference_audio ../data/reference.wav --audio_in_question ../data/question.wav

# Frontend: WavLM Large with frozen parameters
python source/Inference.py --frontend wavlm_large --dataset BSI.genuine --reference_audio ../data/reference.wav --audio_in_question ../data/question.wav
python source/Inference.py --frontend wavlm_large --dataset BSI.deepfake --reference_audio ../data/reference.wav --audio_in_question ../data/question.wav

# Frontend: WavLM Large with jointly trained parameters
python source/Inference.py --frontend wavlm_large --frozen 0 --dataset BSI.genuine  --reference_audio ../data/reference.wav --audio_in_question ../data/question.wav
python source/Inference.py --frontend wavlm_large --frozen 0 --dataset BSI.deepfake --reference_audio ../data/reference.wav --audio_in_question ../data/question.wav

Analytics

This script performs analytics on the trained models and saves the results to the ./data/analytics.csv file. It calculates the Equal Error Rate (EER) and the minimum Detection Cost Function (minDCF), and generates a threshold at which the EER is minimized.

usage: Analytics.py [-h] [--train | --no-train] [--valid | --no-valid] [--test | --no-test] [--downsample_train DOWNSAMPLE_TRAIN]
                    [--downsample_valid DOWNSAMPLE_VALID] [--downsample_test DOWNSAMPLE_TEST] --dataset DATASET [--mfccs MFCCS] [--sample_rate SAMPLE_RATE]
                    --frontend FRONTEND [--frozen FROZEN] [--embedding_size EMBEDDING_SIZE] [--device DEVICE] [--batch_size BATCH_SIZE]
                    [--analyze_genuine | --no-analyze_genuine] [--analyze_deepfake | --no-analyze_deepfake]

Analytics of the ECAPA-TDNN Model for Deepfake Speaker Verification and Deepfake Detection

options:
  -h, --help            show this help message and exit
  --train, --no-train   Whether to generate analytics for the training set (default=False) (default: False)
  --valid, --no-valid   Whether to generate analytics for the valid set (default=False) (default: False)
  --test, --no-test     Whether to generate analytics for the test set (default=True) (default: True)
  --downsample_train DOWNSAMPLE_TRAIN
                        Downsample training data by a factor (default: 0 - no downsampling)
  --downsample_valid DOWNSAMPLE_VALID
                        Downsample validation data by a factor (default: 0 - no downsampling)
  --downsample_test DOWNSAMPLE_TEST
                        Downsample test data by a factor (default: 0 - no downsampling)
  --dataset DATASET     Which dataset to use (LibriSpeech.genuine | VoxCeleb.genuine | BSI.genuine | BSI.deepfake)
  --mfccs MFCCS         Number of MFCC features to extract (default: 13)
  --sample_rate SAMPLE_RATE
                        Sample rate for the audio data (default: 16000)
  --frontend FRONTEND   Which frontend model to use for feature extraction (mfcc, wavlm_base, wavlm_large)
  --frozen FROZEN       Whether the frontend model is jointly trained or frozen during training (1=frozen, 0=joint)
  --embedding_size EMBEDDING_SIZE
                        Size of the embedding vector (default: 192)
  --device DEVICE       Which device to use (per default looks if cuda is available)
  --batch_size BATCH_SIZE
                        Batch size for training (default: 8)
  --analyze_genuine, --no-analyze_genuine
                        Whether to generate analytics for the genuine dataset. (default: True)
  --analyze_deepfake, --no-analyze_deepfake
                        Whether to generate analytics for the deepfake dataset. (default: True)

Examples

Below are some examples to help you get started with starting the analytics on the trained models:

# Frontend: MFCC
python source/Analytics.py --frontend mfcc --dataset BSI.genuine --mfccs 80 --batch_size 16 --downsample_train 1000
python source/Analytics.py --frontend mfcc --dataset BSI.deepfake --mfccs 80 --batch_size 16 --downsample_train 1000

# Frontend: WavLM Base with frozen parameters
python source/Analytics.py --frontend wavlm_base --dataset BSI.genuine --batch_size 8 --downsample_train 1000
python source/Analytics.py --frontend wavlm_base --dataset BSI.deepfake --batch_size 8 --downsample_train 1000

# Frontend: WavLM Base with jointly trained parameters
python source/Analytics.py --frontend wavlm_base --frozen 0 --dataset BSI.genuine --batch_size 8 --downsample_train 1000
python source/Analytics.py --frontend wavlm_base --frozen 0 --dataset BSI.deepfake --batch_size 8 --downsample_train 1000

# Frontend: WavLM Large with frozen parameters
python source/Analytics.py --frontend wavlm_large --dataset BSI.genuine --batch_size 8 --downsample_train 1000
python source/Analytics.py --frontend wavlm_large --dataset BSI.deepfake --batch_size 8 --downsample_train 1000

# Frontend: WavLM Large with jointly trained parameters
python source/Analytics.py --frontend wavlm_large --frozen 0 --dataset BSI.genuine --batch_size 8 --downsample_train 1000
python source/Analytics.py --frontend wavlm_large --frozen 0 --dataset BSI.deepfake --batch_size 8 --downsample_train 1000

Experiments

The experiments script can be used to run several different experiments at once. This script reads a text file, where each line represents an experiment. It will check how many GPU's are available and run experiments on all of them. To make sure every model is trained before analytics are run, it will first run training scripts, just after all these scripts are finished, it will run inference and finally analytics scripts. For testing, the a Lightweight set was added as example. To train and analyze all the models, the Full set was added.

usage: Experiments.py [-h] [--experiments EXPERIMENTS]

Running several experiments on ECAPA-TDNN Model for Deepfake Speaker Verification

options:
  -h, --help            show this help message and exit
  --experiments EXPERIMENTS
                        File of experiments to run

Example

python source/Experiments.py --experiments experiments.txt

Results

1. Models: BSI-Dataset

Speaker Verification

Front-End	Triplet Mining	Dataset	Speaker Verification EER
MFCC	Deepfake	training	0.727663
MFCC	Deepfake	validation	0.721183
MFCC	Deepfake	test	0.728241
MFCC	Genuine	training	0.701033
MFCC	Genuine	validation	0.701497
MFCC	Genuine	test	0.707249
WavLM-Base/Frozen	Deepfake	training	0.669932
WavLM-Base/Frozen	Deepfake	validation	0.667618
WavLM-Base/Frozen	Deepfake	test	0.671599
WavLM-Base/Frozen	Genuine	training	0.711452
WavLM-Base/Frozen	Genuine	validation	0.710852
WavLM-Base/Frozen	Genuine	test	0.713652
WavLM-Base/Joint	Deepfake	training	0.676053
WavLM-Base/Joint	Deepfake	validation	0.6809
WavLM-Base/Joint	Deepfake	test	0.679986
WavLM-Base/Joint	Genuine	training	0.563248
WavLM-Base/Joint	Genuine	validation	0.565732
WavLM-Base/Joint	Genuine	test	0.553168
WavLM-Large/Frozen	Deepfake	training	0.633382
WavLM-Large/Frozen	Deepfake	validation	0.643552
WavLM-Large/Frozen	Deepfake	test	0.64387
WavLM-Large/Frozen	Genuine	training	0.636669
WavLM-Large/Frozen	Genuine	validation	0.629064
WavLM-Large/Frozen	Genuine	test	0.651273

WavLM-Base/Joint front-end generally outperforms other front-emds in terms of EER for both deepfake and genuine datasets.
Genuine datasets show slightly higher EER compared to deepfake datasets with WavLM.

Deepfake Detection

Front-End	Triplet Mining	Dataset	Deepfake Detection EER
MFCC	Deepfake	training	0.670413
MFCC	Deepfake	validation	0.663193
MFCC	Deepfake	test	0.672118
MFCC	Genuine	training	0.538294
MFCC	Genuine	validation	0.544037
MFCC	Genuine	test	0.543519
WavLM-Base/Frozen	Deepfake	training	0.739727
WavLM-Base/Frozen	Deepfake	validation	0.739647
WavLM-Base/Frozen	Deepfake	test	0.748872
WavLM-Base/Frozen	Genuine	training	0.541659
WavLM-Base/Frozen	Genuine	validation	0.55832
WavLM-Base/Frozen	Genuine	test	0.536989
WavLM-Base/Joint	Deepfake	training	0.719853
WavLM-Base/Joint	Deepfake	validation	0.723511
WavLM-Base/Joint	Deepfake	test	0.732477
WavLM-Base/Joint	Genuine	training	0.524131
WavLM-Base/Joint	Genuine	validation	0.531101
WavLM-Base/Joint	Genuine	test	0.528507
WavLM-Large/Frozen	Deepfake	training	0.660818
WavLM-Large/Frozen	Deepfake	validation	0.669061
WavLM-Large/Frozen	Deepfake	test	0.674733
WavLM-Large/Frozen	Genuine	training	0.565944
WavLM-Large/Frozen	Genuine	validation	0.572612
WavLM-Large/Frozen	Genuine	test	0.575025

WavLM-Base/Joint front-end shows better performance in detecting deepfakes with Genuine datasets.
The Genuine dataset consistently performs better in deepfake detection across all front-ends, highlighting that genuine audio may provide more reliable features for detecting anomalies.

Conclusion

The choice of front-end and dataset significantly impacts the performance of both speaker verification and deepfake detection systems.

WavLM-Base/Joint front-end performs better in speaker verification, especially with genuine datasets.

However, the evaluation of the trained models suggest, that the datasets is not sophisticated enough, to train a model for automatic speaker verification on them. With the inability to train speaker verification, the detection of deepfakes using feature embeddings of authentic audio samples is also not feasible.

There are some possible alterations, that could result in better performance in speaker verification, as well as deepfake detection:

Train the model on different datasets
- VoxCeleb
- ASVSpoof 2014
- Full LibriSpeech Dataset
Use the pretrained WavLM-Large/ECAPA-TDNN Model from Espnet-SPK and fine-tune to deepfake detecion using the BSI dataset
Use pretrained Models, to fine-tune on
- single speaker verification and
- deepfake detection
Change the trained model to use two embeddings at the same time. Instead of using a mono audio spur, this could use a dual audio input for audio in question and real audio. This would not be trained using triplet loss.

Lessons Learned

There were still some errors in the model generation and EER calculation. The loss was not calculated for batches.

2. Models: LibriSpeech train-100

The main reason is to test, if Triplet Loss is able to learn speaker verification

Front-End	Triplet Mining	Dataset	Speaker Verification EER
MFCC	Genuine	training	0.0815375
MFCC	Genuine	validation	0.0965594
MFCC	Genuine	test	0.110305
WavLM-Base/Frozen	Genuine	training	0.0957287
WavLM-Base/Frozen	Genuine	validation	0.145394
WavLM-Base/Frozen	Genuine	test	0.139695
WavLM-Base/Joint	Genuine	training	0.045797
WavLM-Base/Joint	Genuine	validation	0.099889
WavLM-Base/Joint	Genuine	test	0.0961832
WavLM-Large/Frozen	Genuine	training	0.0760363
WavLM-Large/Frozen	Genuine	validation	0.129856
WavLM-Large/Frozen	Genuine	test	0.108015
WavLM-Large/Joint	Genuine	training	0.0363362
WavLM-Large/Joint	Genuine	validation	0.0836108
WavLM-Large/Joint	Genuine	test	0.0675573

Next steps is to train on full LibriSpeech, as well as VoxCeleb (Only WavLm Base Joint)

3. Models: LibriSpeech Full + VoxCeleb1&2

LibriSpeech Full

Front-End	Triplet Mining	Dataset	Speaker Verification EER
WavLM-Base/Joint	Genuine	training	0.0355303
WavLM-Base/Joint	Genuine	validation	0.0747318
WavLM-Base/Joint	Genuine	test	0.0675573

VoxCeleb 1&2 (3 Epochs)

Front-End	Triplet Mining	Dataset	Speaker Verification EER
WavLM-Base/Joint	Genuine	validation	0.499964
WavLM-Base/Joint	Genuine	test	0.496923
WavLM-Base/Joint	Genuine	training	0.502123

VoxCeleb 1&2 (7 Epochs)

Front-End	Triplet Mining	Dataset	Speaker Verification EER
WavLM-Base/Joint	Genuine	training	0.099185
WavLM-Base/Joint	Genuine	validation	0.119871
WavLM-Base/Joint	Genuine	test	0.119846

4. Models: LibriSpeech Full - Random vs. Hard online vs. Hard offline Mining

LibriSpeech Full - Random Mining

Front-End	Triplet Mining	Dataset	Speaker Verification EER
WavLM-Base/Joint	Genuine	training	0.0355303
WavLM-Base/Joint	Genuine	validation	0.0747318
WavLM-Base/Joint	Genuine	test	0.0675573

LibriSpeech Full - Hard online Mining

Front-End	Triplet Mining	Dataset	Speaker Verification EER
WavLM-Base/Joint	Genuine	training	0.0107922
WavLM-Base/Joint	Genuine	validation	0.0610433
WavLM-Base/Joint	Genuine	test	0.0541985

LibriSpeech Full - Hard offline Mining

Front-End	Triplet Mining	Dataset	Speaker Verification EER
WavLM-Base/Joint	Genuine	training	0.0494411
WavLM-Base/Joint	Genuine	validation	0.132445
WavLM-Base/Joint	Genuine	test	0.108779

5. VoxCeleb - Hard online Mining: 7 Epochs

Front-End	Triplet Mining	Dataset	Speaker Verification EER
WavLM-Base/Joint	Genuine	training	0.05057586702465077
WavLM-Base/Joint	Genuine	validation	0.07528398725401961
WavLM-Base/Joint	Genuine	test	0.07521101408382184

6. Fine Tuning: 100 Epochs

Before fine tuning

Front-End	Triplet Mining	Dataset	Speaker Verification EER
WavLM-Base/Joint	Genuine	training	0.1445
WavLM-Base/Joint	Genuine	validation	0.2795
WavLM-Base/Joint	Genuine	test	0.2889

Positive = Genuine from Same Speaker - Negative: Random Deepfake from Same Speaker

Front-End	Triplet Mining	Dataset	Speaker Verification EER
WavLM-Base/Joint	Deepfake	training	0.0946
WavLM-Base/Joint	Deepfake	validation	0.1800
WavLM-Base/Joint	Deepfake	test	0.1820

Positive = Genuine from Same Speaker - Negative: Random Deepfake from Same Speaker and same Utterance

Front-End	Triplet Mining	Dataset	Speaker Verification EER
WavLM-Base/Joint	Deepfake	training	0.1314
WavLM-Base/Joint	Deepfake	validation	0.2473
WavLM-Base/Joint	Deepfake	test	0.2503

Positive = Genuine from Same Speaker - Negative: Random Deepfake from Same Speaker (Hard mining per epoch to find hardest deepfake method)

Front-End	Triplet Mining	Dataset	Speaker Verification EER
WavLM-Base/Joint	Deepfake	training	0.1130
WavLM-Base/Joint	Deepfake	validation	0.2122
WavLM-Base/Joint	Deepfake	test	0.2007

6. Fine Tuning, different margin: Positive = Genuine from Same Speaker - Negative: Random Deepfake from Same Speaker

0.02

Front-End	Triplet Mining	Dataset	Speaker Verification EER
WavLM-Base/Joint	Deepfake	training	0.1603
WavLM-Base/Joint	Deepfake	validation	0.2698
WavLM-Base/Joint	Deepfake	test	0.2759

0.2

Front-End	Triplet Mining	Dataset	Speaker Verification EER
WavLM-Base/Joint	Deepfake	training	0.0946
WavLM-Base/Joint	Deepfake	validation	0.1800
WavLM-Base/Joint	Deepfake	test	0.1820

1

Front-End	Triplet Mining	Dataset	Speaker Verification EER
WavLM-Base/Joint	Deepfake	training	0.1629
WavLM-Base/Joint	Deepfake	validation	0.2914
WavLM-Base/Joint	Deepfake	test	0.2988

7. Fine Tuning, different learning rate: Positive = Genuine from Same Speaker - Negative: Random Deepfake from Same Speaker

0.00000001

Front-End	Triplet Mining	Dataset	Speaker Verification EER
WavLM-Base/Joint	Deepfake	training	0.1419
WavLM-Base/Joint	Deepfake	validation	0.2640
WavLM-Base/Joint	Deepfake	test	0.2764

0.0000001

Front-End	Triplet Mining	Dataset	Speaker Verification EER
WavLM-Base/Joint	Deepfake	training	0.1143
WavLM-Base/Joint	Deepfake	validation	0.1976
WavLM-Base/Joint	Deepfake	test	0.2069

0.000001

Front-End	Triplet Mining	Dataset	Speaker Verification EER
WavLM-Base/Joint	Deepfake	training	0.0802
WavLM-Base/Joint	Deepfake	validation	0.1468
WavLM-Base/Joint	Deepfake	test	0.1522

0.000005

Front-End	Triplet Mining	Dataset	Speaker Verification EER
WavLM-Base/Joint	Deepfake	training	0.0657
WavLM-Base/Joint	Deepfake	validation	0.1386
WavLM-Base/Joint	Deepfake	test	0.1405

0.00001

Front-End	Triplet Mining	Dataset	Speaker Verification EER
WavLM-Base/Joint	Deepfake	training	0.0723
WavLM-Base/Joint	Deepfake	validation	0.1501
WavLM-Base/Joint	Deepfake	test	0.1433

0.0001

Front-End	Triplet Mining	Dataset	Speaker Verification EER
WavLM-Base/Joint	Deepfake	training	0.0946
WavLM-Base/Joint	Deepfake	validation	0.1800
WavLM-Base/Joint	Deepfake	test	0.1820

0.001

Front-End	Triplet Mining	Dataset	Speaker Verification EER
WavLM-Base/Joint	Deepfake	training	0.2339
WavLM-Base/Joint	Deepfake	validation	0.4918
WavLM-Base/Joint	Deepfake	test	0.4871

8. Fine Tuning, using vocoder

Front-End	Triplet Mining	Dataset	Speaker Verification EER
WavLM-Base/Joint	Deepfake	training	0.0815
WavLM-Base/Joint	Deepfake	validation	0.1620
WavLM-Base/Joint	Deepfake	test	0.1613

References

ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification

@inproceedings{desplanques20_interspeech,
  author={Brecht Desplanques and Jenthe Thienpondt and Kris Demuynck},
  title={{ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={3830--3834},
  doi={10.21437/Interspeech.2020-2650},
  issn={2308-457X}
}

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

@ARTICLE{9814838,
  author={Chen, Sanyuan and Wang, Chengyi and Chen, Zhengyang and Wu, Yu and Liu, Shujie and Chen, Zhuo and Li, Jinyu and Kanda, Naoyuki and Yoshioka, Takuya and Xiao, Xiong and Wu, Jian and Zhou, Long and Ren, Shuo and Qian, Yanmin and Qian, Yao and Wu, Jian and Zeng, Michael and Yu, Xiangzhan and Wei, Furu},
  journal={IEEE Journal of Selected Topics in Signal Processing}, 
  title={WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing}, 
  year={2022},
  volume={16},
  number={6},
  pages={1505-1518},
  keywords={Predictive models;Self-supervised learning;Speech processing;Speech recognition;Convolution;Benchmark testing;Self-supervised learning;speech pre-training},
  doi={10.1109/JSTSP.2022.3188113}}

ESPnet-SPK: full pipeline speaker embedding toolkit with reproducible recipes, self-supervised front-ends, and off-the-shelf models

@misc{jung2024espnetspk,
      title={ESPnet-SPK: full pipeline speaker embedding toolkit with reproducible recipes, self-supervised front-ends, and off-the-shelf models}, 
      author={Jee-weon Jung and Wangyou Zhang and Jiatong Shi and Zakaria Aldeneh and Takuya Higuchi and Barry-John Theobald and Ahmed Hussen Abdelaziz and Shinji Watanabe},
      year={2024},
      eprint={2401.17230},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}

SpeechBrain: A General-Purpose Speech Toolkit

@misc{ravanelli2021speechbrain,
      title={SpeechBrain: A General-Purpose Speech Toolkit}, 
      author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
      year={2021},
      eprint={2106.04624},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}

SUPERB: Speech Processing Universal PERformance Benchmark

@inproceedings{yang21c_interspeech,
  author={Shu-wen Yang and Po-Han Chi and Yung-Sung Chuang and Cheng-I Jeff Lai and Kushal Lakhotia and Yist Y. Lin and Andy T. Liu and Jiatong Shi and Xuankai Chang and Guan-Ting Lin and Tzu-Hsien Huang and Wei-Cheng Tseng and Ko-tik Lee and Da-Rong Liu and Zili Huang and Shuyan Dong and Shang-Wen Li and Shinji Watanabe and Abdelrahman Mohamed and Hung-yi Lee},
  title={{SUPERB: Speech Processing Universal PERformance Benchmark}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={1194--1198},
  doi={10.21437/Interspeech.2021-1775}
}

FaceNet: A Unified Embedding for Face Recognition and Clustering

@INPROCEEDINGS{7298682,
  author={Schroff, Florian and Kalenichenko, Dmitry and Philbin, James},
  booktitle={2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, 
  title={FaceNet: A unified embedding for face recognition and clustering}, 
  year={2015},
  volume={},
  number={},
  pages={815-823},
  keywords={Face;Face recognition;Training;Accuracy;Artificial neural networks;Standards;Principal component analysis},
  doi={10.1109/CVPR.2015.7298682}}

Learning local feature descriptors with triplets and shallow convolutional neural networks

@inproceedings{inproceedings,
author = {Balntas, Vassileios and Riba, Edgar and Ponsa, Daniel and Mikolajczyk, Krystian},
year = {2016},
month = {01},
pages = {119.1-119.11},
title = {Learning local feature descriptors with triplets and shallow convolutional neural networks},
doi = {10.5244/C.30.119}
}

LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

@misc{zen2019libritts,
      title={LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech}, 
      author={Heiga Zen and Viet Dang and Rob Clark and Yu Zhang and Ron J. Weiss and Ye Jia and Zhifeng Chen and Yonghui Wu},
      year={2019},
      eprint={1904.02882},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}

Name		Name	Last commit message	Last commit date
Latest commit History 205 Commits
.vscode		.vscode
analytics		analytics
images		images
logs		logs
models		models
source		source
.gitignore		.gitignore
README.md		README.md
config.yml		config.yml
experiments.txt		experiments.txt
experiments_deepfake.txt		experiments_deepfake.txt
experiments_librispeech.txt		experiments_librispeech.txt
experiments_light.txt		experiments_light.txt
experiments_voxceleb.txt		experiments_voxceleb.txt
myexp.txt		myexp.txt
notes.txt		notes.txt
requirements.txt		requirements.txt
setup.py		setup.py

DigitalPhilosopher/SpeakerRecognition

Folders and files

Latest commit

History

Repository files navigation

Audio Deepfake Detection with the aid of Authentic Reference Material

Setup

Steps to initialize the project

Train Model

Inference

Analytics

Experiments

Results

1. Models: BSI-Dataset

Speaker Verification

Deepfake Detection

Conclusion

Lessons Learned

2. Models: LibriSpeech train-100

3. Models: LibriSpeech Full + VoxCeleb1&2

LibriSpeech Full

VoxCeleb 1&2 (3 Epochs)

VoxCeleb 1&2 (7 Epochs)

4. Models: LibriSpeech Full - Random vs. Hard online vs. Hard offline Mining

LibriSpeech Full - Random Mining

LibriSpeech Full - Hard online Mining

LibriSpeech Full - Hard offline Mining

5. VoxCeleb - Hard online Mining: 7 Epochs

6. Fine Tuning: 100 Epochs

Before fine tuning

Positive = Genuine from Same Speaker - Negative: Random Deepfake from Same Speaker

Positive = Genuine from Same Speaker - Negative: Random Deepfake from Same Speaker and same Utterance

Positive = Genuine from Same Speaker - Negative: Random Deepfake from Same Speaker (Hard mining per epoch to find hardest deepfake method)

6. Fine Tuning, different margin: Positive = Genuine from Same Speaker - Negative: Random Deepfake from Same Speaker

0.02

0.2

1

7. Fine Tuning, different learning rate: Positive = Genuine from Same Speaker - Negative: Random Deepfake from Same Speaker

0.00000001

0.0000001

0.000001

0.000005

0.00001

0.0001

0.001

8. Fine Tuning, using vocoder

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages