pip install -r requirements.txt
NOTE:
- you may need additional setting for
phonemizer
if you want to do phoneme-related data preprocessing
- Data Preparation
For SLURP,
The preprocessed dataset is
datasets/slurp/slurp_with_oracle_test.json
. The preprocessed dataset without filtering and separating test sets isdatasets/slurp/slurp.json
The data preprocessing includes multiple operations including:- Derive ASR hypothesis
- Generate phoneme sequences by
phonemizer
- Preprocess the dataset (1st version)
- Scripts in
prepare_data
would you understand the process: - first run
make_golden_dataset
read only from data provided in SLURP repo - and then
make_dataset
would need transcriptions from different systems
- Scripts in
- Fine-tune
roberta-base
models on the 1st version dataset - Collect predictions and sub-sample the dataset with agreed pseudo label
For ATIS/TREC6 from PhonemeBERT, You can just clone their repo and unzip the dataset.
- Contrastive Pretraining
python contrastive_pretraining.py
- Fine-tuning
python finetune_on_slurp.py
or on the phonemebert datasets:
python finetune_on_phonemebert.py
Training and evaluation are both included in these two scripts. Adjust the arguments as you need.
Please cite the following paper:
@inproceedings{chang2022contrastive,
title={Contrastive Learning for Improving ASR Robustness in Spoken Language Understanding},
author={Chang, Ya-Hsin and Chen, Yun-Nung},
booktitle={The 23rd Annual Meeting of the International Speech Communication Association (INTERSPEECH)},
pages={3458-3462},
year={2022}
}