This repository is an extension of the paper Punctuation Restoration using Transformer Models for High-and Low-Resource Languages accepted at the EMNLP workshop W-NUT 2020.
TedTalk datasets are provided in data
directory.
We fine-tune a Transformer architecture based language model (e.g., BERT) for the punctuation restoration task. Transformer encoder is followed by a bidirectional LSTM and linear layer that predicts target punctuation token at each sequence position.
Install PyTorch following instructions from PyTorch website. Remaining dependencies can be installed with the following command
pip install -r requirements.txt
To train punctuation restoration model with optimal parameter settings for English run the following command
bash src/run-train.sh
bert-base-uncased
You can run inference on unprocessed text file to produce punctuated text using inference
module. Note that if the
text already contains punctuation they are removed before inference.
Example script for English:
bash src/run-inference.sh
Trained models can be tested on processed data using test
module to prepare result.
For example, to test the best preforming English model run following command
bash run-test.py