Punctuation Restoration using Transformer Models

This repository is an extension of the paper Punctuation Restoration using Transformer Models for High-and Low-Resource Languages accepted at the EMNLP workshop W-NUT 2020.

Data

Datasets

TedTalk datasets are provided in data directory.

Model Architecture

We fine-tune a Transformer architecture based language model (e.g., BERT) for the punctuation restoration task. Transformer encoder is followed by a bidirectional LSTM and linear layer that predicts target punctuation token at each sequence position.

Dependencies

Install PyTorch following instructions from PyTorch website. Remaining dependencies can be installed with the following command

pip install -r requirements.txt

Training

To train punctuation restoration model with optimal parameter settings for English run the following command

bash src/run-train.sh

Target models for finetuning

bert-base-uncased

Inference

You can run inference on unprocessed text file to produce punctuated text using inference module. Note that if the text already contains punctuation they are removed before inference.

Example script for English:

bash src/run-inference.sh

Test

Trained models can be tested on processed data using test module to prepare result.

For example, to test the best preforming English model run following command

bash run-test.py

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
data		data
legacy_data		legacy_data
src		src
.gitignore		.gitignore
README.md		README.md
finetuning_bert_bilstm.ipynb		finetuning_bert_bilstm.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Punctuation Restoration using Transformer Models

Data

Datasets

Model Architecture

Dependencies

Training

Target models for finetuning

Inference

Test

About

Languages

anthonyhughes/finetuning-en-punctuation-restoration

Folders and files

Latest commit

History

Repository files navigation

Punctuation Restoration using Transformer Models

Data

Datasets

Model Architecture

Dependencies

Training

Target models for finetuning

Inference

Test

About

Topics

Resources

Stars

Watchers

Forks

Languages