Terminology Extraction

We experiment with different Transformer-based pre-trained language models for the terminology extraction task, namely XLNet, BERT, DistilBERT, and RoBERTa, with additional techniques, including class weighting in order to reduce the significant imbalance in the training data, as well as rule-based term expansion and filtering. Our experiments are conducted on the ACTER dataset covering 3 languages and 3 domains. The results prove to be competitive on English and French, and the proposed approach outperforms the state of the art (SOTA) on Dutch.

1. ACTER

The dataset structure as well as the distributions of terms per domain per language are demonstrated in data.exploration.ipynb

2. Models and Architecture

2.1 Models

For each language, we examine several pretrained language models using SimpleTransformers as the following table.

Model	English dataset	French dataset	Dutch dataset
Multilingual BERT (uncased)	x	x	x
Multilingual BERT (cased)	x	x	x
Monolingual English BERT (uncased)	x
Monolingual English BERT (cased)	x
RoBERTa	x
DistilBERT (uncased)	x
DistilBERT (cased)	x
Multilingual DistilBERT (cased)			x
XLNet	x
CamemBERT		x

2.2 Architecture

The worflow of our implementation:

The code insides ./core_model/ is an example of how we implemented on French dataset on CamemBERT. Preferable to run on Google Collab to take advantage of GPU (in case your local machine does not support).

3. Results

All the saved prediction results of mentionned pretrained models on 3 languages are saved in folder ./results/weighted_results/.

Name		Name	Last commit message	Last commit date
Latest commit History 114 Commits
ACTER		ACTER
architecture		architecture
core_model		core_model
eda		eda
patterns		patterns
processed_data		processed_data
references		references
results		results
simpletransformers		simpletransformers
termhood		termhood
training_corpus		training_corpus
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Terminology Extraction

1. ACTER

2. Models and Architecture

2.1 Models

2.2 Architecture

3. Results

4. References

Contributors:

About

Releases

Packages

Contributors 2

Languages

honghanhh/terminology-extraction

Folders and files

Latest commit

History

Repository files navigation

Terminology Extraction

1. ACTER

2. Models and Architecture

2.1 Models

2.2 Architecture

3. Results

4. References

Contributors:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages