Tutorial End-to-End

Google colab showing how to use this parser are available here :

naija spoken training from pre-trained english model : link
training from scratch on naija spoken : link
training from scratch on written english : link
mock colab for testing if everything is fine : link

Requirements

Python 3.11 on your machine as well as poetry (link)

# install latest poetry
curl -sSL https://install.python-poetry.org | python3 -

# check installation
poetry --version

# in case you have multiple version of python, make sure to specify the version 3.11
poetry env use 3.11

Installation

On linux

git clone https://github.com/kirianguiller/BertForDeprel
cd BertForDeprel
# optional : if you want to have your venv in the project folder as .venv (recommended)
poetry config virtualenvs.in-project true

poetry install
poetry run pytest

How to run

Train a model

Either provide the path to a model json config :

python /home/BertForDeprel/BertForDeprel/run.py train --conf /home/models/template.config.json   --ftrain /home/parsing_project/conllus/train.conllu

or just give a --new_model_path and a --model_name parameter (default params will be loaded if no config or no CLI parameters are provided)

python /home/BertForDeprel/BertForDeprel/run.py train --new_model_path /home/models/ --model_name my_parser   --ftrain /home/parsing_project/conllus/train.conllu

PS : here an example of a valid config.json

{
    "new_model_path": "/home/user1/models/",
    "max_epoch": 150,
    "patience": 30,
    "batch_size": 16,
    "maxlen": 512,
    "embedding_type": "xlm-roberta-large",
    "adapter_config_type": ""
}

Predicting on raw conllus

For predicting, you need to provide the --conf parameter, which is the path to the xxx.config.json file. You also need to provide the --inpath parameter, which is the path to a single conllu file or a folder containing multiple conllu. The output folder parameter --outpath (or -o) is optional.

python /home/BertForDeprel/BertForDeprel/run.py train --conf /home/models/my_parser.config.json   --inpath /home/parsing_project/to_predict/ --outpath /home/parsing_project/predicted/

Command line parameters

shared

--conf -c : path to config json file (for training, it's optional if both --new_model_path and model_name are provided)
--batch_size: numbers of sample per batches (high incidence on total speed)
--num_workers: numbers of workers for preparing dataset (low incidence on total speed)
--seed -s : random seed (default = 42)

The directory to store and load pretrained models is set via the environment variable TORCH_HOME.

train

--new_model_path -f path to parent folder of the model : optional if --conf is already provided
--embedding_type -e : type of embedding (default : xlm-roberta-large)
--max_epoch : maximum number of epochs (early stopping can shorten this number)
--patience : number of epochs without improve required to stop the training (early stopping)
--ftrain : path to train file or folder (files need .conllu extension)
--ftest : path to train file or folder (files need .conllu extension) (not required. If not provided, see --split_ratio )
--split_ratio : Ratio for splitting ftrain dataset in train and test dataset (default : 0.8)
--pretrained_path : path to pretrain model config, used for finetuning a pretrained BertForDeprel model
--overwrite_pretrain_classifiers: erase pretraines classifier heads and recompute annotation schema

predict

--model_path -m : path to the model (folder or file)
--inpath -i : path to the file or the folder containing the files to predict
--outpath -o : path to the folder that will contain the predicted files
--suffix : optional (default = "") , suffix that will be added to the name of the predicted files (before the file extension)
--overwrite : whether or not to overwrite outputted predicted conllu if already existing
--write_preds_in_misc : whether or not to write prediction in the conllu MISC column instead than in the corresponding column for upos deprel and head

keep (HEAD/UPOS/...) (optionals)

each of the following parameters is a string that can take the values "NONE" | "EXISTING" | "ALL") (default : "NONE") : --keep_heads; --keep_upos; --keep_xpos ; --keep_deprels ; --keep_misc ; --keep_feats ; --keep_deps ; --keep_morph ; --keep_lemmas ; --keep_lemmas

Prepare Dataset

You will need some conllus for training the model and doing inferences.

data for training

For training, you have the choice between :

providing a single conllu file (--ftrain cli parameter) with all your training and testing sentences (train_test split ratio is 0.8 by default, but you can set it with --split_ratio parameter)
providing a train conllu file (--ftrain) and a test conllu file (--ftest)
providing a train folder containing the .conllu files (--ftest can also be provided, as a file or a folder too)

data for inferences

For inference, you have to provide an input file or folder (--inpath or -i). The model will infere parse trees for all sentences of all conllus, and these outputted conllus will be written in the output folder (--outpath or -o)

annotation schema

For people who want to use the parser for language transfer (training on lang A, then fine tuning on lang B), it is important to provide a --path_folder_compute_annotation_schema with a folder that contains both gold conllu from lang A and B so you can precompute the annotation schema (set of deprels, uposs, feats, lemma scripts, etc) before the pretraining. It is required to use the same annotation schema for training, inference and fine-tuning.

Folder hierarchy example

Here is a folder structure example of how I am storing the different train/test/to_predicts/results conllus

|- [NAME_FOLDER]/
|   |- conllus/
|       | - <train.langA.conllu>
|       | - <test.langA.conllu>
|       | - <train.langB.conllu>
|       | - <test.langB.conllu>
|   |- to_predict/
|       | - <raw1.langB.conllu>
|       | - <raw2.langB.conllu>
|       | - <raw3.langB.conllu>
|   |- predicted/

where <train.conllu> and <test.conllu> are respectively the train and test datasets. They can have the name you want as you will have to indicate the path to this file in the running script.

Finetuning a previously trained BertForDeprel model

WARNING : when training from a pretrained model, be sure to use the same annotation_schema.json for fine-tuning that the one that was used for pretraining. It would break the training otherwise.

To fine-tune a pre-trained model, need to follow the same step as for training a new model, but need to also provide the path to the config file of the previously trained model with --pretrained_path

python /home/BertForDeprel/BertForDeprel/run.py train --new_model_path /home/models/ --model_name my_parser  --ftrain /home/parsing_project/conllus/train.conllu  --pretrained_path /home/models/pretrained_model.config.json

GPU/CPU training

--gpu_ids 0 : For running the training on a single GPU of id 0, add the parameter . Respectively, for running on one single gpu of id 3, add the parameter ``--gpu_ids 3`
--gpu_ids 0,1 : For running the training on multiple GPU of ids 0 and 1, add the parameter
--gpu_ids "-2" : For running the training on all available GPUs, add the parameter
--gpu_ids "-1" : For training on CPU only, add the parameter

Pretrained Models (/!\ DEPRECATED /!\ TODO : update this section)

You can find on this Gdrive repo all the pretrained models, google colab script for training and publicly available treebanks (.conllu files).

Among others, here are the most important pretrained models :

Major TODOs

Add feats and gloss prediction
Add lemma
Add confidence threshold prediction (model outputting nothing when the confidence is below a certain value)
Add possibility of returning the confidence of the predictions (inside miscs)
Support for active learning
Tokenization
Memory-efficient prediction service (only load one copy of XLM-Roberta for all languages)

Name		Name	Last commit message	Last commit date
Latest commit History 207 Commits
.github/workflows		.github/workflows
.vscode		.vscode
BertForDeprel		BertForDeprel
tests		tests
.editorconfig		.editorconfig
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
TODO.md		TODO.md
commands		commands
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tutorial End-to-End

Requirements

Installation

How to run

Train a model

Predicting on raw conllus

Command line parameters

shared

train

predict

keep (HEAD/UPOS/...) (optionals)

Prepare Dataset

data for training

data for inferences

annotation schema

Folder hierarchy example

Finetuning a previously trained BertForDeprel model

GPU/CPU training

Pretrained Models (/!\ DEPRECATED /!\ TODO : update this section)

Major TODOs

About

Releases

Packages

Contributors 2

Languages

License

kirianguiller/BertForDeprel

Folders and files

Latest commit

History

Repository files navigation

Tutorial End-to-End

Requirements

Installation

How to run

Train a model

Predicting on raw conllus

Command line parameters

shared

train

predict

keep (HEAD/UPOS/...) (optionals)

Prepare Dataset

data for training

data for inferences

annotation schema

Folder hierarchy example

Finetuning a previously trained BertForDeprel model

GPU/CPU training

Pretrained Models (/!\ DEPRECATED /!\ TODO : update this section)

Major TODOs

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages