Google colab showing how to use this parser are available here :
- naija spoken training from pre-trained english model : link
- training from scratch on naija spoken : link
- training from scratch on written english : link
- mock colab for testing if everything is fine : link
Python 3.11 on your machine as well as poetry (link)
# install latest poetry
curl -sSL https://install.python-poetry.org | python3 -
# check installation
poetry --version
# in case you have multiple version of python, make sure to specify the version 3.11
poetry env use 3.11
On linux
git clone https://github.com/kirianguiller/BertForDeprel
cd BertForDeprel
# optional : if you want to have your venv in the project folder as .venv (recommended)
poetry config virtualenvs.in-project true
poetry install
poetry run pytest
Either provide the path to a model json config :
python /home/BertForDeprel/BertForDeprel/run.py train --conf /home/models/template.config.json --ftrain /home/parsing_project/conllus/train.conllu
or just give a --new_model_path
and a --model_name
parameter (default params will be loaded if no config or no CLI parameters are provided)
python /home/BertForDeprel/BertForDeprel/run.py train --new_model_path /home/models/ --model_name my_parser --ftrain /home/parsing_project/conllus/train.conllu
PS : here an example of a valid config.json
{
"new_model_path": "/home/user1/models/",
"max_epoch": 150,
"patience": 30,
"batch_size": 16,
"maxlen": 512,
"embedding_type": "xlm-roberta-large",
"adapter_config_type": ""
}
For predicting, you need to provide the --conf
parameter, which is the path to the xxx.config.json file. You also need to provide the --inpath
parameter, which is the path to a single conllu file or a folder containing multiple conllu. The output folder parameter --outpath
(or -o
) is optional.
python /home/BertForDeprel/BertForDeprel/run.py train --conf /home/models/my_parser.config.json --inpath /home/parsing_project/to_predict/ --outpath /home/parsing_project/predicted/
--conf
-c
: path to config json file (for training, it's optional if both--new_model_path
andmodel_name
are provided)--batch_size
: numbers of sample per batches (high incidence on total speed)--num_workers
: numbers of workers for preparing dataset (low incidence on total speed)--seed
-s
: random seed (default = 42)
The directory to store and load pretrained models is set via the environment variable TORCH_HOME
.
--new_model_path
-f
path to parent folder of the model : optional if--conf
is already provided--embedding_type
-e
: type of embedding (default :xlm-roberta-large
)--max_epoch
: maximum number of epochs (early stopping can shorten this number)--patience
: number of epochs without improve required to stop the training (early stopping)--ftrain
: path to train file or folder (files need .conllu extension)--ftest
: path to train file or folder (files need .conllu extension) (not required. If not provided, see--split_ratio
)--split_ratio
: Ratio for splitting ftrain dataset in train and test dataset (default : 0.8)--pretrained_path
: path to pretrain model config, used for finetuning a pretrained BertForDeprel model--overwrite_pretrain_classifiers
: erase pretraines classifier heads and recompute annotation schema
--model_path
-m
: path to the model (folder or file)--inpath
-i
: path to the file or the folder containing the files to predict--outpath
-o
: path to the folder that will contain the predicted files--suffix
: optional (default = "") , suffix that will be added to the name of the predicted files (before the file extension)--overwrite
: whether or not to overwrite outputted predicted conllu if already existing--write_preds_in_misc
: whether or not to write prediction in the conllu MISC column instead than in the corresponding column for upos deprel and head
each of the following parameters is a string that can take the values "NONE" | "EXISTING" | "ALL") (default : "NONE") :
--keep_heads
; --keep_upos
; --keep_xpos
; --keep_deprels
; --keep_misc
; --keep_feats
; --keep_deps
; --keep_morph
; --keep_lemmas
; --keep_lemmas
You will need some conllus for training the model and doing inferences.
For training, you have the choice between :
- providing a single conllu file (
--ftrain
cli parameter) with all your training and testing sentences (train_test split ratio is 0.8 by default, but you can set it with--split_ratio
parameter) - providing a train conllu file (
--ftrain
) and a test conllu file (--ftest
) - providing a train folder containing the .conllu files (--ftest can also be provided, as a file or a folder too)
For inference, you have to provide an input file or folder (--inpath
or -i
). The model will infere parse trees for all sentences of all conllus, and these outputted conllus will be written in the output folder (--outpath
or -o
)
For people who want to use the parser for language transfer (training on lang A, then fine tuning on lang B), it is important to provide a --path_folder_compute_annotation_schema
with a folder that contains both gold conllu from lang A and B so you can precompute the annotation schema (set of deprels, uposs, feats, lemma scripts, etc) before the pretraining. It is required to use the same annotation schema for training, inference and fine-tuning.
Here is a folder structure example of how I am storing the different train/test/to_predicts/results conllus
|- [NAME_FOLDER]/
| |- conllus/
| | - <train.langA.conllu>
| | - <test.langA.conllu>
| | - <train.langB.conllu>
| | - <test.langB.conllu>
| |- to_predict/
| | - <raw1.langB.conllu>
| | - <raw2.langB.conllu>
| | - <raw3.langB.conllu>
| |- predicted/
where <train.conllu>
and <test.conllu>
are respectively the train and test datasets. They can have the name you want as you will have to indicate the path to this file in the running script.
WARNING : when training from a pretrained model, be sure to use the same annotation_schema.json for fine-tuning that the one that was used for pretraining. It would break the training otherwise.
To fine-tune a pre-trained model, need to follow the same step as for training a new model, but need to also provide the path to the config file of the previously trained model with --pretrained_path
python /home/BertForDeprel/BertForDeprel/run.py train --new_model_path /home/models/ --model_name my_parser --ftrain /home/parsing_project/conllus/train.conllu --pretrained_path /home/models/pretrained_model.config.json
--gpu_ids 0
: For running the training on a single GPU of id 0, add the parameter . Respectively, for running on one single gpu of id 3, add the parameter ``--gpu_ids 3`--gpu_ids 0,1
: For running the training on multiple GPU of ids 0 and 1, add the parameter--gpu_ids "-2"
: For running the training on all available GPUs, add the parameter--gpu_ids "-1"
: For training on CPU only, add the parameter
You can find on this Gdrive repo all the pretrained models, google colab script for training and publicly available treebanks (.conllu files).
Among others, here are the most important pretrained models :
- English model trained from scratch on written english
- Naija model trained from scratch on spoken naija
- Naija model fine-tuned on spoken naija from model pretrained on written english
- Add feats and gloss prediction
- Add lemma
- Add confidence threshold prediction (model outputting nothing when the confidence is below a certain value)
- Add possibility of returning the confidence of the predictions (inside miscs)
- Support for active learning
- Tokenization
- Memory-efficient prediction service (only load one copy of XLM-Roberta for all languages)