Skip to content

kirianguiller/BertForDeprel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tutorial End-to-End

Google colab showing how to use this parser are available here :

  • naija spoken training from pre-trained english model : link
  • training from scratch on naija spoken : link
  • training from scratch on written english : link
  • mock colab for testing if everything is fine : link

Requirements

Python 3.11 on your machine as well as poetry (link)

# install latest poetry
curl -sSL https://install.python-poetry.org | python3 -

# check installation
poetry --version

# in case you have multiple version of python, make sure to specify the version 3.11
poetry env use 3.11

Installation

On linux

git clone https://github.com/kirianguiller/BertForDeprel
cd BertForDeprel
# optional : if you want to have your venv in the project folder as .venv (recommended)
poetry config virtualenvs.in-project true

poetry install
poetry run pytest

How to run

Train a model

Either provide the path to a model json config :

python /home/BertForDeprel/BertForDeprel/run.py train --conf /home/models/template.config.json   --ftrain /home/parsing_project/conllus/train.conllu

or just give a --new_model_path and a --model_name parameter (default params will be loaded if no config or no CLI parameters are provided)

python /home/BertForDeprel/BertForDeprel/run.py train --new_model_path /home/models/ --model_name my_parser   --ftrain /home/parsing_project/conllus/train.conllu

PS : here an example of a valid config.json

{
    "new_model_path": "/home/user1/models/",
    "max_epoch": 150,
    "patience": 30,
    "batch_size": 16,
    "maxlen": 512,
    "embedding_type": "xlm-roberta-large",
    "adapter_config_type": ""
}

Predicting on raw conllus

For predicting, you need to provide the --conf parameter, which is the path to the xxx.config.json file. You also need to provide the --inpath parameter, which is the path to a single conllu file or a folder containing multiple conllu. The output folder parameter --outpath (or -o) is optional.

python /home/BertForDeprel/BertForDeprel/run.py train --conf /home/models/my_parser.config.json   --inpath /home/parsing_project/to_predict/ --outpath /home/parsing_project/predicted/

Command line parameters

shared

  • --conf -c : path to config json file (for training, it's optional if both --new_model_path and model_name are provided)
  • --batch_size: numbers of sample per batches (high incidence on total speed)
  • --num_workers: numbers of workers for preparing dataset (low incidence on total speed)
  • --seed -s : random seed (default = 42)

The directory to store and load pretrained models is set via the environment variable TORCH_HOME.

train

  • --new_model_path -f path to parent folder of the model : optional if --conf is already provided
  • --embedding_type -e : type of embedding (default : xlm-roberta-large)
  • --max_epoch : maximum number of epochs (early stopping can shorten this number)
  • --patience : number of epochs without improve required to stop the training (early stopping)
  • --ftrain : path to train file or folder (files need .conllu extension)
  • --ftest : path to train file or folder (files need .conllu extension) (not required. If not provided, see --split_ratio )
  • --split_ratio : Ratio for splitting ftrain dataset in train and test dataset (default : 0.8)
  • --pretrained_path : path to pretrain model config, used for finetuning a pretrained BertForDeprel model
  • --overwrite_pretrain_classifiers: erase pretraines classifier heads and recompute annotation schema

predict

  • --model_path -m : path to the model (folder or file)
  • --inpath -i : path to the file or the folder containing the files to predict
  • --outpath -o : path to the folder that will contain the predicted files
  • --suffix : optional (default = "") , suffix that will be added to the name of the predicted files (before the file extension)
  • --overwrite : whether or not to overwrite outputted predicted conllu if already existing
  • --write_preds_in_misc : whether or not to write prediction in the conllu MISC column instead than in the corresponding column for upos deprel and head

keep (HEAD/UPOS/...) (optionals)

each of the following parameters is a string that can take the values "NONE" | "EXISTING" | "ALL") (default : "NONE") : --keep_heads; --keep_upos; --keep_xpos ; --keep_deprels ; --keep_misc ; --keep_feats ; --keep_deps ; --keep_morph ; --keep_lemmas ; --keep_lemmas

Prepare Dataset

You will need some conllus for training the model and doing inferences.

data for training

For training, you have the choice between :

  • providing a single conllu file (--ftrain cli parameter) with all your training and testing sentences (train_test split ratio is 0.8 by default, but you can set it with --split_ratio parameter)
  • providing a train conllu file (--ftrain) and a test conllu file (--ftest)
  • providing a train folder containing the .conllu files (--ftest can also be provided, as a file or a folder too)

data for inferences

For inference, you have to provide an input file or folder (--inpath or -i). The model will infere parse trees for all sentences of all conllus, and these outputted conllus will be written in the output folder (--outpath or -o)

annotation schema

For people who want to use the parser for language transfer (training on lang A, then fine tuning on lang B), it is important to provide a --path_folder_compute_annotation_schema with a folder that contains both gold conllu from lang A and B so you can precompute the annotation schema (set of deprels, uposs, feats, lemma scripts, etc) before the pretraining. It is required to use the same annotation schema for training, inference and fine-tuning.


Folder hierarchy example

Here is a folder structure example of how I am storing the different train/test/to_predicts/results conllus

|- [NAME_FOLDER]/
|   |- conllus/
|       | - <train.langA.conllu>
|       | - <test.langA.conllu>
|       | - <train.langB.conllu>
|       | - <test.langB.conllu>
|   |- to_predict/
|       | - <raw1.langB.conllu>
|       | - <raw2.langB.conllu>
|       | - <raw3.langB.conllu>
|   |- predicted/

where <train.conllu> and <test.conllu> are respectively the train and test datasets. They can have the name you want as you will have to indicate the path to this file in the running script.

Finetuning a previously trained BertForDeprel model

WARNING : when training from a pretrained model, be sure to use the same annotation_schema.json for fine-tuning that the one that was used for pretraining. It would break the training otherwise.

To fine-tune a pre-trained model, need to follow the same step as for training a new model, but need to also provide the path to the config file of the previously trained model with --pretrained_path

python /home/BertForDeprel/BertForDeprel/run.py train --new_model_path /home/models/ --model_name my_parser  --ftrain /home/parsing_project/conllus/train.conllu  --pretrained_path /home/models/pretrained_model.config.json

GPU/CPU training

  • --gpu_ids 0 : For running the training on a single GPU of id 0, add the parameter . Respectively, for running on one single gpu of id 3, add the parameter ``--gpu_ids 3`
  • --gpu_ids 0,1 : For running the training on multiple GPU of ids 0 and 1, add the parameter
  • --gpu_ids "-2" : For running the training on all available GPUs, add the parameter
  • --gpu_ids "-1" : For training on CPU only, add the parameter

Pretrained Models (/!\ DEPRECATED /!\ TODO : update this section)

You can find on this Gdrive repo all the pretrained models, google colab script for training and publicly available treebanks (.conllu files).

Among others, here are the most important pretrained models :

Major TODOs

  • Add feats and gloss prediction
  • Add lemma
  • Add confidence threshold prediction (model outputting nothing when the confidence is below a certain value)
  • Add possibility of returning the confidence of the predictions (inside miscs)
  • Support for active learning
  • Tokenization
  • Memory-efficient prediction service (only load one copy of XLM-Roberta for all languages)

About

Framework for training dependency parsing models.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages