A SyntaxDot model is trained using finetuning. Finetuning takes a pretrained (transformer-based) language model and refines its weights for syntax annotation.
To finetune a model, you need several ingredients:
- A pretrained language model, including its vocabulary and model configuration.
- Training and test data. The training data will be used to finetune the model's parameters. The test data is used to evaluate the model's performance during finetuning.
- A SyntaxDot configuration file.
SyntaxDot requires pretrained models in the libtorch OutputArchive format, with the expected parameter names. Python scripts to convert some commonly-used model types to the format expected by SyntaxDot are provided in the ../scripts directory.
You could also use an existing pretrained model in the SyntaxDot format. You can download the required parts for the well-performing multilingual XLM-RoBERTa base model here:
The SyntaxDot configuration file contains the SyntaxDot-specific model settings in TOML format, such as the sequence labeling tasks, and biaffine parsing. This is an example configuration file:
[input]
tokenizer = { xlm_roberta = { vocab = "xlm-roberta-base-sentencepiece.bpe.model" } }
[biaffine]
labels = "syntaxdot.biaffine"
head = { dims = 128, head_bias = true, dependent_bias = true }
relation = { dims = 128, head_bias = true, dependent_bias = true }
[labeler]
labels = "sticker.labels"
encoders = [
{ name = "pos-ud", encoder = { sequence = "upos" } },
{ name = "pos-extended", encoder = { sequence = "xpos" } },
{ name = "lemma", encoder = { lemma = "form" } },
{ name = "morph", encoder = { sequence = "feature_string" } },
]
[model]
pretrain_type = "xlm_roberta"
parameters = "epoch-74"
position_embeddings = "model"
pretrain_config = "xlm-roberta-base-config.json"
We will now discuss each section in more detail.
The input
section describes processing of the input tokens. Currently,
this section only configures the tokenizer that should be used. For
example,
[input]
tokenizer = { xlm_roberta = { vocab = "xlm-roberta-base-sentencepiece.bpe.model" } }
Uses the XLM-RoBERTa tokenization model, which is in the file
xlm-roberta-base-sentencepiece.bpe.model
. The supported tokenizers
are:
albert
: for ALBERT models.bert
: for BERT-based modelsxlm_roberta
: for XLM-RoBERTa models.
The biaffine
section configures a biaffine parser. If this section
is absent, the model is a pure sequence labeling model, and cannot
perform dependency parsing. Here is an example biaffine
section:
[biaffine]
labels = "syntaxdot.biaffine"
head = { dims = 128, head_bias = true, dependent_bias = true }
relation = { dims = 128, head_bias = true, dependent_bias = true }
The labels
options sets the file name to which the dependency labels
should be saved. This is only done once during model preparation (see
below).
The head
and relation
options configure the hidden representations
of tokens as used for head prediction and relation prediction. In both
cases, dims
configures the dimensionality of the hidden
representations. head_bias
and dependent_bias
configure whether a
bias should be added for a token's representation as a head or as a
dependent. Enabling these biases generally doesn't negatively affect
performance, so it is safest to simply enable them.
This section configures one or more sequence labelers. For example:
[labeler]
labels = "sticker.labels"
encoders = [
{ name = "pos-ud", encoder = { sequence = "upos" } },
{ name = "pos-extended", encoder = { sequence = "xpos" } },
{ name = "lemma", encoder = { lemma = "form" } },
{ name = "morph", encoder = { sequence = "feature_string" } },
]
The labels
option specifies the file in which all labels for
sequence annotation should be stored. encoders
is a list that
configures the different sequence annotation layers. Each layer should
have a unique name
and specify an encoder. We will now discuss
the encoder types in more detail.
The sequence encoder is the simplest form of encoder, it simply takes values from a CoNLL-U layer. Basic sequence encoders have the following form:
{ name = "NAME", encoder = { sequence = "LAYER" } },
where LAYER is one of:
upos
: universal part-of-speech tagxpos
: language-specific part-of-speech tagfeature_string
: the features field as a string
Note that feature_string
is invariant to the feature ordering, since
features are internally sorted in lexicographical order.
It is also possible to create annotation layers based on specific
features or miscellaneous features. For example, the following
sequence encoder uses the TopoField
feature of the miscellaneous
features, setting the default value to null
if the feature
is absent:
{ name = "tf", encoder = { sequence = { misc = { feature = "TopoField", default = "null" } } } }
Similarly, we could create an annotation layer using the Number
feature:
{ name = "tf", encoder = { sequence = { misc = { feature = "TopoField", default = "null" } } } }
The lemma encoder (transparently) encodes lemmas to edit trees. The configuration of this encoder is simply as follows:
{ name = "lemma", encoder = { lemma = "form" } },
The only configuration is the back-off strategy, which is used when
the predicted edit tree cannot be applied. Here the back-off strategy
is to use the form as the lemma. Another possible value is nothing
,
which will not update the lemma at all in such a case.
The final configuration section of the SyntaxDot configuration specifies the model. For example:
[model]
pretrain_type = "xlm_roberta"
parameters = "epoch-74"
position_embeddings = "model"
pretrain_config = "xlm-roberta-base-config.json"
pretrain_type
is the type of pretrained model that is used. This is
one of:
albert
bert
squeeze_albert
squeeze_bert
xlm_roberta
The parameters
option contains the file with model parameters. This
can be any value during finetuning. But after finetuning, this should
be set to the parameter file of the best epoch.
The position_embeddings
option configures the type of position
embeddings that should be used. For finetuning existing models, this
should always be set to model
, which will use the position
embeddings of the model.
Finally, pretrain_config
is the filename of the pretrained model
configuration.
With the configuration set up, the a model can be trained. The first
step is to prepare the label files, which is done with the syntaxdot prepare
subcommand:
$ syntaxdot prepare syntaxdot.conf train.conllu
Then, you can start the finetuning of the pretrained model. For example, to finetune a syntax model with XLM-RoBERTa base:
$ syntaxdot finetune \
syntaxdot.conf \
xlm-roberta-base.pt \
train.conllu \
test.conllu \
--gpu 0 \
--label-smoothing 0.03 \
--maxlen 100 \
--mixed-precision \
--warmup 10000
SyntaxDot has more command-line options to configure finetuning, but is the recommended set of options:
--gpu 0
: enables GPU training using GPU 0.--label-smoothin 0.03
: enables label smoothing, distributing 0.03 of the 'probability mass' to incorrect labels. This is a form of regularization and generally results in better models.--maxlen 100
: excludes sentences of more than 100 pieces, this can be used in conjunction with thebatch-size
option to control the amount of GPU memory needed during training.--mixed-precision
: use mixed precision. This lowers GPU memory use and speeds up training considerably, but is only supported on NVIDIA GPUs with Tensor Cores (Volta/Turing GPUs or later generations).--warmup 10000
: use 10,000 steps of learning rate warmup. This avoids that early weight update steps are too large.
After finetuning is done, SyntaxDot will report the best epoch. Don't
forget to update the parameters
option in your SyntaxDot
configuration to use the parameters from the best epoch!