Incorrect tagging by a trained model for Tibetan #13549
-
I tried to train a tagger for Tibetan. However, the result is not satisfactory. What is particularly striking is that the genitive, which is consistently tagged as ADP in the training dataset, is wrongly tagged as NOUN, AUX, etc., by the generated model. I hope the training (train.spacy: 10.3 MB) and validation (dev.spacy: 2.7 MB) datasets are large enough. So, I suspect that the cause of the incorrect tagging lies in the configuration. The following is the configuration file, which has not been processed by spacy init fill-config.
And the following is one of the logs.
I tried to train a model with different learning rates (0.001, 0.005, 0.0005), but none of them improves the results. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
I just ran the "debug data" command and found that there are many misaligned tokens in both the training and validation datasets. Could this be related to the incorrect tagging?
|
Beta Was this translation helpful? Give feedback.
I have finally identified the cause of the poor tagging through testing with other languages: the configuration file incorrectly lists the pipeline as ["tok2vec", "tagger"]. It should be set to ["tok2vec", "morphologizer"]. The "tagger" option is used to train a model for XPOS, i.e., language-specific part-of-speech tags, while the "morphologizer" is used for UPOS, i.e., universal part-of-speech tags.
This is the simplest explanation for the issue, but there's another problem in our training dataset: the absence of MISC, the last column in the conllu file. I discovered this by modifying conllu files and training German and Chinese POS taggers from scratch: