Incorrect tagging by a trained model for Tibetan #13549

ykyogoku · 2024-06-28T09:11:08Z

ykyogoku
Jun 28, 2024

I tried to train a tagger for Tibetan. However, the result is not satisfactory. What is particularly striking is that the genitive, which is consistently tagged as ADP in the training dataset, is wrongly tagged as NOUN, AUX, etc., by the generated model. I hope the training (train.spacy: 10.3 MB) and validation (dev.spacy: 2.7 MB) datasets are large enough. So, I suspect that the cause of the incorrect tagging lies in the configuration. The following is the configuration file, which has not been processed by spacy init fill-config.

# This is an auto-generated partial config. To use it with 'spacy train'
# you can run spacy init fill-config to auto-fill all default settings:
# python -m spacy init fill-config ./base_config.cfg ./config.cfg
[paths]
train = corpus/train/train.spacy
dev = corpus/dev/dev.spacy
vectors = null
[system]
gpu_allocator = null

[nlp]
lang = "xx"
pipeline = ["tok2vec","tagger"]
batch_size = 1000

[nlp.tokenizer]
@tokenizers = "botok_tokenizer"

[components]

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["NORM", "PREFIX", "SUFFIX", "SHAPE"]
rows = [5000, 1000, 2500, 2500]
include_static_vectors = false

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3

[components.tagger]
factory = "tagger"
label_smoothing = 0.05

[components.tagger.model]
@architectures = "spacy.Tagger.v2"
nO = null

[components.tagger.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}

[corpora]

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"

[training.optimizer]
@optimizers = "Adam.v1"
learn_rate = 0.0005

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001

[initialize]
vectors = ${paths.vectors}

And the following is one of the logs.

python3 -m spacy train config/tibetan.cfg --output ./models --code src/functions.py
ℹ Saving to output directory: models
ℹ Using CPU

=========================== Initializing pipeline ===========================
Loading Trie... (4s.)
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'tagger']
ℹ Initial learn rate: 0.0005
E    #       LOSS TOK2VEC  LOSS TAGGER  TAG_ACC  SCORE 
---  ------  ------------  -----------  -------  ------
  0       0          0.00       102.95    33.25    0.33
  0     200        170.77     11445.30    79.19    0.79
  0     400        266.82      6373.48    84.11    0.84
  0     600        279.83      5712.99    85.63    0.86
  0     800        273.38      5285.70    86.62    0.87
  0    1000        284.61      5471.45    87.09    0.87
  0    1200        344.62      6753.86    87.49    0.87
  0    1400        433.04      8474.99    88.06    0.88
  0    1600        523.86     10420.14    88.40    0.88
  0    1800        631.81     12536.41    88.84    0.89
  0    2000        765.19     15424.33    89.14    0.89
  0    2200        923.66     19017.07    89.46    0.89
  0    2400       1093.63     22641.52    89.59    0.90
  0    2600       1167.86     24747.30    89.79    0.90
  0    2800       1148.50     24550.15    89.91    0.90
  1    3000       1121.65     24103.53    89.98    0.90
  1    3200       1092.52     22605.25    90.07    0.90
  1    3400       1099.02     22797.08    90.14    0.90
  1    3600       1099.56     23025.98    90.20    0.90
  1    3800       1064.88     22315.77    90.27    0.90
  1    4000       1054.56     22230.55    90.37    0.90
  1    4200       1061.39     22536.55    90.38    0.90
  1    4400       1062.39     22529.06    90.46    0.90
  2    4600       1021.08     20939.97    90.45    0.90
  2    4800       1048.66     21343.97    90.46    0.90
  2    5000       1068.01     21516.02    90.56    0.91
  2    5200       1051.87     21306.10    90.63    0.91
  2    5400       1075.92     21696.71    90.56    0.91
  2    5600       1061.37     21204.38    90.57    0.91
  2    5800       1045.43     21145.01    90.70    0.91
  3    6000       1030.56     20449.69    90.63    0.91
  3    6200       1052.11     20314.25    90.62    0.91
  3    6400       1047.11     20332.95    90.72    0.91
  3    6600       1061.07     20476.86    90.73    0.91
  3    6800       1072.91     20605.72    90.78    0.91
  3    7000       1069.63     20384.29    90.84    0.91
  3    7200       1068.42     20695.01    90.79    0.91
  4    7400       1039.46     19768.80    90.88    0.91
  4    7600       1050.79     19452.34    90.85    0.91
  4    7800       1068.46     19846.84    90.86    0.91
  4    8000       1061.76     19676.06    90.90    0.91
  4    8200       1082.13     19781.68    90.91    0.91
  4    8400       1092.04     20048.89    90.95    0.91
  4    8600       1085.01     20064.83    91.02    0.91
  5    8800       1061.45     19393.75    91.00    0.91
  5    9000       1065.15     18834.86    90.97    0.91
  5    9200       1070.84     18838.53    91.05    0.91
  5    9400       1089.75     19167.94    91.01    0.91
  5    9600       1116.44     19478.66    91.00    0.91
  5    9800       1106.50     19531.91    91.01    0.91
  5   10000       1099.75     19270.49    91.09    0.91
  6   10200       1110.22     19226.36    90.98    0.91
  6   10400       1093.48     18425.63    91.06    0.91
  6   10600       1135.64     18821.21    91.13    0.91
  6   10800       1112.82     18479.51    91.09    0.91
  6   11000       1149.21     19211.55    91.14    0.91
  6   11200       1131.47     18814.61    91.15    0.91
  6   11400       1131.53     18795.49    91.13    0.91
  7   11600       1131.57     18768.56    91.14    0.91
  7   11800       1120.43     17922.86    91.12    0.91
  7   12000       1137.09     17961.34    91.09    0.91
  7   12200       1160.18     18361.98    91.13    0.91
  7   12400       1172.90     18470.55    91.11    0.91
  7   12600       1184.59     18664.04    91.24    0.91
  7   12800       1172.41     18509.48    91.13    0.91
  8   13000       1164.20     18503.98    91.20    0.91
  8   13200       1163.64     17842.08    91.24    0.91
  8   13400       1164.33     17608.08    91.15    0.91
  8   13600       1187.11     17784.38    91.17    0.91
  8   13800       1216.53     18430.87    91.23    0.91
  8   14000       1214.77     18216.79    91.25    0.91
  8   14200       1197.03     18112.15    91.21    0.91
  8   14400       1209.07     18342.25    91.28    0.91
  9   14600       1172.99     17379.30    91.19    0.91
  9   14800       1178.92     17232.48    91.23    0.91
  9   15000       1205.65     17520.68    91.22    0.91
  9   15200       1221.52     17802.52    91.24    0.91
  9   15400       1228.07     18036.40    91.23    0.91
  9   15600       1223.70     17814.32    91.22    0.91
  9   15800       1212.91     17788.70    91.28    0.91
 10   16000       1190.99     17192.02    91.31    0.91
 10   16200       1209.97     16846.41    91.29    0.91
 10   16400       1249.26     17332.27    91.26    0.91
 10   16600       1238.09     17255.63    91.25    0.91
 10   16800       1257.71     17688.87    91.37    0.91
 10   17000       1250.54     17547.91    91.26    0.91
 10   17200       1253.67     17639.42    91.37    0.91
 11   17400       1223.65     17068.12    91.31    0.91
 11   17600       1231.69     16638.83    91.32    0.91
 11   17800       1277.55     17216.81    91.35    0.91
 11   18000       1281.86     17243.06    91.35    0.91
 11   18200       1279.03     17109.84    91.35    0.91
 11   18400       1309.51     17555.12    91.20    0.91
 11   18600       1278.97     17194.74    91.38    0.91
 12   18800       1256.72     16805.98    91.33    0.91
 12   19000       1260.25     16379.69    91.30    0.91
 12   19200       1295.40     16850.36    91.37    0.91
 12   19400       1300.79     16892.68    91.36    0.91
 12   19600       1312.61     17033.50    91.33    0.91
 12   19800       1313.49     17035.39    91.39    0.91
 12   20000       1301.66     16929.55    91.37    0.91
✔ Saved pipeline to output directory
models/model-last

I tried to train a model with different learning rates (0.001, 0.005, 0.0005), but none of them improves the results.
Could you tell what I can change to improve the tagging?

Answered by ykyogoku

Aug 6, 2024

I have finally identified the cause of the poor tagging through testing with other languages: the configuration file incorrectly lists the pipeline as ["tok2vec", "tagger"]. It should be set to ["tok2vec", "morphologizer"]. The "tagger" option is used to train a model for XPOS, i.e., language-specific part-of-speech tags, while the "morphologizer" is used for UPOS, i.e., universal part-of-speech tags.

This is the simplest explanation for the issue, but there's another problem in our training dataset: the absence of MISC, the last column in the conllu file. I discovered this by modifying conllu files and training German and Chinese POS taggers from scratch:

ID, FORM, LEMMA and UPOS: This …

View full answer

ykyogoku · 2024-06-29T10:11:31Z

ykyogoku
Jun 29, 2024
Author

I just ran the "debug data" command and found that there are many misaligned tokens in both the training and validation datasets. Could this be related to the incorrect tagging?

1329598 total word(s) in the data (23951 unique)
7918 misaligned tokens in the training data
2161 misaligned tokens in the dev data

2 replies

ykyogoku Aug 6, 2024
Author

I have finally identified the cause of the poor tagging through testing with other languages: the configuration file incorrectly lists the pipeline as ["tok2vec", "tagger"]. It should be set to ["tok2vec", "morphologizer"]. The "tagger" option is used to train a model for XPOS, i.e., language-specific part-of-speech tags, while the "morphologizer" is used for UPOS, i.e., universal part-of-speech tags.

This is the simplest explanation for the issue, but there's another problem in our training dataset: the absence of MISC, the last column in the conllu file. I discovered this by modifying conllu files and training German and Chinese POS taggers from scratch:

ID, FORM, LEMMA and UPOS: This is the same format as our Tibetan training dataset. The German model produces almost perfect results, while the Chinese model performs as poorly as our Tibetan model. Note that German does not require a custom tokenizer, whereas Chinese does due to the lack of spaces between words (The training did work for Korean which has spaces between words, but not for Japanese and Chinese). For Chinese, I used MicroTokenizer. (I initially wanted to test with Japanese, but none of the Japanese tokenizers work with spaCy for some reason, so I switched to Chinese.)
ID, FORM, LEMMA, UPOS and MISC: both the German and Chinese models produce good results.
ID, FORM, LEMMA, UPOS and XPOS: the German model performs well, while the Chinese model performs as poorly as the first model and our Tibetan model, as though the labels were randomly assigned.

It seems that the German POS-tagger can be trained just with ID, FORM, LEMMA and UPOS, while training of a Chinese POS-Tagger requires MISC (I wonder why only Chinese needs MISC, maybe it is because I integrated a third party tokenizer for Chinese, just as I did for Tibetan, but I am not sure). So, to generate a Tibetan POS-Tagger specifically for UPOS, we need a training dataset that includes the MISC annotation.

Answer selected by ykyogoku

ykyogoku Aug 6, 2024
Author

I added SpaceAfter=No to the MISC column in the training and validation datasets, and the tagging finally worked!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect tagging by a trained model for Tibetan #13549

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Incorrect tagging by a trained model for Tibetan #13549

ykyogoku Jun 28, 2024

Replies: 1 comment · 2 replies

ykyogoku Jun 29, 2024 Author

ykyogoku Aug 6, 2024 Author

ykyogoku Aug 6, 2024 Author

ykyogoku
Jun 28, 2024

Replies: 1 comment 2 replies

ykyogoku
Jun 29, 2024
Author

ykyogoku Aug 6, 2024
Author

ykyogoku Aug 6, 2024
Author