Vocabulary construction #887

ZJaume · 2024-10-21T10:55:45Z

I was going to comment at #745, but I think this translates to a more general discussion about vocabulary building. Although I don't know if this would be considered a meta issue.

Character coverage

I don't think there is need to force 100% coverage when training SentencePiece.
https://github.com/mozilla/firefox-translations-training/blob/2027f4e99b78d45ce73e44ed454c8527e03718f7/pipeline/train/spm-vocab.sh#L86
In fact, when byte fallback is enabled the default character coverage should be better because it increases the amount of training instances using the byte fallback tokens. Therefore decreasing the chances of one of the byte fallback token being poorly trained and model hallucinations when that token comes in the input during inference. Also, related to the coverage, there is the training data size for the vocabulary. It doesn't need to be very large to cover the most part of the MT model training data. I think it only needs to be a representative sample.

So, for character coverage I think it is enough to use the default option and the training size it should be enough with 1 or 2 million sentence pairs (random sample). That way we increase the chances of having a strong byte fallback training.

I think this applies for all languages, including CJK.

Numbers

I would recommend the use split_digits options to clean all those vocabulary slots occupied by numbers that may only be common in the training set. Been using this lately with good results.

User-defined tokens

Misc

It might be useful to add a few more auxiliary user-defined tokens like __misc1__ __misc2__ etc. just in case in the future there's need to implement a new logic that needs special tokens, so there's no need to retrain the whole model. Just use the auxiliary tokens in a fine-tuning manner.

Sentence/paragraph separator

Also add a token like __sep__ or something similar. So in the future if we want to explore paragraph-level or document-level translations, we can encode the newlines.

Backtranslation tagging

I've been using a BT special token to tag backtranslated data, but I have my doubts about this is useful or not.

Split vocabularies

The languages that do not share scripts (or even languages with the same script that are very distant) will benefit from separated vocabularies. Maybe using 64k, like mentioned in #747, does the same effect, but have not experimented with that.

NFKC Normalization

There's one thing that's been annoying me too much, specially when dealing with technical in-domain data, which is superscripts and subscripts being normalized:

...
2074	34	# ⁴ => 4
2075	35	# ⁵ => 5
2076	36	# ⁶ => 6
2077	37	# ⁷ => 7
2078	38	# ⁸ => 8
2079	39	# ⁹ => 9
207A	2B	# ⁺ => +
207B	2212	# ⁻ => −
207C	3D	# ⁼ => =
207C 338	2260	# ⁼̸ => ≠
207D	28	# ⁽ => (
207E	29	# ⁾ => )
207F	6E	# ⁿ => n
...

So, I do have a custom normalization file built from the original that omits this kind of stuff.

The text was updated successfully, but these errors were encountered:

gregtatum · 2024-10-21T13:30:51Z

Sentence/paragraph separator

This could be really nice when working on the inference engine. @nordzilla and I were looking at the strategy for how we chunk up a page for translation, and I think we would benefit from sending in larger chunks of text for translation at the same time so that they have more context on what's happening on a page. After the translation you would need to retain these separators to reconstruct the DOM.

eu9ene · 2024-10-21T17:29:23Z

Split vocabularies

The languages that do not share scripts (or even languages with the same script that are very distant) will benefit from separated vocabularies. Maybe using 64k, like mentioned in #747, does the same effect, but have not experimented with that.

We've been struggling with Baltic and Slavic languages. I wonder whether using a shared vocab for the languages in Cyrillic is at play here.

ZJaume · 2024-10-31T10:33:07Z

Most of the LLM vocabs use BPE and I remember back in the days when SentencePiece was establishing, some papers arguing that SP was worse than BPE for NMT. So probably an experiment with spm_train --model_type bpe is something that would be worth to consider.

ZJaume mentioned this issue Oct 21, 2024

Currency translation for English to German is incorrect #870

Open

eu9ene added the quality Improving robustness and translation quality label Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vocabulary construction #887

Vocabulary construction #887

ZJaume commented Oct 21, 2024

gregtatum commented Oct 21, 2024

eu9ene commented Oct 21, 2024

ZJaume commented Oct 31, 2024 •

edited

Loading

Vocabulary construction #887

Vocabulary construction #887

Comments

ZJaume commented Oct 21, 2024

Character coverage

Numbers

User-defined tokens

Misc

Sentence/paragraph separator

Backtranslation tagging

Split vocabularies

NFKC Normalization

gregtatum commented Oct 21, 2024

eu9ene commented Oct 21, 2024

ZJaume commented Oct 31, 2024 • edited Loading

ZJaume commented Oct 31, 2024 •

edited

Loading