Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vocabulary construction #887

Open
ZJaume opened this issue Oct 21, 2024 · 3 comments
Open

Vocabulary construction #887

ZJaume opened this issue Oct 21, 2024 · 3 comments
Labels
quality Improving robustness and translation quality

Comments

@ZJaume
Copy link
Collaborator

ZJaume commented Oct 21, 2024

I was going to comment at #745, but I think this translates to a more general discussion about vocabulary building. Although I don't know if this would be considered a meta issue.

Character coverage

I don't think there is need to force 100% coverage when training SentencePiece.
https://github.com/mozilla/firefox-translations-training/blob/2027f4e99b78d45ce73e44ed454c8527e03718f7/pipeline/train/spm-vocab.sh#L86
In fact, when byte fallback is enabled the default character coverage should be better because it increases the amount of training instances using the byte fallback tokens. Therefore decreasing the chances of one of the byte fallback token being poorly trained and model hallucinations when that token comes in the input during inference. Also, related to the coverage, there is the training data size for the vocabulary. It doesn't need to be very large to cover the most part of the MT model training data. I think it only needs to be a representative sample.

So, for character coverage I think it is enough to use the default option and the training size it should be enough with 1 or 2 million sentence pairs (random sample). That way we increase the chances of having a strong byte fallback training.

I think this applies for all languages, including CJK.

Numbers

I would recommend the use split_digits options to clean all those vocabulary slots occupied by numbers that may only be common in the training set. Been using this lately with good results.

User-defined tokens

Misc

It might be useful to add a few more auxiliary user-defined tokens like __misc1__ __misc2__ etc. just in case in the future there's need to implement a new logic that needs special tokens, so there's no need to retrain the whole model. Just use the auxiliary tokens in a fine-tuning manner.

Sentence/paragraph separator

Also add a token like __sep__ or something similar. So in the future if we want to explore paragraph-level or document-level translations, we can encode the newlines.

Backtranslation tagging

I've been using a BT special token to tag backtranslated data, but I have my doubts about this is useful or not.

Split vocabularies

The languages that do not share scripts (or even languages with the same script that are very distant) will benefit from separated vocabularies. Maybe using 64k, like mentioned in #747, does the same effect, but have not experimented with that.

NFKC Normalization

There's one thing that's been annoying me too much, specially when dealing with technical in-domain data, which is superscripts and subscripts being normalized:

...
2074	34	# ⁴ => 4
2075	35	# ⁵ => 5
2076	36	# ⁶ => 6
2077	37	# ⁷ => 7
2078	38	# ⁸ => 8
2079	39	# ⁹ => 9
207A	2B	# ⁺ => +
207B	2212	# ⁻ => −
207C	3D	# ⁼ => =
207C 338	2260	# ⁼̸ => ≠
207D	28	# ⁽ => (
207E	29	# ⁾ => )
207F	6E	# ⁿ => n
...

So, I do have a custom normalization file built from the original that omits this kind of stuff.

@gregtatum
Copy link
Member

Sentence/paragraph separator

This could be really nice when working on the inference engine. @nordzilla and I were looking at the strategy for how we chunk up a page for translation, and I think we would benefit from sending in larger chunks of text for translation at the same time so that they have more context on what's happening on a page. After the translation you would need to retain these separators to reconstruct the DOM.

@eu9ene eu9ene added the quality Improving robustness and translation quality label Oct 21, 2024
@eu9ene
Copy link
Collaborator

eu9ene commented Oct 21, 2024

Split vocabularies

The languages that do not share scripts (or even languages with the same script that are very distant) will benefit from separated vocabularies. Maybe using 64k, like mentioned in #747, does the same effect, but have not experimented with that.

We've been struggling with Baltic and Slavic languages. I wonder whether using a shared vocab for the languages in Cyrillic is at play here.

@ZJaume
Copy link
Collaborator Author

ZJaume commented Oct 31, 2024

Most of the LLM vocabs use BPE and I remember back in the days when SentencePiece was establishing, some papers arguing that SP was worse than BPE for NMT. So probably an experiment with spm_train --model_type bpe is something that would be worth to consider.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
quality Improving robustness and translation quality
Projects
None yet
Development

No branches or pull requests

3 participants