-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
space/none mode potentiel issue with case_markup #176
Comments
Case markups are not really supported for the "none" and "space" tokenization modes.
Should we just raise an error in this case? Or maybe can you describe what was your expectation when using case_markup and space/none modes? |
I want to use the Correct me if I'm wrong: I then tried the following experiments:
These two appoarch gives different model & vocab, which I think caused by the way Tokenizer ingest file. Models learned with Tokenizer use "tokens"("phrase") rather than "sentence". |
Your understanding is correct. The behavior is the same as So the recommendation would simply be: do not use SentencePiece scripts directly. The use case sounds reasonable. A similar issue came up in https://forum.opennmt.net/t/problem-in-tokenize-and-detokenize-during-translation/3954. The difficulty is that we need to lowercase the phrase before SentencePiece so that different casing result in the same segmentation. We would need to add some code to find the original casing after applying SentencePiece. I will look into it. Alternatively, you can try using mode "conservative" or "aggressive". SentencePiece will be used as a subtokenizer like BPE. |
I'm also trying to make sentencepiece work with |
OK, this actually seems to work. I lowercased, created a sentencepiece model and vocab with Maybe this could be adapted in code, so when sentencepiece is used with a mode other than @guillaumekln , could you please clarify what happens and in what order when creating a sentencepiece model and vocab with |
Actually this is not possible with To set a different tokenization, you could use the
The insertion of case markup tokens does not happen in this learning phase. They are added during tokenization after applying the SentencePiece model. |
I see, so essentially it's like running the spm_train directly and
My attempt was to avoid a separate preprocessing step and have everything ready with
Yes, and if the SentencePiece model already contains the case markup tokens as user defined symbols, then sentencepiece ignores them when it decodes so the case can be restored correctly and the translated text seems (mostly) fine. But some inconsistencies remain, due to case splitting that creates unseen tokens/subwords to sentencepiece. |
Yes. I added support for pre-tokenization in the PR linked above.
When using |
That's fantastic, thanks! |
I thought I should leave some feedback on this:
|
You generated the vocabulary with |
Yes, the vocab is built with |
Some more feedback: I updated pyonmttok and OpenNMT-tf and tried to build a new vocab with sentencepiece and |
To summarize what was done in the latest update, there are now 2 modes when generating the SentencePiece vocabulary: When no pretokenizer is set:
When a pretokenizer is set:
Are the user-defined symbols in the training data? As said above, the training data is retokenized with SentencePiece so the symbols should appear in the tokenized data to be included in the vocabulary.
You should still be able to use another tokenization mode such as aggressive. Is there an error or bug? |
I should get a better grasp of it, so I could use your help. First here is the command: onmt-build-vocab --tokenizer_config ../../../Tokenization/lower_tokenization.yml --size 32000 --sentencepiece user_defined_symbols="⦅D01⦆,⦅D02⦆,⦅D03⦆,⦅D04⦆,⦅D05⦆,⦅mrk_case_modifier_C⦆,⦅mrk_case_modifier_L⦆,⦅mrk_case_modifier_U⦆,⦅mrk_case_modifier_M⦆,⦅mrk_case_modifier_N⦆,⦅mrk_begin_case_region_C⦆,⦅mrk_begin_case_region_L⦆,⦅mrk_begin_case_region_U⦆,⦅mrk_begin_case_region_M⦆,⦅mrk_begin_case_region_N⦆,⦅mrk_end_case_region_C⦆,⦅mrk_end_case_region_L⦆,⦅mrk_end_case_region_U⦆,⦅mrk_end_case_region_M⦆,⦅mrk_end_case_region_N⦆" character_coverage=1 input_sentence_size=10000000 num_threads=16 --size_multiple 8 --save_vocab vocab/base corpus.combined Here is my
So, with this configuration, I think I'm using "Mode 1" and all options are ignored; the sp model and vocab are built, but the user-defined symbols are not added to the vocab, which confuses me. These symbols are not included in the corpus, but this is not a problem when using sentencepiece directly to create a model and vocab --it adds the user-defined symbols even when not present in the training corpus. When I change tokenizer = tokenizers.make_tokenizer(args.tokenizer_config)
File "/home/panos/venv36/lib/python3.6/site-packages/opennmt/tokenizers/tokenizer.py", line 322, in make_tokenizer
tokenizer = tokenizer_class(**tokenizer_params)
File "/home/panos/venv36/lib/python3.6/site-packages/opennmt/tokenizers/opennmt_tokenizer.py", line 23, in __init__
self._tokenizer = pyonmttok.Tokenizer(**kwargs)
TypeError: __init__(): incompatible constructor arguments. The following argument types are supported:
1. pyonmttok._ext.Tokenizer(tokenizer: pyonmttok._ext.Tokenizer)
2. pyonmttok._ext.Tokenizer(mode: str, *, bpe_model_path: str = '', bpe_vocab_path: str = '', bpe_vocab_threshold: int = 50, bpe_dropout: float = 0, vocabulary_path: str = '', vocabulary_threshold: int = 0, sp_model_path: str = '', sp_nbest_size: int = 0, sp_alpha: float = 0.1, joiner: str = '■', joiner_annotate: bool = False, joiner_new: bool = False, spacer_annotate: bool = False, spacer_new: bool = False, case_feature: bool = False, case_markup: bool = False, soft_case_regions: bool = False, no_substitution: bool = False, preserve_placeholders: bool = False, preserve_segmented_tokens: bool = False, segment_case: bool = False, segment_numbers: bool = False, segment_alphabet_change: bool = False, support_prior_joiners: bool = False, segment_alphabet: object = None)
Invoked with: kwargs: mode='aggresive', case_markup=True, spacer_annotate=True, soft_case_region=True, preserve_placeholders=True, preserve_segmented_tokens=True If I get it correctly, "Mode 2" requires using any other mode except Thanks for your patience and your help. |
Sorry for the confusion but when I said "When a pretokenizer is set", it's whenever the option
There is a typo in your config: it should be |
I'm following this thread with a lot of interest, many thanks @guillaumekln and @panosk. So, if I understand well, it should be possible to pretokenise raw data using the aggressive mode, then create SP vocabs from that pretokenised data, then use the converted vocabs to segment text for training and inference with the OpenNMT tokeniser. I also understand this can be done manually or via the script. However, I suppose that for the aggressive mode to work as expected when tokenising/detokenising, one should apply joiner annotation; otherwise, I see many possible ambiguity cases when detokenising. On the other hand, if a SP model is used, the tokens are generated with the spacer annotation by default, which is incompatible with the joiner annotation according to the doc. Am I right? Or applying the aggressive mode does not need joiner annotation at all, and therefore, is fully compatible with using SP vocab models? Otherwise, could this be solved by applying different parameters when pretokenising for vocab creation and pretokenising for training/inference? |
Hi @dmar1n , @guillaumekln , |
@dmar1n
When you use SentencePiece via the OpenNMT Tokenizer, the spacers are removed internally and converted into metadata so that we can later decide if we want to inject joiners or spacers. From the user perspective, using a pretokenization with SentencePiece should be the same as using a pretokenization with BPE.
This extra step is needed because the internal SentencePiece vocabulary is invalid when using a pretokenization. The basic example is when you want to use joiner annotation with SentencePiece: the SentencePiece internal vocabulary will contain spacers, but the external vocabulary should include joiners. This is why we need to get the vocabulary from the training data, and not from the SentencePiece internal representation. But I'm not sure to understand the use case of user-defined symbols with 0 frequency. If they are not in the tokenized training data why should they appear in the vocabulary? |
Thanks for the explanations @guillaumekln , I see.
I'm adding these symbols later for training the NMT model and for inference, at least that was the case when I was using sentencepiece directly --I may have to adapt it now, no big deal. |
After a few tests, I can confirm that the user-defined symbols must be included in the vocab. Apart from any custom symbols (which can be included in the corpus for training the sp model), the major problem is with the case markup symbols which cannot be included in the training corpus beforehand but should be in the vocab anyway, otherwise casing doesn't work and there are countless Just to make sure that I'm not doing anything wrong from my part, after creating the sp model and vocab, I used the same tokenization .yml config for the actual NMT training, with the extra option |
To complete @panosk comments, I have also run some tests with the same idea (applying aggressive mode with case markup as pretok and SentencePiece as vocab model). I first tried manually by building the SentencePiece model on pretokenised text (which already included special symbols). This sort of worked (no errors), but I had the same problem as @panosk: the predictions had many With the script, I managed to reduce the amount of On the other hand, I wonder if this is somehow an inevitable side effect of using pretokenised data with the aggressive mode, and then maybe the Concretely, I'm creating the vocabs with the script and the following
When you tokenise the data for training, do you pretokenise using the OpenNMT tokeniser? This should add the case markup symbols to the training data. At least, this worked for me. |
The case markup symbols should be included in the vocabulary. I just the try building the following dummy vocabulary to make sure it works:
|
I confirm the case markup tokens are included in the vocabulary. These are the first lines of my target vocab:
And indeed, the predictions include the symbols. Here is an example of prediction with
In this sentence, |
Well... I was using a lowercased version of my corpus with I also wonder if the increased amount of |
But this is really strange, because I'm editing this post, as the example I gave was not exact. Here is a real case:
The source vocab has |
When training the SentencePiece model, do you set the
That's only true for plain SentencePiece. When using a pretokenization with either SentencePiece or BPE, I'm just not sure sure why the I understand the initial goal of this issue is to train case insensitive SentencePiece models. We might need to think of a different approach that does not involve a full pretokenization. |
Yes, but with a value in the order of millions. Otherwise, data is monolingual, of good quality and deduplicated.
Actually, the example had the After a number of tests, I can confirm what @panosk pointed out: the issue seems to be linked to a non-alphabetic character preceding the token, such as apostrophes, parenthesis, etc. To give you another more representative example (at 17k steps):
In this case, |
Just to note that when using a pretokenization,
Maybe using |
You are right, but I was careful with that. So while with a normal sentencepiece training in sentence level I set a limit of 10M sentences, with pretokenization I set a limit of 300M (tokens) which should be enough --at least that's a safe high limit for 64GB of RAM.
That's a good idea, I'll try it asap! |
Thanks a lot for the hints, @guillaumekln. I was indeed using a value of 10M. I will remove that argument and limit the initial corpus beforehand to 10M lines. Regarding the joiner annotation, this was my initial idea when I first intervened in the thread. Unfortunately, when using joiner annotation, I got some incompatibility error with SentencePiece models. I will try again, though. |
Here you have some updates. I have tried with the joiner annotation. The vocabs are correctly created (there are the expected joiners and no spacers). But when I tokenise the training data, I get the following error:
If I then change the config to have spacer annotation (using the vocabs correctly created with the joiners), I get extremely segmented data, which is normal given that the vocab does not have any spacer. |
I see. At this point why not using BPE? Since managing case with SentencePiece currently requires a pretokenization (could be improved in the future), it seems there is little benefit over BPE. From experience the following BPE tokenization should work well in many cases: pyonmttok.Tokenizer(
"aggressive",
bpe_model_path=...,
vocabulary_path=...,
joiner_annotate=True,
case_markup=True,
soft_case_regions=True,
preserve_placeholders=True,
preserve_segmented_tokens=True,
segment_case=True,
segment_numbers=True,
segment_alphabet_change=True,
) |
Thanks for the config sample! I see there are options in that configuration that I was not specifying in my tests. And for clarification, I have been using BPE as a subword model via SentencePiece all the time. I referred to SentencePiece just as the library used to subtokenise, which I configure via the option Update: I think I understand better now. So, the simplest way to proceed would be to create a BPE model, or a BPE-based tokeniser using the Python wrapper, with the required OpenNMT tokeniser options. This should indeed simplify the process a lot. I will try this approach and let you know. Many thanks again for your help! |
Yes, I meant using the BPE implementation in the Tokenizer. The BPE training is not integrated in |
@guillaumekln , I know this gets a bit off, but could you please verify the below steps for using BPE? I've been using sentencepiece since forever and all my code is adapted to it, but I really need case handling so I'll test BPE extensively.
Thanks in advance! |
I recommend training the BPE model with the Tokenizer directly. It will take care of many details and ensure consistency. Here's a basic workflow: import pyonmttok
tokenizer = pyonmttok.Tokenizer(
"aggressive",
joiner_annotate=True,
case_markup=True,
soft_case_regions=True,
preserve_placeholders=True,
preserve_segmented_tokens=True,
segment_case=True,
segment_numbers=True,
segment_alphabet_change=True,
)
learner = pyonmttok.BPELearner(tokenizer=tokenizer, symbols=32000)
learner.ingest_file("train.txt")
tokenizer = learner.learn("bpe.model")
tokenizer.tokenize_file("train.txt", "train.txt.tok", num_threads=4) Then build the vocabulary from onmt-build-vocab --save_vocab bpe.vocab train.txt.tok (Note: Finally you can either train directly on Let's try not to diverge too much from the initial issue. For further discussion about BPE, please consider opening a topic on the forum. |
Thanks a lot!
Absolutely! |
I followed the approach suggested to build vocabs and tokenise training data. Until here, everything works like a charm. After 15k training steps, though, there are still many
As you can see, each parenthesis has its joiner attached, while the numbers have spaces around; unfortunately, all indicates that these orphaned joiners are systematically replaced with I replicated the proposed settings/workflow line by line, but maybe I missed an important option here? Otherwise, it shouldn't be difficult to fix this issue in a postprocessing step, but I guess it would be better to find the root cause first. I will look at it and let you know if I find anything relevant. |
Hi @dmar1n , If you followed the steps for using BPE directly in the tokenizer with no sentencepiece involvement, I can confirm that it works like a charm and I get 0 As @guillaumekln noted, we are getting off track from the initial issue, so feel free to post your last comment in the forum and we can continue there. |
Thanks, @panosk, it's good to know that it works for you. I confirm I followed the exact workflow and options suggested. Also note that the issue remains the same for me; that is, not being able to use case markup in any configuration with subword tokenisation. Anyway, I will give it another try and post the issue in the forum, if still unresolved. Thanks both for your help! Just a quick update. The suggested solution did work eventually. I think it was a problem of the versions installed. With the latest versions, it works great. Thanks again! |
To get back to the initial issue and request,
So I'm not sure it is possible to effectively implement this outside of SentencePiece. If you have any ideas, please let me know. |
When using
case_markup
inspace
/none
mode, unexpected behavior happens:As you can see,
.detokenize
can not rebuild the original text. Same behavior exists forspace
.While mode
conservative
oraggressive
does not suffer this issue. But the result compare to nocase_markup
is not consistent, as they split the text to insert markup placeholder.The text was updated successfully, but these errors were encountered: