French tokenization - iconsistent application of exceptions in FR_BASE_EXCEPTIONS & other unexpected tokenization #8920
Replies: 1 comment 6 replies
-
Hi, rule-based French tokenization of hyphens (without any information beyond the token forms at the tokenization stage) is definitely very difficult to do well. I think the misunderstanding here is that the rules from spaCy/spacy/lang/fr/tokenizer_exceptions.py Lines 10 to 12 in a1e9f19 You can see if adding them improves the performance for your task, however you would probably want to retrain the model from scratch if the differences are large because the pipeline will not perform well on tokens and token sequences it's never seen before. You'd have to compare the performance on your task to see how well it works to use the existing Something like the statistical tokenizer from |
Beta Was this translation helpful? Give feedback.
-
Two possibly related issues:
Always possible that I have fundamentally misunderstood something in how spaCy works, apologies if that's the case.
Expected output of test_1: ['Regarde', 'le', 'monte-plat', 'là-bas', '.']
Actual output of test_1: ['Regarde', 'le', 'monte', '-', 'plat', 'là', '-', 'bas', '.']
Expected output of test_2: ['C'', 'est', 'peut-être', 'un', 'chat', '.'] (tokenization exception should see 'peut-être' as one token)
Actual output of test_2: ['C'', 'est', 'peut-être', 'un', 'chat', '.'] (success)
Expected output of test_3: ['C'', 'est', 'Peut-être', 'un', 'chat', '.']
Actual output of test_3: ['C'', 'est', 'Peut', '-', 'être', 'un', 'chat', '.'] (tokenization rule not applied if case differs)
Expected output of test_4: ['Peut-être', 'est', '-', 'ce', 'un', 'chat', '.']
Actual output of test_4: ['Peut', '-', 'être', 'est', '-ce', 'un', 'chat', '.'] (tokenizer does not split 'est', '-', 'ce' correctly on hyphen (here, inversion marker))
This happens with several exceptions from spacy/lang/fr/_tokenization_exceptions_list.py that I tried, but is not consistent in all cases. Sometimes, text containing 'Anne-marie' rather than 'Anne-Marie' will be tokenized as ['Anne', '-', 'marie'].
Additionally, the exceptions that contain spaces, such as a number of city names - 'Les Ormes-sur-Voulize' etc. - are tokenized as two tokens rather than one: 'Les' (DET) and 'Ormes-sur-Voulize' (PROPN) - Perhaps this is to be expected if the text is split on whitespaces before tokenization exceptions are applied... In which case, any tokenization exceptions containing a space may not work as expected, i.e. 'Les Ormes-sur-Voulize' as one token rather than two.
The consequences of these issues are particularly devastating for any statistics on part-of-speech types, since these patterns in tokenization lead to misleading tagging, e.g. 'peut-être' tokenized as ['peut', '-', 'être'] often results in all three tokens (including '-') being assigned .pos_ of 'ADV', so any script that counts adverbs in a document will return inflated numbers; and tokenizing 'est-ce' as ['est', '-ce'] results in lemmatization of '-ce' as '-ce' (rather than 'ce'), etc.
Your Environment
Beta Was this translation helpful? Give feedback.
All reactions