French tokenization - iconsistent application of exceptions in FR_BASE_EXCEPTIONS & other unexpected tokenization #8920

e-nesse · 2021-08-09T20:40:25Z

e-nesse
Aug 9, 2021

Two possibly related issues:

Tokenization exceptions in spacy/lang/fr/_tokenization_exceptions_list.py -> FR_BASE_EXCEPTIONS are not applied and/or are applied inconsistently, sometimes seeming to depend on upper/lower case
French tokenization produces inconsistent and unexpected results with inverted constructions (est-ce vs c'est)

Always possible that I have fundamentally misunderstood something in how spaCy works, apologies if that's the case.

import spacy
nlp = spacy.load('fr_dep_news_trf')

'monte-plat' and 'là-bas': tokenization exceptions not applied:

test_1 = nlp('Regarde le monte-plat là-bas.')
[t.text for t in test_1]

Expected output of test_1: ['Regarde', 'le', 'monte-plat', 'là-bas', '.']
Actual output of test_1: ['Regarde', 'le', 'monte', '-', 'plat', 'là', '-', 'bas', '.']

'peut-être' vs. 'Peut-être', ['-', 'ce'] vs '-ce' - tokenization exception not applied if case differs, also unexpected tokenization of an inversion:

test_2 = nlp('C'est peut-être un chat.')
test_3 = nlp('C'est Peut-être un chat.') (note improper capital P)
test_4 = nlp('Peut-être est-ce un chat.')
print([t.text for t in test_1])
print([t.text for t in test_2])

Expected output of test_2: ['C'', 'est', 'peut-être', 'un', 'chat', '.'] (tokenization exception should see 'peut-être' as one token)
Actual output of test_2: ['C'', 'est', 'peut-être', 'un', 'chat', '.'] (success)

Expected output of test_3: ['C'', 'est', 'Peut-être', 'un', 'chat', '.']
Actual output of test_3: ['C'', 'est', 'Peut', '-', 'être', 'un', 'chat', '.'] (tokenization rule not applied if case differs)

Expected output of test_4: ['Peut-être', 'est', '-', 'ce', 'un', 'chat', '.']
Actual output of test_4: ['Peut', '-', 'être', 'est', '-ce', 'un', 'chat', '.'] (tokenizer does not split 'est', '-', 'ce' correctly on hyphen (here, inversion marker))

This happens with several exceptions from spacy/lang/fr/_tokenization_exceptions_list.py that I tried, but is not consistent in all cases. Sometimes, text containing 'Anne-marie' rather than 'Anne-Marie' will be tokenized as ['Anne', '-', 'marie'].

Additionally, the exceptions that contain spaces, such as a number of city names - 'Les Ormes-sur-Voulize' etc. - are tokenized as two tokens rather than one: 'Les' (DET) and 'Ormes-sur-Voulize' (PROPN) - Perhaps this is to be expected if the text is split on whitespaces before tokenization exceptions are applied... In which case, any tokenization exceptions containing a space may not work as expected, i.e. 'Les Ormes-sur-Voulize' as one token rather than two.

The consequences of these issues are particularly devastating for any statistics on part-of-speech types, since these patterns in tokenization lead to misleading tagging, e.g. 'peut-être' tokenized as ['peut', '-', 'être'] often results in all three tokens (including '-') being assigned .pos_ of 'ADV', so any script that counts adverbs in a document will return inflated numbers; and tokenizing 'est-ce' as ['est', '-ce'] results in lemmatization of '-ce' as '-ce' (rather than 'ce'), etc.

Your Environment

spaCy version: 3.1.1
Platform: Windows-10-10.0.19041-SP0
Python version: 3.8.8
Pipelines: fr_dep_news_trf (3.1.0)

adrianeboyd · 2021-08-10T07:58:54Z

adrianeboyd
Aug 10, 2021

Hi, rule-based French tokenization of hyphens (without any information beyond the token forms at the tokenization stage) is definitely very difficult to do well.

I think the misunderstanding here is that the rules from _tokenizer_exceptions_list are not used in the default French pipelines because they are slow:

spaCy/spacy/lang/fr/tokenizer_exceptions.py

Lines 10 to 12 in a1e9f19

    
           # not using the large _tokenizer_exceptions_list by default as it slows down the tokenizer 
        
           # from ._tokenizer_exceptions_list import FR_BASE_EXCEPTIONS 
        
           FR_BASE_EXCEPTIONS = ["aujourd'hui", "Aujourd'hui"]

You can see if adding them improves the performance for your task, however you would probably want to retrain the model from scratch if the differences are large because the pipeline will not perform well on tokens and token sequences it's never seen before. You'd have to compare the performance on your task to see how well it works to use the existing fr_dep_news_trf model on the modified tokenization.

Something like the statistical tokenizer from stanza might be a better alternative, although also much slower.

6 replies

e-nesse Aug 12, 2021
Author

A couple of questions (hopefully not entirely ignorant ones, but..) just to confirm I'm understanding some aspects of spaCy's French tokenization:

Regarding conventions like this one (lang/fr/tokenizer_exceptions.py) handling inversions with '-t-il/elle/on':

[. . .]
for orth in [verb, verb.title()]:
for pronoun in ["elle", "il", "on"]:
token = f"{orth}-t-{pronoun}"
_exc[token] = [{ORTH: orth}, {ORTH: "-t"}, {ORTH: "-" + pronoun}]
[. . .]

Am I correct in thinking that spaCy's tokenization groups a trait d'union with the following pronoun ("-" + pronoun -> "-il" etc.) because that's (from what I can see) a convention used in tagged data such as UD_French-Sequoia (which treats 'est-il' as 'est' and '-il')? If that's the reason, it will help me understand how to work with my data and get better stats.

Is it the case that every tokenization exception is fully case sensitive, e.g. only a perfect string match will trigger the exception? I see the exception e.g. for 'peut-être' in /lang/fr/tokenizer_exceptions.py, but not 'Peut-être', 'pEut-être', etc., and some exception definitions seem to make provisions for (initial) capital letters as well, such as this one, which should handle 'est-il' and 'Est-il':

for verb in ["est"]:
for orth in [verb, verb.title()]:
_exc[f"{orth}-ce"] = [{ORTH: orth}, {ORTH: "-ce"}]

...But many others do not. What is the reasoning behind the selectivity? Many (if not all) of these exceptions can and do begin sentences, and my tests seem to bear out the idea that every exception rule (where an explicit capitalized version isn't specified too) won't catch the pattern at the beginning of a sentence. Doing something like .lower() on source material in an attempt at normalization will also result in mis-tokenization of exceptions like city names or "Ste."...

As someone new to NLP, is there simply a step in the process I've skipped or done poorly (better data cleaning, or some kind of normalization..?), and/or is there reasoning behind the case sensitivity that might not be obvious? I ask particularly because I am working on a project involving papers by non-native writers of French, where improper capitalization and unlikely sentence-starters are pretty common, even if spelling is accurate. Other than making tokenization slower, is there a reason why it would be a Bad Idea to add capitalized variants for most tokenization exceptions (excluding things like "Vol.") - or even to somehow get the tokenizer to test the .lower() of sub-strings against <exception_string>.lower(). or something like that?

Thanks for any insights/advice you can provide!

adrianeboyd Aug 12, 2021

These are very reasonable questions!

We are trying to follow the tokenization in UD_French-Sequoia, but with some limits on the amount of time we have to customize the tokenizer for each language, there are definitely rarer cases (rare within the training corpus itself, like questions, not necessarily rare for the language overall) that aren't at 100%. Because of the difficulty around hyphens, I think French has the worst tokenization performance for all the pretrained pipelines for languages that use letters and space-separated words.

If you have found straightforward corrections for the tokenizer settings that would improve tokenization for UD_French-Sequoia, we'd be happy to include them in the next minor release (would be v3.2.0).

The exceptions are fully case-sensitive, more for speed/simplicity than anything. Exceptions like abbreviations often are case-sensitive, so switching to lower across the board would be problematic, and having multiple types of exceptions, some case-sensitive and some case-insensitive, would be slower and more complicated in terms of the tokenizer settings, which are already complicated enough.

I don't think it should affect the performance too much if you add a lot of case variants to the exceptions. The French tokenizer is mainly slow because of all the regexes in token_match, not because of the exceptions. (Actually, it's possible we've improved the performance of the tokenizer since the huge exceptions list was disabled years ago and we could reenable it in the future, but I'd have to do some speed tests to know for sure.)

Given the limitations of the tokenizer, I would recommend doing some post-processing in a custom component with the retokenizer rather than trying to get the tokenizer to do 100% of what you need. It will be a bit slower, but you can then use full regexes for the cases that the tokenizer doesn't handle well.

You might be able to handle some cases by extending the token_match setting, but this only works for full tokens, so if you want peut-être to always be one token, you can extend token_match with a case-insensitive pattern. But if you wanted case-insensitive "peut" "-être" (as a contrived example) then you'd need a custom component with the retokenizer.

Docs on the retokenizer: https://spacy.io/usage/linguistic-features#retokenization

Be aware that fr_dep_news_trf is going to be sensitive to the tokenization, so if you're asking it to tag token sequences that are different from what it's seen during training, you might get unusual output. You'd just have to check for your cases to see.

If you want to check in advance, you can construct docs by hand using Doc and test individual components:

doc = Doc(nlp.vocab, words=["a", "b", "c", ...], spaces=[True, False, True, ...])
doc = nlp.get_pipe("morphologizer")(doc)

The other alternative is to retrain a model from UD_French-Sequoia using the updated tokenization. There are demo projects that show how to train from UD corpora.

And again, check the tokenization and tagger performance with stanza or spacy-stanza just to see. If you only tokenize, it might not be that slow even on CPU and you might get better results. Just try spacy-stanza with "fr", package="sequoia" (or other options) to see how the UPOS performance is. It might not be worth the effort for your case to use spacy's tokenizer here. Be aware that the tokenization isn't consistent across the UD corpora within a single language and in my experience the GSD corpora have lower quality than the other options for a lot of languages.

See: https://github.com/explosion/spacy-stanza/#stanza-pipeline-options

e-nesse Aug 12, 2021
Author

Fantastic, this is very helpful as well! I will certainly try stanza/spacy-stanza (once I figure out how!); in the meantime, though, due to this conversation and to poking around in spaCy's code and Sequoia's tagged data to try to understand exactly how French tokenization works and why, I am curious whether there might be a way to improve the accuracy of the tokenizer with respect to hyphenated words specifically, and possibly simplify it (maybe even speed it up!) simultaneously.

Many caveats, including my limited understanding of lots of stuff regarding spaCy; my PhD is in literature, not linguistics, so while very familiar with French, there are almost certainly some linguistic considerations I am overlooking.. etc. Still, I think the general principles might hold in the following thought process:

In French, generally speaking, the number of cases where a hyphen between two alpha characters should not result in a split to 2+ tokens is quite large and undefined (numbers written out, compound fist names/surnames, place names...).
The number of cases where a hyphen does separate 2+ tokens is comparatively small and well-defined, e.g. pronouns (and one particle) in an inversion structure, suffixes of precision appended to substantives: -ce, -je, -tu, -il, -moi, -le, -elles, -t-, '-ci'and '-là' not preceded by 'celui/ceux/celle/celles' etc. etc. Here especially I may be overlooking something...
This is predicated on the observation that Sequoia's dataset appears to treat prefixed words as single tokens (which makes sense). I have not verified extensively and don't know if the convention applies in other datasets, but every instance of [^\d \t]-[^\d \t]' I found in Sequoia (ex-gérant, non-lieu, pré-natal, etc.) was treated as a single token.
Right now, the tokenizer seems as if it splits up '{alpha character}-{alpha character}' by default. If this behavior can be changed for French, then the exceptions logic can be flipped. Rather than many exceptions that shouldn't be split, the tokenizer could test instead for a much smaller number of cases where it should. Then, essentially everything else with a hyphen (not matching some other exception rule) would be treated as a single token, which may be a more appropriate default for French.
There may be edge cases where prefixes like 'politico-' should be separated, but the exceptions I understand (e.g. all of the prefixes defined in _hyphen_prefix in lang/fr/tokenizer_exceptions.py are all aimed at not splitting them off. Even given edge cases, it's probably a relatively small number in French.
Possible result: for tokenization regarding hyphens specifically, a significant amount of regex becomes unnecessary, and case sensitivity is only an issue for inversion/-ci/-là: 'peut-être' or 'Peut-être' or 'peUt-êTrE' will all be tokenized as single tokens; so will compound personal names, place names without articles, unknown place names etc. as well. Whether or not the model handles these tokens well is another question... but at least I would imagine chances of correct pos_ tagging (for instance) are much better.
This doesn't address some other exceptions that also involve hyphens, such as place names with articles (and therefore spaces), like 'Le Puy-en-Velay', but at least would tokenize both 'Puy-en-Velay' and wild misspellings like 'Puits-an-Velay' as one token - but without having to test for things like '[PpUuYy]-[{alpha}]' or '{alpha}-[EeNn]-{alpha}' at all.
To the best of my knowledge, this might actually work, because the number of infixes in French involving hyphens is close enough to zero.

If this actually makes sense and is even close to feasible, perhaps it could improve tokenization - but, of course, there may be other aspects to tokenization I'm not thinking of. I am not sure how precisely to implement it, either, unfortunately.

e-nesse Aug 17, 2021
Author

After some digging/learning, I may have found an improved approach to French tokenization. In case it is useful to others as well, I've attached an example of a custom tokenizer (with URL detection; seems to work well) that performs (on my data!) more consistently than spaCy's factory defaults, corrects for a few oversights regarding élision/suffix detection, and tries to follow Sequoia's conventions.

I've also forked spaCy and created a branch with changes to spacy/lang/fr/punctuation.py that implement the same ideas and fix the same oversights, in addition to a minor addition to the exceptions rules for 'c'est-à-dire'. I am happy to do a pull request so long as what I've done looks reasonable enough (there must be a better way than my ugly regex...) that it wouldn't be likely to waste anyone's time.

Thanks again for your help!
spacy_fr_custom_tokenizer.txt

lsmith77 Oct 15, 2024

whatever happened here? I am getting cesttokenized toc'est` using the latest fr_core_news_lg model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

French tokenization - iconsistent application of exceptions in FR_BASE_EXCEPTIONS & other unexpected tokenization #8920

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

French tokenization - iconsistent application of exceptions in FR_BASE_EXCEPTIONS & other unexpected tokenization #8920

Uh oh!

e-nesse Aug 9, 2021

Your Environment

Replies: 1 comment · 6 replies

Uh oh!

adrianeboyd Aug 10, 2021

Uh oh!

Uh oh!

e-nesse Aug 12, 2021 Author

Uh oh!

adrianeboyd Aug 12, 2021

Uh oh!

Uh oh!

e-nesse Aug 12, 2021 Author

Uh oh!

e-nesse Aug 17, 2021 Author

Uh oh!

lsmith77 Oct 15, 2024

e-nesse
Aug 9, 2021

Replies: 1 comment 6 replies

adrianeboyd
Aug 10, 2021

e-nesse Aug 12, 2021
Author

e-nesse Aug 12, 2021
Author

e-nesse Aug 17, 2021
Author