nlp(u'BUSINESS')[0].lemma_ == 'busines' #2900

danielvarga · 2018-11-05T13:26:55Z

I wonder if this is just some ad hoc machine learning weirdness, or the symptom of something more significant.

nlp = spacy.load('en_core_web_sm')
nlp(u'business')[0].lemma_
u'business'
nlp(u'BUSINESS')[0].lemma_
u'busines'

## Info about spaCy

* **Python version:** 2.7.10
* **Platform:** Darwin-17.6.0-x86_64-i386-64bit
* **spaCy version:** 2.0.11
* **Models:** en_core_web_sm

The text was updated successfully, but these errors were encountered:

DuyguA · 2018-11-14T19:56:32Z

Hello,
Lemmatization is via lookup. I checked the lemmatizer file, "business" is in regular nouns list as expected. I have a rough idea for this bug, can you please also test the words "KINDNESS", "KIDS" and "BALAS" and report the result? (I'm on travel and shell-less 😬 )

danielvarga · 2018-11-15T00:20:15Z

The all-caps versions are lemmatized to kindness, kids, balas.
The lower case versions are lemmatized to kindness, kid, bala.

giannisdaras · 2018-11-17T16:57:09Z

I think I figured out what's going on.

First of all, if you load an English language model, lemmatization is not based on lookup. There is a rule based lemmatization strategy based on the pos tag. See #2668.

Both 'business' and 'BUSINESS' are identified as nouns, so both are supposed to follow the rule based lemmatization for nouns.

Rule based lemmatization works by converting words to lowercase and then trying to apply rules that change suffixes based on the pos tag in order to match a lemma. The thing is that the rules are applied even if the word is a lemma itself. This leads to profound mistakes. In order to correct this, spaCy checks if a word is a base form before going to lemmatize option.

But first, what a base form is? This comes from issue #435. The idea is that the lemma of a lemma word is the lemma itself. To explain it a little bit more: Uninflected words do not need a lemmatization approach, we just return the word itself. So we need a way to check if the word is uninflected. This is done in spaCy with the is_base_form function.

In order to decide whether a word is a lemma itself, spaCy checks the morphological features of the word and checks if those match the morphological features of lemmas for the specified pos tag.

Now, the problem:

business gets a tag_: NN while BUSINESS gets a tag_ NNS. This leads to the identification of the business word as a base form while BUSINESS word is not identified as base form and thus it gets lemmatized based on rules that lead to wrong results.

In case you are wondering what those tag means: NN means singular noun and NNS means plural noun.

Of course, this is a bug and happens for other words as well.
For example, the word patness behaves in the same way.
I can't find a quick fix on that because morphological features come from models trained on data so there is always an error risk. I think that something we should consider is collecting a list of lemmas for each language and modify a bit the is_base_form to look on that list too. We can get the list easily by getting the first column of lookup tables and converting that to a set.

@ines , @honnibal what's your views on that?

honnibal · 2018-12-16T16:49:52Z

Ultimately I think we do want the lemmatizer to "trust" the attributes returned by the tagger. So, I think this is a tagger error, rather than a lemmatizer error. Given that the tagger has said "BUSINESS" is a plural noun, "BUSINES" is kind of the best guess you could hope for as a lemma to this hypothetical plural...The real problem is that the tagger has made such an unreasonable guess.

ines · 2018-12-16T16:51:00Z

Merging this with #3052. We've now added a master thread for incorrect predictions and related reports – see the issue for more details.

lock · 2019-01-15T17:58:50Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added bug Bugs and behaviour differing from documentation feat / lemmatizer Feature: Rule-based and lookup lemmatization lang / en English language data and models labels Nov 5, 2018

giannisdaras mentioned this issue Nov 29, 2018

Low quality of Swedish tokenization and lemmatization #2578

Closed

honnibal removed the bug Bugs and behaviour differing from documentation label Dec 16, 2018

honnibal added the perf / accuracy Performance: accuracy label Dec 16, 2018

ines closed this as completed Dec 16, 2018

lock bot locked as resolved and limited conversation to collaborators Jan 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nlp(u'BUSINESS')[0].lemma_ == 'busines' #2900

nlp(u'BUSINESS')[0].lemma_ == 'busines' #2900

danielvarga commented Nov 5, 2018

DuyguA commented Nov 14, 2018 •

edited

Loading

danielvarga commented Nov 15, 2018

giannisdaras commented Nov 17, 2018 •

edited

Loading

honnibal commented Dec 16, 2018

ines commented Dec 16, 2018

lock bot commented Jan 15, 2019

nlp(u'BUSINESS')[0].lemma_ == 'busines' #2900

nlp(u'BUSINESS')[0].lemma_ == 'busines' #2900

Comments

danielvarga commented Nov 5, 2018

DuyguA commented Nov 14, 2018 • edited Loading

danielvarga commented Nov 15, 2018

giannisdaras commented Nov 17, 2018 • edited Loading

honnibal commented Dec 16, 2018

ines commented Dec 16, 2018

lock bot commented Jan 15, 2019

DuyguA commented Nov 14, 2018 •

edited

Loading

giannisdaras commented Nov 17, 2018 •

edited

Loading