Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nlp(u'BUSINESS')[0].lemma_ == 'busines' #2900

Closed
danielvarga opened this issue Nov 5, 2018 · 6 comments
Closed

nlp(u'BUSINESS')[0].lemma_ == 'busines' #2900

danielvarga opened this issue Nov 5, 2018 · 6 comments
Labels
feat / lemmatizer Feature: Rule-based and lookup lemmatization lang / en English language data and models perf / accuracy Performance: accuracy

Comments

@danielvarga
Copy link

I wonder if this is just some ad hoc machine learning weirdness, or the symptom of something more significant.

nlp = spacy.load('en_core_web_sm')
nlp(u'business')[0].lemma_
u'business'
nlp(u'BUSINESS')[0].lemma_
u'busines'

## Info about spaCy

* **Python version:** 2.7.10
* **Platform:** Darwin-17.6.0-x86_64-i386-64bit
* **spaCy version:** 2.0.11
* **Models:** en_core_web_sm
@ines ines added bug Bugs and behaviour differing from documentation feat / lemmatizer Feature: Rule-based and lookup lemmatization lang / en English language data and models labels Nov 5, 2018
@DuyguA
Copy link
Contributor

DuyguA commented Nov 14, 2018

Hello,
Lemmatization is via lookup. I checked the lemmatizer file, "business" is in regular nouns list as expected. I have a rough idea for this bug, can you please also test the words "KINDNESS", "KIDS" and "BALAS" and report the result? (I'm on travel and shell-less 😬 )

@danielvarga
Copy link
Author

The all-caps versions are lemmatized to kindness, kids, balas.
The lower case versions are lemmatized to kindness, kid, bala.

@giannisdaras
Copy link
Contributor

giannisdaras commented Nov 17, 2018

I think I figured out what's going on.

First of all, if you load an English language model, lemmatization is not based on lookup. There is a rule based lemmatization strategy based on the pos tag. See #2668.

Both 'business' and 'BUSINESS' are identified as nouns, so both are supposed to follow the rule based lemmatization for nouns.

Rule based lemmatization works by converting words to lowercase and then trying to apply rules that change suffixes based on the pos tag in order to match a lemma. The thing is that the rules are applied even if the word is a lemma itself. This leads to profound mistakes. In order to correct this, spaCy checks if a word is a base form before going to lemmatize option.

But first, what a base form is? This comes from issue #435. The idea is that the lemma of a lemma word is the lemma itself. To explain it a little bit more: Uninflected words do not need a lemmatization approach, we just return the word itself. So we need a way to check if the word is uninflected. This is done in spaCy with the is_base_form function.

In order to decide whether a word is a lemma itself, spaCy checks the morphological features of the word and checks if those match the morphological features of lemmas for the specified pos tag.

Now, the problem:

business gets a tag_: NN while BUSINESS gets a tag_ NNS. This leads to the identification of the business word as a base form while BUSINESS word is not identified as base form and thus it gets lemmatized based on rules that lead to wrong results.

In case you are wondering what those tag means: NN means singular noun and NNS means plural noun.

Of course, this is a bug and happens for other words as well.
For example, the word patness behaves in the same way.
I can't find a quick fix on that because morphological features come from models trained on data so there is always an error risk. I think that something we should consider is collecting a list of lemmas for each language and modify a bit the is_base_form to look on that list too. We can get the list easily by getting the first column of lookup tables and converting that to a set.

@ines , @honnibal what's your views on that?

@honnibal honnibal removed the bug Bugs and behaviour differing from documentation label Dec 16, 2018
@honnibal
Copy link
Member

Ultimately I think we do want the lemmatizer to "trust" the attributes returned by the tagger. So, I think this is a tagger error, rather than a lemmatizer error. Given that the tagger has said "BUSINESS" is a plural noun, "BUSINES" is kind of the best guess you could hope for as a lemma to this hypothetical plural...The real problem is that the tagger has made such an unreasonable guess.

@honnibal honnibal added the perf / accuracy Performance: accuracy label Dec 16, 2018
@ines
Copy link
Member

ines commented Dec 16, 2018

Merging this with #3052. We've now added a master thread for incorrect predictions and related reports – see the issue for more details.

@ines ines closed this as completed Dec 16, 2018
@lock
Copy link

lock bot commented Jan 15, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Jan 15, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feat / lemmatizer Feature: Rule-based and lookup lemmatization lang / en English language data and models perf / accuracy Performance: accuracy
Projects
None yet
Development

No branches or pull requests

5 participants