-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nlp(u'BUSINESS')[0].lemma_ == 'busines' #2900
Comments
Hello, |
The all-caps versions are lemmatized to kindness, kids, balas. |
I think I figured out what's going on. First of all, if you load an English language model, lemmatization is not based on lookup. There is a rule based lemmatization strategy based on the pos tag. See #2668. Both 'business' and 'BUSINESS' are identified as nouns, so both are supposed to follow the rule based lemmatization for nouns. Rule based lemmatization works by converting words to lowercase and then trying to apply rules that change suffixes based on the pos tag in order to match a lemma. The thing is that the rules are applied even if the word is a lemma itself. This leads to profound mistakes. In order to correct this, spaCy checks if a word is a base form before going to lemmatize option. But first, what a base form is? This comes from issue #435. The idea is that the lemma of a lemma word is the lemma itself. To explain it a little bit more: Uninflected words do not need a lemmatization approach, we just return the word itself. So we need a way to check if the word is uninflected. This is done in spaCy with the is_base_form function. In order to decide whether a word is a lemma itself, spaCy checks the morphological features of the word and checks if those match the morphological features of lemmas for the specified pos tag. Now, the problem:
In case you are wondering what those tag means: NN means singular noun and NNS means plural noun. Of course, this is a bug and happens for other words as well. |
Ultimately I think we do want the lemmatizer to "trust" the attributes returned by the tagger. So, I think this is a tagger error, rather than a lemmatizer error. Given that the tagger has said "BUSINESS" is a plural noun, "BUSINES" is kind of the best guess you could hope for as a lemma to this hypothetical plural...The real problem is that the tagger has made such an unreasonable guess. |
Merging this with #3052. We've now added a master thread for incorrect predictions and related reports – see the issue for more details. |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
I wonder if this is just some ad hoc machine learning weirdness, or the symptom of something more significant.
The text was updated successfully, but these errors were encountered: