'POS tagging' output is not correct #13310
Replies: 4 comments
-
Hi! The accuracy of the tagger is not 100%, so you will definitely find cases that are incorrect. Nevertheless, our benchmarks put its accuracy at above 97% on the OntoNotes 5.0 corpus. Are you giving it full grammatical sentences to tag? Can you give some examples of systematic errors of both the lowercase and uppercase words, where they are used and wrongly tagged in full sentences? |
Beta Was this translation helpful? Give feedback.
-
Yes, I am giving full grammatical sentences to tag. I cannot share the sentences as it is confidential business data. But I have given examples above and here are few more examples of incorrect tagging: This is what I am printing above |
Beta Was this translation helpful? Give feedback.
-
I've noticed the same issue.
The only noun (according to Spacy) is "Student". Others are detected as proper nouns. It's not correct imo. In this variant:
The first word is still marked as So capitalization affinity is way too strong in Spacy. |
Beta Was this translation helpful? Give feedback.
-
Another case of improper POS tagging:
|
Beta Was this translation helpful? Give feedback.
-
How to reproduce the behaviour
Hi, I tried the 4 english pipelines (en_core_web - md, sm, lg, trf) for POS tagging. I know that spacy is case sensitive but how are words like 'Water', 'Wheat', 'Cereal', 'Dry', 'Information', 'Research', etc tagged as Proper Noun (NNP, PROPN)?
These words are title case. Even lowercase words like oil, dry, nutritional, law, express are tagged as NNP.
I thought maybe 'md' model is not large enough to recognize it but even large models like 'lg' and 'trf' are giving poor results. Am I doing something wrong? Can you please help me?
import spacy
nlp_trf = spacy.load("en_core_web_trf")
doc = nlp_trf(text)
for token in doc:
print(token.text, token.lemma_, token.tag_, token.pos_)
Thanks,
Aditya
Your Environment
Beta Was this translation helpful? Give feedback.
All reactions