'POS tagging' output is not correct #13310

adityakadrekar16 · 2024-01-17T00:38:03Z

adityakadrekar16
Jan 17, 2024

How to reproduce the behaviour

Hi, I tried the 4 english pipelines (en_core_web - md, sm, lg, trf) for POS tagging. I know that spacy is case sensitive but how are words like 'Water', 'Wheat', 'Cereal', 'Dry', 'Information', 'Research', etc tagged as Proper Noun (NNP, PROPN)?
These words are title case. Even lowercase words like oil, dry, nutritional, law, express are tagged as NNP.

I thought maybe 'md' model is not large enough to recognize it but even large models like 'lg' and 'trf' are giving poor results. Am I doing something wrong? Can you please help me?

import spacy
nlp_trf = spacy.load("en_core_web_trf")
doc = nlp_trf(text)
for token in doc:
print(token.text, token.lemma_, token.tag_, token.pos_)

Thanks,
Aditya

Your Environment

Operating System: Mac M1 Pro Venture 13.6
Python Version Used: 3.10.13
spaCy Version Used: 3.6.1
Environment Information:

svlandeg · 2024-02-06T14:56:18Z

svlandeg
Feb 6, 2024
Maintainer

Hi! The accuracy of the tagger is not 100%, so you will definitely find cases that are incorrect. Nevertheless, our benchmarks put its accuracy at above 97% on the OntoNotes 5.0 corpus.

Are you giving it full grammatical sentences to tag? Can you give some examples of systematic errors of both the lowercase and uppercase words, where they are used and wrongly tagged in full sentences?

0 replies

adityakadrekar16 · 2024-02-13T01:46:24Z

adityakadrekar16
Feb 13, 2024
Author

Yes, I am giving full grammatical sentences to tag. I cannot share the sentences as it is confidential business data. But I have given examples above and here are few more examples of incorrect tagging:
''' Wheat NNP PROPN
Water NNP PROPN
energy NNP PROPN'
pea NNP PROPN
protein NNP PROPN
express NNP PROPN"

This is what I am printing above
doc = nlp(pdf_text) for token in doc: print(token.text, token.tag_, token.pos_)

0 replies

ivan-kleshnin · 2024-10-07T05:36:27Z

ivan-kleshnin
Oct 7, 2024

I've noticed the same issue.

Lawyer. Lecturer. Researcher. Student.

The only noun (according to Spacy) is "Student". Others are detected as proper nouns. It's not correct imo.

In this variant:

Lawyer, lecturer, researcher, student.

The first word is still marked as PROPN. Like it's a surname or something.

So capitalization affinity is way too strong in Spacy.

0 replies

ivan-kleshnin · 2024-10-07T09:06:13Z

ivan-kleshnin
Oct 7, 2024

Another case of improper POS tagging:

Undergraduate at UC Berkeley -- "Undergraduate" is NOUN, correct
Undergraduate studying Software Engineering at UC Berkeley -- "Undergraduate" is ADJ, should be NOUN as well

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'POS tagging' output is not correct #13310

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

'POS tagging' output is not correct #13310

adityakadrekar16 Jan 17, 2024

How to reproduce the behaviour

Your Environment

Replies: 4 comments

svlandeg Feb 6, 2024 Maintainer

adityakadrekar16 Feb 13, 2024 Author

ivan-kleshnin Oct 7, 2024

ivan-kleshnin Oct 7, 2024

adityakadrekar16
Jan 17, 2024

svlandeg
Feb 6, 2024
Maintainer

adityakadrekar16
Feb 13, 2024
Author

ivan-kleshnin
Oct 7, 2024

ivan-kleshnin
Oct 7, 2024