Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inaccurate German POS tags at beginning of sentence #2166

Closed
danthe96 opened this issue Mar 29, 2018 · 3 comments
Closed

Inaccurate German POS tags at beginning of sentence #2166

danthe96 opened this issue Mar 29, 2018 · 3 comments
Labels
feat / tagger Feature: Part-of-speech tagger lang / de German language data and models perf / accuracy Performance: accuracy

Comments

@danthe96
Copy link

The German POS model seems to have an extremely high tendency to classify the first word of a sentence as NOUN, even when the right class seems obvious. There are some cases where I wouldn't expect the model to perform well such as when verbs have the same form as a noun, but the more straight-forward ones are misclassified as well. Anecdotally, out of all sentences in our OpenSubtitles2018 dataset where the first word of a sentence has been tagged as NOUN, roughly 60% are mislabeled. I've included examples and the used environment below.

This issue may be related to how the original training set was capitalized, especially at the beginning of a sentence, but I'm sure an active contributor would know more about that. Another issue could be that the German model is trained (in part?) on Wikipedia data, which usually doesn't have the kind of direct speech you see in Opensubtitles, e.g. second-person singular verb forms, interjections, etc.

Examples

All examples are from the OpenSubtitles2018 dataset. I can provide many more if requested. It seems to be of note that the issue is not specific to any class, there is a general bias toward NOUN regardless of true class.

"Fährt sonst noch jemand damit?" - Fährt should be VERB.

NOUN	Fährt
ADV	sonst
ADV	noch
PRON	jemand
ADV	damit
PUNCT	?

"Schlauer Mord, schlaue Entsorgungsmethode." - Schlauer should be ADJ.

NOUN	Schlauer
NOUN	Mord
PUNCT	,
VERB	schlaue
NOUN	Entsorgungsmethode
PUNCT	.

"Jemanden, der nicht hierher gehört, ein Wesen einer höheren lntelligenzstufe als der unseren?" - Jemanden should be PRON.

NOUN	Jemanden
PUNCT	,
PRON	der
PART	nicht
ADV	hierher
VERB	gehört
PUNCT	,
DET	ein
NOUN	Wesen
DET	einer
ADJ	höheren
VERB	lntelligenzstufe
CONJ	als
DET	der
PRON	unseren
PUNCT	?

"Hey, Schätzchen, wo sind deine Federn hin?" - Hey should be INTJ (I believe?).

NOUN	Hey
PUNCT	,
NOUN	Schätzchen
PUNCT	,
ADV	wo
AUX	sind
DET	deine
NOUN	Federn
PART	hin
PUNCT	?

"Antek hat früher bei einem alten guten Anwalt gearbeitet.." - Antek should be PROPN.

NOUN	Antek
AUX	hat
ADJ	früher
ADP	bei
DET	einem
ADJ	alten
ADJ	guten
NOUN	Anwalt
VERB	gearbeitet
PUNCT	.

Some code to copy-paste if you would like to quickly reproduce the issue:

import spacy
nlp = spacy.load('de')

examples = [
    'Fährt sonst noch jemand damit?', 
    'Schlauer Mord, schlaue Entsorgungsmethode.',
    'Jemanden, der nicht hierher gehört, ein Wesen einer höheren lntelligenzstufe als der unseren?',
    'Hey, Schätzchen, wo sind deine Federn hin?',
    'Antek hat früher bei einem alten guten Anwalt gearbeitet.'
]

for example in examples:
    doc = nlp(example)
    for token in doc:
        print(f'{token.pos_}\t{token.text}')
    print()

Your Environment

  • spaCy version: 2.0.7
  • Platform: Darwin-16.7.0-x86_64-i386-64bit
  • Python version: 3.6.3
  • Models: de, en, en_core_web_sm
@honnibal honnibal added performance lang / de German language data and models feat / tagger Feature: Part-of-speech tagger labels Mar 30, 2018
@honnibal
Copy link
Member

Thanks, will look into this!

@ines
Copy link
Member

ines commented Dec 14, 2018

Merging this with #3052. We've now added a master thread for incorrect predictions and related reports – see the issue for more details.

@ines ines closed this as completed Dec 14, 2018
@lock
Copy link

lock bot commented Jan 13, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Jan 13, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feat / tagger Feature: Part-of-speech tagger lang / de German language data and models perf / accuracy Performance: accuracy
Projects
None yet
Development

No branches or pull requests

3 participants