Inaccurate German POS tags at beginning of sentence #2166

danthe96 · 2018-03-29T21:26:21Z

The German POS model seems to have an extremely high tendency to classify the first word of a sentence as NOUN, even when the right class seems obvious. There are some cases where I wouldn't expect the model to perform well such as when verbs have the same form as a noun, but the more straight-forward ones are misclassified as well. Anecdotally, out of all sentences in our OpenSubtitles2018 dataset where the first word of a sentence has been tagged as NOUN, roughly 60% are mislabeled. I've included examples and the used environment below.

This issue may be related to how the original training set was capitalized, especially at the beginning of a sentence, but I'm sure an active contributor would know more about that. Another issue could be that the German model is trained (in part?) on Wikipedia data, which usually doesn't have the kind of direct speech you see in Opensubtitles, e.g. second-person singular verb forms, interjections, etc.

Examples

All examples are from the OpenSubtitles2018 dataset. I can provide many more if requested. It seems to be of note that the issue is not specific to any class, there is a general bias toward NOUN regardless of true class.

"Fährt sonst noch jemand damit?" - Fährt should be VERB.

NOUN	Fährt
ADV	sonst
ADV	noch
PRON	jemand
ADV	damit
PUNCT	?

"Schlauer Mord, schlaue Entsorgungsmethode." - Schlauer should be ADJ.

NOUN	Schlauer
NOUN	Mord
PUNCT	,
VERB	schlaue
NOUN	Entsorgungsmethode
PUNCT	.

"Jemanden, der nicht hierher gehört, ein Wesen einer höheren lntelligenzstufe als der unseren?" - Jemanden should be PRON.

NOUN	Jemanden
PUNCT	,
PRON	der
PART	nicht
ADV	hierher
VERB	gehört
PUNCT	,
DET	ein
NOUN	Wesen
DET	einer
ADJ	höheren
VERB	lntelligenzstufe
CONJ	als
DET	der
PRON	unseren
PUNCT	?

"Hey, Schätzchen, wo sind deine Federn hin?" - Hey should be INTJ (I believe?).

NOUN	Hey
PUNCT	,
NOUN	Schätzchen
PUNCT	,
ADV	wo
AUX	sind
DET	deine
NOUN	Federn
PART	hin
PUNCT	?

"Antek hat früher bei einem alten guten Anwalt gearbeitet.." - Antek should be PROPN.

NOUN	Antek
AUX	hat
ADJ	früher
ADP	bei
DET	einem
ADJ	alten
ADJ	guten
NOUN	Anwalt
VERB	gearbeitet
PUNCT	.

Some code to copy-paste if you would like to quickly reproduce the issue:

import spacy
nlp = spacy.load('de')

examples = [
    'Fährt sonst noch jemand damit?', 
    'Schlauer Mord, schlaue Entsorgungsmethode.',
    'Jemanden, der nicht hierher gehört, ein Wesen einer höheren lntelligenzstufe als der unseren?',
    'Hey, Schätzchen, wo sind deine Federn hin?',
    'Antek hat früher bei einem alten guten Anwalt gearbeitet.'
]

for example in examples:
    doc = nlp(example)
    for token in doc:
        print(f'{token.pos_}\t{token.text}')
    print()

Your Environment

spaCy version: 2.0.7
Platform: Darwin-16.7.0-x86_64-i386-64bit
Python version: 3.6.3
Models: de, en, en_core_web_sm

The text was updated successfully, but these errors were encountered:

honnibal · 2018-03-30T09:08:33Z

Thanks, will look into this!

ines · 2018-12-14T11:26:41Z

Merging this with #3052. We've now added a master thread for incorrect predictions and related reports – see the issue for more details.

lock · 2019-01-13T16:59:10Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal added performance lang / de German language data and models feat / tagger Feature: Part-of-speech tagger labels Mar 30, 2018

karelin mentioned this issue Jun 19, 2018

Incorrect lemmatization for German at beginning of sentence #2465

Closed

ines added perf / accuracy Performance: accuracy and removed performance labels Aug 15, 2018

ines closed this as completed Dec 14, 2018

lock bot locked as resolved and limited conversation to collaborators Jan 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inaccurate German POS tags at beginning of sentence #2166

Inaccurate German POS tags at beginning of sentence #2166

danthe96 commented Mar 29, 2018

honnibal commented Mar 30, 2018

ines commented Dec 14, 2018

lock bot commented Jan 13, 2019

Inaccurate German POS tags at beginning of sentence #2166

Inaccurate German POS tags at beginning of sentence #2166

Comments

danthe96 commented Mar 29, 2018

Examples

Your Environment

honnibal commented Mar 30, 2018

ines commented Dec 14, 2018

lock bot commented Jan 13, 2019