Inaccurate German POS tags at beginning of sentence #2166
Labels
feat / tagger
Feature: Part-of-speech tagger
lang / de
German language data and models
perf / accuracy
Performance: accuracy
The German POS model seems to have an extremely high tendency to classify the first word of a sentence as
NOUN
, even when the right class seems obvious. There are some cases where I wouldn't expect the model to perform well such as when verbs have the same form as a noun, but the more straight-forward ones are misclassified as well. Anecdotally, out of all sentences in our OpenSubtitles2018 dataset where the first word of a sentence has been tagged asNOUN
, roughly 60% are mislabeled. I've included examples and the used environment below.This issue may be related to how the original training set was capitalized, especially at the beginning of a sentence, but I'm sure an active contributor would know more about that. Another issue could be that the German model is trained (in part?) on Wikipedia data, which usually doesn't have the kind of direct speech you see in Opensubtitles, e.g. second-person singular verb forms, interjections, etc.
Examples
All examples are from the OpenSubtitles2018 dataset. I can provide many more if requested. It seems to be of note that the issue is not specific to any class, there is a general bias toward
NOUN
regardless of true class."Fährt sonst noch jemand damit?" -
Fährt
should beVERB
."Schlauer Mord, schlaue Entsorgungsmethode." -
Schlauer
should beADJ
."Jemanden, der nicht hierher gehört, ein Wesen einer höheren lntelligenzstufe als der unseren?" -
Jemanden
should bePRON
."Hey, Schätzchen, wo sind deine Federn hin?" -
Hey
should beINTJ
(I believe?)."Antek hat früher bei einem alten guten Anwalt gearbeitet.." -
Antek
should bePROPN
.Some code to copy-paste if you would like to quickly reproduce the issue:
Your Environment
The text was updated successfully, but these errors were encountered: