You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
nlp=spacy.load('en_core_web_lg')
fortinlist(nlp("It's Jess' n' Sam's car. No, it's just Jess'.")):
print(f'{t.text:8}{t.lemma_:8}{t.pos_:8}{t.tag_:8}')
Output:
It it PRON PRP
's be AUX VBZ <--- correct
Jess Jess PROPN NNP
' ' PART POS <--- correct
n n CCONJ CC
' ' PUNCT '' <--- incorrect
Sam Sam PROPN NNP
's 's PART POS <--- correct
car car NOUN NN
. . PUNCT .
No no INTJ UH
, , PUNCT ,
it it PRON PRP
's be AUX VBZ <--- correct
just just ADV RB
Jess Jess PROPN NNP
' ' PUNCT '' <--- incorrect
. . PUNCT .
For context, in post processing I want to merge contractions and possessives into a single word. The incorrect annotations above are indistinguishable from when ' is used merely as a single quotation mark:
fortinlist(nlp("I like 'apples'.")):
print(f'{t.text:8}{t.lemma_:8}{t.pos_:8}{t.tag_:8}')
Output:
I I PRON PRP
like like VERB VBP
' ' PUNCT ``
apples apple NOUN NNS
' ' PUNCT '' <--- indistinguishable
. . PUNCT .
So I can't distinguish the apostrophes that I want to merge from the single quotation marks that I don't want to merge.
Given my goal, it probably makes more sense to have a new type of annotation that tells you whether tokens are part of the same word, since some languages may have multi-token words that are not separated by apostrophes.
For abbreviations like n' it might better to have a rule-based exception. For the possessive vs. quote cases, this distinction is present in the annotation scheme for the training corpus, but it's likely that this is ambiguous enough and training examples are rare enough that the trained pipelines like en_core_web_lg are going to make a fair number of mistakes.
Aside from the general recommendation that you can improve the performance by training or fine-tuning a model with more of these kinds of examples (#3052), I'd recommend looking at the dependency parse along with the POS tags to distinguish these cases and consider using en_core_web_trf, which at least for these cases seems to perform a bit better. Obviously you'd need to evaluate this carefully for your data.
For example:
en_core_web_lg
It it PRON PRP nsubj
's be AUX VBZ ROOT
Jess Jess PROPN NNP attr
' ' PART POS case
n n CCONJ CC cc
' ' PUNCT '' punct
Sam Sam PROPN NNP poss
's 's PART POS case
car car NOUN NN attr
. . PUNCT . punct
No no INTJ UH intj
, , PUNCT , punct
it it PRON PRP nsubj
's be AUX VBZ ROOT
just just ADV RB advmod
Jess Jess PROPN NNP attr
' ' PUNCT '' punct
. . PUNCT . punct
I I PRON PRP nsubj
like like VERB VBP ROOT
' ' PUNCT `` punct
apples apple NOUN NNS dobj
' ' PUNCT '' punct
. . PUNCT . punct
en_core_web_trf
It it PRON PRP nsubj
's be AUX VBZ ROOT
Jess Jess PROPN NNP poss
' ' PART POS case
n n CCONJ CC cc
' ' CCONJ CC cc
Sam Sam PROPN NNP conj
's 's PART POS case
car car NOUN NN attr
. . PUNCT . punct
No no INTJ UH intj
, , PUNCT , punct
it it PRON PRP nsubj
's be AUX VBZ ccomp
just just ADV RB advmod
Jess Jess PROPN NNP attr
' ' PART POS case
. . PART POS case
I I PRON PRP nsubj
like like VERB VBP ROOT
' ' PUNCT `` punct
apples apple NOUN NNS dobj
' ' PUNCT . punct
. . PUNCT . punct
How to reproduce the behaviour
Output:
For context, in post processing I want to merge contractions and possessives into a single word. The incorrect annotations above are indistinguishable from when
'
is used merely as a single quotation mark:Output:
So I can't distinguish the apostrophes that I want to merge from the single quotation marks that I don't want to merge.
Given my goal, it probably makes more sense to have a new type of annotation that tells you whether tokens are part of the same word, since some languages may have multi-token words that are not separated by apostrophes.
Your Environment
The text was updated successfully, but these errors were encountered: