Apostrophes: It's Jess' n' Sam's car. #12468

ryanheise · 2023-03-27T07:18:07Z

How to reproduce the behaviour

nlp = spacy.load('en_core_web_lg')
for t in list(nlp("It's Jess' n' Sam's car. No, it's just Jess'.")):
    print(f'{t.text:8} {t.lemma_:8} {t.pos_:8} {t.tag_:8}')

Output:

It       it       PRON     PRP
's       be       AUX      VBZ       <--- correct
Jess     Jess     PROPN    NNP
'        '        PART     POS       <--- correct
n        n        CCONJ    CC
'        '        PUNCT    ''        <--- incorrect
Sam      Sam      PROPN    NNP
's       's       PART     POS       <--- correct
car      car      NOUN     NN
.        .        PUNCT    .
No       no       INTJ     UH
,        ,        PUNCT    ,
it       it       PRON     PRP
's       be       AUX      VBZ       <--- correct
just     just     ADV      RB
Jess     Jess     PROPN    NNP
'        '        PUNCT    ''        <--- incorrect
.        .        PUNCT    .

For context, in post processing I want to merge contractions and possessives into a single word. The incorrect annotations above are indistinguishable from when ' is used merely as a single quotation mark:

for t in list(nlp("I like 'apples'.")):
    print(f'{t.text:8} {t.lemma_:8} {t.pos_:8} {t.tag_:8}')

Output:

I        I        PRON     PRP
like     like     VERB     VBP
'        '        PUNCT    ``
apples   apple    NOUN     NNS
'        '        PUNCT    ''        <--- indistinguishable
.        .        PUNCT    .

So I can't distinguish the apostrophes that I want to merge from the single quotation marks that I don't want to merge.

Given my goal, it probably makes more sense to have a new type of annotation that tells you whether tokens are part of the same word, since some languages may have multi-token words that are not separated by apostrophes.

Your Environment

spaCy version: 3.5.0
Platform: Linux-6.2.6-arch1-1-x86_64-with-glibc2.37
Python version: 3.10.10
Pipelines: en_core_web_lg (3.5.0)

The text was updated successfully, but these errors were encountered:

adrianeboyd · 2023-04-20T06:36:10Z

For abbreviations like n' it might better to have a rule-based exception. For the possessive vs. quote cases, this distinction is present in the annotation scheme for the training corpus, but it's likely that this is ambiguous enough and training examples are rare enough that the trained pipelines like en_core_web_lg are going to make a fair number of mistakes.

Aside from the general recommendation that you can improve the performance by training or fine-tuning a model with more of these kinds of examples (#3052), I'd recommend looking at the dependency parse along with the POS tags to distinguish these cases and consider using en_core_web_trf, which at least for these cases seems to perform a bit better. Obviously you'd need to evaluate this carefully for your data.

For example:

en_core_web_lg

It       it       PRON     PRP      nsubj   
's       be       AUX      VBZ      ROOT    
Jess     Jess     PROPN    NNP      attr    
'        '        PART     POS      case    
n        n        CCONJ    CC       cc      
'        '        PUNCT    ''       punct   
Sam      Sam      PROPN    NNP      poss    
's       's       PART     POS      case    
car      car      NOUN     NN       attr    
.        .        PUNCT    .        punct   
No       no       INTJ     UH       intj    
,        ,        PUNCT    ,        punct   
it       it       PRON     PRP      nsubj   
's       be       AUX      VBZ      ROOT    
just     just     ADV      RB       advmod  
Jess     Jess     PROPN    NNP      attr    
'        '        PUNCT    ''       punct   
.        .        PUNCT    .        punct   
I        I        PRON     PRP      nsubj   
like     like     VERB     VBP      ROOT    
'        '        PUNCT    ``       punct   
apples   apple    NOUN     NNS      dobj    
'        '        PUNCT    ''       punct   
.        .        PUNCT    .        punct   

en_core_web_trf

It       it       PRON     PRP      nsubj   
's       be       AUX      VBZ      ROOT    
Jess     Jess     PROPN    NNP      poss    
'        '        PART     POS      case    
n        n        CCONJ    CC       cc      
'        '        CCONJ    CC       cc      
Sam      Sam      PROPN    NNP      conj    
's       's       PART     POS      case    
car      car      NOUN     NN       attr    
.        .        PUNCT    .        punct   
No       no       INTJ     UH       intj    
,        ,        PUNCT    ,        punct   
it       it       PRON     PRP      nsubj   
's       be       AUX      VBZ      ccomp   
just     just     ADV      RB       advmod  
Jess     Jess     PROPN    NNP      attr    
'        '        PART     POS      case    
.        .        PART     POS      case    
I        I        PRON     PRP      nsubj   
like     like     VERB     VBP      ROOT    
'        '        PUNCT    ``       punct   
apples   apple    NOUN     NNS      dobj    
'        '        PUNCT    .        punct   
.        .        PUNCT    .        punct

adrianeboyd · 2023-04-20T06:36:25Z

Let me move this to the discussion board...

shadeMe added feat / tagger Feature: Part-of-speech tagger lang / en English language data and models labels Apr 3, 2023

explosion locked and limited conversation to collaborators Apr 20, 2023

adrianeboyd converted this issue into discussion #12552 Apr 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Apostrophes: It's Jess' n' Sam's car. #12468

Apostrophes: It's Jess' n' Sam's car. #12468

ryanheise commented Mar 27, 2023

adrianeboyd commented Apr 20, 2023

adrianeboyd commented Apr 20, 2023

This issue was moved to a discussion.

This issue was moved to a discussion.

Apostrophes: It's Jess' n' Sam's car. #12468

Apostrophes: It's Jess' n' Sam's car. #12468

Comments

ryanheise commented Mar 27, 2023

How to reproduce the behaviour

Your Environment

adrianeboyd commented Apr 20, 2023

adrianeboyd commented Apr 20, 2023

This issue was moved to a discussion.