Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apostrophes: It's Jess' n' Sam's car. #12468

Closed
ryanheise opened this issue Mar 27, 2023 · 2 comments
Closed

Apostrophes: It's Jess' n' Sam's car. #12468

ryanheise opened this issue Mar 27, 2023 · 2 comments
Labels
feat / tagger Feature: Part-of-speech tagger lang / en English language data and models

Comments

@ryanheise
Copy link

How to reproduce the behaviour

nlp = spacy.load('en_core_web_lg')
for t in list(nlp("It's Jess' n' Sam's car. No, it's just Jess'.")):
    print(f'{t.text:8} {t.lemma_:8} {t.pos_:8} {t.tag_:8}')

Output:

It       it       PRON     PRP
's       be       AUX      VBZ       <--- correct
Jess     Jess     PROPN    NNP
'        '        PART     POS       <--- correct
n        n        CCONJ    CC
'        '        PUNCT    ''        <--- incorrect
Sam      Sam      PROPN    NNP
's       's       PART     POS       <--- correct
car      car      NOUN     NN
.        .        PUNCT    .
No       no       INTJ     UH
,        ,        PUNCT    ,
it       it       PRON     PRP
's       be       AUX      VBZ       <--- correct
just     just     ADV      RB
Jess     Jess     PROPN    NNP
'        '        PUNCT    ''        <--- incorrect
.        .        PUNCT    .

For context, in post processing I want to merge contractions and possessives into a single word. The incorrect annotations above are indistinguishable from when ' is used merely as a single quotation mark:

for t in list(nlp("I like 'apples'.")):
    print(f'{t.text:8} {t.lemma_:8} {t.pos_:8} {t.tag_:8}')

Output:

I        I        PRON     PRP
like     like     VERB     VBP
'        '        PUNCT    ``
apples   apple    NOUN     NNS
'        '        PUNCT    ''        <--- indistinguishable
.        .        PUNCT    .

So I can't distinguish the apostrophes that I want to merge from the single quotation marks that I don't want to merge.

Given my goal, it probably makes more sense to have a new type of annotation that tells you whether tokens are part of the same word, since some languages may have multi-token words that are not separated by apostrophes.

Your Environment

  • spaCy version: 3.5.0
  • Platform: Linux-6.2.6-arch1-1-x86_64-with-glibc2.37
  • Python version: 3.10.10
  • Pipelines: en_core_web_lg (3.5.0)
@shadeMe shadeMe added feat / tagger Feature: Part-of-speech tagger lang / en English language data and models labels Apr 3, 2023
@adrianeboyd
Copy link
Contributor

For abbreviations like n' it might better to have a rule-based exception. For the possessive vs. quote cases, this distinction is present in the annotation scheme for the training corpus, but it's likely that this is ambiguous enough and training examples are rare enough that the trained pipelines like en_core_web_lg are going to make a fair number of mistakes.

Aside from the general recommendation that you can improve the performance by training or fine-tuning a model with more of these kinds of examples (#3052), I'd recommend looking at the dependency parse along with the POS tags to distinguish these cases and consider using en_core_web_trf, which at least for these cases seems to perform a bit better. Obviously you'd need to evaluate this carefully for your data.

For example:

en_core_web_lg

It       it       PRON     PRP      nsubj   
's       be       AUX      VBZ      ROOT    
Jess     Jess     PROPN    NNP      attr    
'        '        PART     POS      case    
n        n        CCONJ    CC       cc      
'        '        PUNCT    ''       punct   
Sam      Sam      PROPN    NNP      poss    
's       's       PART     POS      case    
car      car      NOUN     NN       attr    
.        .        PUNCT    .        punct   
No       no       INTJ     UH       intj    
,        ,        PUNCT    ,        punct   
it       it       PRON     PRP      nsubj   
's       be       AUX      VBZ      ROOT    
just     just     ADV      RB       advmod  
Jess     Jess     PROPN    NNP      attr    
'        '        PUNCT    ''       punct   
.        .        PUNCT    .        punct   
I        I        PRON     PRP      nsubj   
like     like     VERB     VBP      ROOT    
'        '        PUNCT    ``       punct   
apples   apple    NOUN     NNS      dobj    
'        '        PUNCT    ''       punct   
.        .        PUNCT    .        punct   

en_core_web_trf

It       it       PRON     PRP      nsubj   
's       be       AUX      VBZ      ROOT    
Jess     Jess     PROPN    NNP      poss    
'        '        PART     POS      case    
n        n        CCONJ    CC       cc      
'        '        CCONJ    CC       cc      
Sam      Sam      PROPN    NNP      conj    
's       's       PART     POS      case    
car      car      NOUN     NN       attr    
.        .        PUNCT    .        punct   
No       no       INTJ     UH       intj    
,        ,        PUNCT    ,        punct   
it       it       PRON     PRP      nsubj   
's       be       AUX      VBZ      ccomp   
just     just     ADV      RB       advmod  
Jess     Jess     PROPN    NNP      attr    
'        '        PART     POS      case    
.        .        PART     POS      case    
I        I        PRON     PRP      nsubj   
like     like     VERB     VBP      ROOT    
'        '        PUNCT    ``       punct   
apples   apple    NOUN     NNS      dobj    
'        '        PUNCT    .        punct   
.        .        PUNCT    .        punct

@adrianeboyd
Copy link
Contributor

Let me move this to the discussion board...

@explosion explosion locked and limited conversation to collaborators Apr 20, 2023
@adrianeboyd adrianeboyd converted this issue into discussion #12552 Apr 20, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
feat / tagger Feature: Part-of-speech tagger lang / en English language data and models
Projects
None yet
Development

No branches or pull requests

3 participants