Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spaces impacting tag/pos #13680

Open
lsmith77 opened this issue Oct 28, 2024 · 1 comment
Open

Spaces impacting tag/pos #13680

lsmith77 opened this issue Oct 28, 2024 · 1 comment

Comments

@lsmith77
Copy link

lsmith77 commented Oct 28, 2024

How to reproduce the behaviour

Notice the double space in front of sourire in the first case vs. the single space in the second case

Les publics avec un sourire chaleureux et

image

https://demos.explosion.ai/displacy?text=Les%20publics%20avec%20un%20%20sourire%20chaleureux%20%20et&model=fr_core_news_sm

vs.

Les publics avec un sourire chaleureux et

image

https://demos.explosion.ai/displacy?text=Les%20publics%20avec%20un%20sourire%20chaleureux%20%20et&model=fr_core_news_sm

Your Environment

  • Operating System:
  • Python Version Used: 3.12
  • spaCy Version Used: v3.5 (displacy) but also in v3.7
  • Environment Information:

Semi-related: Any guidance on how to modify the tokenizer so that a double spaces would be placed into whitespace_ (ie. ) and not lead to a SPACE token? I did take note of #1707 though putting the additional spaces into whitespace_ seems more logical to me.

Research

a) Maybe related #621
b) Semi-related https://stephantul.github.io/spacy/2019/05/01/tokenizationspacy/
c) Semi-related #9978

@smal8
Copy link

smal8 commented Nov 12, 2024

Maybe we could use infixes or suffixes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants