token differs from token.string, in some cases token.string contains a trailing space while token does not #5504

leomrocha · 2020-05-25T20:42:03Z

leomrocha
May 25, 2020

How to reproduce the behaviour

Length and text differ when obtaining the string from a token in a sentence depending the way it is called. When calling with token.string gives (sometimes) a space at the end and the length shows so, but when using directly token the space is not included.

Expected behaviour: to return in both cases the same value, the entire token length and text used with or without spaces that is not an issue, only the coherence between both methods.

Example code:

In [2]: import en_core_web_md                                                                                                                                 
In [3]: nlp = en_core_web_md.load()                                                                                                                           
In [4]: sen = "Say this to him,\n    He's beat from his best ward.\n  "                                                                                       
In [5]: doc = nlp(sen)                                                                                                                                        
In [6]: sentences = doc.sents                                                                                                                                 
In [7]: sentences = list(doc.sents)                                                                                                                           

In [8]: sentences                                                                                                                                             
Out[8]: 
[Say this to him,
     He's beat from his best ward.
   ]
In [9]: for t in sentences[0]: 
   ...:     print(len(t), len(t.string), len(str(t)), '|{}|'.format(t), '|{}|'.format(t.string)) 
   ...:                                                                                                                                                       
3 4 3 |Say| |Say |
4 5 4 |this| |this |
2 3 2 |to| |to |
3 3 3 |him| |him|
1 1 1 |,| |,|
5 5 5 |
    | |
    |
2 2 2 |He| |He|
2 3 2 |'s| |'s |
4 5 4 |beat| |beat |
4 5 4 |from| |from |
3 4 3 |his| |his |
4 5 4 |best| |best |
4 4 4 |ward| |ward|
1 1 1 |.| |.|
3 3 3 |
  | |
  |

Your Environment

Operating System: Ubuntu 20.04
Python Version Used: 3.8
spaCy Version Used: 2.2.4
Environment Information: ipython and jupyter notebooks

Answered by adrianeboyd

May 26, 2020

Token.string was deprecated a while ago in favor of Token.text_with_ws, probably in part due to this confusion. I'd recommend using Token.text and Token.text_with_ws instead. Token.text_with_ws includes a trailing whitespace if the token is followed by a space in the original text.

View full answer

adrianeboyd · 2020-05-26T07:54:19Z

adrianeboyd
May 26, 2020

Token.string was deprecated a while ago in favor of Token.text_with_ws, probably in part due to this confusion. I'd recommend using Token.text and Token.text_with_ws instead. Token.text_with_ws includes a trailing whitespace if the token is followed by a space in the original text.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

token differs from token.string, in some cases token.string contains a trailing space while token does not #5504

{{title}}

Replies: 1 comment

{{title}}

Select a reply

token differs from token.string, in some cases token.string contains a trailing space while token does not #5504

leomrocha May 25, 2020

How to reproduce the behaviour

Your Environment

Replies: 1 comment

adrianeboyd May 26, 2020

leomrocha
May 25, 2020

adrianeboyd
May 26, 2020