token differs from token.string, in some cases token.string contains a trailing space while token does not #5504
-
How to reproduce the behaviourLength and text differ when obtaining the string from a token in a sentence depending the way it is called. When calling with token.string gives (sometimes) a space at the end and the length shows so, but when using directly token the space is not included. Expected behaviour: to return in both cases the same value, the entire token length and text used with or without spaces that is not an issue, only the coherence between both methods. Example code: In [2]: import en_core_web_md
In [3]: nlp = en_core_web_md.load()
In [4]: sen = "Say this to him,\n He's beat from his best ward.\n "
In [5]: doc = nlp(sen)
In [6]: sentences = doc.sents
In [7]: sentences = list(doc.sents)
In [8]: sentences
Out[8]:
[Say this to him,
He's beat from his best ward.
]
In [9]: for t in sentences[0]:
...: print(len(t), len(t.string), len(str(t)), '|{}|'.format(t), '|{}|'.format(t.string))
...:
3 4 3 |Say| |Say |
4 5 4 |this| |this |
2 3 2 |to| |to |
3 3 3 |him| |him|
1 1 1 |,| |,|
5 5 5 |
| |
|
2 2 2 |He| |He|
2 3 2 |'s| |'s |
4 5 4 |beat| |beat |
4 5 4 |from| |from |
3 4 3 |his| |his |
4 5 4 |best| |best |
4 4 4 |ward| |ward|
1 1 1 |.| |.|
3 3 3 |
| |
| Your Environment
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
|
Beta Was this translation helpful? Give feedback.
Token.string
was deprecated a while ago in favor ofToken.text_with_ws
, probably in part due to this confusion. I'd recommend usingToken.text
andToken.text_with_ws
instead.Token.text_with_ws
includes a trailing whitespace if the token is followed by a space in the original text.