Incorrect detection of sentence boundaries, if last sentence missing eos symbol for trf model #13356

koder-ua · 2024-02-25T20:50:50Z

koder-ua
Feb 25, 2024

How to reproduce the behaviour

In [69]: len(list(spacy.load("en_core_web_trf")("The first sentence. The second sentence. The last one").sents))
Out[69]: 1   <<<<<<<<<<<<<<<<<<<<<< WRONG

In [70]: len(list(spacy.load("en_core_web_trf")("The first sentence. The second sentence. The last one.").sents))
Out[70]: 3

In [71]: len(list(spacy.load("en_core_web_sm")("The first sentence. The second sentence. The last one").sents))
Out[71]: 3

In [72]: len(list(spacy.load("en_core_web_sm")("The first sentence. The second sentence. The last one.").sents))
Out[72]: 3

Your Environment

Operating System: max os x 10.3
Python Version Used: 3.11
spaCy Version Used: 3.7.4
Environment Information:

en_core_web_trf.__version__  >> '3.7.3'
en_core_web_sm.__version__ >> '3.7.1'

Answered by svlandeg

Feb 27, 2024

Hi!

In this pretrained pipeline, the sentence segmentation is actually done by the parser, and the model was mostly trained on texts with correct punctuation. So unfortunately this type of occassional error is unavoidable.

If you'd like to have more predictable behaviour, you can use the sentencizer instead, which is a more simple rule-based component that splits sentences on punctuation like ., ! or ?.

View full answer

svlandeg · 2024-02-27T13:47:04Z

svlandeg
Feb 27, 2024
Maintainer

Hi!

In this pretrained pipeline, the sentence segmentation is actually done by the parser, and the model was mostly trained on texts with correct punctuation. So unfortunately this type of occassional error is unavoidable.

If you'd like to have more predictable behaviour, you can use the sentencizer instead, which is a more simple rule-based component that splits sentences on punctuation like ., ! or ?.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect detection of sentence boundaries, if last sentence missing eos symbol for trf model #13356

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Incorrect detection of sentence boundaries, if last sentence missing eos symbol for trf model #13356

koder-ua Feb 25, 2024

How to reproduce the behaviour

Your Environment

Replies: 1 comment

svlandeg Feb 27, 2024 Maintainer

koder-ua
Feb 25, 2024

svlandeg
Feb 27, 2024
Maintainer