You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Spacy sentence splitting incorrectly splits long/complex sentences.
In two examples I encountered, Spacy incorrectly split one long sentence after a comma, and another long sentence after a closing paranthesis ')'.
I found incorrect splitting in other similar sentences too.
The two examples and steps to reproduce are described below.
Steps/Code to Reproduce
import spacy
nlp = spacy.load('en_core_web_lg', disable = ['ner'])
texts = [
'Definitely encourage you to continue making big bets in 2018. The new project seems like a great opportunity for us to invest in an area where the org needs better tooling. It says alot when you made the internal team swap to address this bet. It's easy to think "lets do it", but committing to it by moving parts that existing stakeholders were previously happy with (I hope) is a big bet in itself. I've developed the opinion that with our current team size and the kind of requests I've seen come down the pipeline, if every stakeholder is perfectly happy, it's likely we are not really taking those big bets.',
"Continue helping us push back on smaller (lower impact) requests to keep time, not only for big bets, but also tech debt, better documentation, internal improvements and tooling. There is delicate balance needed to keep helping our partners in the short term, while working for the long term objectives of XYZ. So far, you've been a big help with this."
]
for n, text in enumerate(texts) :
doc = nlp(text)
print('Doc ', n, ':', sep='')
for i, sentence in enumerate(doc.sents)
print(i, sentence, sep=':' )
Expected Results
Doc 0:
0: Definitely encourage you to continue making big bets in 2018.
1: The new project seems like a great opportunity for us to invest in an area where the org needs better tooling.
2: It says alot when you made the internal team swap to address this bet.
3: It's easy to think "lets do it", but committing to it by moving parts that existing stakeholders were previously happy with (I hope) is a big bet in itself.
4: I've developed the opinion that with our current team size and the kind of requests I've seen come down the pipeline, if every stakeholder is perfectly happy, it's likely we are not really taking those big bets.
Doc 1:
0: Continue helping us push back on smaller (lower impact) requests to keep time, not only for big bets, but also tech debt, better documentation, internal improvements and tooling.
1: There is delicate balance needed to keep helping our partners in the short term, while working for the long term objectives of XYZ.
2: So far, you've been a big help with this.
Actual Results
Doc 0:
0: Definitely encourage you to continue making big bets in 2018.
1: The new project seems like a great opportunity for us to invest in an area where the org needs better tooling.
2:I t says alot when you made the internal team swap to address this bet.
3: It's easy to think "lets do it", but committing to it by moving parts that existing stakeholders were previously happy with (I hope) is a big bet in itself.
4: I've developed the opinion that with our current team size and the kind of requests I've seen come down the pipeline,
5: if every stakeholder is perfectly happy, it's likely we are not really taking those big bets.
Doc 1:
0: Continue helping us push back on smaller (lower impact)
1: requests to keep time, not only for big bets, but also tech debt, better documentation, internal improvements and tooling.
2: There is delicate balance needed to keep helping our partners in the short term, while working for the long term objectives of XYZ.
3: So far, you've been a big help with this.
By default, spaCy uses the parser to set sentence boundaries. This is usually more accurate – however, depending on the data, it also means that it's affected by wrong predictions in the dependency parse. See here for details on how to customise the sentence segmentation and how to use a rule-based component instead: https://spacy.io/usage/linguistic-features#section-sbd
I'm also merging this with #3052. We've now added a master thread for incorrect predictions and related reports – see the issue for more details.
Description
Spacy sentence splitting incorrectly splits long/complex sentences.
In two examples I encountered, Spacy incorrectly split one long sentence after a comma, and another long sentence after a closing paranthesis ')'.
I found incorrect splitting in other similar sentences too.
The two examples and steps to reproduce are described below.
Steps/Code to Reproduce
import spacy
nlp = spacy.load('en_core_web_lg', disable = ['ner'])
texts = [
'Definitely encourage you to continue making big bets in 2018. The new project seems like a great opportunity for us to invest in an area where the org needs better tooling. It says alot when you made the internal team swap to address this bet. It's easy to think "lets do it", but committing to it by moving parts that existing stakeholders were previously happy with (I hope) is a big bet in itself. I've developed the opinion that with our current team size and the kind of requests I've seen come down the pipeline, if every stakeholder is perfectly happy, it's likely we are not really taking those big bets.',
"Continue helping us push back on smaller (lower impact) requests to keep time, not only for big bets, but also tech debt, better documentation, internal improvements and tooling. There is delicate balance needed to keep helping our partners in the short term, while working for the long term objectives of XYZ. So far, you've been a big help with this."
]
for n, text in enumerate(texts) :
doc = nlp(text)
print('Doc ', n, ':', sep='')
for i, sentence in enumerate(doc.sents)
print(i, sentence, sep=':' )
Expected Results
Doc 0:
0: Definitely encourage you to continue making big bets in 2018.
1: The new project seems like a great opportunity for us to invest in an area where the org needs better tooling.
2: It says alot when you made the internal team swap to address this bet.
3: It's easy to think "lets do it", but committing to it by moving parts that existing stakeholders were previously happy with (I hope) is a big bet in itself.
4: I've developed the opinion that with our current team size and the kind of requests I've seen come down the pipeline, if every stakeholder is perfectly happy, it's likely we are not really taking those big bets.
Doc 1:
0: Continue helping us push back on smaller (lower impact) requests to keep time, not only for big bets, but also tech debt, better documentation, internal improvements and tooling.
1: There is delicate balance needed to keep helping our partners in the short term, while working for the long term objectives of XYZ.
2: So far, you've been a big help with this.
Actual Results
Doc 0:
0: Definitely encourage you to continue making big bets in 2018.
1: The new project seems like a great opportunity for us to invest in an area where the org needs better tooling.
2:I t says alot when you made the internal team swap to address this bet.
3: It's easy to think "lets do it", but committing to it by moving parts that existing stakeholders were previously happy with (I hope) is a big bet in itself.
4: I've developed the opinion that with our current team size and the kind of requests I've seen come down the pipeline,
5: if every stakeholder is perfectly happy, it's likely we are not really taking those big bets.
Doc 1:
0: Continue helping us push back on smaller (lower impact)
1: requests to keep time, not only for big bets, but also tech debt, better documentation, internal improvements and tooling.
2: There is delicate balance needed to keep helping our partners in the short term, while working for the long term objectives of XYZ.
3: So far, you've been a big help with this.
My Environment
Windows-10-10.0.17134-SP0
Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]
NumPy 1.12.1
SciPy 1.1.0
Scikit-Learn 0.19.1
Spacy 2.0.11
The text was updated successfully, but these errors were encountered: