Fix concatenating sentence parts separated with newlines #506

AndyTheFactory · 2023-10-24T18:28:38Z

Issue by dhgelling
Fri Feb 12 13:22:28 2021
Originally opened as codelucas/newspaper#873

The text content of newspapers seems to be returned as paragraphs separated by two newlines. When doing nlp on this, the tokenizer sometimes thinks a sentence spans across two paragraphs, returning a sentence looking like 'first sentence\n\nsecond sentence', which means this part of the code would concatenate the words sentence and second, returning 'first sentencesecond sentence'. This change is meant to fix that, by replacing any run of multiple spaces or newlines by one space.

Another solution would be to first split on double newlines and processing the content of that separately before concatenating again, but this seemed like it would change the least

dhgelling included the following code: https://github.com/codelucas/newspaper/pull/873/commits

The text was updated successfully, but these errors were encountered:

AndyTheFactory · 2023-11-01T21:11:05Z

changed

AndyTheFactory added undecided yet PR-verify Has a PR, must be checked labels Oct 30, 2023

AndyTheFactory added this to the Release 0.9.1 milestone Oct 30, 2023

AndyTheFactory closed this as completed Nov 1, 2023

AndyTheFactory added a commit that referenced this issue Nov 8, 2023

fix(parse): replace \n with space in sentence split (Issue #506)

3ccb87c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix concatenating sentence parts separated with newlines #506

Fix concatenating sentence parts separated with newlines #506

AndyTheFactory commented Oct 24, 2023

AndyTheFactory commented Nov 1, 2023

Fix concatenating sentence parts separated with newlines #506

Fix concatenating sentence parts separated with newlines #506

Comments

AndyTheFactory commented Oct 24, 2023

AndyTheFactory commented Nov 1, 2023