Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sentences split on newlines #177

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

Conversation

Thomzoy
Copy link
Collaborator

@Thomzoy Thomzoy commented Dec 19, 2022

For the moment, the eds.sentences pipe splits sentences on a newline (\n) token iff it is followed by a capitalized token, i.e. a token with an uppercase initial, and with subsequent letter being lowercase. This can be problematic, (see #176)

Description

This PR adds a new split_on_newlines parameter to the pipe.

  • If split_on_newlines="with_capitalized", only newlines which subsequent token is capitalized will split sentences.
  • If split_on_newlines="with_uppercase", only newlines which subsequent token starts with an uppercase letter will split sentences.
  • If split_on_newlines=False, newlines will never split sentences.

Checklist

  • If this PR is a bug fix, the bug is documented in the test suite.
  • Changes were documented in the changelog (pending section).
  • If necessary, changes were made to the documentation (eg new pipeline).

@Thomzoy Thomzoy self-assigned this Dec 19, 2022
changelog.md Outdated Show resolved Hide resolved
@codecov-commenter
Copy link

Codecov Report

Base: 94.05% // Head: 94.05% // Increases project coverage by +0.00% 🎉

Coverage data is based on head (dcaed0f) compared to base (2cdd30c).
Patch coverage: 100.00% of modified lines in pull request are covered.

❗ Current head dcaed0f differs from pull request most recent head 8d56402. Consider uploading reports for the commit 8d56402 to get more accurate results

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #177   +/-   ##
=======================================
  Coverage   94.05%   94.05%           
=======================================
  Files         172      172           
  Lines        5078     5079    +1     
=======================================
+ Hits         4776     4777    +1     
  Misses        302      302           
Impacted Files Coverage Δ
edsnlp/pipelines/core/sentences/factory.py 100.00% <ø> (ø)
edsnlp/pipelines/core/sentences/terms.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@Thomzoy Thomzoy changed the title Sentences split on lnewlines Sentences split on newlines Dec 19, 2022
Comment on lines +121 to +124
if seen_period and Lexeme.c_check_flag(token.lex, IS_DIGIT):
continue
seen_newline = False
seen_period = False
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a correction. Before, a text like "Mesure du H.3 négatif" produced two sentences, "Mesure du H.3" and "négatif" since seen_newlines and seen_period were not reset.

@aricohen93
Copy link
Collaborator

@Thomzoy

An additional case :

if next token is \n

@Thomzoy
Copy link
Collaborator Author

Thomzoy commented Dec 21, 2022

@Thomzoy

An additional case :

if next token is \n

Do you mean that we should split into two sentences a text like this: "Une première phrase\n\net une deuxième." (two linebreaks with no punctuation and no uppercase)

@aricohen93
Copy link
Collaborator

Yes, at least 2 consecutive breaks should split the text into phrases I think, no ??

Copy link

sonarcloud bot commented Jul 1, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants