Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve safety of NER build #127

Merged
merged 2 commits into from
Feb 6, 2024
Merged

Improve safety of NER build #127

merged 2 commits into from
Feb 6, 2024

Conversation

cthoyt
Copy link
Member

@cthoyt cthoyt commented Jan 2, 2024

This PR updates the NER index build to skip entries whose norm terms are none or contain an empty string. This shouldn't happen in practice, but depending on the pipeline used to construct terms, this might happen. A few alternatives to consider:

  1. Have the Term class raise errors when None is given for norm_text
  2. Raise an error in the NER index build

@bgyori
Copy link
Member

bgyori commented Jan 2, 2024

This is actually tricky to solve perfectly. We already throw a ValueError if a Term's text evaluates to False in the constructor (https://github.com/gyorilab/gilda/blob/master/gilda/term.py#L53-L54). So I think checking norm_text the same way could make sense. Still, depending on how the terms are loaded, this could be inconvenient. We could also check for norm_text.strip() I suppose since a norm_text consisting of just spaces doesn't make sense either.

This PR updates the NER index build to skip entries whose norm terms are none or contain an empty string. This shouldn't happen in practice, but depending on the pipeline used to construct terms, this might happen. A few alternatives to consider:

1. Have the Term class raise errors when None is given for norm_text
2. Raise an error in the NER index build
@bgyori bgyori force-pushed the safer-prefix-index branch from eabc938 to 7b72202 Compare February 6, 2024 01:22
@bgyori bgyori merged commit 24cfa02 into master Feb 6, 2024
4 checks passed
@bgyori bgyori deleted the safer-prefix-index branch February 6, 2024 01:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants