Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment: switch to link by log odds window #229

Draft
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

bamader
Copy link
Collaborator

@bamader bamader commented Feb 27, 2025

Description

Experimental branch showing how the guts of the code might change switching from a windowed belongingness to a windowed log odds, with Dan's suggestion to create the whole thing as a medical test result with interpretation layered on top (i.e. normalization pushed as far up the pipeline as possible so all values the user deals with are between 0 and 1). Code is by no means PR ready or finalized (e.g. tests not handled, updates to schemas.Prediction and schemas.LinkResult not fully changed, etc.), but wanted to share the ideas.

NOTE: I think we should de-couple the log odds work from missing fields or changes to blocking to avoid scope creep and pushing this feature farther out. I think this milestone should just be about replacing belongingness with lod-odds.

Related Issues

[Link any related issues or tasks from your project management system.]

Additional Notes

[Add any additional context or notes that reviewers should know about.]

<--------------------- REMOVE THE LINES BELOW BEFORE MERGING --------------------->

Checklist

Please review and complete the following checklist before submitting your pull request:

  • I have ensured that the pull request is of a manageable size, allowing it to be reviewed within a single session.
  • I have reviewed my changes to ensure they are clear, concise, and well-documented.
  • I have updated the documentation, if applicable.
  • I have added or updated test cases to cover my changes, if applicable.
  • I have minimized the number of reviewers to include only those essential for the review.

Checklist for Reviewers

Please review and complete the following checklist during the review process:

  • The code follows best practices and conventions.
  • The changes implement the desired functionality or fix the reported issue.
  • The tests cover the new changes and pass successfully.
  • Any potential edge cases or error scenarios have been considered.

@bamader bamader marked this pull request as draft February 27, 2025 14:46
Copy link

codecov bot commented Feb 28, 2025

Codecov Report

Attention: Patch coverage is 95.40230% with 4 lines in your changes missing coverage. Please review.

Project coverage is 97.72%. Comparing base (d54af84) to head (b0e2931).
Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
src/recordlinker/linking/link.py 93.65% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #229      +/-   ##
==========================================
+ Coverage   97.69%   97.72%   +0.02%     
==========================================
  Files          32       32              
  Lines        1651     1714      +63     
==========================================
+ Hits         1613     1675      +62     
- Misses         38       39       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bamader
Copy link
Collaborator Author

bamader commented Feb 28, 2025

@ericbuckley @m-goggins Updated code to make all tests pass (there's still one set of three tests that involved belongingness that I'm not quite sure what to do with, but will continue noodling).

I also ran the algorithm tests using the certain hierarchy I established, and we doubled our performance! If we were getting 33% last time (two out of the six cases), we now get 4 out of the 6 cases correct. We correctly get 3 out of the 5 match cases, and correctly flag the invalid birthdate. The cases we miss are (1) the record has first and last name switched (there's nothing we can do about that, since first and last are blocking fields in one pass [they fail exact matching] and then evaluation fields in another [but they earn 0 points]); and (2) a test case where someone named Tho-mas fails to match to a cluster that contains 2 identical patients except for their first names, Thomas and ThoMas. I believe if we implemented basic string normalization on incoming names (e.g. standardize casing, remove numbers and punctuation) we would catch this case. We don't even have to do it at the persistence level so that data is still preserved as supplied; we could apply it on the fly during record evaluation, just like we proposed with skip values. This would allow us to still correctly model names that might actually have punctuation (e.g. the Arabic surname Al'Charif, also sometimes written Al-Charif) but wouldn't penalize one of those representations over the other. I believe this is a bug we should fix, but wanted to drop my findings here before I actually took off for the day.

@ericbuckley ericbuckley added this to the v25.5.0 milestone Feb 28, 2025
@bamader
Copy link
Collaborator Author

bamader commented Mar 5, 2025

Updating to mark all code here complete. All tests have been adjusted to match the new algorithm and everything passes. This is, from my view, "merge ready" code.

@bamader bamader requested a review from ericbuckley March 5, 2025 14:04
Comment on lines +66 to +83
1. If both grades (previously seen and newly processed) are equal,
updating is easy: just take the result with the higher RMS.
2. If the existing grade is certain but the new grade is not, we
*don't* update: being above the Certain Match Threshold is a stricter
inequality than being within the Possible Match Window, so we don't
want to overwrite with less information (Example: suppose the DIBBs
algorithm passes were switched. Suppose Cluster A had an RMS of 0.918.
This would be a match grade of 'certain'. If Pass 1 ran after Pass 2,
and Cluster A scored an RMS of 0.922, that would grade as 'possible'.
But despite the higher RMS, Cluster A actually accumualted more points
and stronger separation already, so we don't want to downgrade.)
3. If the new grade is certain but the existing grade is not, we
*always* update. Being a 'certain' match is a stronger statement and
thus more worth saving. Consider the example above with the passes
as normal. It would be better to save the Pass 2 RMS of 0.918 that
graded as 'Certain' than it would be to keep the Pass 1 RMS of 0.922
that only graded 'Possible,' since the user's previous profiling found
these matches higher quality.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to keep the examples in here for now but we might want to move to a consolitdated place ventually, maybe in design.md when it gets updated in #142

cmt: float
grade: str

def _do_update(self, earned_points, rms, mmt, cmt, grade):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about update_score_tracking_row?

self.cmt = cmt
self.mmt = mmt

def handle_update(self, earned_points, rms, mmt, cmt, grade):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about check_for_score_tracking_updates?

return rule_result


def grade_result(rms: float, mmt: float, cmt: float) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about assign_match_grade or grade_rms_result or grade-rms?

certain_results = [x for x in results if x.grade == 'certain']
# re-assign the results array since we already have the higher-priority
# 'certain' grades if we need them
results = [x for x in results if x.grade == 'possible']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
results = [x for x in results if x.grade == 'possible']
possible_results = [x for x in results if x.grade == 'possible']

result_counts["above_lower_bound"] = len(results)
if not results:

if not results and not certain_results:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if not results and not certain_results:
if not possible_results and not certain_results:

Comment on lines +26 to +27
0.8,
0.925
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remind me where these numbers come from?

)

assert link.compare(rec, pat, algorithm_pass) is True
assert link.compare(rec, pat, algorithm_pass) == 0.35
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert link.compare(rec, pat, algorithm_pass) == 0.35
assert link.compare(rec, pat, algorithm_pass) == algorithm_pass.kwargs["log_odds"]["IDENTIFIER"]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this makes it clearer that the resulting score is expecting to be equal to the log-odds and not normalized to 1.0. Might consider for other tests as well if you think it's a good idea.

)

#should pass as MR is the same for both
assert link.compare(rec, pat, algorithm_pass) is True
assert link.compare(rec, pat, algorithm_pass) == 0.35
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. this is beyond the scope of this PR but I'm wondering if we should be treating all identifiers equally. It makes sense to me to give full points if the patient agrees with any identifiers of the same type in a record since you can have multiple MRNs, driver's licenses, etc. However, SSN should not be different because it's a national ID. I'm wondering if we should treat SSN differently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants