Experiment: switch to link by log odds window #229

bamader · 2025-02-27T14:46:51Z

Description

Experimental branch showing how the guts of the code might change switching from a windowed belongingness to a windowed log odds, with Dan's suggestion to create the whole thing as a medical test result with interpretation layered on top (i.e. normalization pushed as far up the pipeline as possible so all values the user deals with are between 0 and 1). Code is by no means PR ready or finalized (e.g. tests not handled, updates to schemas.Prediction and schemas.LinkResult not fully changed, etc.), but wanted to share the ideas.

NOTE: I think we should de-couple the log odds work from missing fields or changes to blocking to avoid scope creep and pushing this feature farther out. I think this milestone should just be about replacing belongingness with lod-odds.

Related Issues

[Link any related issues or tasks from your project management system.]

Additional Notes

[Add any additional context or notes that reviewers should know about.]

<--------------------- REMOVE THE LINES BELOW BEFORE MERGING --------------------->

Checklist

Please review and complete the following checklist before submitting your pull request:

I have ensured that the pull request is of a manageable size, allowing it to be reviewed within a single session.
I have reviewed my changes to ensure they are clear, concise, and well-documented.
I have updated the documentation, if applicable.
I have added or updated test cases to cover my changes, if applicable.
I have minimized the number of reviewers to include only those essential for the review.

Checklist for Reviewers

Please review and complete the following checklist during the review process:

The code follows best practices and conventions.
The changes implement the desired functionality or fix the reported issue.
The tests cover the new changes and pass successfully.
Any potential edge cases or error scenarios have been considered.

src/recordlinker/assets/initial_algorithms.json

src/recordlinker/database/mpi_service.py

src/recordlinker/linking/link.py

codecov · 2025-02-28T14:04:54Z

Codecov Report

Attention: Patch coverage is 95.40230% with 4 lines in your changes missing coverage. Please review.

Project coverage is 97.72%. Comparing base (d54af84) to head (b0e2931).
Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
src/recordlinker/linking/link.py	93.65%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #229      +/-   ##
==========================================
+ Coverage   97.69%   97.72%   +0.02%     
==========================================
  Files          32       32              
  Lines        1651     1714      +63     
==========================================
+ Hits         1613     1675      +62     
- Misses         38       39       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

bamader · 2025-02-28T14:31:16Z

@ericbuckley @m-goggins Updated code to make all tests pass (there's still one set of three tests that involved belongingness that I'm not quite sure what to do with, but will continue noodling).

I also ran the algorithm tests using the certain hierarchy I established, and we doubled our performance! If we were getting 33% last time (two out of the six cases), we now get 4 out of the 6 cases correct. We correctly get 3 out of the 5 match cases, and correctly flag the invalid birthdate. The cases we miss are (1) the record has first and last name switched (there's nothing we can do about that, since first and last are blocking fields in one pass [they fail exact matching] and then evaluation fields in another [but they earn 0 points]); and (2) a test case where someone named Tho-mas fails to match to a cluster that contains 2 identical patients except for their first names, Thomas and ThoMas. I believe if we implemented basic string normalization on incoming names (e.g. standardize casing, remove numbers and punctuation) we would catch this case. We don't even have to do it at the persistence level so that data is still preserved as supplied; we could apply it on the fly during record evaluation, just like we proposed with skip values. This would allow us to still correctly model names that might actually have punctuation (e.g. the Arabic surname Al'Charif, also sometimes written Al-Charif) but wouldn't penalize one of those representations over the other. I believe this is a bug we should fix, but wanted to drop my findings here before I actually took off for the day.

bamader · 2025-03-05T13:08:26Z

Updating to mark all code here complete. All tests have been adjusted to match the new algorithm and everything passes. This is, from my view, "merge ready" code.

m-goggins · 2025-03-04T23:07:39Z

src/recordlinker/linking/link.py

+        1. If both grades (previously seen and newly processed) are equal,
+        updating is easy: just take the result with the higher RMS.
+        2. If the existing grade is certain but the new grade is not, we
+        *don't* update: being above the Certain Match Threshold is a stricter
+        inequality than being within the Possible Match Window, so we don't 
+        want to overwrite with less information (Example: suppose the DIBBs
+        algorithm passes were switched. Suppose Cluster A had an RMS of 0.918.
+        This would be a match grade of 'certain'. If Pass 1 ran after Pass 2,
+        and Cluster A scored an RMS of 0.922, that would grade as 'possible'.
+        But despite the higher RMS, Cluster A actually accumualted more points
+        and stronger separation already, so we don't want to downgrade.)
+        3. If the new grade is certain but the existing grade is not, we
+        *always* update. Being a 'certain' match is a stronger statement and
+        thus more worth saving. Consider the example above with the passes
+        as normal. It would be better to save the Pass 2 RMS of 0.918 that 
+        graded as 'Certain' than it would be to keep the Pass 1 RMS of 0.922
+        that only graded 'Possible,' since the user's previous profiling found
+        these matches higher quality.        


I think it's fine to keep the examples in here for now but we might want to move to a consolitdated place ventually, maybe in design.md when it gets updated in #142

m-goggins · 2025-03-04T23:08:45Z

src/recordlinker/linking/link.py

+    cmt: float
+    grade: str
+
+    def _do_update(self, earned_points, rms, mmt, cmt, grade):


What about update_score_tracking_row?

m-goggins · 2025-03-04T23:09:13Z

src/recordlinker/linking/link.py

+        self.cmt = cmt
+        self.mmt = mmt
+
+    def handle_update(self, earned_points, rms, mmt, cmt, grade):


What about check_for_score_tracking_updates?

m-goggins · 2025-03-04T23:27:01Z

src/recordlinker/linking/link.py

+    return rule_result
+
+
+def grade_result(rms: float, mmt: float, cmt: float) -> str:


How about assign_match_grade or grade_rms_result or grade-rms?

m-goggins · 2025-03-04T23:41:06Z

src/recordlinker/linking/link.py

+    certain_results = [x for x in results if x.grade == 'certain']
+    # re-assign the results array since we already have the higher-priority
+    # 'certain' grades if we need them
+    results = [x for x in results if x.grade == 'possible']


Suggested change

results = [x for x in results if x.grade == 'possible']

possible_results = [x for x in results if x.grade == 'possible']

m-goggins · 2025-03-04T23:41:22Z

src/recordlinker/linking/link.py

-    result_counts["above_lower_bound"] = len(results)
-    if not results:
+
+    if not results and not certain_results:


Suggested change

if not results and not certain_results:

if not possible_results and not certain_results:

m-goggins · 2025-03-05T15:58:25Z

src/recordlinker/assets/initial_algorithms.json

+                    0.8,
+                    0.925


Can you remind me where these numbers come from?

m-goggins · 2025-03-05T16:10:41Z

tests/unit/linking/test_link.py

        )

-        assert link.compare(rec, pat, algorithm_pass) is True
+        assert link.compare(rec, pat, algorithm_pass) == 0.35


Suggested change

assert link.compare(rec, pat, algorithm_pass) == 0.35

assert link.compare(rec, pat, algorithm_pass) == algorithm_pass.kwargs["log_odds"]["IDENTIFIER"]

nit: this makes it clearer that the resulting score is expecting to be equal to the log-odds and not normalized to 1.0. Might consider for other tests as well if you think it's a good idea.

m-goggins · 2025-03-05T16:16:41Z

tests/unit/linking/test_link.py

        )

        #should pass as MR is the same for both
-        assert link.compare(rec, pat, algorithm_pass) is True
+        assert link.compare(rec, pat, algorithm_pass) == 0.35


Hm. this is beyond the scope of this PR but I'm wondering if we should be treating all identifiers equally. It makes sense to me to give full points if the patient agrees with any identifiers of the same type in a record since you can have multiple MRNs, driver's licenses, etc. However, SSN should not be different because it's a national ID. I'm wondering if we should treat SSN differently.

Experiment: switch to link by log odds window

902b5fc

bamader requested review from ericbuckley and m-goggins as code owners February 27, 2025 14:46

bamader marked this pull request as draft February 27, 2025 14:46

ericbuckley reviewed Feb 27, 2025

View reviewed changes

src/recordlinker/assets/initial_algorithms.json Outdated Show resolved Hide resolved

ericbuckley reviewed Feb 27, 2025

View reviewed changes

src/recordlinker/database/mpi_service.py Show resolved Hide resolved

ericbuckley reviewed Feb 27, 2025

View reviewed changes

src/recordlinker/linking/link.py Outdated Show resolved Hide resolved

bamader added 3 commits February 27, 2025 14:41

Remove belongingness; possible match window tests

8b06b03

Fix some more tests

757c3ac

All tests passing

11a8eda

ericbuckley added this to the v25.5.0 milestone Feb 28, 2025

bamader added 7 commits March 4, 2025 13:53

Finish tests and get full coverage

4c5ffe6

Smoke and type check tests

ffe59fc

Way better constructor

f8e2887

smoke again

bf0294d

Add back another smoke

eddd632

Maybe this smoke works

a9d503a

Ugh json formatting

b0e2931

bamader requested a review from ericbuckley March 5, 2025 14:04

m-goggins reviewed Mar 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment: switch to link by log odds window #229

Experiment: switch to link by log odds window #229

bamader commented Feb 27, 2025

codecov bot commented Feb 28, 2025 •

edited

Loading

bamader commented Feb 28, 2025

bamader commented Mar 5, 2025

m-goggins Mar 4, 2025

m-goggins Mar 4, 2025

m-goggins Mar 4, 2025

m-goggins Mar 4, 2025

m-goggins Mar 4, 2025

m-goggins Mar 4, 2025

m-goggins Mar 5, 2025

m-goggins Mar 5, 2025

m-goggins Mar 5, 2025

m-goggins Mar 5, 2025

		return rule_result


		def grade_result(rms: float, mmt: float, cmt: float) -> str:

	results = [x for x in results if x.grade == 'possible']
	possible_results = [x for x in results if x.grade == 'possible']

	if not results and not certain_results:
	if not possible_results and not certain_results:

	assert link.compare(rec, pat, algorithm_pass) == 0.35
	assert link.compare(rec, pat, algorithm_pass) == algorithm_pass.kwargs["log_odds"]["IDENTIFIER"]

Experiment: switch to link by log odds window #229

Are you sure you want to change the base?

Experiment: switch to link by log odds window #229

Conversation

bamader commented Feb 27, 2025

Description

Related Issues

Additional Notes

Checklist

Checklist for Reviewers

codecov bot commented Feb 28, 2025 • edited Loading

Codecov Report

bamader commented Feb 28, 2025

bamader commented Mar 5, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Feb 28, 2025 •

edited

Loading