Merge branch 'main' into feat/add-nbs-test-suite

CDCgov · Dec 5, 2024 · 8e1575d · 8e1575d
2 parents 79556bb + 7de3b70
commit 8e1575d
Show file tree

Hide file tree

Showing 28 changed files with 819 additions and 1,419 deletions.
diff --git a/docs/site/configuration.md b/docs/site/configuration.md
@@ -65,7 +65,7 @@ Each setting can be configured as follows:
 
 :   URI for the Splunk HTTP Event Collector (HEC) endpoint. When set, logs will be sent to
     the configured Splunk instance for analysis. The format is
-    `splunkhec://<token>@<host>:<port>?index=<index>&proto=<protocol>&ssl_verify=<verify>&source=<source>`
+    `splunkhec://<token>@<host>:<port>?index=<index>&proto=<protocol>&source=<source>`
 
     **Docker Default**: `""`
 

diff --git a/docs/site/reference.md b/docs/site/reference.md
@@ -119,12 +119,12 @@ patient data and used during query retrieval. The following blocking key types a
 These are the functions that can be used to evaluate the matching results as a collection, thus
 determining it the incoming payload is a match or not to an existing Patient record.
 
-`func:recordlinker.linking.matchers.eval_perfect_match`
+`func:recordlinker.linking.matchers.rule_match`
 
 :   Determines whether a given set of feature comparisons represent a 'perfect' match
     (i.e. all features that were compared match in whatever criteria was specified).
 
-`func:recordlinker.linking.matchers.eval_log_odds_cutoff`
+`func:recordlinker.linking.matchers.rule_probabilistic_match`
 
 :   Determines whether a given set of feature comparisons matches enough to be the
     result of a true patient link instead of just random chance. This is represented
@@ -143,22 +143,22 @@ matching is designed to compare one list of values to another list of values.  F
 incoming record could have a GIVEN_NAME of ["John", "Dean"] and we could be comparing them to an
 existing Patient with the GIVEN_NAME of ["John", "D"].
 
-`func:recordlinker.linking.matchers.feature_match_any`
+`func:recordlinker.linking.matchers.compare_match_any`
 
 :   Determines if any of the features are a direct match.
 
-`func:recordlinker.linking.matchers.feature_match_all`
+`func:recordlinker.linking.matchers.compare_match_all`
 
 :   Determines if all of the features are a direct match.
 
-`func:recordlinker.linking.matchers.feature_match_fuzzy_string`
+`func:recordlinker.linking.matchers.compare_fuzzy_match`
 
 :   Determines if the features are a fuzzy match based on a string comparison.
     JaroWinkler, Levenshtein and Damerau-Levenshtein are supported, with JaroWinkler as the default.
     Use the `kwargs` parameter to specify the desired algorithm and thresholds.
     Example: `{"kwargs": {"similarity_measure": "levenshtein", "thresholds": {"FIRST_NAME": 0.8}}}`
 
-`func:recordlinker.linking.matchers.feature_match_log_odds_fuzzy_compare`
+`func:recordlinker.linking.matchers.compare_probabilistic_fuzzy_match`
 
 :   Similar to the above function, but uses a log-odds ratio to determine if the features are a match 
     probabilistically. This is useful when wanting to more robustly compare features by incorporating

diff --git a/rfc/000-template.md b/rfc/000-template.md
@@ -0,0 +1,51 @@
+# RFC: [Title of Proposal]
+
+- **Status**: [Draft, Under Review, Accepted, Rejected]
+- **Author(s)**: [Name(s) or GitHub handle(s)]
+- **Creation Date**: [YYYY-MM-DD]
+- **Last Updated**: [YYYY-MM-DD]
+- **RFC ID**: [Optional, e.g., RFC-001]
+
+---
+
+## Summary
+
+**(Replace this text)** Briefly explain what this RFC proposes. The summary should be a few sentences that capture the purpose and scope of the change.
+
+## Motivation
+
+**(Replace this text)** Explain why this proposal is necessary or valuable. Describe the problem or limitations that this RFC aims to address, and what benefits it brings to the project. Highlight any relevant background information or previous discussions.
+
+## Prior Art
+
+**(Replace this text)** Describe similar solutions, patterns, or standards in other projects, libraries, or frameworks that influenced this proposal. Explain how this RFC builds on or differs from prior art, and why this specific approach is recommended for this project.
+
+## Proposal
+
+**(Replace this text)** Describe the details of the proposal. Include technical specifications, design choices, and any relevant diagrams or examples. The goal is to provide enough detail for reviewers to understand exactly what is being proposed.
+
+## Alternatives Considered
+
+**(Replace this text)** List any alternative solutions that were considered and explain why they were not chosen. This helps reviewers understand the trade-offs and decision-making process.
+
+## Risks and Drawbacks
+
+**(Replace this text)** Outline any potential risks, drawbacks, or negative implications of implementing this proposal. Consider the impact on users, maintainers, performance, and other factors.
+
+## Implementation Plan
+
+**(Replace this text)** Provide an overview of the steps required to implement the proposal, if it is accepted. Include any necessary code changes, documentation updates, migration plans, etc.
+
+## Unresolved Questions
+
+**(Replace this text)** List any open questions that still need to be addressed. This section can be used to highlight uncertainties and gather feedback from reviewers.
+
+## Future Possibilities
+
+**(Optional; replace or remove this text)** Describe any future improvements or extensions that could be considered after implementing this proposal.
+
+---
+
+## Footnotes or References
+
+**(Replace this text)** Include links to relevant documents, issues, discussions, or additional resources.
diff --git a/rfc/001-identifier-triplets.md b/rfc/001-identifier-triplets.md
@@ -0,0 +1,201 @@
+# RFC: Identifier Triplets
+
+- **Status**: Accepted
+- **Author(s)**: @ericbuckley
+- **Creation Date**: 2024-11-09
+- **Last Updated**: 2024-12-02
+- **RFC ID**: RFC-001
+
+---
+
+## Summary
+
+There are numerous identifiers that are used across healthcare documents to uniquely identify patients.
+Implementing each of these manually is time-consuming, error-prone and can lead to inconsistencies.
+This RFC proposes a standard set of identifier triplets that can be used to uniquely identify patients
+across healthcare documents and a process for how we can compare them in Record Linker.
+
+## Motivation
+
+NBS currently supports over a dozen different identifiers for patients, while Record Linker only has support
+for three (MRN, SSN and Driver's License). This makes it difficult to configure Record Linker to run comparisons
+on all fields that customers may want to link on.  Additionally, there differences in how we currently handle
+comparisons on MRN, SSN and Driver's License, which can confuse users configuring business rules and lead to
+inconsistent results.  Standardizing the way identifiers are compared will make it easier for users to configure
+Record Linker and will lead to more consistent results.
+
+## Prior Art
+
+[FHIR currently uses a 6-tuple](http://hl7.org/fhir/R4/datatypes.html#Identifier) of identifier use, type,
+system, value, period and assigner to uniquely identify patients.  While all of these fields are optional, some
+are almost always used in practice (eg value and type).  NBS currently uses a 3-tuple system of type, authority
+and value.  Between the two, type and value are directly comparable, while assigner and authority offer similar
+functionality.
+
+## Background
+
+Field (aka feature) comparisons happen in two phases of the Record Linker process: blocking and evaluation.
+Blocking is the process of quickly reducing the number of existing documents to compare against, so the evaluation
+phase can be more efficient.  Evaluation is the process of comparing the remaining documents in detail to determine
+if they are the same or different.  Blocking is meant to be fuzzy, as its just for performance, evaluation is meant
+for detailed comparisons to really determine if two documents are the same or different.
+
+Blocking has been optimized to join against a separate index table (ie the blocking values table) to quickly reduce
+the number of documents to compare against.  This table has 3 main columns to facilitate this: the patient id, the
+blocking key and the value.  The blocking key is essentially the field name in which we are comparing against,
+however it is not a direct mapping.  For example the `LAST_NAME` blocking field only contains the first 4 chars of
+the last name.  The important thing to note is we have one fixed blocking key column (eg `LAST_NAME`, `SEX`,
+`IDENTIFIER`) and one variable value column (`Smit`, `M`, `123-45-6789`).
+
+Evaluation, the process of comparing documents in code, has more flexibility in we can create different functions
+for comparing different fields.  The general process though has been to compare a list of string values for each
+field.  While many fields only have 1 value (eg `SEX`, `RACE`, `BIRTH_DATE`), some fields have multiple values
+(eg `ADDRESS`, `FIRST_NAME`, `LAST_NAME`).  Thus it was important to implement the evaluation functions to
+accept a list of strings for both the incoming document and the existing document when comparing these fields.
+The important thing to note here, is if we are comparing a field like `IDENTIFIER`, we are likely limited to
+just comparing 1 string value per identifier listed in a document (we are assuming more than 1 identifier can
+be listed in a document).
+
+## Proposal
+
+The proposal is to use a 3-tuple of type, authority and value to specify patient identifiers and evaluate them.
+
+- Input: A list of identifier objects, each with a type, authority and value.
+    example: `[{type: 'DL', authority: 'CA', value: 'A123456'}, {type: 'SS', authority: '', value: '123-45-6789'}]`
+- Business Rules: Algorithm configuration will accept the `IDENTIFIER` field with a type suffix, to specify
+    the type of identifier to compare.
+    example: `IDENTIFIER:DL`, `IDENTIFIER:SS`
+- Blocking: The blocking key will be `IDENTIFIER` and values will be inserted for every identifier specified in
+    the document using the type and the last four of the value.
+    example: `DL:CA:3456`, `SS::6789`
+- Evaluation: The evaluation function will accept a list of identifier strings, containing all 3 parts of the
+    tuple, and compare them using the specified evaluation function.
+    example: `['DL:CA:A123456', 'SS::123-45-6789']`
+
+### Input
+
+The input to the Record Linker process will be a list of identifier objects, each with a type, authority and value.
+The type and value attributes are required, while authority is optional.
+
+```json
+{
+    "identifiers": [
+        {
+            "type": "DL",
+            "authority": "CA",
+            "value": "A123456"
+        },
+        {
+            "type": "SS",
+            "authority": "",
+            "value": "123-45-6789"
+        }
+    ]
+}
+```
+
+The type attribute will be limited to a codes defined by the
+[HL7 identifierType code system](https://terminology.hl7.org/6.0.2/CodeSystem-v2-0203.html). This 
+includes roughly 100 different types of identifiers that are all coded using a 2-7 character value.
+For example, `DL` is the code for Driver's License, `SS` is the code for Social Security Number, `MR`
+is the code for Medical Record Number.
+
+The authority attribute will be a free-form string that can be used to specify the issuing authority
+of the identifier.  For example, `CA` could be used to specify that the Driver's License was issued
+by the state of California.  There is no standard code system for this attribute, so it will be up to
+the user to specify a value that makes sense for their data.
+
+The value attribute will be a free-form string that contains the actual value of the identifier.  For
+example, `A123456` could be the value of a Driver's License, `123-45-6789` could be the value of a
+Social Security Number.
+
+
+### Business Rules
+
+When specifying a blocking key or evaluation field in an Algorithm configuration, identifier matches
+can be specified in two different forms.  The first form, which is applicable to blocking keys
+and evaluation fields, is to specify `IDENTIFIER` indicating it will match on any like identifier.
+The second form, which is **only applicable to evaluation fields**, is to specify `IDENTIFIER:<type>`
+indicating it will match on a specific type of identifier.
+
+```json
+{
+    "blocking_keys": [
+        "IDENTIFIER",
+        "BIRTH_DATE",
+    ],
+    "evaluators": {
+        "IDENTIFIER": "func:recordlinker.linking.matchers.feature_match_exact",
+    }
+}
+```
+
+```json
+{
+    "blocking_keys": [
+        "IDENTIFIER",
+        "BIRTH_DATE",
+    ],
+    "evaluators": {
+        "IDENTIFIER:SS": "func:recordlinker.linking.matchers.feature_match_exact",
+    }
+}
+```
+
+### Blocking
+
+Blocking keys are an important part of the linkage process, but only from a performance perspective.
+We use these values to efficiently index the documents, and pull out a subset of documents to compare
+in detail. For them to work efficiently, we have some limitations to the size of the values we can
+store in the blocking table.  Currently, that is limited to 20 characters which is a bit arbitrary,
+but the idea is to keep the values small so we can index them efficiently.  If we keep that limit, we
+need to use identifier values that are guaranteed to be less than that limit.
+
+Previous research has shown that the last **4 characters of an identifier value** are often the most
+unique and can be used to block on.  However, we should likely also include elements of the type and
+authority, if we want to provide some assurance that we are not blocking on a different identifier
+all together. For that, we recommend storing the **entire type** (which is limited to 7 characters)
+and the **first 2 characters of the authority** (which is free-form).
+
+| patient_id | blocking_key | value      |
+|------------|--------------|------------|
+| 1          | IDENTIFIER   | DL:CA:3456 |
+| 1          | IDENTIFIER   | SS::6789   |
+
+### Evaluation
+
+As indicated in the business rules section, evaluation on identifiers can happen in two ways. The
+first way is to evaluate on any identifier, meaning that the evaluation step will result in a
+match if any two identifiers between the documents are a match.  The second way is to evaluate on
+a specific type of identifier, meaning that the evaluation step will result in a match if the 
+specified type of identifier between the documents is a match.
+
+The evaluation functions will be comparing all 3 parts of the identifier tuple (type, authority,
+value) when determining if two identifiers are a match.  The difference is just between what types
+of identifiers are we going to compare.
+
+```python
+
+if feature == 'IDENTIFIER':
+    assert values == ['DL:CA:A123456', 'SS::123-45-6789']
+if feature == 'IDENTIFIER:SS':
+    assert values == ['SS::123-45-6789']
+```
+
+## Alternatives Considered
+
+The main alternative to a generic identifier triplet is to continue creating specific fields (eg 
+`MRN`, `SSN`, `drivers_license`) for each identifier type.  This would require more configuration
+and would not be as flexible as the proposed solution. However, this does allow for customization's
+when blocking or evaluating on specific identifier types.  For example, we know that SSN will never
+require an authority, so we can make a slight reduction in the blocking value size knowing that is
+never required.  In the case of Driver's License, we know that the authority will always be a state,
+so we could implement custom normalization logic attempting to standardize the state values (eg `CA`
+vs `California`).  This is more work long-term, and likely more confusing for users as each identifier
+field has slightly different behavior, but does allow for maximum flexibility.
+
+## Risks and Drawbacks
+
+- Variations in authority values could lead to false negatives in comparisons (eg `CA` vs `California`).
+- Variations in value formats could lead to false negatives in comparisons (eg `123-45-6789` vs `123456789`).
+- Blocking values can't be specified per type, so if we want to block on identifiers we need to block on all.
diff --git a/src/recordlinker/assets/initial_algorithms.json b/src/recordlinker/assets/initial_algorithms.json
@@ -15,14 +15,14 @@
                 "evaluators": [
                     {
                         "feature": "FIRST_NAME",
-                        "func": "func:recordlinker.linking.matchers.feature_match_fuzzy_string"
+                        "func": "func:recordlinker.linking.matchers.compare_fuzzy_match"
                     },
                     {
                         "feature": "LAST_NAME",
-                        "func": "func:recordlinker.linking.matchers.feature_match_exact"
+                        "func": "func:recordlinker.linking.matchers.compare_match_all"
                     }
                 ],
-                "rule": "func:recordlinker.linking.matchers.eval_perfect_match",
+                "rule": "func:recordlinker.linking.matchers.rule_match",
                 "kwargs": {
                     "thresholds": {
                         "FIRST_NAME": 0.9,
@@ -44,14 +44,14 @@
                 "evaluators": [
                     {
                         "feature": "ADDRESS",
-                        "func": "func:recordlinker.linking.matchers.feature_match_fuzzy_string"
+                        "func": "func:recordlinker.linking.matchers.compare_fuzzy_match"
                     },
                     {
                         "feature": "BIRTHDATE",
-                        "func": "func:recordlinker.linking.matchers.feature_match_exact"
+                        "func": "func:recordlinker.linking.matchers.compare_match_all"
                     }
                 ],
-                "rule": "func:recordlinker.linking.matchers.eval_perfect_match",
+                "rule": "func:recordlinker.linking.matchers.rule_match",
                 "kwargs": {
                     "thresholds": {
                         "FIRST_NAME": 0.9,
@@ -81,14 +81,14 @@
                 "evaluators": [
                     {
                         "feature": "FIRST_NAME",
-                        "func": "func:recordlinker.linking.matchers.feature_match_log_odds_fuzzy_compare"
+                        "func": "func:recordlinker.linking.matchers.compare_probabilistic_fuzzy_match"
                     },
                     {
                         "feature": "LAST_NAME",
-                        "func": "func:recordlinker.linking.matchers.feature_match_log_odds_fuzzy_compare"
+                        "func": "func:recordlinker.linking.matchers.compare_probabilistic_fuzzy_match"
                     }
                 ],
-                "rule": "func:recordlinker.linking.matchers.eval_log_odds_cutoff",
+                "rule": "func:recordlinker.linking.matchers.rule_probabilistic_match",
                 "kwargs": {
                     "similarity_measure": "JaroWinkler",
                     "thresholds": {
@@ -123,14 +123,14 @@
                 "evaluators": [
                     {
                         "feature": "ADDRESS",
-                        "func": "func:recordlinker.linking.matchers.feature_match_log_odds_fuzzy_compare"
+                        "func": "func:recordlinker.linking.matchers.compare_probabilistic_fuzzy_match"
                     },
                     {
                         "feature": "BIRTHDATE",
-                        "func": "func:recordlinker.linking.matchers.feature_match_log_odds_fuzzy_compare"
+                        "func": "func:recordlinker.linking.matchers.compare_probabilistic_fuzzy_match"
                     }
                 ],
-                "rule": "func:recordlinker.linking.matchers.eval_log_odds_cutoff",
+                "rule": "func:recordlinker.linking.matchers.rule_probabilistic_match",
                 "kwargs": {
                     "similarity_measure": "JaroWinkler",
                     "thresholds": {

diff --git a/src/recordlinker/database/mpi_service.py b/src/recordlinker/database/mpi_service.py
@@ -236,3 +236,15 @@ def reset_mpi(session: orm.Session, commit: bool = True):
     session.query(models.Person).delete()
     if commit:
         session.commit()
+
+def delete_patient(session: orm.Session, obj: models.Patient, commit: bool = False) -> None:
+    """
+    Deletes an Patient from the database
+
+    :param session: The database session
+    :param obj: The Patient to delete
+    :param commit: Commit the transaction
+    """
+    session.delete(obj)
+    if commit:
+        session.commit()