Skip to content

Commit

Permalink
Merge branch 'main' into feat/add-nbs-test-suite
Browse files Browse the repository at this point in the history
  • Loading branch information
ericbuckley committed Dec 5, 2024
2 parents 79556bb + 7de3b70 commit 8e1575d
Show file tree
Hide file tree
Showing 28 changed files with 819 additions and 1,419 deletions.
2 changes: 1 addition & 1 deletion docs/site/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ Each setting can be configured as follows:

: URI for the Splunk HTTP Event Collector (HEC) endpoint. When set, logs will be sent to
the configured Splunk instance for analysis. The format is
`splunkhec://<token>@<host>:<port>?index=<index>&proto=<protocol>&ssl_verify=<verify>&source=<source>`
`splunkhec://<token>@<host>:<port>?index=<index>&proto=<protocol>&source=<source>`

**Docker Default**: `""`

Expand Down
12 changes: 6 additions & 6 deletions docs/site/reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,12 +119,12 @@ patient data and used during query retrieval. The following blocking key types a
These are the functions that can be used to evaluate the matching results as a collection, thus
determining it the incoming payload is a match or not to an existing Patient record.

`func:recordlinker.linking.matchers.eval_perfect_match`
`func:recordlinker.linking.matchers.rule_match`

: Determines whether a given set of feature comparisons represent a 'perfect' match
(i.e. all features that were compared match in whatever criteria was specified).

`func:recordlinker.linking.matchers.eval_log_odds_cutoff`
`func:recordlinker.linking.matchers.rule_probabilistic_match`

: Determines whether a given set of feature comparisons matches enough to be the
result of a true patient link instead of just random chance. This is represented
Expand All @@ -143,22 +143,22 @@ matching is designed to compare one list of values to another list of values. F
incoming record could have a GIVEN_NAME of ["John", "Dean"] and we could be comparing them to an
existing Patient with the GIVEN_NAME of ["John", "D"].

`func:recordlinker.linking.matchers.feature_match_any`
`func:recordlinker.linking.matchers.compare_match_any`

: Determines if any of the features are a direct match.

`func:recordlinker.linking.matchers.feature_match_all`
`func:recordlinker.linking.matchers.compare_match_all`

: Determines if all of the features are a direct match.

`func:recordlinker.linking.matchers.feature_match_fuzzy_string`
`func:recordlinker.linking.matchers.compare_fuzzy_match`

: Determines if the features are a fuzzy match based on a string comparison.
JaroWinkler, Levenshtein and Damerau-Levenshtein are supported, with JaroWinkler as the default.
Use the `kwargs` parameter to specify the desired algorithm and thresholds.
Example: `{"kwargs": {"similarity_measure": "levenshtein", "thresholds": {"FIRST_NAME": 0.8}}}`

`func:recordlinker.linking.matchers.feature_match_log_odds_fuzzy_compare`
`func:recordlinker.linking.matchers.compare_probabilistic_fuzzy_match`

: Similar to the above function, but uses a log-odds ratio to determine if the features are a match
probabilistically. This is useful when wanting to more robustly compare features by incorporating
Expand Down
51 changes: 51 additions & 0 deletions rfc/000-template.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# RFC: [Title of Proposal]

- **Status**: [Draft, Under Review, Accepted, Rejected]
- **Author(s)**: [Name(s) or GitHub handle(s)]
- **Creation Date**: [YYYY-MM-DD]
- **Last Updated**: [YYYY-MM-DD]
- **RFC ID**: [Optional, e.g., RFC-001]

---

## Summary

**(Replace this text)** Briefly explain what this RFC proposes. The summary should be a few sentences that capture the purpose and scope of the change.

## Motivation

**(Replace this text)** Explain why this proposal is necessary or valuable. Describe the problem or limitations that this RFC aims to address, and what benefits it brings to the project. Highlight any relevant background information or previous discussions.

## Prior Art

**(Replace this text)** Describe similar solutions, patterns, or standards in other projects, libraries, or frameworks that influenced this proposal. Explain how this RFC builds on or differs from prior art, and why this specific approach is recommended for this project.

## Proposal

**(Replace this text)** Describe the details of the proposal. Include technical specifications, design choices, and any relevant diagrams or examples. The goal is to provide enough detail for reviewers to understand exactly what is being proposed.

## Alternatives Considered

**(Replace this text)** List any alternative solutions that were considered and explain why they were not chosen. This helps reviewers understand the trade-offs and decision-making process.

## Risks and Drawbacks

**(Replace this text)** Outline any potential risks, drawbacks, or negative implications of implementing this proposal. Consider the impact on users, maintainers, performance, and other factors.

## Implementation Plan

**(Replace this text)** Provide an overview of the steps required to implement the proposal, if it is accepted. Include any necessary code changes, documentation updates, migration plans, etc.

## Unresolved Questions

**(Replace this text)** List any open questions that still need to be addressed. This section can be used to highlight uncertainties and gather feedback from reviewers.

## Future Possibilities

**(Optional; replace or remove this text)** Describe any future improvements or extensions that could be considered after implementing this proposal.

---

## Footnotes or References

**(Replace this text)** Include links to relevant documents, issues, discussions, or additional resources.
201 changes: 201 additions & 0 deletions rfc/001-identifier-triplets.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
# RFC: Identifier Triplets

- **Status**: Accepted
- **Author(s)**: @ericbuckley
- **Creation Date**: 2024-11-09
- **Last Updated**: 2024-12-02
- **RFC ID**: RFC-001

---

## Summary

There are numerous identifiers that are used across healthcare documents to uniquely identify patients.
Implementing each of these manually is time-consuming, error-prone and can lead to inconsistencies.
This RFC proposes a standard set of identifier triplets that can be used to uniquely identify patients
across healthcare documents and a process for how we can compare them in Record Linker.

## Motivation

NBS currently supports over a dozen different identifiers for patients, while Record Linker only has support
for three (MRN, SSN and Driver's License). This makes it difficult to configure Record Linker to run comparisons
on all fields that customers may want to link on. Additionally, there differences in how we currently handle
comparisons on MRN, SSN and Driver's License, which can confuse users configuring business rules and lead to
inconsistent results. Standardizing the way identifiers are compared will make it easier for users to configure
Record Linker and will lead to more consistent results.

## Prior Art

[FHIR currently uses a 6-tuple](http://hl7.org/fhir/R4/datatypes.html#Identifier) of identifier use, type,
system, value, period and assigner to uniquely identify patients. While all of these fields are optional, some
are almost always used in practice (eg value and type). NBS currently uses a 3-tuple system of type, authority
and value. Between the two, type and value are directly comparable, while assigner and authority offer similar
functionality.

## Background

Field (aka feature) comparisons happen in two phases of the Record Linker process: blocking and evaluation.
Blocking is the process of quickly reducing the number of existing documents to compare against, so the evaluation
phase can be more efficient. Evaluation is the process of comparing the remaining documents in detail to determine
if they are the same or different. Blocking is meant to be fuzzy, as its just for performance, evaluation is meant
for detailed comparisons to really determine if two documents are the same or different.

Blocking has been optimized to join against a separate index table (ie the blocking values table) to quickly reduce
the number of documents to compare against. This table has 3 main columns to facilitate this: the patient id, the
blocking key and the value. The blocking key is essentially the field name in which we are comparing against,
however it is not a direct mapping. For example the `LAST_NAME` blocking field only contains the first 4 chars of
the last name. The important thing to note is we have one fixed blocking key column (eg `LAST_NAME`, `SEX`,
`IDENTIFIER`) and one variable value column (`Smit`, `M`, `123-45-6789`).

Evaluation, the process of comparing documents in code, has more flexibility in we can create different functions
for comparing different fields. The general process though has been to compare a list of string values for each
field. While many fields only have 1 value (eg `SEX`, `RACE`, `BIRTH_DATE`), some fields have multiple values
(eg `ADDRESS`, `FIRST_NAME`, `LAST_NAME`). Thus it was important to implement the evaluation functions to
accept a list of strings for both the incoming document and the existing document when comparing these fields.
The important thing to note here, is if we are comparing a field like `IDENTIFIER`, we are likely limited to
just comparing 1 string value per identifier listed in a document (we are assuming more than 1 identifier can
be listed in a document).

## Proposal

The proposal is to use a 3-tuple of type, authority and value to specify patient identifiers and evaluate them.

- Input: A list of identifier objects, each with a type, authority and value.
example: `[{type: 'DL', authority: 'CA', value: 'A123456'}, {type: 'SS', authority: '', value: '123-45-6789'}]`
- Business Rules: Algorithm configuration will accept the `IDENTIFIER` field with a type suffix, to specify
the type of identifier to compare.
example: `IDENTIFIER:DL`, `IDENTIFIER:SS`
- Blocking: The blocking key will be `IDENTIFIER` and values will be inserted for every identifier specified in
the document using the type and the last four of the value.
example: `DL:CA:3456`, `SS::6789`
- Evaluation: The evaluation function will accept a list of identifier strings, containing all 3 parts of the
tuple, and compare them using the specified evaluation function.
example: `['DL:CA:A123456', 'SS::123-45-6789']`

### Input

The input to the Record Linker process will be a list of identifier objects, each with a type, authority and value.
The type and value attributes are required, while authority is optional.

```json
{
"identifiers": [
{
"type": "DL",
"authority": "CA",
"value": "A123456"
},
{
"type": "SS",
"authority": "",
"value": "123-45-6789"
}
]
}
```

The type attribute will be limited to a codes defined by the
[HL7 identifierType code system](https://terminology.hl7.org/6.0.2/CodeSystem-v2-0203.html). This
includes roughly 100 different types of identifiers that are all coded using a 2-7 character value.
For example, `DL` is the code for Driver's License, `SS` is the code for Social Security Number, `MR`
is the code for Medical Record Number.

The authority attribute will be a free-form string that can be used to specify the issuing authority
of the identifier. For example, `CA` could be used to specify that the Driver's License was issued
by the state of California. There is no standard code system for this attribute, so it will be up to
the user to specify a value that makes sense for their data.

The value attribute will be a free-form string that contains the actual value of the identifier. For
example, `A123456` could be the value of a Driver's License, `123-45-6789` could be the value of a
Social Security Number.


### Business Rules

When specifying a blocking key or evaluation field in an Algorithm configuration, identifier matches
can be specified in two different forms. The first form, which is applicable to blocking keys
and evaluation fields, is to specify `IDENTIFIER` indicating it will match on any like identifier.
The second form, which is **only applicable to evaluation fields**, is to specify `IDENTIFIER:<type>`
indicating it will match on a specific type of identifier.

```json
{
"blocking_keys": [
"IDENTIFIER",
"BIRTH_DATE",
],
"evaluators": {
"IDENTIFIER": "func:recordlinker.linking.matchers.feature_match_exact",
}
}
```

```json
{
"blocking_keys": [
"IDENTIFIER",
"BIRTH_DATE",
],
"evaluators": {
"IDENTIFIER:SS": "func:recordlinker.linking.matchers.feature_match_exact",
}
}
```

### Blocking

Blocking keys are an important part of the linkage process, but only from a performance perspective.
We use these values to efficiently index the documents, and pull out a subset of documents to compare
in detail. For them to work efficiently, we have some limitations to the size of the values we can
store in the blocking table. Currently, that is limited to 20 characters which is a bit arbitrary,
but the idea is to keep the values small so we can index them efficiently. If we keep that limit, we
need to use identifier values that are guaranteed to be less than that limit.

Previous research has shown that the last **4 characters of an identifier value** are often the most
unique and can be used to block on. However, we should likely also include elements of the type and
authority, if we want to provide some assurance that we are not blocking on a different identifier
all together. For that, we recommend storing the **entire type** (which is limited to 7 characters)
and the **first 2 characters of the authority** (which is free-form).

| patient_id | blocking_key | value |
|------------|--------------|------------|
| 1 | IDENTIFIER | DL:CA:3456 |
| 1 | IDENTIFIER | SS::6789 |

### Evaluation

As indicated in the business rules section, evaluation on identifiers can happen in two ways. The
first way is to evaluate on any identifier, meaning that the evaluation step will result in a
match if any two identifiers between the documents are a match. The second way is to evaluate on
a specific type of identifier, meaning that the evaluation step will result in a match if the
specified type of identifier between the documents is a match.

The evaluation functions will be comparing all 3 parts of the identifier tuple (type, authority,
value) when determining if two identifiers are a match. The difference is just between what types
of identifiers are we going to compare.

```python

if feature == 'IDENTIFIER':
assert values == ['DL:CA:A123456', 'SS::123-45-6789']
if feature == 'IDENTIFIER:SS':
assert values == ['SS::123-45-6789']
```

## Alternatives Considered

The main alternative to a generic identifier triplet is to continue creating specific fields (eg
`MRN`, `SSN`, `drivers_license`) for each identifier type. This would require more configuration
and would not be as flexible as the proposed solution. However, this does allow for customization's
when blocking or evaluating on specific identifier types. For example, we know that SSN will never
require an authority, so we can make a slight reduction in the blocking value size knowing that is
never required. In the case of Driver's License, we know that the authority will always be a state,
so we could implement custom normalization logic attempting to standardize the state values (eg `CA`
vs `California`). This is more work long-term, and likely more confusing for users as each identifier
field has slightly different behavior, but does allow for maximum flexibility.

## Risks and Drawbacks

- Variations in authority values could lead to false negatives in comparisons (eg `CA` vs `California`).
- Variations in value formats could lead to false negatives in comparisons (eg `123-45-6789` vs `123456789`).
- Blocking values can't be specified per type, so if we want to block on identifiers we need to block on all.
24 changes: 12 additions & 12 deletions src/recordlinker/assets/initial_algorithms.json
Original file line number Diff line number Diff line change
Expand Up @@ -15,14 +15,14 @@
"evaluators": [
{
"feature": "FIRST_NAME",
"func": "func:recordlinker.linking.matchers.feature_match_fuzzy_string"
"func": "func:recordlinker.linking.matchers.compare_fuzzy_match"
},
{
"feature": "LAST_NAME",
"func": "func:recordlinker.linking.matchers.feature_match_exact"
"func": "func:recordlinker.linking.matchers.compare_match_all"
}
],
"rule": "func:recordlinker.linking.matchers.eval_perfect_match",
"rule": "func:recordlinker.linking.matchers.rule_match",
"kwargs": {
"thresholds": {
"FIRST_NAME": 0.9,
Expand All @@ -44,14 +44,14 @@
"evaluators": [
{
"feature": "ADDRESS",
"func": "func:recordlinker.linking.matchers.feature_match_fuzzy_string"
"func": "func:recordlinker.linking.matchers.compare_fuzzy_match"
},
{
"feature": "BIRTHDATE",
"func": "func:recordlinker.linking.matchers.feature_match_exact"
"func": "func:recordlinker.linking.matchers.compare_match_all"
}
],
"rule": "func:recordlinker.linking.matchers.eval_perfect_match",
"rule": "func:recordlinker.linking.matchers.rule_match",
"kwargs": {
"thresholds": {
"FIRST_NAME": 0.9,
Expand Down Expand Up @@ -81,14 +81,14 @@
"evaluators": [
{
"feature": "FIRST_NAME",
"func": "func:recordlinker.linking.matchers.feature_match_log_odds_fuzzy_compare"
"func": "func:recordlinker.linking.matchers.compare_probabilistic_fuzzy_match"
},
{
"feature": "LAST_NAME",
"func": "func:recordlinker.linking.matchers.feature_match_log_odds_fuzzy_compare"
"func": "func:recordlinker.linking.matchers.compare_probabilistic_fuzzy_match"
}
],
"rule": "func:recordlinker.linking.matchers.eval_log_odds_cutoff",
"rule": "func:recordlinker.linking.matchers.rule_probabilistic_match",
"kwargs": {
"similarity_measure": "JaroWinkler",
"thresholds": {
Expand Down Expand Up @@ -123,14 +123,14 @@
"evaluators": [
{
"feature": "ADDRESS",
"func": "func:recordlinker.linking.matchers.feature_match_log_odds_fuzzy_compare"
"func": "func:recordlinker.linking.matchers.compare_probabilistic_fuzzy_match"
},
{
"feature": "BIRTHDATE",
"func": "func:recordlinker.linking.matchers.feature_match_log_odds_fuzzy_compare"
"func": "func:recordlinker.linking.matchers.compare_probabilistic_fuzzy_match"
}
],
"rule": "func:recordlinker.linking.matchers.eval_log_odds_cutoff",
"rule": "func:recordlinker.linking.matchers.rule_probabilistic_match",
"kwargs": {
"similarity_measure": "JaroWinkler",
"thresholds": {
Expand Down
12 changes: 12 additions & 0 deletions src/recordlinker/database/mpi_service.py
Original file line number Diff line number Diff line change
Expand Up @@ -236,3 +236,15 @@ def reset_mpi(session: orm.Session, commit: bool = True):
session.query(models.Person).delete()
if commit:
session.commit()

def delete_patient(session: orm.Session, obj: models.Patient, commit: bool = False) -> None:
"""
Deletes an Patient from the database
:param session: The database session
:param obj: The Patient to delete
:param commit: Commit the transaction
"""
session.delete(obj)
if commit:
session.commit()
Loading

0 comments on commit 8e1575d

Please sign in to comment.