Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NBS Test Cases #123

Merged
merged 4 commits into from
Dec 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -73,3 +73,6 @@ __pycache__/

# Databases
*.sqlite3

# Test result files
output.csv
22 changes: 22 additions & 0 deletions compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -76,3 +76,25 @@ services:
depends_on:
api:
condition: service_healthy

algo-test-runner:
build:
context: tests/algorithm
dockerfile: Dockerfile.algo
env_file:
- tests/algorithm/algo.env
environment:
DB_URI: "postgresql+psycopg2://postgres:pw@db:5432/postgres"
API_URL: "http://api:8080"
volumes:
- ./tests/algorithm/scripts:/app/scripts
- ./tests/algorithm/data:/app/data
- ./tests/algorithm/results:/app/results
- ./tests/algorithm/configurations:/app/configurations
depends_on:
db:
condition: service_healthy
api:
condition: service_healthy
profiles:
- algo-test
12 changes: 12 additions & 0 deletions tests/algorithm/Dockerfile.algo
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# Use the official Python 3.11 slim image as the base
FROM python:3.12-slim

# Set the working directory
WORKDIR /app

# Copy the scripts and data directories into the image
COPY scripts /app/scripts
COPY data /app/data

# Install Python dependencies
RUN pip install --no-cache-dir requests
104 changes: 104 additions & 0 deletions tests/algorithm/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# Record Linkage Algorithm Testing

This repository contains a project to evaluate the match accuracy performance of the RecordLinker algorithm.

## Prerequisites

Before getting started, ensure you have the following installed:

- [Docker](https://docs.docker.com/engine/install/)
- [Docker Compose](https://docs.docker.com/compose/install/)

## Directory Structure

- `/`: Contains the `.env` file and `Dockerfile` to build
- `configurations/`: Contains the configuration `.json` file that will be used for the test
- `data/`: Contains the data `.csv` files used for the algorithm test (seed file and test file)
- `results/`: Contains the results `.csv` file after running the test
- `scripts/`: Contains the scripts to run the test

## Setup

1. Build the Docker images:

```bash
docker compose --profile algo-test build
```

2. Add seed and test data files
You can use the sample data files provided in the `data` directory or add your own data files.
The format of the input files should be a CSV file with the same column headers as shown in the sample files.

`/data/sample_seed_data.csv`

`/data/sample_test_data.csv`


3. Configure environment variables

`/algo.env`

Edit the environment variables in the file

4. Edit the algorithm configuration file

`/configurations/algorithm_configuration.json`

Edit the configuration file to tune the algorithm parameters

## Running Algorithm Tests

1. Run the test

```bash
docker compose run --rm algo-test-runner scripts/run_test.py
```

2. Analyze the results

The results of the algorithm tests will be available in the `results/output.csv` file.

The results will be in a CSV formatted file with the following columns:
`Test Case #`, `Expected Result`, `Match Result`, `Details`

## Rerunning Algorithm Tests

After you've run the algorithm tests, you may want to rerun the tests with different seed data, test data, or configurations.

Edit the csv files and/or the configuration file as needed and then run the following commands to rerun the tests.

1. Reset the mpi database

```bash
docker compose run --rm algo-test-runner python scripts/reset_db.py
```
2. Run the tests

```bash
docker compose run --rm algo-test-runner scripts/run_test.py
```

## Environment Variables

1. `env file`: The attributes that should be tuned for your particular algorithm test,
are located in the `algo_test.env` file.

2. `environment`: The attributes that should likely remain static for all algorithm tests are located directly in the `compose.yml` file.

### Algorithm Test Parameters

The following environment variables can be tuned in the `algo-test.env` file:

- `SEED_FILE`: The file containing person data to seed the mpi with
- `TEST_FILE`: The file containing patient data to test the algorithm with
- `ALGORITHM_CONFIGURATION`: The file containing the algorithm configuration json
- `ALGORITHM_NAME`: The name of the algorithm to use (either the name of your `ALGORITHM_CONFIGURATION` or can be the built in `dibbs-basic` or `dibbs-enhanced` algorithms)


## Cleanup

After you've finished running algorithm tests and analyzing the results, you can stop and remove the Docker containers by running:

```bash
docker compose --profile algo-test down
```
4 changes: 4 additions & 0 deletions tests/algorithm/algo.env
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
SEED_FILE="data/sample_seed_data.csv"
TEST_FILE="data/sample_test_data.csv"
ALGORITHM_CONFIGURATION="configurations/algorithm_configuration.json"
ALGORITHM_NAME="test-config"
66 changes: 66 additions & 0 deletions tests/algorithm/configurations/algorithm_configuration.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
{
"label": "test-config",
"description": "test algorithm configuration",
"is_default": false,
"include_multiple_matches": true,
"belongingness_ratio": [0.75, 0.9],
"passes": [
{
"blocking_keys": [
"BIRTHDATE"
],
"evaluators": [
{
"feature": "FIRST_NAME",
"func": "func:recordlinker.linking.matchers.feature_match_fuzzy_string"
cbrinson-rise8 marked this conversation as resolved.
Show resolved Hide resolved
},
{
"feature": "LAST_NAME",
"func": "func:recordlinker.linking.matchers.feature_match_exact"
}
],
"rule": "func:recordlinker.linking.matchers.eval_perfect_match",
"cluster_ratio": 0.9,
"kwargs": {
"thresholds": {
"FIRST_NAME": 0.9,
"LAST_NAME": 0.9,
"BIRTHDATE": 0.95,
"ADDRESS": 0.9,
"CITY": 0.92,
"ZIP": 0.95
}
}
},
{
"blocking_keys": [
"ZIP",
"FIRST_NAME",
"LAST_NAME",
"SEX"
],
"evaluators": [
{
"feature": "ADDRESS",
"func": "func:recordlinker.linking.matchers.feature_match_fuzzy_string"
},
{
"feature": "BIRTHDATE",
"func": "func:recordlinker.linking.matchers.feature_match_exact"
}
],
"rule": "func:recordlinker.linking.matchers.eval_perfect_match",
"cluster_ratio": 0.9,
"kwargs": {
"thresholds": {
"FIRST_NAME": 0.9,
"LAST_NAME": 0.9,
"BIRTHDATE": 0.95,
"ADDRESS": 0.9,
"CITY": 0.92,
"ZIP": 0.95
}
}
}
]
}
6 changes: 6 additions & 0 deletions tests/algorithm/data/sample_seed_data.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Match Id,ID,BIRTHDATE,FIRST,LAST,SUFFIX,MAIDEN,RACE,ETHNICITY,GENDER,ADDRESS,CITY,STATE,COUNTY,ZIP,SSN,
1,3020167,1951-06-02,Linda,Nash,Sr,Gutierrez,Asian,Hispanic,F,968 Gonzalez Mount,South Emilybury,GU,North Kennethburgh County,93236,675-79-1449,
2,9488697,1942-08-03,Jose,Singleton,Sr,Ingram,Asian,Hispanic,M,631 Fowler Causeway,Port Williamfurt,IN,Wardburgh County,90637,587-60-3668,
3,1805504,1963-01-29,Ryan,Lawrence,IV,Armstrong,Black,Non-Hispanic,M,5256 Lisa Light,Port Monica,GA,South Christine County,51813,371-33-0433,
4,1792678,1950-08-10,Thomas,Brady,II,Cobb,White,Unknown,M,944 Hayes Port,Jonesville,FM,Jonesview County,6015,272-78-9905,
5,1332302,1972-08-26,Angie,Murphy,Sr,Mcmahon,Black,Non-Hispanic,F,60015 Edward Vista Suite 518,Lake Andreaview,UT,North Rodney County,46540,740-16-5170,
7 changes: 7 additions & 0 deletions tests/algorithm/data/sample_test_data.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Test Case #,Match Id,ID,BIRTHDATE,FIRST,LAST,SUFFIX,MAIDEN,RACE,ETHNICITY,GENDER,ADDRESS,CITY,STATE,COUNTY,ZIP,SSN,Expected Result
1,1,3020167,1951-06-02,Linda,Nash,Jr,Gutierrez,Asian,Hispanic,F,968 Gonzalez Mount,South Emilybury,GU,North Kennethburgh County,93236,675-79-1449,Should be a Match
2,2,9488697,1942-08-03,Singleton,Jose,Sr,Ingram,Asian,Hispanic,M,631 Fowler Causeway,Port Williamfurt,IN,Wardburgh County,90637,587-60-3668,Should be a Match
3,3,1805504,1963-01-29,Ryan,Law-rence,IV,Armstrong,Black,Non-Hispanic,M,5256 Lisa Light,Port Monica,GA,South Christine County,51813,371-33-0433,Should be a Match
4,4,1792678,1950-08-10,Tho-mas,Brady,II,Cobb,White,Unknown,M,944 Hayes Port,Jonesville,FM,Jonesview County,6015,272-78-9905,Should be a Match
5,4,1792678,1950-08-10,ThoMas,Brady,II,Cobb,White,Unknown,M,944 Hayes Port,Jonesville,FM,Jonesview County,6015,272-78-9905,Should be a Match
6,0,1792679,1950-18-10,ThoMas,Brady,II,Cobb,White,Unknown,M,944 Hayes Port,Jonesville,FM,Jonesview County,6015,272-78-9905,Should fail
43 changes: 43 additions & 0 deletions tests/algorithm/scripts/helpers.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
import json


def dict_to_pii(record_data) -> dict | None:
# convert row to a pii_record
pii_record = {
"external_id": record_data.get('ID', None),
"birth_date": record_data.get("BIRTHDATE", None),
"sex": record_data.get("GENDER", None),
"address": [
{
"line": [record_data.get("ADDRESS", None)],
"city": record_data.get("CITY", None),
"state": record_data.get("STATE", None),
"county": record_data.get("COUNTY", None),
"postal_code": str(record_data.get("ZIP", ""))
}
],
"name": [
{
"given": [record_data.get("FIRST", None)],
"family": record_data.get("LAST", None),
"suffix": [record_data.get("SUFFIX", None)]
}
],
"ssn": record_data.get("SSN", None),
"race": record_data.get("RACE", None)
}

return pii_record


def load_json(file_path: str) -> dict | None:
"""
Load JSON data from a file.
"""
with open(file_path, "rb") as fobj:
try:
content = json.load(fobj)
return content
except json.JSONDecodeError as exc:
print(f"Error loading JSON file: {exc}")
return None
18 changes: 18 additions & 0 deletions tests/algorithm/scripts/reset_db.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
import os

import requests


def reset_db(api_url):
print("Resetting the database...")
try:
response = requests.delete(f"{api_url}/seed")
response.raise_for_status() # Raise an error for bad status codes
print("Database reset successfully")
except requests.exceptions.RequestException as e:
print(f"Failed to reset the database: {e}")


if __name__ == "__main__":
api_url = os.getenv("API_URL")
reset_db(api_url)
33 changes: 33 additions & 0 deletions tests/algorithm/scripts/run_test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
#!/usr/bin/env python3

import os
cbrinson-rise8 marked this conversation as resolved.
Show resolved Hide resolved

from helpers import load_json
from seed_db import seed_database
from send_test_records import send_test_records
from set_configuration import add_configuration
from set_configuration import check_if_config_already_exists
from set_configuration import update_configuration


def main():
# Get the environment variables
api_url = os.getenv("API_URL")
algorithm_name = os.getenv("ALGORITHM_NAME")
algorithm_config_file = os.getenv("ALGORITHM_CONFIGURATION")
seed_csv = os.getenv("SEED_FILE")
test_csv = os.getenv("TEST_FILE")

# setup the algorithm configuration
algorithm_config = load_json(algorithm_config_file)
if check_if_config_already_exists(algorithm_config, api_url):
update_configuration(algorithm_config, api_url)
else:
add_configuration(algorithm_config, api_url)

seed_database(seed_csv, api_url)

send_test_records(test_csv, algorithm_name, api_url)

if __name__ == "__main__":
main()
43 changes: 43 additions & 0 deletions tests/algorithm/scripts/seed_db.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
import csv

import requests
from helpers import dict_to_pii


def seed_database(csv_file, api_url):
MAX_CLUSTERS = 100
alhayward marked this conversation as resolved.
Show resolved Hide resolved
cluster_group = []

print("Seeding the database...")

# Read the CSV file using the csv module
with open(csv_file, mode='r', newline='', encoding='utf-8') as file:
reader = csv.DictReader(file)

for row in reader:
record_data = {k: ("" if v in [None, "NaN"] else v) for k, v in row.items()}

# convert dict to a pii_record
pii_record = dict_to_pii(record_data)

# nesting for the seeding api request
cluster = {"records": [pii_record]}
cluster_group.append(cluster)

if len(cluster_group) == MAX_CLUSTERS:
send_clusters_to_api(cluster_group, api_url)
cluster_group = []

if cluster_group:
send_clusters_to_api(cluster_group, api_url)

print("Finished seeding the database.")


def send_clusters_to_api(cluster_group, api_url):
"""Helper function to send a batch of clusters to the API."""
try:
response = requests.post(f"{api_url}/seed", json={"clusters": cluster_group})
response.raise_for_status() # Raise an error for bad status codes
except requests.exceptions.RequestException as e:
print(f"Failed to post batch: {e}")
Loading
Loading