Skip to content

Commit

Permalink
feat: added algorithm updates
Browse files Browse the repository at this point in the history
  • Loading branch information
cbrinson-rise8 committed Dec 11, 2024
1 parent 5dc914e commit 4000764
Show file tree
Hide file tree
Showing 17 changed files with 192 additions and 1,289 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -73,3 +73,6 @@ __pycache__/

# Databases
*.sqlite3

# Test result files
output.csv
2 changes: 1 addition & 1 deletion tests/algorithm/Dockerfile.algo
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,4 @@ COPY scripts /app/scripts
COPY data /app/data

# Install Python dependencies
RUN pip install --no-cache-dir pandas requests
RUN pip install --no-cache-dir requests
53 changes: 40 additions & 13 deletions tests/algorithm/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Record Linkage Algorithm Testing

This repository contains a project to test the effectiveness of the RecordLinker algorithm.
This repository contains a project to evaluate the match accuracy performance of the RecordLinker algorithm.

## Prerequisites

Expand All @@ -12,48 +12,75 @@ Before getting started, ensure you have the following installed:
## Directory Structure

- `/`: Contains the `.env` file and `Dockerfile` to build
- `configurations/`: Contains the configuration file for the algorithm tests
- `data/`: Contains the data `.csv` files used for the algorithm tests (seed file and test file)
- `results/`: Contains the results of the algorithm tests
- `scripts/`: Contains the scripts to run the algorithm tests
- `configurations/`: Contains the configuration `.json` file that will be used for the test
- `data/`: Contains the data `.csv` files used for the algorithm test (seed file and test file)
- `results/`: Contains the results `.csv` file after running the test
- `scripts/`: Contains the scripts to run the test

## Steup
## Setup

1. Build the Docker images:

```bash
docker compose --profile algo-test build
```

2. Configure environment variables
2. Add seed and test data files
You can use the sample data files provided in the `data` directory or add your own data files.
The format of the input files should be a CSV file with the same column headers as shown in the sample files.

`/data/sample_seed_data.csv`

`/data/sample_test_data.csv`


3. Configure environment variables

`/algo.env`

Edit the environment variables in the file

3. Edit the algorithm configuration file
4. Edit the algorithm configuration file

`/configurations/algorithm_configuration.json`

Edit the configuration file to tune the algorithm parameters

## Running Algorithm Tests

1. Run the tests
1. Run the test

```bash
docker compose --profile algo-test run --rm algo-test-runner python scripts/run_test.py
docker compose run --rm algo-test-runner scripts/run_test.py
```

2. Analyze the results

The results of the algorithm tests will be available in the `results/output.csv` file.

The results will be in a csv formatted file with each test case number, the expected result, and the actual response from the algorithm.
The results will be in a CSV formatted file with the following columns:
`Test Case #`, `Expected Result`, `Match Result`, `Details`

## Rerunning Algorithm Tests

After you've run the algorithm tests, you may want to rerun the tests with different seed data, test data, or configurations.
Edit the csv files and/or the configuration file as needed and then run the following commands to rerun the tests.
1. Reset the mpi database
```bash
docker compose run --rm algo-test-runner python scripts/reset_db.py
```
2. Run the tests
```bash
docker compose run --rm algo-test-runner scripts/run_test.py
```
## Environment Variables
1. `env_file`: The attributes that should be tuned for your particular algorithm test,
1. `env file`: The attributes that should be tuned for your particular algorithm test,
are located in the `algo_test.env` file.
2. `environment`: The attributes that should likely remain static for all algorithm tests are located directly in the `compose.yml` file.
Expand All @@ -74,4 +101,4 @@ After you've finished running algorithm tests and analyzing the results, you can

```bash
docker compose --profile algo-test down
```
```
4 changes: 2 additions & 2 deletions tests/algorithm/algo.env
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
SEED_FILE="data/seed_data.csv"
TEST_FILE="data/test_data.csv"
SEED_FILE="data/sample_seed_data.csv"
TEST_FILE="data/sample_test_data.csv"
ALGORITHM_CONFIGURATION="configurations/algorithm_configuration.json"
ALGORITHM_NAME="test-config"
3 changes: 1 addition & 2 deletions tests/algorithm/configurations/algorithm_configuration.json
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,7 @@
"passes": [
{
"blocking_keys": [
"BIRTHDATE",
"SEX"
"BIRTHDATE"
],
"evaluators": [
{
Expand Down
6 changes: 6 additions & 0 deletions tests/algorithm/data/sample_seed_data.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Match Id,ID,BIRTHDATE,FIRST,LAST,SUFFIX,MAIDEN,RACE,ETHNICITY,GENDER,ADDRESS,CITY,STATE,COUNTY,ZIP,SSN,
1,3020167,1951-06-02,Linda,Nash,Sr,Gutierrez,Asian,Hispanic,F,968 Gonzalez Mount,South Emilybury,GU,North Kennethburgh County,93236,675-79-1449,
2,9488697,1942-08-03,Jose,Singleton,Sr,Ingram,Asian,Hispanic,M,631 Fowler Causeway,Port Williamfurt,IN,Wardburgh County,90637,587-60-3668,
3,1805504,1963-01-29,Ryan,Lawrence,IV,Armstrong,Black,Non-Hispanic,M,5256 Lisa Light,Port Monica,GA,South Christine County,51813,371-33-0433,
4,1792678,1950-08-10,Thomas,Brady,II,Cobb,White,Unknown,M,944 Hayes Port,Jonesville,FM,Jonesview County,6015,272-78-9905,
5,1332302,1972-08-26,Angie,Murphy,Sr,Mcmahon,Black,Non-Hispanic,F,60015 Edward Vista Suite 518,Lake Andreaview,UT,North Rodney County,46540,740-16-5170,
7 changes: 7 additions & 0 deletions tests/algorithm/data/sample_test_data.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Test Case #,Match Id,ID,BIRTHDATE,FIRST,LAST,SUFFIX,MAIDEN,RACE,ETHNICITY,GENDER,ADDRESS,CITY,STATE,COUNTY,ZIP,SSN,Expected Result
1,1,3020167,1951-06-02,Linda,Nash,Jr,Gutierrez,Asian,Hispanic,F,968 Gonzalez Mount,South Emilybury,GU,North Kennethburgh County,93236,675-79-1449,Should be a Match
2,2,9488697,1942-08-03,Singleton,Jose,Sr,Ingram,Asian,Hispanic,M,631 Fowler Causeway,Port Williamfurt,IN,Wardburgh County,90637,587-60-3668,Should be a Match
3,3,1805504,1963-01-29,Ryan,Law-rence,IV,Armstrong,Black,Non-Hispanic,M,5256 Lisa Light,Port Monica,GA,South Christine County,51813,371-33-0433,Should be a Match
4,4,1792678,1950-08-10,Tho-mas,Brady,II,Cobb,White,Unknown,M,944 Hayes Port,Jonesville,FM,Jonesview County,6015,272-78-9905,Should be a Match
5,4,1792678,1950-08-10,ThoMas,Brady,II,Cobb,White,Unknown,M,944 Hayes Port,Jonesville,FM,Jonesview County,6015,272-78-9905,Should be a Match
6,0,1792679,1950-18-10,ThoMas,Brady,II,Cobb,White,Unknown,M,944 Hayes Port,Jonesville,FM,Jonesview County,6015,272-78-9905,Should fail
Loading

0 comments on commit 4000764

Please sign in to comment.