Data Visibility MVP Tech spec

Updated Data Visibility Tech Spec (October 2024)

Historic Data Visibility MVP Tech Spec (June 2023)

Background

Currently VA.gov activity data, including disability benefits claim submission data, is functionally inaccessible to Benefits Portfolio product teams, with the exception of a handful of engineers with command line access to query the production postgres database in vets-api. OCTO wants to develop a safer, more accessible, and more user-friendly way for teams to access this data.

Thus, VRO as a platform as an MVP will be resposible for safely and securly providing VRO partner teams within the Benefits portfolio the cliams data submitted via 526EZ forms through va.gov.

In-order to make this happen, the VRO team is responsible for coordinating this effort via collaboration across the Benefits Portfolio, in particular with the Disability Benefits Experience team(s) who are familiar with the va.gov Postgres database and the needs of engineers working on va.gov benefits products.

Pain points

Disability benefits claim submission data is only avaiable via rails console in prod.
Cannot use any BI/dashboarding tools to view metrics.

MVP goals and assumptions:

Focused on the 526EZ form benefits claims submission data
Data dump from production vets.gov postgres db happens daily into a s3 bucket through a another process.
Data at rest is decrypted before being dumped into the bucket
S3 bucket is already setup, encyrpted and secured via SSE-KMS or other AWS provided options
Benefits claims data is available via sql initially from the VRO postgres db

Solution

Utilize Kubernetes cron job to run a python script daily
Use Pandas or another dataframe python library to read the csv file, sanitize the data, filter any unwanted data, standardize datetime if nessessary.
Keep track of processed csv file s3 bucket file names in a database.
Store the processed claims data into the database using transactions.
Re-try mechanism for errors when they happen during cron-job.
Slack notification when a dump has been processed or failed.

Monitor

Create Datadog dashboard to monitor the cron jobs

Local development

s3 bucket and csv file

Generate fake data csv file without any PII to simulate daily dumps
To emulate s3 bucket functionality locally, use local stack
Rather than using docker-compose.yaml files for the container, use the kubernetes deployment files used for LHDI env locally and leverage minikube to run the container locally and it can be ingrained into the existing Gradle tasks. Added step will be that VRO will installing minikube.

graph TD

subgraph LHDI AWS
    subgraph S3 Bucket
        csv-files
    end
    DB[("(Platform)\nDB")]
end

csv-files <-.-> cron-job 
local-stack-or-minio <-.-> cron-job 

subgraph VRO

    subgraph local-env[Local Env]
        local-stack-or-minio[localstack to emulate S3]
        local-db[("(Local)\nDB")]
    end
    
    subgraph Kubernetes
        cron-job[Cron Job written in Python] -.->|Benefits Claims data| DB
            
        subgraph cron-job
            pandas-with-python[Python with Pandas]
        end
    end

end

pandas-with-python -.->|Errors and logs| DataDog
pandas-with-python -.->|Errors and Success Messages| Slack
pandas-with-python -.->|Store processed cron-job transaction history | DB
DB <-.-> data-visualization[Data Visualization tool]

style DB fill:#aea,stroke-width:4px
style cron-job fill:#AAF,stroke-width:2px,stroke:#777
style local-stack-or-minio fill:#AAA,stroke-width:2px,stroke:#777
style data-visualization fill:#BFD,stroke-width:2px,stroke:#777

Questions and comments

How do we handle storage of PII data because of possible ATO restrictions?
What exactly does current data look like? This can help us design exception handling and job-retry mechanisms.
Have a backup mechanism in place for the data in case of any failures or data loss

Provide feedback

Saved searches

Use saved searches to filter your results more quickly