Skip to content
This repository has been archived by the owner on Jun 9, 2022. It is now read-only.
/ data-extractor Public archive

Simple data extractor/shuffler

License

Notifications You must be signed in to change notification settings

hmcts/data-extractor

Repository files navigation

data-extractor

Archive notice

This repository is archive now, SDP ingestion has moved to Azure data factory https://github.com/hmcts/mi-data-ingestion

Simple data extractor/shuffler

Runs a sql query against a database and saves the results in an azure storage container.

It can export the results as one of:

  • csv
  • json
  • jsonlines

It obtains its configuration from the following environment variables:

  • ETL_DB_URL: jdbc connection url (e.g. "jdbc:postgresql://localhost:5432")
  • ETL_DB_USER_FILE: file containing the db username relative to "/mnt/secrets" (e.g. data-extractor/aat-ccd-user)
  • ETL_DB_PASSWORD_FILE: file containing the db password relative to "/mnt/secrets" (e.g. data-extractor/aat-ccd-pwd)
  • ETL_MSI_CLIENT_ID: pod identity client id to get credentials from keyvault and write access to blob storage.
  • ETL_CONNECTIONSTRING: This is an alternative authentication system in case manage identity is not setup in the cluster.
  • ETL_ACCOUNT: Azure storage account where output should be saved (e.g. "devstoreaccount1")
  • EXTRACTION_CASETYPES: Configuration list to extract each case type. This list can contain the following values:
    • CONTAINER: Azure storage container where output should be saved (e.g. "testcontainer")
    • TYPE: output file type. One of: jsonlines, csv, json (default "jsonlines")
    • PREFIX: prefix for the output file (e.g. "test01").
    • CASETYPE: CCD case type to extract

The 2 values: ETL_DB_USER_FILE and ETL_DB_PASSWORD_FILE are useful if the username and password are retrieved from Azure keyvault and exposed as flexvolumes. The same username and password can alternatively be passed as environment variables (ETL_DB_USER and ETL_DB_PASSWORD).

The output file obtained contains all the records generated by the query. File naming follows this convention: <prefix>-<datetime>.<type>

Helm chart

The easiest way to run a job is by using the included helm chart which is based on chart-job. This can be done running the following command: helm install hmcts/data-extractor-job --name data-extractor-job-001 --namespace mi -f job-values.yaml --wait where job-values.yaml is:

job:
  image: hmcts.azurecr.io/hmcts/data-extractor-job:prod-f888e665
  aadIdentityName: mi
  keyVaults:
    "data-extractor":
      resourceGroup: data-extractor 
      secrets:
        - ccdro-user
        - ccdro-password
        - appinsights-instrumentationkey
  labels:
    app.kubernetes.io/instance : data-extractor-job-001
    app.kubernetes.io/name: data-extractor-job
  environment:
    ETL_DB_URL: jdbc:postgresql://ccd-data-store-api-postgres-db-aat.postgres.database.azure.com:5432/ccd_data_store
    ETL_DB_USER_FILE: data-extractor/ccdro-user
    ETL_DB_PASSWORD_FILE: data-extractor/ccdro-password
    ETL_SQL: >
      SELECT id, created_date, event_id, summary, description, user_id, case_data_id,
      case_type_id, case_type_version, state_id, user_first_name, user_last_name,
      event_name, state_name, security_classification
      FROM case_event
      WHERE created_date >= (current_date-1 + time '00:00')
      AND created_date < (current_date + time '00:00')
      ORDER BY created_date ASC;
    ETL_ACCOUNT: midatastg
    ETL_MSI_CLIENT_ID: 1461ff03-675c-423c-95a4-fb50d31254ff
    EXTRACTION_CASETYPES_0__CONTAINER: "test-container"
    EXTRACTION_CASETYPES_0__CASETYPE: "Caveat"
    EXTRACTION_CASETYPES_0__TYPE: "jsonlines"
    EXTRACTION_CASETYPES_0__PREFIX: "CCD-TEST"
global:
  job:
    kind: Job
  subscriptionId: "1c4f0704-a29e-403d-b719-b90c34ef14c9"
  tenantId: "531ff96d-0ae9-462a-8d2d-bec7c0b42082"
  environment: aat

For an example of how to run this using flux please see: flux github repo

About

Simple data extractor/shuffler

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published