Skip to content

btcooper22/ace_project

Repository files navigation

Connected Bradford scripts (subfolder = cBradford)

Extraction scripts

These scripts use the 50k subset of the "Connected Bradford/Yorkshire" (henceforth cBradford) dataset to construct queries to extract relevant data from the full dataset. As such, each script generates (and is accompanied by in the queries folder) an SQL script which was then run against the full dataset. All scripts run of cBradford only return the results for the patients in the ACE dataset. Results are stored in .csv files in data/cBradford, and are not included in the git repository for data privacy reasons.

comorbidities.R

This script searches the cBradford "condition" concepts for terms matching conditions known to be co-morbid with asthma. Conditions are classified by an upper and lower level, where the lower level generally uses synonyms, acronyms, or sub-groupings to ensure all relevant concepts are found. The query returns all ocurrances of each concept, along with person_id and date of diagnosis.

demographics.R

This script extracts data on gender and ethnicity from cBradford.

prescriptions.R

This script searches the cBradford "drug" concepts for terms matching a number of types of common medication. Specific drug names are grouped into major categories: Fast-acting bronchodilators, long-acting bronchodilators, antihistamines, prednisolone, and other steroids. The query returns all ocurrances of each concept, along with person_id and date of prescription.

visits.R

This script searches the cBradford "visit" concepts for all those applying to ACE patients. This returns four main categories of visit (GP, ER, hospital as inpatient, hospital as outpatient) along with the date of the visit.

build_cbradford_variables.R

This script uses the concept occurrance data generated by the previous scripts (and their accompanying SQL queries) to generate "variables of interest" which correlate with hospitalisation. It loads the ACE data through cBradford, then loads each of the outputs from above in turn. For each major variable type within these, it binarises and/or splits the variable by timeframe (as appropriate) to generate binary (TRUE/FALSE) variables. Univariate binomial GLMs are then used, and variables showing a significant link with hospitalisation are recorded in .csv files in data/new_features (excluded from git repository).

Spatial data scripts (subfolder = spatial)

These scripts extract spatial data (aggregated to the LSOA level), link it to patient records (also at the LSOA level), and generate binary variables.

build_imd_maps.R

This script takes Indices of Multiple Deprivation (IMD) data at the LSOA level from the Ministry of Housing, Communities & Local Government (in spatial/raw/Indices_of_Multiple_Deprivation_(IMD)_2019.csv). These indices are then aggregated at the MSOA and postcode level using spatial/raw/PCD_OA_LSOA_MSOA_LAD_AUG19_UK_LU.csv.gz (obtained from the ONS website, and provided compressed due to file size), and at ward level using spatial/raw/Lower_Layer_Super_Output_Area_(2011)_to_Ward_(2015)_Lookup_in_England_and_Wales.csv also obtained from the ONS. As well as aggregated IMD variables, which are output to separate .csv files in spatial/area_stats/bradford_imd, this script also produces a single file that translates between postcode, LSOA, MSOA and ward, written at spatial/area_stats/postcode_ONS_translation.csv

build_air_maps.R

This script builds maps of air pollution variables at different spatial scales - postcode, LSOA, MSOA, ward and postcode district, and places the data for them in .csv format in spatial/area_stats/bradford_air. Air pollution data are found in spatial/raw and were originally obtained from the DEFRA website. Also provided is the files spatial/raw/postcode_gridref.csv, which translates between Bradford postcodes to grid references (originally built using this tool), as air pollution data are only provided using grid references.

investigate_air_imd.R

This script links the air pollution and IMD variables to the ACE data, and examines variables of interest. ACE data comes from 2 sources - cBradford using BigQuery, and the file data/ace_data_LSOA.xslx, which gives the addresses of ACE patients at the LSOA level. This second file is excluded from the git repository, but was provided by Attia Gilani from the ACE team. Like a number of scripts which build new "variables of interest", it loads in the "final results file" (data/ace_data_cooper_final.csv, excluded from git) to cross-tabulate the variables of interest for patients in the final population. New variables are generated by binarisation, and written to data/new_features/air_imd.csv (excluded from git)

build_gp_distances.R

This script measures the distance from the centroid of each LSOA to each patient's GP surgery, as well as the closest hospital. Patient information is loaded from the ACE dataset in cBradford (with LSOAs provdided from data/ace_data_LSOA.xslx). LSOAs are translated to latitude and longitude using spatial/raw/postcode_gridref.csv and spatial/area_stats/postcode_ONS_translation.csv, and co-ordinates for each surgery, obtained from Google Maps, are found in spatial/raw/surgery_coords.csv. The "final results file" (data/ace_data_cooper_final.csv, excluded from git) is also used for cross-tabulation. Resulting variables are binarised, and written to data/new_features/distance.csv (excluded from git).

Data preparation scripts (subfolder = prep)

join_datasets.R

This script links data on cBradford with data in spreadsheets provided by the ACE team. This is necessary, as the spreadsheets contain a number of physiological variables which are known to have a relationship with hospitalisation which are not yet available on the cBradford platform. Spreadsheet data came from three files: data/ace_data_orig.xlsx", data/ace_data_extra.xlsx and data/brand_new_data.xlsx, all of which are excluded from git. As the person_id variable which links the ACE data on cBradford to the wider platform is not used in the spreadsheets provided by the ACE team, other variables were used to link unique entries. These variables are first cleaned to ensure formatting matches between the spreadsheet and cBradford versions, then the datasets are joined. Identifying variables were patient age, address, GP surgery, whether hospitalisation was required, "number of bed days saved", ethnicity, referral source, referral date and referral time. The combined dataset is then written to data/ace_data_linked.csv (excluded from git).

import_data.Rmd

This script runs Sam Relins' data preprocessing routines on the newly-linked data to ensure consistancy with previous work. It requires the data/ace_data_linked.csv file generated from join_datasets.R, as well as data/ace_data_extra.csv to ensure consistancy of column names. Both of these files are excluded from git. The full combined and preprocessed data is then written to the "final results file" at data/ace_data_cooper_final.csv (excluded from git).

Analysis scripts (subfolder = analysis)

These two scripts, bootstrap_aggregate_original.Rmd and bootstrap_aggregate_additional.Rmd perform the analysis itself. In each script, after lasso is used to determine the model structure, three main functions are defined and run iteratively. Data is partitioned, with the training partition used in the "inner loop" to generate distributions of coefficients. The validation partition is then used to assess model predictive performance. This "outer loop" is then repeated. Two main parameters control execution of the loop: nboot and ncycles, both set to 1,000. nboot is the number of iterations of the inner loop, and increasing this increases the robustness of the model coefficient estimates. ncycles is the number of iterations of the outer loop, and increasing this increases the robustness of the model performance metrics.

Each script returns full distributions for all model coefficients and performance metrics, and writes this as an R data structure (.RDS) file, stored in analysis/boostrap_aggregate_original.RDS and analysis/boostrap_aggregate_additional.RDS respectively. WARNING: These scripts are slow. Each one takes 3-4 hours to run on a 16-core CPU with 32GB RAM (i.e, a high performance machine). The inner loop is efficiently parallelised, and will use n_cores - 1 if less than 16 cores are available. The two scripts only differ in that the lasso algorithm is given the additional variables in bootstrap_aggregate_additional.Rmd, and thus the model structure is different. Both scripts require data/ace_data_cooper_final.csv, which is excluded from git.

Plotting/tabulating scripts (subfolder = plots_tables)

These are fairly straightforward: plot_results.R generates the main results figures and tables, and appendix_tables.R generates cross-tabulations of variables of interest for the appendix.

Functions (subfolder = functions)

Two additional functions are used: inverse_logit.R converts log-odds to a probability, and load_connected_yorkshire.R loads all the tables for the cBradford 50k dataset into two objects - CY for the main data, and CY-V for the vocabulary tables.

Data not provided

The following files and folders are required to run the scripts above, but cannot be provided publically as they contain confidential patient data.

  • data/new_features/
  • data/cBradford/
  • data/ace_data_LSOA.xslx
  • data/ace_data_cooper_final.csv
  • data/ace_data_orig.xlsx
  • data/ace_data_extra.xlsx
  • data/brand_new_data.xlsx
  • data/ace_data_linked.csv
  • data/ace_data_extra.csv

They are all stored securely as a .zip file on the servers of Bradford Teaching Hospitals NHS Foundation Trust; contact John Birkinshaw, senior database manager at Connected Yorkshire.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published