Merge pull request #12 from clinicalml/docs

omop-learn v2.0 documentation
clinicalml · Jun 17, 2023 · ad2a61e · ad2a61e
2 parents b6357d0 + 3cf4251
commit ad2a61e
Show file tree

Hide file tree

Showing 3 changed files with 85 additions and 102 deletions.
diff --git a/README.md b/README.md
@@ -24,10 +24,20 @@ BibTeX:
 
 ## Installation
 
-Run the following from the current directory:
+Dependencies for `omop-learn` are managed through an [environment.yml](./environment.yml) file. Run the following from the current directory to create the conda environment necessary to run the package:
 
     conda env create -f environment.yml
     conda activate omop-learn
     pip install .
 
-This installs the dependencies and omop-learn package into a conda environment `omop-learn`.
+This installs the dependencies and pip installs the `omop-learn` package into a conda environment `omop-learn`.
+
+## Documentation
+
+For a more detailed summary of omop-learn's data collection pipeline, and for documentation of functions, please see the full [documentation](https://clinicalml.github.io/omop-learn/) for this repo, which also describes the process of creating one's own cohorts, predictive tasks, and features. 
+
+## Contributors and Acknowledgements
+
+`omop-learn` was written by Rohan Kodialam and Jake Marcus, with additional contributions by Rebecca Boiarsky, Justin Lim, Ike Lage, Shannon Hwang, Hunter Lang, Christina Ji, Irene Chen, and Alejandro Buendia.
+
+This package was developed as part of a collaboration with Independence Blue Cross and would not have been possible without the advice and support of Aaron Smith-McLallen, Ravi Chawla, Kyle Armstrong, Luogang Wei, Neil Dixit, and Jim Denyer.
diff --git a/docs/index.md b/docs/index.md
@@ -3,6 +3,18 @@
 
 ## Introductory Tutorial
 
+<<<<<<< HEAD
+`omop-learn` allows [OMOP-standard (CDM v5 and v6)](https://github.com/OHDSI/CommonDataModel/wiki) medical data like claims and EHR information to be processed efficiently for predictive tasks. The library allows users to precisely define cohorts of interest, patient-level time series features, and target variables of interest. Relevant data is automatically extracted and surfaced in formats suitable for most machine learning algorithms, and the (often extreme) sparsity of patient-level data is fully taken into account to provide maximum performance. 
+
+The library provides several benefits for modeling, both in terms of ease of use and performance:
+* All that needs to be specified are cohort and outcome definitions, which can often be done using simple SQL queries.
+* Our fast data ingestion and transformation pipelines allow for easy and efficient tuning of algorithms. We have seen significant improvements in out-of-sample performance of predictors after hyperparameter tuning that would take days with simple SQL queries but minutes with `omop-learn`.
+* We modularize the data extraction and modeling processes, allowing users to use new models as they become available with very little modification to the code. Tools ranging from simple regression to deep neural net models can easily be substituted in a plug-and-play manner.
+
+`omop-learn` serves as a modern python alternative to the [PatientLevelPrediction R library](https://github.com/OHDSI/PatientLevelPrediction). We allow seamless integration of many Python-based machine learning and data science libraries by supporting generic `sklearn`-style classifiers. Our new data storage paradigm also allows for more on-the-fly feature engineering as compared to previous libraries. 
+
+In this tutorial, we walk through the process of using `omop-learn` for an end-of-life prediction task for Medicare patients with clear applications to improving palliative care. The code used can also be found in the [example notebook](https://github.com/clinicalml/omop-learn/blob/master/examples/eol/sard_eol.ipynb), and can be run on your own data as you explore `omop-learn`. The control flow diagram below also links to relevant sections of the library documentation.
+=======
 omop-learn allows [OMOP-standard (CDM v5 and v6)](https://github.com/OHDSI/CommonDataModel/wiki) medical data like Claims and EHR information to be processed efficiently for predictive tasks. The library allows users to precisely define cohorts of interest, patient-level time series features, and target variables of interest. Relevant data is automatically extracted and surfaced in formats suitable for most machine learning algorithms, and the (often extreme) sparsity of patient-level data is fully taken into account to provide maximum performance. 
 
 
@@ -14,6 +26,7 @@ The library provides several benefits for modelling, both in terms of ease of us
 omop-learn serves as a modern python alternative to the [PatientLevelPrediction R library](https://github.com/OHDSI/PatientLevelPrediction). We allow seamless integration of many Python based machine learning and data science libraries by supporting generic sklearn-stye classifiers. Our new data storage paradigm also allows for more on-the-fly feature engineering as compared to previous libraries. 
 
 In this tutorial, we walk through the process of using omop-learn for an end-of-life prediction task for Medicare patients with clear applications to improving palliative care. The code used can also be found in the [example notebook](https://github.com/clinicalml/omop-learn/blob/master/PL2%20Test%20Driver.ipynb), and can be run on your own data as you explore omop-learn. The control flow diagram below also links to relevant sections of the library documentation.
+>>>>>>> master
 <center>
 <div class="mxgraph" style="max-width:100%;border:1px solid transparent;" data-mxgraph="{&quot;highlight&quot;:&quot;#006633&quot;,&quot;lightbox&quot;:false,&quot;nav&quot;:false,&quot;resize&quot;:true,&quot;xml&quot;:&quot;&lt;mxfile host=\&quot;www.draw.io\&quot; modified=\&quot;2020-01-27T20:09:03.888Z\&quot; agent=\&quot;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36\&quot; etag=\&quot;8otv9sNdF-oivO5T6t3e\&quot; version=\&quot;12.5.8\&quot; type=\&quot;device\&quot;&gt;&lt;diagram id=\&quot;C5RBs43oDa-KdzZeNtuy\&quot; name=\&quot;Page-1\&quot;&gt;7Zpbc5s4FIB/jR/DgABfHuM47nY33XinbdI+dWRQQImMqJAbe3/9HoG44xindnrZeDItOhIS0vnOhWMP7IvV5o3AcfiO+4QNkOlvBvZsgJDlIDRQf6a/zSQjx80EgaC+HlQK3tN/iRaaWrqmPklqAyXnTNK4LvR4FBFP1mRYCP5YH3bHWX3VGAekJXjvYdaW3lJfhpl0jEal/A9CgzBf2RpOsp4VzgfrnSQh9vljRWRfDuwLwbnMrlabC8LU4eXncvt2e8uuHoZv/vwn+Yo/Tv/68PfNWTbZ/JBbii0IEslnT42+boPxTXI7f7uerS7DL7d8yM9y7X7DbK0PTG9WbvMThGlAWdCYPoZUkvcx9lTPI/ACslCuGLQsuMRJnGnwjm4IrDoNGE4S3enxFfX0dSIFfyAXnHGRLmGbpumY46InV5Sr5hDYp7Dvyui79AN9dzySc7yiTLF5Q4SPI6zFGkTLgbbeIhGSbBpQ7DlRq1Az2AfhKyLFFu7Ts7hDbQraNIY5KY8laJarZWEVslyINdxBMXepQLjQOkybHxMirpf36oCRyfCSsFxdQwbTTpdwEagLy4ArvFK6iZZJnO4/GwIzFqNm5I5GBKa64CEXak4c+fDv9VqCpkj2eIxGD9kqoZTKXs/V06G5Bx0UTGzFjIDKcL00KAfxQhCfepLy6IouBRbbdDPI9tO1vnh6JbSPd7ty6hriCo6CryNf4ZWitAfJJgpVIuusqd7Cws0GsDtAO4SiDgB3gmUNzRpYyOkAy+kAa3Q4V9CsoHWA6zBbnuP63fUCJDMsccuJqM0rZK4UuQueUIUJdEmuNIYZDVTLg4MkYOXTFPAp9h6CVN1dxp/PeK7vXXIp+arThzzhb6oa93ESpmCZWU+snn21CVRgBMKTkUEhSiWGrza4EwrW2GGxpyaLFUyz7XLhE5E/ZcQjssf9ncqv2aiO36RNn+O04XNtw3YP5q8fbJbVDks+xHndBL8S8oBHmF2W0mndU1TtuRx/xRWAqfCeSLnVCsJryesqggMV208ajrTxWTUMN2/ONtXO2bbkqCSMRP65ymkUrox7D5loTlm+Shep8GlRbR8a/dRpPYMROHG+Fh55QjW2TuqwCIh8YtyomzlBGJb0W/3hThEXUa+4eBFynpDCi0GumsZIxtIFjh8U7wiWawEJci0s7j7oXycsnsAvjX+BsOjsT6jrqqqopcMDdXqqqvqsPm4FWnp8O8m2mxreEXt1QNrvdnf7sCo7ZENl6k6NievqduZRJ/ZIt0uXqhrbSmNBBAUVqrha+txPdYf8ueaPu73zqZ2j+ULOUd+64DSSFYOZ1A3GQg1DyJ5f31V9d2xMNHQalmc2Jso22JooNapiP88P/+6rUR1kVFbNoNynzemZic0RTWf3W2AP07FPYzojZJiobj3OxJiMxxPX1B/3ecZkN43JflljspwfnksXvr/q+a2emB7m4X/7/HvU006s7zWU73qBG7U8eJFUdxcM8vfvkGxwoF6kp3El5mtpkQag/TnvMYqRP7rg2EyA8/ftagKMOhLg8eEJcE9f0qN+nGuSrtJSfVUnzRpOVg3aVSsqKjw9qkPpYud5QVoZcKs6rZ9n1niN8yOU1nng9cwnwgBUQJqWfNA8rf+og6KJB6eDI8LXSXoIc+R0yOFvbMRR8Jvg1yhLWuNRCz9n3KYvlx2fvslrJPuNIpnVN+Wz3NOEst61JNsYKDfbLh8tBIkF9wiYejWuHalmFFdm31susv739SJ3VPdX9uhF60W9aXJ20VQUI7Pv5j4ITCP4P/+SfkFjwtJv8Z5iTIOVxbESuF2YgTdZqr3hRGVaMOgKpC4sThKZXswEWJgwaLyNln0wdF8xrGPomj9h2RK1v87bm8gdI0/rzgC7c7VaOlfN3grWVX52xjKYNfHpOJWzRWQtMDuLiHzk4uFMDe2WniHbuI+DFyRiiDry+C4iCuHRUyn0E33B9oulRc/JAU+eSrk9U6nKL5mOWT4bug2f13wD7V15bk5kHatYBs3yt1vZ8PIXcPblfw==&lt;/diagram&gt;&lt;/mxfile&gt;&quot;}"></div>
 <script type="text/javascript" src="https://www.draw.io/js/viewer.min.js"></script>
@@ -25,8 +38,35 @@ We define our end-of-life task as follows:
 
 > For each patient who is over the age of 70 at prediction time, and is enrolled in an insurance plan for which we have claims data available for 95% of the days of calendar year 2016, and is alive as of March 31, 2017: predict if the patient will die during the interval of time between April 1, 2017 and September 31, 2017 using data including the drugs prescribed, procedures performed, conditions diagnosed and the medical specialties of the clinicians who cared for the patient during 2016.
 
+<<<<<<< HEAD
+`omop-learn` splits the conversion of this natural language specification of a task to code into two natural steps. First, we define a **cohort** of patients, each of which has an outcome. Second, we generate **features** for each of these patients. These two steps are kept independent of each other, allowing different cohorts or feature sets to very quickly be tested and evaluated. We explain how cohorts and features are initialized through the example of the end-of-life problem.
+
+#### 1.1 Data Backend Initialization
+`omop-learn` supports a collection of data [backend](https://github.com/clinicalml/omop-learn/tree/master/src/omop_learn/backends) engines depending on where the source OMOP tables are stored: PostgreSQL, Google BigQuery, and Apache Spark. The `PostgresBackend`, `BigQueryBackend`, and `SparkBackend` classes inherit from parent class `OMOPDatasetBackend` which defines the set of methods to interface with the data storage as well as the feature creation.
+
+Configuration parameters used to initialize the backend are surfaced through `.env` files, for example [`bigquery.env`](https://github.com/clinicalml/omop-learn/blob/master/bigquery.env), which stores the name of the project in Google BigQuery, schemas to write the cohort to, as well as local directories to store feature data and trained models. The backend can then simply be created in Python as:
+
+```sql
+load_dotenv("bigquery.env")
+
+config = Config({
+    "project_name": os.getenv("PROJECT_NAME"),
+    "cdm_schema": os.getenv("CDM_SCHEMA"),
+    "aux_cdm_schema": os.getenv("AUX_CDM_SCHEMA"),
+    "prefix_schema": os.getenv("PREFIX_SCHEMA"),
+    "datasets_dir": os.getenv("OMOP_DATASETS_DIR"),
+    "models_dir": os.getenv("OMOP_MODELS_DIR")
+})
+
+# Set up database
+backend = BigQueryBackend(config)
+```
+
+#### 1.2 <a name="define_cohort"></a> Cohort Initialization
+=======
 omop-learn splits the conversion of this natural language specification of a task to code into two natural steps. First, we define a **cohort** of patients, each of which has an outcome. Second, we generate **features** for each of these patients -- these two steps are kept independent of each other in omop-learn, allowing different cohorts or feature sets to very quickly be tested and evaluated. We explain how cohorts and features are initialized through the example of the end-of-life problem.
 #### 1.1 <a name="define_cohort"></a> Cohort Initialization
+>>>>>>> master
 OMOP's [`PERSON`](https://github.com/OHDSI/CommonDataModel/wiki/PERSON) table is the starting point for cohort creation, and is filtered via SQL query. Note that these SQL queries can be written with variable parameters which can be adjusted for different analyses. These parameters are implemented as [Python templates](https://www.python.org/dev/peps/pep-3101/). In this example, we leave dates as parameters to show how cohort creation can be flexible.
 
 We first want to establish when patients were enrolled in insurance plans which we have access to. We do so using OMOP's `OBSERVATION_PERIOD` table. Our SQL logic finds the number of days within our data collection period (all of 2016, in this case) that a patient was enrolled in a particular plan:
@@ -48,7 +88,11 @@ death_training_elig_counts as (
         from cdm.observation_period
     )
 ```
+<<<<<<< HEAD
+Note that the dates are left as template strings that can be filled later on. Next, we want to filter for patients who are enrolled for 95% of the days in our data collection period. Note that we must be careful to include patients who used multiple different insurance plans over the course of the year by aggregating the intermediate table `death_training_elig_counts` which is specified above. Thus, we first aggregate and then collect the `person_id` field for patients with sufficient coverage over the data collection period:
+=======
 Note that the dates are left as template strings that can be filled later on. Next, we want to filter for patients who are enrolled for 95% of the days in our data collection period - note that we must be careful to include patients who used multiple different insurance plans over the course of the year by aggregating the intermediate table `death_training_elig_counts` which is specified above. Thus, we first aggregate and then collect the `person_id` field for patients with sufficient coverage over the data collection period:
+>>>>>>> master
 ```sql
     death_trainingwindow_elig_perc as (
         select
@@ -109,13 +153,41 @@ Finally, we can create the cohort:
             or d.death_datetime >= (date '{ training_end_date }' + interval '{ gap }')
         )
 ```
+<<<<<<< HEAD
+The full cohort creation SQL query can be found [here](https://github.com/clinicalml/omop-learn/blob/master/examples/eol/postgres_sql/gen_EOL_cohort.sql).
+=======
 The full cohort creation SQL query can be found [here](https://github.com/clinicalml/omop-learn/blob/master/sql/Cohorts/gen_EOL_cohort.sql).
+>>>>>>> master
 
 Note the following key fields in the resulting table:
 
 Field | Meaning
 ------------ | -------------
 `example_id` | A unique identifier for each example in the dataset. While in the case of end-of-life each patient will occur as a positive example at most once, this is not the case for all possible prediction tasks, and thus this field offers more flexibility than using the patient ID alone.
+<<<<<<< HEAD
+`y` | A column indicating the outcome of interest. Currently, `omop-learn` supports binary outcomes.
+`person_id` | A column indicating the ID of the patient.
+`start_date` and `end_date` | Columns indicating the beginning and end of the time periods to be used for data collection for this patient. This will be used downstream for feature generation. 
+
+We are now ready to build a cohort. We construct a [`Cohort`](https://github.com/clinicalml/omop-learn/blob/master/src/omop_learn/data/cohort.py) object by passing the path to a defining SQL script, the relevant data backend, and the set of cohort params.
+```sql
+cohort_params = {
+    "cohort_table_name": "eol_cohort",
+    "schema_name": config.prefix_schema,
+    "cdm_schema": config.cdm_schema,
+    "aux_data_schema": config.aux_cdm_schema,
+    "training_start_date": "2016-01-01",
+    "training_end_date": "2017-01-01",
+    "gap": "3 months",
+    "outcome_window": "6 months",
+}
+sql_dir = "examples/eol/postgres_sql"
+sql_file = open(f"{sql_dir}/gen_EOL_cohort.sql", 'r')
+cohort = Cohort.from_sql_file(sql_file, backend, params=cohort_params)
+```
+
+#### <a name="define_features"></a> 1.3 Feature Initialization
+=======
 `y` | A column indicating the outcome of interest. Currently, omop-learn supports binary 0/1 outcomes
 `person_id` | A column indicating the ID of the patient
 `start_date` and `end_date` | Columns indicating the beginning and end of the time periods to be used for data collection for this patient. This will be used downstream for feature generation. 
@@ -141,6 +213,7 @@ cohort = CohortGenerator.Cohort(
 ```
 Note that this does *not* run the SQL queries -- the `CohortGenerator` object currently just stores *how* to set up the cohort in any system, allowing for more portability. Thus, our next step is to materialize the actual cohort to a table in a specific database `db` by calling `cohort.build(db)`.
 #### <a name="define_features"></a> 1.2 Feature Initialization
+>>>>>>> master
 With a cohort now fully in place, we are ready to associate features with each patient in the cohort. These features will be used downstream to predict outcomes. 
 
 The OMOP Standardized Clinical Data tables offer several natural features for a patient, including histories of condition occurrence, procedures, etc. omop-learn includes SQL scripts to collect time-series of these common features automatically for any cohort, allowing a user to very quickly set up a feature set. To do so, we first initialize a `FeatureGenerator` object with a database indicating where feature data is to be found. Similar to the `CohortGenerator`, this does not actually create a feature set -- that is only done once all parameters are specified. We next select the pre-defined features of choice, and finally select a cohort for which data is to be collected: