From bf3234a34ab102a9c34adcbac12bf6fba9c07d66 Mon Sep 17 00:00:00 2001 From: Alejandro Buendia Date: Sat, 17 Jun 2023 07:53:51 -0400 Subject: [PATCH] Update to docs --- docs/index.md | 40 +++++++++++++++++----------------------- 1 file changed, 17 insertions(+), 23 deletions(-) diff --git a/docs/index.md b/docs/index.md index 10bec2f..11358bb 100644 --- a/docs/index.md +++ b/docs/index.md @@ -207,51 +207,45 @@ dataset = OMOPDataset.from_prebuilt(config.datasets_dir) window_days = [30, 180, 365, 730, 1500, 5000, 10000] windowed_dataset = dataset.to_windowed(window_days) windowed_dataset.split() -) ``` The `to_windowed()` function takes in the raw sparse tensor of features, filters several times to collect data from the past `d` days for each `d` in `window_lengths`, then sums along the time axis to find the total count of the number of times each code was assigned to a patient over the last `d` days. These count matrices are then concatenated to build a final feature set of windowed count features. Note that unlike a pure SQL implementation of this kind of feature, `omop-learn` can quickly rerun the analysis for a different set of windows; this ability to tune parameters allows use of a validation set to determine optimal values and thus significantly increase model performance. Note that we can also easily split the windowed data into train, validation, and test sets by calling the method `split()` on the windowed dataset in evaluating model performance. ## Files -We review the subdirectories in the source package. +We review the subdirectories in the source package for `omop-learn`(https://github.com/clinicalml/omop-learn/tree/master/src/omop_learn). -### backends -The set of backends interfaces with the data storage +### `backends` +The set of backends interfaces with the data storage and the compute engine to run feature extraction. We support PostgreSQL, Google BigQuery, and Apache Spark. The set of defining methods are inherited from `OMOPDatasetBackend`. Note that backend feature creation leverages python's `multiprocessing` library to extract features, parallelized by OMOP `person_id`. -### Utils +### `data` -#### dbutils.py +Data methods include the `Cohort`, `Feature`, and `ConceptTokenizer` classes. Cohort and features can be initialized using the previously reviewed code snippets. -dbutils.py provides tools for interacting with a postgres database into which a set of OMOP compliant tables have been loaded. The Database object can be instantiated using a standard postgres connection string, and can then be used (via 'query', 'execute' and 'fast_query') to run arbitrary SQL code and return results in Pandas dataframes. +The `ConceptTokenizer` class offers a compact representation for storing the set of relevant OMOP concepts by providing a mapping from concept index to name. This class also includes a set of special tokens, including beginning of sequence, end of sequence, separator, pad, and unknown, for use with language modeling applications. -#### PopulateAux.py -PopulateAux.py allows for the definition of custom tables that do not exist in the OMOP framework, but are required over multiple models by the user. These can be instantiated and kept in an auxiliary schema, and used persistently as needed. +### `hf` +Utilities for interface with [Hugging Face libraries](https://huggingface.co/) are provided. This includes a mapping from the `OMOPDataset` object to those ingestible by Hugging Face models. -### Generators -This directory contains the implementation of classes to store and instantiate Cohorts of patients and sets of Features that can be used for prediction tasks. +### `models` -#### CohortGenerator.py +The files `transformer.py` and `visit_transformer.py` provide modeling methods used to create the SARD architecture [Kodialam et al. 2021]. The methods in `transformer.py` define transformer blocks and multi-head attention in the standard way. The methods in `visit_transformer.py` defines a transformer-based architecture over visits consisting of OMOP concepts. -Cohorts are defined by giving the schema in which the cohort table will be materialized, a unique cohort name, and a SQL script that uses OMOP standard tables (and/or user defined auxiliary tables) to generate the cohort itself. +### `sparse` -An example script can be found in /sql/Cohorts. As in that script, cohort definitions should give at minimum a unique example ID, a person ID corresponding to the patient's unique identifier in the rest of the OMOP database, and an outcome column (here denoted by 'y') indicating the outcome of interest for this particular patient. +The classes in `sparse` allow for end-to-end modeling over the created feature representation using sparse tensors in COO format. `data.py` defines the previously reviewed `OMOPDatasetSparse` and `OMOPDatasetWindowed` classes which aggregate features over multiple time windows. `models.py` defines a wrapper over the `sklearn` `LogisticRegression` object, which integrates tightly with the `OMOPDatasetWindowed` class. -#### FeatureGenerator.py +### `torch` -The FeatureGenerator file defines two objects: Features and FeatureSets. Features are defined by a SQL script and a set of keyword arguments that can be used to modify the SQL script just before it is run through Python's 'format' functionality. Several SQL scripts are already pre-implemented and can be seen in /sql/Features. At present, PredictionLibrary supports time-series of binary features. Thus, feature SQL scripts should generate tables with at least the following columns: -- A person ID to join with the cohort and identify which patient this feature is associated with -- A feature name, which often will be generated by joining with OMOP's concept table to get a human-readable description of a OMOP concept -- A timestamp value +The classes in `data.py` define a wrapper around the `OMOPDataset` object for use with pytorch tensors. Similar to the classes in `hf` this allows for quick modeling with `torch` code. `models.py` gives some example models that can ingest `OMOPDatasetTorch` objects, including an alternate implementation for the `VisitTransformer`. -FeatureSet objects simply collect a list of Feature objects. When the 'build' function is called, the FeatureSet will run all SQL associated with each Feature and insert the resulting rows into a highly data-efficient three-dimensional sparse tensor representation, with the three axes of this tensor representing distinct patients, distinct features, and distinct timestamps respectively. The tensor can then be accessed directly and manipulated as needed for any chosen modelling approach. +### `utils` +A variety of `utils` are provided which support both data ingestion and modeling. `config.py` defines a simple configuration object for use in constructing the backend, while methods in `date_utils.py` are used for conversion between unix timestamps and datetime objects. -### End of Life Linear Model Example.ipynb and End of Life Deep Model Example.ipynb - -These notebooks walk through all the functionality of the library through the example of building a relatively simple yet performant end-of-life prediction model from OMOP data loaded from IBC. Use these files as a tutorial and as a way to see the correct way to call the functions in the library. +`embedding_utils.py` defines a gensim word embedding model used in the end-of-life example notebook.