Skip to content

Commit

Permalink
Update docs for table builders (#121)
Browse files Browse the repository at this point in the history
* Update docs for table builders

* PR feedback

* Updated sql style guide section
  • Loading branch information
dogversioning authored Sep 12, 2023
1 parent 36323e4 commit ee6afd8
Show file tree
Hide file tree
Showing 6 changed files with 262 additions and 17 deletions.
9 changes: 9 additions & 0 deletions cumulus_library/studies/template/manifest.toml
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,15 @@ export_list = [
"template__count_influenza_test_month",
]

# For generating counts table in a more standardized manner, we have a class in the
# main library you can extend that will handle most of the logic of assembling
# queries for you. We use this pattern for generating the core tables, as well
# other studies authored inside BCH. These will always be run after any other
# SQL queries have been generated
# [counts_builder_config]
# file_names = [
# "count.py"
# ]

# For most use cases, this should not be required, but if you need to programmatically
# build tables, you can provide a list of files implementing BaseTableBuilder.
Expand Down
2 changes: 1 addition & 1 deletion docs/core-study-details.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Core Study Details
parent: Library
nav_order: 4
nav_order: 5
# audience: clinical researchers, IRB reviewers
# type: reference
---
Expand Down
203 changes: 203 additions & 0 deletions docs/creating-sql-with-python.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,203 @@
---
title: Creating SQL with Python
parent: Library
nav_order: 4
# audience: clinical researcher or engineer familiar with project
# type: tutorial
---

# Creating SQL with python

Before jumping into this doc, take a look at
[Creating Studies](creating-studies.md).
If you're just working with `core` tables related to the US Core FHIR profiles, you
may not be interested in this, or only need to look at the
[Working with TableBuilders](#working-with-tablebuilders)
and the
[Generating count tables](#generating-counts-tables)
sections.

## Why would I even need to think about this?

There are three main reasons why you would need to use python to generate sql:
- You would like to make use of the
[helper class we've built](#generating-counts-tables)
for ease of creating count tables in a structured manual.
- You have a dataset you'd like to
[load into a table from a static file](#adding-a-static-dataset),
separate from the ETL tables.
- The gnarly one: you are working against the raw FHIR resource tables, and are
trying to access
[nested data](#querying-nested-data) in Athena.
- We infer datatypes in the ETL based on the presence of data once we get past
the top level elements, and so the structure may vary depending on the
implementation, either at the EHR level or at the FHIR interface level.


We've got examples of all three of these cases in this repo, and we'll reference
those as examples as we go.

## Utilities

There are two main bits of infrastructure we use for programmatic tables:
The TableBuilder class, and the collection of template SQL.

### Working with TableBuilders

We have a base
[TableBuilder class](../cumulus_library/base_table_builder.py)
that
all the above use cases leverage. At a high level, here's what it provides:

- A `prepare_queries` function, which is where you put your custom logic. It
should create an array of queries in `self.queries`. The CLI will pass in a cursor
object and database/schema name, so if you need to interrogate the dataset to decide
how to structure your queries, you can.
- An `execute_queries` function, which will run `prepare_queries` and then apply
those queries to the database. You shouldn't need to touch this function -
just be aware this is how your queries actually get run.
- A `write_queries` function, which will write your queries from `prepare_function`
to disk. If you are creating multiple queries in one go, calling `comment_queries`
before `write_queries` will insert some spacing elements for readability.
- A `display_text` string, which is what will be shown with a progress bar when your
queries are being executed.

You can either extend this class directly (like `builder_*.py` files in
`cumulus_library/studies/core`) or create a specific class to add reusable functions
for a repeated use case (like in `cumulus_library/schema/counts.py`).

TableBuilder SQL generally should go through a template SQL generator, so that
your SQL has been validated. If you're just working on counts, you don't need
to worry about this detail, but otherwise, the following section talks about
our templating mechanism.

### Working with template SQL

If you are only worried about building counts tables, skip this section -
we've got enough wrappers that you shouldn't need to worry about this
level of detail.

For validating SQL, we are using
[Jinja templates](https://jinja.palletsprojects.com/en/3.1.x/)
to create validated SQL in a repeatable manner. We don't expect you to write these
templates - instead, using the
[template function library](../cumulus_library/template_sql/templates.py)
you can provide a series of arguments to these templates that will allow you to
generate standard types of SQL tables, as well as using templates targeted for
bespoke operations.

When you're thinking about a query that you'd need to create, first check the
template function library to see if something already exists. Basic CRUD
should be covered, as well as unnestings for some common FHIR objects.

## Use cases

### Generating counts tables
A thing we do over and over as part of studies is generate powerset counts tables
against a filtered resource to get data about a certain kind of clinical population.
Since this is so common we created a class just for this, and we're using it in all
studies the Cumulus team is directly authoring.

The [CountsBuilder class](../cumulus_library/schema/counts.py)
provides a number of convenience methods that are available for use (this covers
mechanics of generation). You can see examples of usage in the
[Core counts builder](../cumulus_library/studies/core/count_core.py)
(which is where the business logic of your study lives).

- `get_table_name` will scan the study's `manifest.toml` and auto prepend a table
name with whatever the study prefix is.
- `get_where_clauses` will format a string, or an array, of where clauses in a
manner that the table constructors will expect.
- `count_[condition,document,encounter,observation,patient]` will take a target table
name, a source table, and an array of columns, and produce the appropriate powerset
table to count that resource. You can optionally provide a list of where statements
for filtering, or can change the minimum bin size used to include data
- The `count_*` functions pass through to `get_count_query` - if you have a use
case we're not covering, you can use this interface directly. We'd love to hear
about it - we'd consider covering it and/or take PRs for new features

As a convenience, if you include a `if __name__ == "__main__":` clause like you
see in `count_core.py`, you can invoke the builder's output by invoking it with
python, which is a nice way to get example SQL output for inclusion in github.
This is where the
[count core sql output](../cumulus_library/studies/core/count_core.sql)
originated from.

Add your count generator file to the `counts_builder_config` section of your
`manifest.toml` to include it in your build invocations.

### Adding a static dataset

*NOTE* - we have an
[open issue](https://github.com/smart-on-fhir/cumulus-library/issues/58)
to develop a faster methodology for adding new datasets.

Occasionally you will have a dataset from a third party that is useful for working
with your dataset. In the vocab study (requiring a license to use), we
[add coding system data](../cumulus_library/studies/vocab/vocab_icd_builder.py)
from flat files to athena. If you need to do this, you should extend the base
TableBuilder class, and your `prepare_queries` function should do the following,
leveraging the
[template function library](../cumulus_library/template_sql/templates.py):
- Use the `get_ctas_query` function to get a CREATE TABLE AS statement to
instantiate your table in athena
- Since athena SQL queries are limited in size to 262144 bytes, if you have
a large dataset, break it up into smaller chunks
- Use the `get_insert_into` function to add the data from each table to
the chunk you just created.

Add the dataset uploader to the `table_builder_config` section of your
`manifest.toml` to include it in your build - this will make this data
available for downstream queries

### Querying nested data

Are you trying to access data from deep within raw FHIR tables? I'm so sorry.
Here's an example of how this can get fussy with code systems:

A FHIR coding element may be an array, or it may be a singleton, or it may
be a singleton wrapped an array. It may be fully populated, or partially populated,
or completely absent. There may be one code per record, or multiple codes per record,
and you may only be interested in a subset of these codes.

This means you may have differing schemas in Athena from one site's data to another
(especially if they come from different EHR systems, where implementation details
may differ). In order to handle this, you need to create a standard output
representation that accounts for all the different permutations you have, and
conform data to match that. The
[encounter coding](../cumulus_library/studies/core/builder_encounter_coding.py)
and
[condition codeableConcept](../cumulus_library/studies/core/builder_condition_codeableconcept.py)
builders both jump through hoops to try and get this data into flat tables for
downstream use.

This is a pretty open ended design problem, but based on our experience, your
`prepare_queries` implementation should attempt the following steps:
- Check if your table has any data at all
- If it does, inspect the table schema to see if the field you're interested in
is populated with the schema elements you're expecting
- If yes, it's safe to grab them
- If no, you will need to manually initialize them to an appropriate null value
- If you are dealing with deeply nested objects, you may need to repeat the above
more than once
- Write a jinja template that handles the conditionally present data, and a
template function to invoke that template
- Test this against data samples from as many different EHR vendors as you can
- Be prepared to need to update this when you hit a condition you didn't expect
- Create a distinct table that has an ID for joining back to the original
- Perform this join as appropriate to create a table with unnested data

You may find it useful to use the `--builder [filename]` sub argument of the cli
build command to run just your builder for iteration. The
[Sample bulk FHIR datasets](https://github.com/smart-on-fhir/sample-bulk-fhir-datasets)
can provide an additional testbed database above and beyond whatever you produce
in house.

Add this builder to the `table_builder_config` section of your
`manifest.toml` - this will make this data available for downstream queries.

Good luck! If you think you're dealing with a pretty common case, you can reach
out to us on the
[discussion forum](https://github.com/smart-on-fhir/cumulus/discussions)
and we may be able to provide an implementation for you, or provide assistance
if you're dealing with a particular edge case.
61 changes: 47 additions & 14 deletions docs/creating-studies.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,8 @@ aggregations in support of ongoing projects.

## Setup

If you are going to be creating new studies, we strongly recommend adding an
environment variable, `CUMULUS_LIBRARY_PATH`, pointing to the folder in which
If you are going to be creating new studies, we recommend, but do not require, adding
an environment variable, `CUMULUS_LIBRARY_PATH`, pointing to the folder in which
you'll be working on study development. `cumulus-library` will look in each
subdirectory of that folder for manifest files, so you can run several studies
at once.
Expand All @@ -24,15 +24,21 @@ to any build/export call to tell it where to look for your work.

## Creating a new study

The easiest way to get started with a new study is to use `cumulus-library` to
create a manifest for you. You can do this with by running:
There are two ways to get started with a new study:

1. Use `cumulus-library` to create a manifest for you. You can do this with by running:
```bash
cumulus-library create ./path/to/your/study/dir
```
We'll create that folder if it doesn't already exist. We recommend you use a name
relevant to your study (we'll use `my_study` forthis document). The folder name is
the same thing you will use as a target with `cumulus_library` to run your study's
queries.
We'll create that folder if it doesn't already exist.

2. Fork the [
Cumulus library template repo](https://github.com/smart-on-fhir/cumulus-library-template),
renaming your fork, and cloning it directly from github.

We recommend you use a name relevant to your study (we'll use `my_study` for this
document). The folder name is the same thing you will use as a target with
`cumulus_library` to run your study's queries.

Once you've made a new study, the `manifest.toml` file is the place you let cumulus
library know how you want your study to be run against the remote database. The
Expand Down Expand Up @@ -68,14 +74,25 @@ Talking about what these three sections do:
counts to reduce exposure of limited datasets, and so we recommend only exporting
count tables.

There are other hooks you can use in the manifest for more advanced control over
how you can generate sql. See [Creating SQL with python](creating-sql-with-python.md)
for more information.

We recommend creating a git repo per study, to help version your study data, which
you can do in the same directory as the manifest file.
you can do in the same directory as the manifest file. If you've forked your study from
the template, you've already checked this step off.

### Writing SQL queries

Most users have a workflow that looks like this:
- Write queries in the [AWS Athena console](https://aws.amazon.com/athena/) while
you are exploring the data
- We recommend trying to keep your studies pointed at the `core` tables. The
base FHIR resource named tables contain a lot of nested data that can be tricky
to write cross-EHR queries against, and so you'll save yourself some headaches
if everything you need is available via those resources. If it isn't, make sure
you look at the [Creating SQL with python](creating-sql-with-python.md) guide
for information about safely extracting datasets from those tables.
- Move queries to a file as you finalize them
- Build your study with the CLI to make sure your queries load correctly.

Expand Down Expand Up @@ -115,9 +132,13 @@ styling.
they have a small number of members**, i.e. less than 10.

**Recommended**
- You may want to select a SQL style guide as a reference.
- You may want to select a SQL style guide as a reference. Mozilla provides a
[SQL style guide](https://docs.telemetry.mozilla.org/concepts/sql_style.html),
which our sqlfluff config enforces. If you have a different style you'd like
to use, you can update the `.sqlfluff` config to allow this. For example,
[Gitlab's data team](https://about.gitlab.com/handbook/business-technology/data-team/platform/sql-style-guide/)
has an example of this, though there are other choices.
has a style guide that is more centered around DBT, but is more perscriptive
around formatting.
- Don't implicitly reference columns tables. Either use the full table name,
or give the table an alias, and use that any time you are referencing a column.
- Don't use the * wildcard in your final tables. Explicitly list the columns
Expand All @@ -127,16 +148,28 @@ styling.
to find other problems if you lightly adhere to this from the start.
- Agggregate count tables should have the first word after the study prefix be
`count`, and otherwise the word `count` should not be used.

**Metadata tables**
- Creating a table called `my_study__meta_date` with two columns, `min date` and
`max date`, and populating it with the start and end date of your study, will
allow other Cumulus tools to detect study date ranges, and otherwise bakes the
study date range into your SQL for future reference.
- Creating a `my_study__meta_version` with one column, `data_package_version`, and
giving it an integer value as shown in this snippet:
```sql
CREATE TABLE my_study__meta_version AS
SELECT 1 AS data_package_version;
```
allows you to signal versions for use in segregating data upstream, like in the
Cumulus aggregator - just increment it when you will need third parties to rerun
your study from scratch due to a change in your counts output. If this is not
set, the version will implicitly be set to zero.

## Sharing studies

If you want to share your study as part of a publication, you'll need to open a PR -
after cloning this repository, make a branch, and add your study config to the
`cumulus_library/studies/` directory, and then just open a PR.
If you want to share your study as an official Cumulus study, please let us know
via the [discussion forum](https://github.com/smart-on-fhir/cumulus/discussions) -
we can talk more about what makes sense for your use case.

If you write a paper using the Cumulus library, please
[cite the project](https://smarthealthit.org/cumulus-a-universal-sidecar-for-a-smart-learning-healthcare-system/)
2 changes: 1 addition & 1 deletion docs/sharing-data.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Data Sharing
parent: Library
nav_order: 5
nav_order: 6
# audience: IT security or clinical researcher with low to medium familiarity with project
# type: explanation
---
Expand Down
2 changes: 1 addition & 1 deletion docs/study-list.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Cumulus studies
parent: Library
nav_order: 6
nav_order: 7
# audience: Clinical researchers interested in publications
# type: reference
---
Expand Down

0 comments on commit ee6afd8

Please sign in to comment.