This repository contains Dagster code to move data from an Ed-Fi API into Google Cloud Storage and from there materializes the data into various data marts in Google BigQuery. This is a common ELT architecture where raw data is extracted from a source system and stored in a data lake (extract and load). From there transformation happens to turn that data into datasets built out for use in analytics.
Note The Ed-Fi Data Standard is well defined and used across many organizations. This is done to provide vendors with a common API surface for which to build API clients against. Analytics is different. Analytics is more personal. This code is meant to provide you with a strong foundational layer for your analytics layer. You should modify the dbt data models to allow for your dims and facts to represent your context. This is not a plug and play piece of software. You are expected to learn python and SQL, and modify this code.
Let's start by running Dagster locally on your machine. This is referred to as your dev
environment. This is where you can make changes to the code and iterate quickly. Once you have a version of the code that's working well, that gets run in production and that's what your reports and dashboards are built from.
This Dagster code has been tested on a Mac and on a PC running Ubuntu via WSL2. Setup guides have been created for Mac and Windows. These setup guides are meant to be opinionated to ensure that you're set up with a development workflow that doesn't hinder your ability to write code and iterate quickly.
The dagster and dbt repositories in this GitHub organization are meant to be used together. Click the Use this template button on both repositories to make a copy into your personal GitHub and clone both to your local machine.
├── code_folder
│ ├── dagster
| | |
| | └── ...
│ ├── dbt
| | |
| | └── ...
│ └── ...
At the root of both the dagster and dbt repos is a .env-sample
file. Copy the file to create a .env
file. In the sections below, you will be asked to complete specific values in the .env
files.
Let's take a detour and configure your Google Cloud environment now.
Create a Google Cloud Platform (GCP) project and set the following variables in your .env
file:
GOOGLE_CLOUD_PROJECT
to the Google Cloud project IDEDFI_BASE_URL
to your Ed-Fi API base URLEDFI_API_KEY
to your Ed-Fi API keyEDFI_API_SECRET
to your Ed-Fi API secretDBT_PROFILES_DIR
to the base path of your dbt folderDBT_PROJECT_DIR
to the base path of your dbt folderPYTHONPATH
to the project folder in your dagster folder
Note If your machine is setup correctly, the
.env
should automatically be read into your environment when youcd
into the dagster directory.cd
into the dagster folder and runecho $GOOGLE_CLOUD_PROJECT
. If your Google Cloud project ID does not print to the terminal, stop and troubleshoot.
Run the commands below to enable the APIs necessary for this repository.
gcloud config set project $GOOGLE_CLOUD_PROJECT;
gcloud config set compute/region us-central1;
gcloud services enable artifactregistry.googleapis.com;
gcloud services enable cloudbuild.googleapis.com;
gcloud services enable compute.googleapis.com;
gcloud services enable container.googleapis.com;
gcloud services enable servicenetworking.googleapis.com;
gcloud services enable sqladmin.googleapis.com;
gcloud services enable iamcredentials.googleapis.com;
Authentication with the GCP project happens through a service account. The commands below will create a service account and download a JSON key. This service account will also be used via Workload Identity when deployed in production on GKE.
gcloud config set project $GOOGLE_CLOUD_PROJECT;
gcloud iam service-accounts create dagster \
--display-name="dagster";
export SA_EMAIL=`gcloud iam service-accounts list --format='value(email)' \
--filter='displayName:dagster'`
gcloud projects add-iam-policy-binding $GOOGLE_CLOUD_PROJECT \
--member serviceAccount:$SA_EMAIL \
--role roles/bigquery.jobUser;
gcloud projects add-iam-policy-binding $GOOGLE_CLOUD_PROJECT \
--member serviceAccount:$SA_EMAIL \
--role roles/bigquery.user;
gcloud projects add-iam-policy-binding $GOOGLE_CLOUD_PROJECT \
--member serviceAccount:$SA_EMAIL \
--role roles/bigquery.dataEditor;
gcloud projects add-iam-policy-binding $GOOGLE_CLOUD_PROJECT \
--member serviceAccount:$SA_EMAIL \
--role roles/storage.admin;
gcloud iam service-accounts keys create service.json \
--iam-account=$SA_EMAIL;
Store the downloaded JSON file in the root of the repository. Set the GOOGLE_APPLICATION_CREDENTIALS
variable in your .env
file to point to your service account JSON file.
Google Cloud Storage will be used to house the JSON data retrieved from the target Ed-Fi API. Create two buckets:
gsutil mb "gs://dagster-dev-${GOOGLE_CLOUD_PROJECT}";
gsutil mb "gs://dagster-prod-${GOOGLE_CLOUD_PROJECT}";
Set the GCS_BUCKET_DEV
and GCS_BUCKET_PROD
variables to the newly created bucket names.
You should now have all variables completed in the .env
files in both your dagster and dbt folders.
The setup guides recommend using poetry to manage your python virtual environments.
# do this in both your dagster and dbt folders
poetry env use 3.9;
poetry install;
Note If your machine is setup correctly, when you
cd
into the dagster folder, your python virtual environment should automatically be loaded and your.env
file should automatically be read into your environment. If that does not happen, stop and troubleshoot.
You should now be able to launch dagit via the command below.
dagit -w workspace.yaml;