-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Issue 2887] Add new columns to EtlDb and bump schema version #2931
Conversation
… so tests run *after* db is initialized
@widal001 @coilysiren please review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Everything runs as expected locally and I see the data in my local Postgres!
I left one comment about the general pattern for inserting the data, but created a separate ticket and added it to our improvements epic since the pattern is used across the inserts in this module.
DB migration
Loading the data
Querying the resulting data
for ghid in dataset.get_project_ghids(): | ||
project_df = dataset.get_project(ghid) | ||
result[ghid], _ = model.sync_project(project_df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File this away for future improvements, but just wanted to highlight that instead of:
- Filtering for the unique set of project github IDs
- Iterating through that list of IDs
- Filtering the dataset again to get project data for a given ID
A more common pattern would be to select the distinct set of values we want to insert using pandas drop_duplicates function and then iterate through the list of data, inserting each row or (preferably) doing a bulk upsert.
The difference between these patterns is trivial when we're inserting 3 project values, but it seems we're using the same pattern in most places:
Plus this current approach limits the options we have for bulk operations, which would make it easier to wrap DML into transaction blocks and rollback all changes if the statement fails, avoiding partial updates of tables during a batch process.
I think it's good the current insert pattern follows the others, but I created this ticket for us to revisit that insert pattern and added it to our improvements/tech debt epic
Summary
Fixes #2887
Time to review: 5 mins
Changes proposed
Add new facts and dimensions to EtlDb to support certain dashboards in Metabase:
gh_deliverable_history
gh_project
, and the tablegh_sprint
is altered to add a new columnproject_id
EtlDataset
to support the new columnsanalytics/integrations/etldb
main and models to use the new schema during transform and loadetldb
main and into model classes to make main more readableContext for reviewers
Certain dashboards in Metabase required certain facts and dimensions in EtlDb. This PR adds the fields that are needed, modifies transform/load to use them, and bumps the schema version to 4.
Additional information