Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue 2887] Add new columns to EtlDb and bump schema version #2931

Merged
merged 37 commits into from
Nov 21, 2024

Conversation

DavidDudas-Intuitial
Copy link
Collaborator

@DavidDudas-Intuitial DavidDudas-Intuitial commented Nov 19, 2024

Summary

Fixes #2887

Time to review: 5 mins

Changes proposed

What was added, updated, or removed in this PR.

Add new facts and dimensions to EtlDb to support certain dashboards in Metabase:

  • Slowly changing dimension "deliverable status" is now supported via the new table gh_deliverable_history
  • Sprint-to-project relationship is now supported via the new table gh_project, and the table gh_sprint is altered to add a new column project_id
  • Updated EtlDataset to support the new columns
  • Updated analytics/integrations/etldb main and models to use the new schema during transform and load
  • Moved exception handling out of etldb main and into model classes to make main more readable
  • Updated tests and mock file used for tests

Context for reviewers

Testing instructions, background context, more in-depth details of the implementation, and anything else you'd like to call out or ask reviewers. Explain how the changes were verified.

Certain dashboards in Metabase required certain facts and dimensions in EtlDb. This PR adds the fields that are needed, modifies transform/load to use them, and bumps the schema version to 4.

Additional information

Screenshots, GIF demos, code examples or output to help show the changes working as expected.

db-init-in-CI

@DavidDudas-Intuitial DavidDudas-Intuitial marked this pull request as draft November 19, 2024 18:40
@DavidDudas-Intuitial
Copy link
Collaborator Author

@widal001 @coilysiren please review

Copy link
Collaborator

@widal001 widal001 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Everything runs as expected locally and I see the data in my local Postgres!

I left one comment about the general pattern for inserting the data, but created a separate ticket and added it to our improvements epic since the pattern is used across the inserts in this module.

DB migration

Screenshot 2024-11-21 at 9 16 17 AM

Loading the data

Screenshot 2024-11-21 at 9 18 25 AM

Querying the resulting data

Screenshot 2024-11-21 at 10 01 59 AM Screenshot 2024-11-21 at 9 23 12 AM

Comment on lines +155 to +157
for ghid in dataset.get_project_ghids():
project_df = dataset.get_project(ghid)
result[ghid], _ = model.sync_project(project_df)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File this away for future improvements, but just wanted to highlight that instead of:

  1. Filtering for the unique set of project github IDs
  2. Iterating through that list of IDs
  3. Filtering the dataset again to get project data for a given ID

A more common pattern would be to select the distinct set of values we want to insert using pandas drop_duplicates function and then iterate through the list of data, inserting each row or (preferably) doing a bulk upsert.

The difference between these patterns is trivial when we're inserting 3 project values, but it seems we're using the same pattern in most places:

Plus this current approach limits the options we have for bulk operations, which would make it easier to wrap DML into transaction blocks and rollback all changes if the statement fails, avoiding partial updates of tables during a batch process.

I think it's good the current insert pattern follows the others, but I created this ticket for us to revisit that insert pattern and added it to our improvements/tech debt epic

@DavidDudas-Intuitial DavidDudas-Intuitial merged commit 0fbbae2 into main Nov 21, 2024
1 check passed
@DavidDudas-Intuitial DavidDudas-Intuitial deleted the issue-2887-new-etldb-columns branch November 21, 2024 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add new columns to analytics database to enable sprint burndown charts
3 participants