[Issue 2887] Add new columns to EtlDb and bump schema version #2931

DavidDudas-Intuitial · 2024-11-19T18:35:38Z

Summary

Fixes #2887

Time to review: 5 mins

Changes proposed

What was added, updated, or removed in this PR.

Add new facts and dimensions to EtlDb to support certain dashboards in Metabase:

Slowly changing dimension "deliverable status" is now supported via the new table gh_deliverable_history
Sprint-to-project relationship is now supported via the new table gh_project, and the table gh_sprint is altered to add a new column project_id
Updated EtlDataset to support the new columns
Updated analytics/integrations/etldb main and models to use the new schema during transform and load
Moved exception handling out of etldb main and into model classes to make main more readable
Updated tests and mock file used for tests

Context for reviewers

Testing instructions, background context, more in-depth details of the implementation, and anything else you'd like to call out or ask reviewers. Explain how the changes were verified.

Certain dashboards in Metabase required certain facts and dimensions in EtlDb. This PR adds the fields that are needed, modifies transform/load to use them, and bumps the schema version to 4.

Additional information

Screenshots, GIF demos, code examples or output to help show the changes working as expected.

…cidents

… so tests run *after* db is initialized

… main to models

analytics/src/analytics/integrations/etldb/main.py

DavidDudas-Intuitial · 2024-11-21T02:24:41Z

@widal001 @coilysiren please review

widal001

LGTM! Everything runs as expected locally and I see the data in my local Postgres!

I left one comment about the general pattern for inserting the data, but created a separate ticket and added it to our improvements epic since the pattern is used across the inserts in this module.

DB migration

Loading the data

Querying the resulting data

widal001 · 2024-11-21T14:39:54Z

analytics/src/analytics/integrations/etldb/main.py

+    for ghid in dataset.get_project_ghids():
+        project_df = dataset.get_project(ghid)
+        result[ghid], _ = model.sync_project(project_df)


File this away for future improvements, but just wanted to highlight that instead of:

Filtering for the unique set of project github IDs

Iterating through that list of IDs

Filtering the dataset again to get project data for a given ID

A more common pattern would be to select the distinct set of values we want to insert using pandas drop_duplicates function and then iterate through the list of data, inserting each row or (preferably) doing a bulk upsert.

The difference between these patterns is trivial when we're inserting 3 project values, but it seems we're using the same pattern in most places:

Deliverables

Epics

Issues

Plus this current approach limits the options we have for bulk operations, which would make it easier to wrap DML into transaction blocks and rollback all changes if the statement fails, avoiding partial updates of tables during a batch process.

I think it's good the current insert pattern follows the others, but I created this ticket for us to revisit that insert pattern and added it to our improvements/tech debt epic

DavidDudas-Intuitial added 23 commits November 14, 2024 15:24

implemented basic versioning for schema management

1d16441

insert row into version table; rename methods for better readability

2cc4825

missing from last commit

5fc3297

enable multiple sql files to execute during init db operation

447fc61

formatting

5226bc0

formatting

219b130

add ability to check schema version number

53cf9a9

added minimum schema version logic

3a7f039

added new columns; bumped schema version number

629287b

made the schema versioning process more resilient to developer error

71ac599

fixed linter issues

37d9b77

rename sql file

eb68eb5

Merge branch 'main' into issue-2857-analytics-db-schema-versioning

3140069

Merge branch 'main' into issue-2857-analytics-db-schema-versioning

5112ba2

added logging of each migration; removed drop_table sql to prevent ac…

bdbf88c

…cidents

added comment to explain regex

f1e276c

improved handling of non-conformant sql filenames

ae6de3e

added verbose docstring

28c2c2e

added tests

2bb8360

fixed linter issue

2f1f52f

fixed linter issue

117e6f4

Merge branch 'main' into issue-2857-analytics-db-schema-versioning

8ff23b6

added support for deliverable status

2620e64

DavidDudas-Intuitial requested review from coilysiren, widal001 and acouch as code owners November 19, 2024 18:35

DavidDudas-Intuitial self-assigned this Nov 19, 2024

Merge branch 'main' into issue-2887-new-etldb-columns

220fc80

DavidDudas-Intuitial marked this pull request as draft November 19, 2024 18:40

run docker compose down before make init-db; changed sequence of jobs…

1dfdbcf

… so tests run *after* db is initialized

DavidDudas-Intuitial added 9 commits November 19, 2024 11:11

Merge branch 'main' into issue-2887-new-etldb-columns

762b180

update tests

0630b97

added support for gh_project table

cdbf4e1

Merge branch 'main' into issue-2887-new-etldb-columns

051c7fa

combined sql files

9abb20e

update tests

0a16643

fix type check bug

bb9a743

format

a34f106

add mapping between sprint and project; moved exception handling from…

170db64

… main to models

DavidDudas-Intuitial mentioned this pull request Nov 20, 2024

Add new columns to analytics database to enable sprint burndown charts #2887

Closed

3 tasks

DavidDudas-Intuitial marked this pull request as ready for review November 20, 2024 01:37

coilysiren reviewed Nov 20, 2024

View reviewed changes

analytics/src/analytics/integrations/etldb/main.py Outdated Show resolved Hide resolved

DavidDudas-Intuitial added 2 commits November 20, 2024 11:10

remove version validation, per request

d59ac04

Merge branch 'main' into issue-2887-new-etldb-columns

8e318f3

DavidDudas-Intuitial requested a review from coilysiren November 20, 2024 19:15

Merge branch 'main' into issue-2887-new-etldb-columns

7599d0f

coilysiren approved these changes Nov 21, 2024

View reviewed changes

widal001 approved these changes Nov 21, 2024

View reviewed changes

DavidDudas-Intuitial merged commit 0fbbae2 into main Nov 21, 2024
1 check passed

DavidDudas-Intuitial deleted the issue-2887-new-etldb-columns branch November 21, 2024 17:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue 2887] Add new columns to EtlDb and bump schema version #2931

[Issue 2887] Add new columns to EtlDb and bump schema version #2931

DavidDudas-Intuitial commented Nov 19, 2024 •

edited

Loading

DavidDudas-Intuitial commented Nov 21, 2024

widal001 left a comment

widal001 Nov 21, 2024

[Issue 2887] Add new columns to EtlDb and bump schema version #2931

[Issue 2887] Add new columns to EtlDb and bump schema version #2931

Conversation

DavidDudas-Intuitial commented Nov 19, 2024 • edited Loading

Summary

Time to review: 5 mins

Changes proposed

Context for reviewers

Additional information

DavidDudas-Intuitial commented Nov 21, 2024

widal001 left a comment

Choose a reason for hiding this comment

DB migration

Loading the data

Querying the resulting data

widal001 Nov 21, 2024

Choose a reason for hiding this comment

DavidDudas-Intuitial commented Nov 19, 2024 •

edited

Loading