Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor insert pattern for GitHub transform and load #2968

Open
2 tasks
widal001 opened this issue Nov 21, 2024 · 0 comments
Open
2 tasks

Refactor insert pattern for GitHub transform and load #2968

widal001 opened this issue Nov 21, 2024 · 0 comments

Comments

@widal001
Copy link
Collaborator

Summary

Currently, the pattern adopting for upserting data in etldb/main.py is to:

  1. Filter for the unique set of Github IDs
  2. Iterate through that list of IDs
  3. Then filter the dataset again to get data for a given ID

See the following sections for examples:

A more common pattern would be to select the distinct set of values you want to insert using pandas DataFrame.drop_duplicates() method and then iterate through the list of data inserting each row or preferably doing a bulk upsert.

In addition to requiring multiple passes through the dataset with $On^k$ complexity (where $n</code> is the size of the source DataFrame and $<code>k$ is the number of unique IDs for a given table) vs a single pass to retrieve all data we want to insert, this pattern also makes it harder to adopt bulk upserts.

Bulk transactions are helpful because they allow us to combine DML into transaction blocks more easily and prevent instances in which tables are only partially updated during a batch data load process.

Acceptance criteria

  • We've reduced the number of passes through the dataframe needed to extract the unique set of values to insert into the database
  • The data is loaded using bulk DML statements rather than separate insert/update statements for each row of data.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

1 participant