Audit Alongside Your Refactor #1306
patkearns10
announced in
Archive
Replies: 1 comment 1 reply
-
Link to a draft I wrote 6 months ago. Happy to start fresh with an outline if that's preferable! |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Alternate longer titles:
Please answer the following questions to get the discussion started.
What is the main problem you are solving?
Refactoring shouldn't be scary and testing shouldn't be a daunting thing you try to avoid doing or save up to the very end
What is your solution? This should help form your core thesis.
What does this process look like? Instead of editing the code in-line, intentionally creating a duplicate model file allows you to run both models in development, at the same time, and you can easily refer back to the original code without navigating to other tools, branches, or points in history. After that, it’s easy to compare the data outputs using the dbt audit_helper package with the assurance that you’re not getting any variances created by environments or run times.
Why should the reader care about this problem?
Being a subject matter expert in my previous role, I was comfortable in the fact that I knew all the tables, fields, and the business logic like the back of my hand. If I merged some code and screwed something up in today’s pull request I would fix it in the next one, no biggie. Maybe no one would even notice!
I was aware of the mental overhead - cognitively juggling the file I was working in, keeping track of my changes, and keeping tabs on whether that would cause issues down the road. I would edit 20 files, commit changes, open a PR, then start running tests after the fact to determine if the changes did, in fact, work by making sure my branch output table matched 100% with the production version.
This was all well and fine until I moved to being a consultant and working in other's projects where:
I realized my process needed an upgrade
Why is your solution the right one? This should help form your specific target audience.
It is a solution in a sea of possible solutions, but one that will allow you to sleep at night and instill trust in your stakeholders
Can you list the steps of your solution for the reader here? This should help you form the overall narrative arc and sketch out an example use case to illustrate it.
Duplicate the model you’re going to refactor:
For example, duplicate
dim_orders
and name the duplicatedim_orders__control
(This naming convention allows them to show up next to each other in the file tree). The SQL code should be the exact same in both files at this stage, but we’ll be using thedim_orders.sql
file to make our changes.Add a new file in the analysis folder called
analysis/compare__template.sql
with the audit_helper SQL. The analysis folder allows you to store queries using dbt functionality like Jinja, but doesn’t actually build anything in your warehouse.This file will only be used while we’re developing so we can interactively check results. It’s also optional - you can also do this using a new statement tab, but if you’re working on refactoring over several days you may want to save the file temporarily.
Run a control test (
dbt run
and “Preview”analysis/compare__template.sql
) before you make edits! This is very important.You can initially run
dbt run -s +compare__template
. By adding the+
selector, you’re doing the same thing asdbt run -s +dim_orders__control +dim_orders
. You want to run all parent models initially to ensure everything is similarly up to date. Then click “Preview” inanalysis/compare__template.sql
to ensure everything matches 100%.For any code that has window functions, this will ensure that the code is reproducible and bring to your attention any non-deterministic values. Non-deterministic ordering happens when the columns you declare in a window function’s
order by
clause isn’t specific enough for the database to assign an idempotent value.For example, let’s say we have an event table and want to order by how recent an event happened. If we order by the date field (instead of the more specific timestamp field) and we have multiple events that happen on the same date, each time we run the model, order will be assigned differently within each date partition, without us changing anything!
Make your changes to the code in the
dim_orders
modelTest against your changes using your auditing code from
compare_template.sql
.dim_orders
, for all subsequent runs, you can rundbt run -s 1+compare__template
, which will look at the analysis model (compare__template
) and find it's first-level upstream dependencies (dim_orders__control
anddim_orders
) to run, which are the two models you’re testing. It’s important to run both the control file and the newly refactored file at the same time to ensure parity between the two table’s contents.Are there any resources that helped inspire or inform your idea?
https://discourse.getdbt.com/t/how-to-not-lose-your-mind-when-auditing-data/445
https://discourse.getdbt.com/t/how-to-not-lose-your-mind-when-auditing-data-part-ii/612
I actually learned of these after I wrote my rough draft, everything I wrote comes from conversations with @christineberger & 6 months of auditing Webfow's DAGs
I feel like the above articles didnt get enough traction because I wasnt aware this existed until I started at dbt Labs
Are there other existing solutions that solve the problem, and if so, how is this solution better or different? If so please share any links here.
Not sure!
Beta Was this translation helpful? Give feedback.
All reactions