Audit Alongside Your Refactor #1306

patkearns10 · 2022-04-05T03:31:07Z

patkearns10
Apr 5, 2022
Collaborator

Alternate longer titles:

How to feel safe editing heavily without breaking everything
How to audit as you edit to avoid having an intense round of editing at the end of a long process

Please answer the following questions to get the discussion started.

What is the main problem you are solving?

Refactoring shouldn't be scary and testing shouldn't be a daunting thing you try to avoid doing or save up to the very end

What is your solution? This should help form your core thesis.

What does this process look like? Instead of editing the code in-line, intentionally creating a duplicate model file allows you to run both models in development, at the same time, and you can easily refer back to the original code without navigating to other tools, branches, or points in history. After that, it’s easy to compare the data outputs using the dbt audit_helper package with the assurance that you’re not getting any variances created by environments or run times.

Why should the reader care about this problem?

Being a subject matter expert in my previous role, I was comfortable in the fact that I knew all the tables, fields, and the business logic like the back of my hand. If I merged some code and screwed something up in today’s pull request I would fix it in the next one, no biggie. Maybe no one would even notice!

I was aware of the mental overhead - cognitively juggling the file I was working in, keeping track of my changes, and keeping tabs on whether that would cause issues down the road. I would edit 20 files, commit changes, open a PR, then start running tests after the fact to determine if the changes did, in fact, work by making sure my branch output table matched 100% with the production version.

This was all well and fine until I moved to being a consultant and working in other's projects where:

I wasn't a subject matter expert
I had no business context
I needed to sound knowledgeable and instill confidence that the changes I made were both correct and beneficial

I realized my process needed an upgrade

Why is your solution the right one? This should help form your specific target audience.

It is a solution in a sea of possible solutions, but one that will allow you to sleep at night and instill trust in your stakeholders

Can you list the steps of your solution for the reader here? This should help you form the overall narrative arc and sketch out an example use case to illustrate it.

Duplicate the model you’re going to refactor:

For example, duplicate dim_orders and name the duplicate dim_orders__control (This naming convention allows them to show up next to each other in the file tree). The SQL code should be the exact same in both files at this stage, but we’ll be using the dim_orders.sql file to make our changes.
Add a new file in the analysis folder called analysis/compare__template.sql with the audit_helper SQL. The analysis folder allows you to store queries using dbt functionality like Jinja, but doesn’t actually build anything in your warehouse.

This file will only be used while we’re developing so we can interactively check results. It’s also optional - you can also do this using a new statement tab, but if you’re working on refactoring over several days you may want to save the file temporarily.

Run a control test (dbt run and “Preview” analysis/compare__template.sql) before you make edits! This is very important.

You can initially run dbt run -s +compare__template. By adding the + selector, you’re doing the same thing as dbt run -s +dim_orders__control +dim_orders. You want to run all parent models initially to ensure everything is similarly up to date. Then click “Preview” in analysis/compare__template.sql to ensure everything matches 100%.

For any code that has window functions, this will ensure that the code is reproducible and bring to your attention any non-deterministic values. Non-deterministic ordering happens when the columns you declare in a window function’s order by clause isn’t specific enough for the database to assign an idempotent value.
For example, let’s say we have an event table and want to order by how recent an event happened. If we order by the date field (instead of the more specific timestamp field) and we have multiple events that happen on the same date, each time we run the model, order will be assigned differently within each date partition, without us changing anything!
Make your changes to the code in the dim_orders model
Test against your changes using your auditing code from compare_template.sql.
- If you only edited SQL within dim_orders, for all subsequent runs, you can run dbt run -s 1+compare__template , which will look at the analysis model (compare__template) and find it's first-level upstream dependencies (dim_orders__control and dim_orders) to run, which are the two models you’re testing. It’s important to run both the control file and the newly refactored file at the same time to ensure parity between the two table’s contents.
- A helpful tip is to add comments to the testing space on each run of the audit to clearly outline what’s changing, what’s being tested, and match percent you’ve achieved. As an added benefit, you can add this to a pull request later for others to review your work and have assurance that you’ve tested your outputs.

Are there any resources that helped inspire or inform your idea?

https://discourse.getdbt.com/t/how-to-not-lose-your-mind-when-auditing-data/445
https://discourse.getdbt.com/t/how-to-not-lose-your-mind-when-auditing-data-part-ii/612
I actually learned of these after I wrote my rough draft, everything I wrote comes from conversations with @christineberger & 6 months of auditing Webfow's DAGs
I feel like the above articles didnt get enough traction because I wasnt aware this existed until I started at dbt Labs

Are there other existing solutions that solve the problem, and if so, how is this solution better or different? If so please share any links here.

Not sure!

patkearns10 · 2022-04-05T03:31:40Z

patkearns10
Apr 5, 2022
Collaborator Author

Link to a draft I wrote 6 months ago. Happy to start fresh with an outline if that's preferable!
https://docs.google.com/document/d/1diM3wKREFqvOCDt-6q-yK7-UabScNdBI3oQdp5fenOM/edit?pli=1

1 reply

gwenwindflower Apr 27, 2022

this is great! one of the next Guides we want to do, coming out of the new 'How we structure', is about exactly this. i actually just put a placeholder link to it in that Guide as a forthcoming resource today.

the steps here all seem right on to me, but i think this is a really big, deep, technical, and important topic, and probably too much for one person or article to tackle thoroughly, so the collective Guide approach is a better fit.

the overall flow here is great, the biggest issue is really that all the technical bits are illustrated primarily via screenshots. this is the kind of thing that we'll want to have real data and an example project that people can walkthrough with us. you definitely need to get hands-on with this process for it to click!

my instinct is that we take this and use it springboard to an outline for the Guide, and then get everybody interested collectively working on building that out to an MVP. an outline and an example project to use will be the first things we need. from there it should be pretty quick to go through and swap out the screenshots for code and data throughout and build out the prose.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Audit Alongside Your Refactor #1306

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Audit Alongside Your Refactor #1306

patkearns10 Apr 5, 2022 Collaborator

Please answer the following questions to get the discussion started.

What is the main problem you are solving?

What is your solution? This should help form your core thesis.

Why should the reader care about this problem?

Why is your solution the right one? This should help form your specific target audience.

Can you list the steps of your solution for the reader here? This should help you form the overall narrative arc and sketch out an example use case to illustrate it.

Are there any resources that helped inspire or inform your idea?

Are there other existing solutions that solve the problem, and if so, how is this solution better or different? If so please share any links here.

Replies: 1 comment · 1 reply

patkearns10 Apr 5, 2022 Collaborator Author

gwenwindflower Apr 27, 2022

patkearns10
Apr 5, 2022
Collaborator

Replies: 1 comment 1 reply

patkearns10
Apr 5, 2022
Collaborator Author