Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Optimize incremental 'insert_overwrite' strategy #527

Open
3 tasks done
AxelThevenot opened this issue Nov 21, 2024 · 14 comments · May be fixed by dbt-labs/dbt-bigquery#1410 or dbt-labs/dbt-bigquery#1415
Open
3 tasks done
Labels
feature:incremental Issues related to incremental materializations pkg:dbt-bigquery Issue affects dbt-bigquery type:enhancement New feature request

Comments

@AxelThevenot
Copy link

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt-bigquery functionality, rather than a Big Idea better suited to a discussion

Describe the feature

The MERGE statement is sub-optimized in BigQuery.
This is of course ok for unique keys as this is what we are looking for.

But for the insert_overwrite strategy where we are looking to rows at the partition-level there is a better solution and here is why:

  • a DELETE or INSERT statement is cheapest than a MERGE statement.
  • incremental tables are the most expensive tables in real-world projects.
  • The DELETE statement in BigQuery is free at the partition-level.

This has been tested at Carrefour which is my company.

  • On this replacement of the MERGE statement it reduces the cost by 50.4% and the elapsed time by 35.2%
  • On the overall procedure it reduces the cost by 26.1% and the elapsed time by 23.1%

This is wrapped in a transaction to avoid deleting rows if any error occurs.

Describe alternatives you've considered

I have considered an alternative to create an additional delete+insert incremental strategy.
But overriding the existing insert_overwrite is the more convenient as this is exactly what we are looking for and the features are the same.

Who will this benefit?

Of course, everyone without any changes.

Are you interested in contributing this feature?

Yes

Anything else?

No response

@philBatardiere
Copy link

Hey @AxelThevenot did you tried the https://docs.getdbt.com/reference/resource-configs/bigquery-configs#copying-partitions?

@AxelThevenot
Copy link
Author

Hi,

yes this is also something i've tried but:

  • It is sequential and not in parallel so really slow compare to this method
  • I prefer to have the entire view of what is executed in my BigQuery procedure instead of having the temp table creation in one side and a BigQuery job in an other side :)

I also though about the static partition definition but uses cases requires to have a dynamic partition overwrite :)

@philBatardiere
Copy link

Yes agreed if your incremental refresh a lot of partitions back then it can takes time (usually i refresh around 3 partitions back so the process isn’t so long).

For debuging purposes, I have seen some benefits having the tmp table by restoring it with the time travel feature. But its clearly worst for auditing purposes, it should be nice to have labels related to the dbt run id in the copy config btw 😀.

I keep your feature in mind in case of future need.

Thanks!

@AxelThevenot
Copy link
Author

For debuging purposes, I have seen some benefits having the tmp table by restoring it with the time travel feature. But its clearly worst for auditing purposes, it should be nice to have labels related to the dbt run id in the copy config btw 😀.

Oh yes it could be nice to have those labels

So you won't valide my pull request for now from what I understand ?

In any case, feel free to reach me if you want some help or anything else
https://www.linkedin.com/in/axel-thevenot/

@philBatardiere
Copy link

Unfortunately I don't have yet the authority to do it :) But I hope someone will do it

@AxelThevenot
Copy link
Author

AxelThevenot commented Nov 25, 2024

Ok, let's pray someone with this authority will go through this pull request/issue

I can create an additional delete+insert strategy instead of overriding the insert_overwrite, even if it make less sense to me 👍

@amychen1776
Copy link
Contributor

@AxelThevenot, I don't think I quite understand your justification for not creating a new delete+insert materialization strategy. Changing a commonly used incremental strategy can cause some unintended results, especially since it seems like your intent is to remove the merge statement from the insert_override statement (or am I misunderstanding?)

@amychen1776 amychen1776 added feature:incremental Issues related to incremental materializations and removed triage:product In Product's queue labels Nov 26, 2024
@borjavb
Copy link
Contributor

borjavb commented Nov 27, 2024

This is great stuff! But I think we should create a delete+insert operation instead of changing the already widely used insert_overwrite, as it could have unexpected consequences across all users. (+1 on @amychen1776)

For example, the DELETE operation being free is only under certain conditions as stated in the documentation and you might not always get the slot-free operation unfortunately. I’ve seen cases where the costs of the DELETE + INSERT end up being higher in aggregate to the whole MERGE. It’s rare, but It can happen and rolling this change for such a fundamental macro can backfire. At the end of the day is all about testing strategies that better suit the use case.

For example granular partitioned tables do not benefit from free DELETEs 😢 :
image

Using a delete+insert will definitely be a performance boost in most cases (if using daily partitions specially), but users have to be aware of how to get the best of it. Better to be clear through documentation and through new methods rather than changing the current one.

@AxelThevenot
Copy link
Author

AxelThevenot commented Nov 27, 2024

@amychen1776 and @borjavb I got your points, I will go for a delete+insert strategy if this is more convenient to you and avoid having edge cases for specific users

I will make the change as soon as possible and tell you when it's done :)

But delete+insert is at the unique-key-level in other adapters and I want a partition-level
Any idea of if I reuse this name even if this is not doing the same or if I give a new name ? (and what name?)

@borjavb
Copy link
Contributor

borjavb commented Nov 27, 2024

Been thinking about this and... how do we feel about an operator/function configuration within instert_overwrite?

The way I've hijacked the insert_overwrite strategy in the past to implement this was through the config block and it always worked really well. Check this draft PR with the gist behind this idea.

Technically we have an incremental strategy that can work with different functions, but apply the same semantics of replacing a whole partition. So basically we can define an incremental_strategy plus the operator/function we want to use: either a merge or a delete+insert. So we get the best of both worlds:

  1. If it's not defined, we default to merge, so we keep backwards compatibility
  2. If the user selects the new delete+insert underlying function, we swap the macro that works under the hood

We already have custom configurations for merge like merge_update_columns and merge_exclude_columns, so this doesn't break with how we do things with other incremental strategies.

So the config block would look like something like this:

{{
    config(
        materialized="incremental",
        incremental_strategy="insert_overwrite",
        insert_overwrite_function='delete+insert',
        cluster_by="id",
        partition_by={
            "field": "date_time",
            "data_type": "datetime",
            "granularity": "day"
        },
        partitions=partitions_to_replace,
    )
}}

Although I'm torn if we should add the incremental_strategy as a dict... but given that we haven't done that for the other configurations for merge... it feels now weird to add it as a dictionary? Not sure.

incremental_strategy={
   strategy = "insert_overwrite",
   operation = "delete+insert"
}

But this would just be some syntax sugar of how we want to expose the config. But I think with this we can let the user decide which operator is the most performant, and keep full backwards compatibility.

What do you think, @amychen1776 and @AxelThevenot?

@AxelThevenot
Copy link
Author

@borjavb this is a great idea ! I will go for it 😄

@jtcohen6
Copy link
Contributor

@AxelThevenot @borjavb Very cool to see the level of interest here :)

But delete+insert is at the unique-key-level in other adapters and I want a partition-level
Any idea of if I reuse this name even if this is not doing the same or if I give a new name ? (and what name?)

This is a totally fair point!

FWIW... The unique_key specified for other adapters does not actually have to be unique, and many users have a configuration like this to achieve the outcome you're after here:

{{ config(
   incremental_strategy = "insert_overwrite",
   unique_key = "my_partition_by_column_name",  -- not actually unique!
   ...
) }}

I know that nomenclature is misleading (Grace called it out a few months ago), but it would be consistent with how users have gotten this working on other DWHs/adapters. (I say that, at the same time acknowledging that non-unique keys can lead to poor query performance; there's a suggested performance optimization to switch from delete ... using to delete ... where: #150 (comment), #151.)


To offer one other perspective:

  • We're rolling out a brand-new microbatch incremental strategy in dbt Core v1.9 (https://docs.getdbt.com/docs/build/incremental-microbatch)
  • This strategy has parallelism built in, with the capability within dbt for per-batch concurrency + retry
  • Our stated goal: "To perform this operation, dbt will use the most efficient atomic mechanism for "full batch" replacement that is available on each data platform." Right now, that's using merge with a constant-false predicate — the same as insert_overwrite (here — but with the option to use the copy_partitions API instead (as Borja noted).
  • Given that microbatch is brand-new — should we make it use a delete; insert transaction right from the start? Or, should we add this as another opt-in config for the insert_overwrite strategy — one that would also be implicitly available to the microbatch strategy, too)? It's probably too late in the game to be asking, because dbt Core v1.9 is planning to go live in early December, but I figured I'd ask anyway :)

@borjavb
Copy link
Contributor

borjavb commented Nov 29, 2024

Hello @jtcohen6 ❤️ !
Oh, the microbatch logic is a great point and maybe I’ll diverge a bit, but I think it’s all related:

Given that microbatch is brand-new — should we make it use a delete; insert transaction right from the start? Or, should we add this as another opt-in config for the insert_overwrite strategy — one that would also be implicitly available to the microbatch strategy, too)? It's probably too late in the game to be asking, because dbt Core v1.9 is planning to go live in early December, but I figured I'd ask anyway :)

Unfortunately, I don’t think we can. With the future requirement of microbatch potentially running in parallel, this delete+insert would lock us in a sequential run: we are wrapping the operations within a transaction, and transactions in BQ cannot run concurrently :_(. I mean, we could potentially remove the transaction block, but we would lose the atomicity of the delete+insert operation which sounds scary and would potentially open a can of worms. 

image

On the other hand, thinking of how microbatch will work for that future requirement. MERGE offers limited parallelism: BigQuery only allows 2 concurrent MERGE operations over the same table and then everything else goes to a queue of PENDING jobs (up to 20, after that they will start failing)

So if we really want to offer pure parallelism, we probably need the insert_overwrite relying on a copy_partitions strategy.

And to add even more complexity to everything..., the delete;insert is not fully optimised (yet?) for hourly partitions (the delete is not free). So it feels like there's no silver bullet for all cases 🤦 😢


Given all of the above, and considering this issue, I would go with the latter option you suggested: add delete+insert logic within the insert_overwrite as an opt-in operation (even for microbatch), so users can have full control over the different options and trade offs: more parallelism vs less parallelism but potential cheaper queries.
Plus this being a opt-in, we don't break past behaviours (who knows what's being built out there assuming the merge is not locking the table...)

This feels like a proper v2 of your old post @jtcohen6 !

@mikealfare mikealfare added the pkg:dbt-bigquery Issue affects dbt-bigquery label Jan 14, 2025
@mikealfare mikealfare transferred this issue from dbt-labs/dbt-bigquery Jan 14, 2025
@borjavb
Copy link
Contributor

borjavb commented Jan 16, 2025

So I think we should add a new incremental_substrategy as part of the insert_overwrite strategy.

From my experience, there’s no silver bullet strategy that has the best performance in every scenario. Merge, delete+insert and copy_partitions have different pros and cons that we should let the user decide which one is the best option.

I've been running multiple queries over the different scenarios, and the next matrix is a fair comparison among all the options we've got to insert/overwrite a full partition, for both static and dynamic partitions. I mainly focused on resources, parallelism and storage costs.

Image

MERGE in terms of resources is one of the worst options, whereas for static partitions DELETE+INSERT and copy_partitions are fairly equal and faster (if picky, there can be a sub-second overhead added by the INSERT vs the copy_partitions, but it's almost the same time it takes for the copy to run).

The copy partitions has the extra overhead of having to store the tmp data if using dynamic partitions, which can result in considerable storage costs over time, which can cast some shade on any slot/wall time performance. Still, a big winner if using dynamic partitions.

And then the parallelism, although dbt doesn't support it right now, users can run multiple dbt dags at the same time, so depending on the option, that parallelism is limited.

I've draft a PR here (https://github.com/dbt-labs/dbt-bigquery/pull/1415/files), so incremental_substrategy could take three different operators: {"merge", "delete+insert,""copy_partitions"} being merge the default:

{{
    config(
        materialized="incremental",
        incremental_strategy="insert_overwrite",
        incremental_substrategy='delete+insert',
    )
}}

What do you think? @amychen1776, @AxelThevenot, @jtcohen6?

note: got an open bug with google now about the delete not working with hourly partitions, it's definitely unexpected behaviour, so we should expect free deletes on any partition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature:incremental Issues related to incremental materializations pkg:dbt-bigquery Issue affects dbt-bigquery type:enhancement New feature request
Projects
None yet
6 participants