[Feature] Optimize incremental 'insert_overwrite' strategy #527

AxelThevenot · 2024-11-21T18:10:20Z

Is this your first time submitting a feature request?

I have read the expectations for open source contributors
I have searched the existing issues, and I could not find an existing issue for this feature
I am requesting a straightforward extension of existing dbt-bigquery functionality, rather than a Big Idea better suited to a discussion

Describe the feature

The MERGE statement is sub-optimized in BigQuery.
This is of course ok for unique keys as this is what we are looking for.

But for the insert_overwrite strategy where we are looking to rows at the partition-level there is a better solution and here is why:

a DELETE or INSERT statement is cheapest than a MERGE statement.
incremental tables are the most expensive tables in real-world projects.
The DELETE statement in BigQuery is free at the partition-level.

This has been tested at Carrefour which is my company.

On this replacement of the MERGE statement it reduces the cost by 50.4% and the elapsed time by 35.2%
On the overall procedure it reduces the cost by 26.1% and the elapsed time by 23.1%

This is wrapped in a transaction to avoid deleting rows if any error occurs.

Describe alternatives you've considered

I have considered an alternative to create an additional delete+insert incremental strategy.
But overriding the existing insert_overwrite is the more convenient as this is exactly what we are looking for and the features are the same.

Who will this benefit?

Of course, everyone without any changes.

Are you interested in contributing this feature?

Yes

Anything else?

No response

The text was updated successfully, but these errors were encountered:

philBatardiere · 2024-11-25T10:43:44Z

Hey @AxelThevenot did you tried the https://docs.getdbt.com/reference/resource-configs/bigquery-configs#copying-partitions?

AxelThevenot · 2024-11-25T10:48:18Z

Hi,

yes this is also something i've tried but:

It is sequential and not in parallel so really slow compare to this method
I prefer to have the entire view of what is executed in my BigQuery procedure instead of having the temp table creation in one side and a BigQuery job in an other side :)

I also though about the static partition definition but uses cases requires to have a dynamic partition overwrite :)

philBatardiere · 2024-11-25T11:22:41Z

Yes agreed if your incremental refresh a lot of partitions back then it can takes time (usually i refresh around 3 partitions back so the process isn’t so long).

For debuging purposes, I have seen some benefits having the tmp table by restoring it with the time travel feature. But its clearly worst for auditing purposes, it should be nice to have labels related to the dbt run id in the copy config btw 😀.

I keep your feature in mind in case of future need.

Thanks!

AxelThevenot · 2024-11-25T11:32:04Z

For debuging purposes, I have seen some benefits having the tmp table by restoring it with the time travel feature. But its clearly worst for auditing purposes, it should be nice to have labels related to the dbt run id in the copy config btw 😀.

Oh yes it could be nice to have those labels

So you won't valide my pull request for now from what I understand ?

In any case, feel free to reach me if you want some help or anything else
https://www.linkedin.com/in/axel-thevenot/

philBatardiere · 2024-11-25T11:36:04Z

Unfortunately I don't have yet the authority to do it :) But I hope someone will do it

AxelThevenot · 2024-11-25T12:10:30Z

Ok, let's pray someone with this authority will go through this pull request/issue

I can create an additional delete+insert strategy instead of overriding the insert_overwrite, even if it make less sense to me 👍

amychen1776 · 2024-11-26T14:39:43Z

@AxelThevenot, I don't think I quite understand your justification for not creating a new delete+insert materialization strategy. Changing a commonly used incremental strategy can cause some unintended results, especially since it seems like your intent is to remove the merge statement from the insert_override statement (or am I misunderstanding?)

borjavb · 2024-11-27T00:13:23Z

This is great stuff! But I think we should create a delete+insert operation instead of changing the already widely used insert_overwrite, as it could have unexpected consequences across all users. (+1 on @amychen1776)

For example, the DELETE operation being free is only under certain conditions as stated in the documentation and you might not always get the slot-free operation unfortunately. I’ve seen cases where the costs of the DELETE + INSERT end up being higher in aggregate to the whole MERGE. It’s rare, but It can happen and rolling this change for such a fundamental macro can backfire. At the end of the day is all about testing strategies that better suit the use case.

For example granular partitioned tables do not benefit from free DELETEs 😢 :

Using a delete+insert will definitely be a performance boost in most cases (if using daily partitions specially), but users have to be aware of how to get the best of it. Better to be clear through documentation and through new methods rather than changing the current one.

AxelThevenot · 2024-11-27T08:25:45Z

@amychen1776 and @borjavb I got your points, I will go for a delete+insert strategy if this is more convenient to you and avoid having edge cases for specific users

I will make the change as soon as possible and tell you when it's done :)

But delete+insert is at the unique-key-level in other adapters and I want a partition-level
Any idea of if I reuse this name even if this is not doing the same or if I give a new name ? (and what name?)

borjavb · 2024-11-27T22:26:05Z

Been thinking about this and... how do we feel about an operator/function configuration within instert_overwrite?

The way I've hijacked the insert_overwrite strategy in the past to implement this was through the config block and it always worked really well. Check this draft PR with the gist behind this idea.

Technically we have an incremental strategy that can work with different functions, but apply the same semantics of replacing a whole partition. So basically we can define an incremental_strategy plus the operator/function we want to use: either a merge or a delete+insert. So we get the best of both worlds:

If it's not defined, we default to merge, so we keep backwards compatibility
If the user selects the new delete+insert underlying function, we swap the macro that works under the hood

We already have custom configurations for merge like merge_update_columns and merge_exclude_columns, so this doesn't break with how we do things with other incremental strategies.

So the config block would look like something like this:

{{
    config(
        materialized="incremental",
        incremental_strategy="insert_overwrite",
        insert_overwrite_function='delete+insert',
        cluster_by="id",
        partition_by={
            "field": "date_time",
            "data_type": "datetime",
            "granularity": "day"
        },
        partitions=partitions_to_replace,
    )
}}

Although I'm torn if we should add the incremental_strategy as a dict... but given that we haven't done that for the other configurations for merge... it feels now weird to add it as a dictionary? Not sure.

incremental_strategy={
   strategy = "insert_overwrite",
   operation = "delete+insert"
}

But this would just be some syntax sugar of how we want to expose the config. But I think with this we can let the user decide which operator is the most performant, and keep full backwards compatibility.

What do you think, @amychen1776 and @AxelThevenot?

AxelThevenot · 2024-11-28T16:35:14Z

@borjavb this is a great idea ! I will go for it 😄

jtcohen6 · 2024-11-28T17:13:28Z

@AxelThevenot @borjavb Very cool to see the level of interest here :)

But delete+insert is at the unique-key-level in other adapters and I want a partition-level
Any idea of if I reuse this name even if this is not doing the same or if I give a new name ? (and what name?)

This is a totally fair point!

FWIW... The unique_key specified for other adapters does not actually have to be unique, and many users have a configuration like this to achieve the outcome you're after here:

{{ config(
   incremental_strategy = "insert_overwrite",
   unique_key = "my_partition_by_column_name",  -- not actually unique!
   ...
) }}

I know that nomenclature is misleading (Grace called it out a few months ago), but it would be consistent with how users have gotten this working on other DWHs/adapters. (I say that, at the same time acknowledging that non-unique keys can lead to poor query performance; there's a suggested performance optimization to switch from delete ... using to delete ... where: #150 (comment), #151.)

To offer one other perspective:

We're rolling out a brand-new microbatch incremental strategy in dbt Core v1.9 (https://docs.getdbt.com/docs/build/incremental-microbatch)
This strategy has parallelism built in, with the capability within dbt for per-batch concurrency + retry
Our stated goal: "To perform this operation, dbt will use the most efficient atomic mechanism for "full batch" replacement that is available on each data platform." Right now, that's using merge with a constant-false predicate — the same as insert_overwrite (here — but with the option to use the copy_partitions API instead (as Borja noted).
Given that microbatch is brand-new — should we make it use a delete; insert transaction right from the start? Or, should we add this as another opt-in config for the insert_overwrite strategy — one that would also be implicitly available to the microbatch strategy, too)? It's probably too late in the game to be asking, because dbt Core v1.9 is planning to go live in early December, but I figured I'd ask anyway :)

borjavb · 2024-11-29T17:35:41Z

Hello @jtcohen6 ❤️ !
Oh, the microbatch logic is a great point and maybe I’ll diverge a bit, but I think it’s all related:

Given that microbatch is brand-new — should we make it use a delete; insert transaction right from the start? Or, should we add this as another opt-in config for the insert_overwrite strategy — one that would also be implicitly available to the microbatch strategy, too)? It's probably too late in the game to be asking, because dbt Core v1.9 is planning to go live in early December, but I figured I'd ask anyway :)

Unfortunately, I don’t think we can. With the future requirement of microbatch potentially running in parallel, this delete+insert would lock us in a sequential run: we are wrapping the operations within a transaction, and transactions in BQ cannot run concurrently :_(. I mean, we could potentially remove the transaction block, but we would lose the atomicity of the delete+insert operation which sounds scary and would potentially open a can of worms.  

On the other hand, thinking of how microbatch will work for that future requirement. MERGE offers limited parallelism: BigQuery only allows 2 concurrent MERGE operations over the same table and then everything else goes to a queue of PENDING jobs (up to 20, after that they will start failing)

So if we really want to offer pure parallelism, we probably need the insert_overwrite relying on a copy_partitions strategy.

And to add even more complexity to everything..., the delete;insert is not fully optimised (yet?) for hourly partitions (the delete is not free). So it feels like there's no silver bullet for all cases 🤦 😢

Given all of the above, and considering this issue, I would go with the latter option you suggested: add delete+insert logic within the insert_overwrite as an opt-in operation (even for microbatch), so users can have full control over the different options and trade offs: more parallelism vs less parallelism but potential cheaper queries.
Plus this being a opt-in, we don't break past behaviours (who knows what's being built out there assuming the merge is not locking the table...)

This feels like a proper v2 of your old post @jtcohen6 !

borjavb · 2025-01-16T21:58:16Z

So I think we should add a new incremental_substrategy as part of the insert_overwrite strategy.

From my experience, there’s no silver ~~bullet~~ strategy that has the best performance in every scenario. Merge, delete+insert and copy_partitions have different pros and cons that we should let the user decide which one is the best option.

I've been running multiple queries over the different scenarios, and the next matrix is a fair comparison among all the options we've got to insert/overwrite a full partition, for both static and dynamic partitions. I mainly focused on resources, parallelism and storage costs.

MERGE in terms of resources is one of the worst options, whereas for static partitions DELETE+INSERT and copy_partitions are fairly equal and faster (if picky, there can be a sub-second overhead added by the INSERT vs the copy_partitions, but it's almost the same time it takes for the copy to run).

The copy partitions has the extra overhead of having to store the tmp data if using dynamic partitions, which can result in considerable storage costs over time, which can cast some shade on any slot/wall time performance. Still, a big winner if using dynamic partitions.

And then the parallelism, although dbt doesn't support it right now, users can run multiple dbt dags at the same time, so depending on the option, that parallelism is limited.

I've draft a PR here (https://github.com/dbt-labs/dbt-bigquery/pull/1415/files), so incremental_substrategy could take three different operators: {"merge", "delete+insert,""copy_partitions"} being merge the default:

{{
    config(
        materialized="incremental",
        incremental_strategy="insert_overwrite",
        incremental_substrategy='delete+insert',
    )
}}

What do you think? @amychen1776, @AxelThevenot, @jtcohen6?

note: got an open bug with google now about the delete not working with hourly partitions, it's definitely unexpected behaviour, so we should expect free deletes on any partition.

AxelThevenot added type:enhancement New feature request triage:product In Product's queue labels Nov 21, 2024

AxelThevenot referenced this issue in AxelThevenot/dbt-bigquery Nov 21, 2024

refactor(incremental): optimize 'insert_overwrite' strategy (#1409)

b848e74

AxelThevenot linked a pull request Nov 21, 2024 that will close this issue

feat(incremental): optimize 'insert_overwrite' strategy (#1409) dbt-labs/dbt-bigquery#1410

Open

4 tasks

amychen1776 added feature:incremental Issues related to incremental materializations and removed triage:product In Product's queue labels Nov 26, 2024

borjavb linked a pull request Nov 27, 2024 that will close this issue

Create a new incremental_substrategy for insert_overwrite and microbatch dbt-labs/dbt-bigquery#1415

Open

4 tasks

mikealfare added the pkg:dbt-bigquery Issue affects dbt-bigquery label Jan 14, 2025

mikealfare transferred this issue from dbt-labs/dbt-bigquery Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Optimize incremental 'insert_overwrite' strategy #527

[Feature] Optimize incremental 'insert_overwrite' strategy #527

AxelThevenot commented Nov 21, 2024

philBatardiere commented Nov 25, 2024

AxelThevenot commented Nov 25, 2024

philBatardiere commented Nov 25, 2024

AxelThevenot commented Nov 25, 2024

philBatardiere commented Nov 25, 2024

AxelThevenot commented Nov 25, 2024 •

edited

Loading

amychen1776 commented Nov 26, 2024

borjavb commented Nov 27, 2024 •

edited

Loading

AxelThevenot commented Nov 27, 2024 •

edited

Loading

borjavb commented Nov 27, 2024 •

edited

Loading

AxelThevenot commented Nov 28, 2024

jtcohen6 commented Nov 28, 2024

borjavb commented Nov 29, 2024 •

edited

Loading

borjavb commented Jan 16, 2025 •

edited

Loading

[Feature] Optimize incremental 'insert_overwrite' strategy #527

[Feature] Optimize incremental 'insert_overwrite' strategy #527

Comments

AxelThevenot commented Nov 21, 2024

Is this your first time submitting a feature request?

Describe the feature

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Anything else?

philBatardiere commented Nov 25, 2024

AxelThevenot commented Nov 25, 2024

philBatardiere commented Nov 25, 2024

AxelThevenot commented Nov 25, 2024

philBatardiere commented Nov 25, 2024

AxelThevenot commented Nov 25, 2024 • edited Loading

amychen1776 commented Nov 26, 2024

borjavb commented Nov 27, 2024 • edited Loading

AxelThevenot commented Nov 27, 2024 • edited Loading

borjavb commented Nov 27, 2024 • edited Loading

AxelThevenot commented Nov 28, 2024

jtcohen6 commented Nov 28, 2024

borjavb commented Nov 29, 2024 • edited Loading

borjavb commented Jan 16, 2025 • edited Loading

AxelThevenot commented Nov 25, 2024 •

edited

Loading

borjavb commented Nov 27, 2024 •

edited

Loading

AxelThevenot commented Nov 27, 2024 •

edited

Loading

borjavb commented Nov 27, 2024 •

edited

Loading

borjavb commented Nov 29, 2024 •

edited

Loading

borjavb commented Jan 16, 2025 •

edited

Loading