-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Optimize incremental 'insert_overwrite' strategy #527
[Feature] Optimize incremental 'insert_overwrite' strategy #527
Comments
Hi, yes this is also something i've tried but:
I also though about the static partition definition but uses cases requires to have a dynamic partition overwrite :) |
Yes agreed if your incremental refresh a lot of partitions back then it can takes time (usually i refresh around 3 partitions back so the process isn’t so long). For debuging purposes, I have seen some benefits having the tmp table by restoring it with the time travel feature. But its clearly worst for auditing purposes, it should be nice to have labels related to the dbt run id in the copy config btw 😀. I keep your feature in mind in case of future need. Thanks! |
Oh yes it could be nice to have those labels So you won't valide my pull request for now from what I understand ? In any case, feel free to reach me if you want some help or anything else |
Unfortunately I don't have yet the authority to do it :) But I hope someone will do it |
Ok, let's pray someone with this authority will go through this pull request/issue I can create an additional |
@AxelThevenot, I don't think I quite understand your justification for not creating a new |
This is great stuff! But I think we should create a For example, the For example granular partitioned tables do not benefit from free DELETEs 😢 : Using a |
@amychen1776 and @borjavb I got your points, I will go for a I will make the change as soon as possible and tell you when it's done :) But |
Been thinking about this and... how do we feel about an operator/function configuration within The way I've hijacked the Technically we have an incremental strategy that can work with different functions, but apply the same semantics of replacing a whole partition. So basically we can define an
We already have custom configurations for merge like So the config block would look like something like this:
Although I'm torn if we should add the
But this would just be some syntax sugar of how we want to expose the config. But I think with this we can let the user decide which operator is the most performant, and keep full backwards compatibility. What do you think, @amychen1776 and @AxelThevenot? |
@borjavb this is a great idea ! I will go for it 😄 |
@AxelThevenot @borjavb Very cool to see the level of interest here :)
This is a totally fair point! FWIW... The {{ config(
incremental_strategy = "insert_overwrite",
unique_key = "my_partition_by_column_name", -- not actually unique!
...
) }} I know that nomenclature is misleading (Grace called it out a few months ago), but it would be consistent with how users have gotten this working on other DWHs/adapters. (I say that, at the same time acknowledging that non-unique keys can lead to poor query performance; there's a suggested performance optimization to switch from To offer one other perspective:
|
Hello @jtcohen6 ❤️ !
Unfortunately, I don’t think we can. With the future requirement of On the other hand, thinking of how So if we really want to offer pure parallelism, we probably need the And to add even more complexity to everything..., the Given all of the above, and considering this issue, I would go with the latter option you suggested: add This feels like a proper v2 of your old post @jtcohen6 ! |
So I think we should add a new From my experience, there’s no silver I've been running multiple queries over the different scenarios, and the next matrix is a fair comparison among all the options we've got to insert/overwrite a full partition, for both static and dynamic partitions. I mainly focused on resources, parallelism and storage costs.
The copy partitions has the extra overhead of having to store the tmp data if using dynamic partitions, which can result in considerable storage costs over time, which can cast some shade on any slot/wall time performance. Still, a big winner if using dynamic partitions. And then the parallelism, although dbt doesn't support it right now, users can run multiple dbt dags at the same time, so depending on the option, that parallelism is limited. I've draft a PR here (https://github.com/dbt-labs/dbt-bigquery/pull/1415/files), so
What do you think? @amychen1776, @AxelThevenot, @jtcohen6? note: got an open bug with google now about the delete not working with hourly partitions, it's definitely unexpected behaviour, so we should expect free deletes on any partition. |
Is this your first time submitting a feature request?
Describe the feature
The
MERGE
statement is sub-optimized in BigQuery.This is of course ok for unique keys as this is what we are looking for.
But for the
insert_overwrite
strategy where we are looking to rows at the partition-level there is a better solution and here is why:DELETE
orINSERT
statement is cheapest than aMERGE
statement.DELETE
statement in BigQuery is free at the partition-level.MERGE
statement it reduces the cost by 50.4% and the elapsed time by 35.2%Describe alternatives you've considered
I have considered an alternative to create an additional
delete+insert
incremental strategy.But overriding the existing
insert_overwrite
is the more convenient as this is exactly what we are looking for and the features are the same.Who will this benefit?
Of course, everyone without any changes.
Are you interested in contributing this feature?
Yes
Anything else?
No response
The text was updated successfully, but these errors were encountered: