You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Certain dbt models that are combinations of views that have not themselves been materialized cause a lot of strain on the query planner and result in many stages getting made. The stage depth in starburst is currently set to 150, which means a developer or user could quickly run into this problem if they are joining or selecting from a composite data product.
Some type of mechanism to identify nodes that create large explosions in the query plan would be helpful for developers as a heuristic on when to think about materialization.
How this could be implemented:
Traverse the DBT query plan, and take the rendered queries and run EXPLAIN (TYPE DISTRIBUTED, FORMAT JSON) {query}
Use this to build a graph of stages, find inflection points of stage accumulation or when near limit.
The reason I add this here is that this is an issue specific using the features of dbt with starburst. I don't know if it can be solved in this package however.
Describe alternatives you've considered
There is a simple way to estimate model stage complexity, but it is a very bad estimate because it assumes all models are equally dense.
here is an excerpt from my justfile.
# list number of model dependencies for each model in a folder
estimate-model-complexity directory:
#!/usr/bin/env bash
declare -A lines
for file in "{{directory}}"/*.sql; do
filename=$(basename "$file" .sql)
output=$({{dbt}} list -m +$filename)
line_count=$(echo "$output" | wc -l)
line_count=$((line_count - 4))
lines["$filename"]=$line_count
done
for key in "${!lines[@]}"; do
echo $key: ${lines[$key]}
done | sort -rn -t: -k2
Who will benefit?
Anyone trying to leverage the modular nature of DBT in starburst views who is trying to figure out the best views to materialize.
Are you willing to submit PR?
Yes I am willing to submit a PR!
The text was updated successfully, but these errors were encountered:
There are some assumptions that it makes about the plan output that might not hold true in practice (the bit about "virtual" stages, for one), but it's been helpful for us in directionally identifying nodes in the dbt graph that contribute to sudden explosions in complexity.
If we could get the direct number of stages from the query planner, and/or more insight into how the plan is generated that would be helpful.
I'm also trying to think of ways to show this to users easier, not sure what makes the most sense. For now, a rendered graph seems to work. We are thinking about embedding this in CI tooling or potentially running this periodically to help as a diagnostic tool.
Describe the feature
Certain dbt models that are combinations of views that have not themselves been materialized cause a lot of strain on the query planner and result in many stages getting made. The stage depth in starburst is currently set to 150, which means a developer or user could quickly run into this problem if they are joining or selecting from a composite data product.
Some type of mechanism to identify nodes that create large explosions in the query plan would be helpful for developers as a heuristic on when to think about materialization.
How this could be implemented:
The reason I add this here is that this is an issue specific using the features of dbt with starburst. I don't know if it can be solved in this package however.
Describe alternatives you've considered
There is a simple way to estimate model stage complexity, but it is a very bad estimate because it assumes all models are equally dense.
here is an excerpt from my justfile.
Who will benefit?
Anyone trying to leverage the modular nature of DBT in starburst views who is trying to figure out the best views to materialize.
Are you willing to submit PR?
The text was updated successfully, but these errors were encountered: