-
Notifications
You must be signed in to change notification settings - Fork 1.5k
fix: stack overflow for substrait functions with large argument lists that translate to DataFusion binary operators #16031
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
According to the contributor guide, mentioning @alamb to trigger the CI because you were the most frequent reviewer on the substrait topic recently (sorry if the direct ping is too noisy -- tell me if I should proceed otherwise). Also pinging @gabotechs as the latest committer on this file. |
Is it possible to add the test that generate equivalent json? |
datafusion/substrait/src/logical_plan/consumer/expr/scalar_function.rs
Outdated
Show resolved
Hide resolved
@jayzhan211 sure, I can even try to build the substrait plan programatically. This will be more verbose, but more explicit. I'll keep you posted! |
@jayzhan211 After looking into it, generating the plan programatically may become very unwieldy (very verbose). Is your concern about the size of the file checked into git, or the fact that it's hard to read because it's so large? |
My idea is that we can generate such parameters for the code we want to test, we don't need end to end test from json. I take a quick look, if we enter this function, I think it is possible to generate a deep nested async fn consume_expression(
&self,
expr: &Expression,
input_schema: &DFSchema,
) -> datafusion::common::Result<Expr> {
from_substrait_rex(self, expr, input_schema).await
} If that doesn't work, I think we don't even need to keep such test since the change to me is quite trivial (pulling function out is common to fix the overflow issue). |
Oh right it makes sense, I'll try to produce a test based on |
@jayzhan211 and @gabotechs thank you for your guidance, I updated the PR:
Ready for a next round of feedback if any! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 very smart usage of std::vec::Drain
+1
I don't think it's strictly necessary, but leaving here some further testing ideas that I explored on my own:
not strictly necessary unit tests
#[cfg(test)]
mod tests {
use super::*;
use insta::assert_snapshot;
fn n(n: i64) -> Expr {
Expr::Literal(ScalarValue::Int64(Some(n)))
}
#[test]
fn arg_list_to_binary_op_tree_1_arg() {
let expr = arg_list_to_binary_op_tree(Operator::Or, vec![n(1)]);
assert_snapshot!(expr.to_string(), @"Int64(1)");
}
#[test]
fn arg_list_to_binary_op_tree_2_args() {
let expr = arg_list_to_binary_op_tree(Operator::Or, vec![n(1), n(2)]);
assert_snapshot!(expr.to_string(), @"Int64(1) OR Int64(2)");
}
#[test]
fn arg_list_to_binary_op_tree_3_args() {
let expr = arg_list_to_binary_op_tree(Operator::Or, vec![n(1), n(2), n(3)]);
assert_snapshot!(expr.to_string(), @"Int64(1) OR Int64(2) OR Int64(3)");
}
#[test]
fn arg_list_to_binary_op_tree_4_args() {
let expr = arg_list_to_binary_op_tree(Operator::Or, vec![n(1), n(2), n(3), n(4)]);
assert_snapshot!(expr.to_string(), @"Int64(1) OR Int64(2) OR Int64(3) OR Int64(4)");
}
}
datafusion/substrait/src/logical_plan/consumer/expr/scalar_function.rs
Outdated
Show resolved
Hide resolved
datafusion/substrait/src/logical_plan/consumer/expr/scalar_function.rs
Outdated
Show resolved
Hide resolved
datafusion/substrait/src/logical_plan/consumer/expr/scalar_function.rs
Outdated
Show resolved
Hide resolved
datafusion/substrait/src/logical_plan/consumer/expr/scalar_function.rs
Outdated
Show resolved
Hide resolved
datafusion/substrait/src/logical_plan/consumer/expr/scalar_function.rs
Outdated
Show resolved
Hide resolved
When transforming a substrait function call to DataFusion logical plan, if the substrait function maps to a DataFusion binary operator, but has more than 2 arguments, it is mapped to a tree of BinaryExpr. This BinaryExpr tree is not balanced, and its depth is the number of arguments: Op / \ arg1 Op / \ arg2 ... / \ argN Op Since many functions manipulating the logical plan are recursive, it means that N arguments result in an O(N) recursion, leading to stack overflows for large N (1000 for example). Transforming these function calls into a balanced tree mitigates the issue: .__ Op __. / \ Op Op / \ / \ ... ... ... ... / \ / \ / \ / \ arg1 ... argN The recursion depth is now O(log2(N)), meaning that 1000 arguments results in a depth of ~10, and it would take 2^1000 arguments to reach a depth of 1000, which is a vastly unreasonable amount of data. Therefore, it's not possible to use this flaw anymore to trigger stack overflows in processes running DataFusion.
29942f6
to
b693229
Compare
Thanks for the second round of feedback! This is now ready for a re-review. |
Courtesy of @gabotechs
b693229
to
1fc8552
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @fmonjalet I think it is LGTM to be now, love the tests
datafusion/substrait/src/logical_plan/consumer/expr/scalar_function.rs
Outdated
Show resolved
Hide resolved
I plan to merge this PR to main once the CI has passed |
Thanks again everyone |
Thanks everyone for the review and the merge! |
… that translate to DataFusion binary operators (apache#16031) * Add substrait consumer test causing a stack overflow * Mitigate stack overflow for substrait binary op with large arg list When transforming a substrait function call to DataFusion logical plan, if the substrait function maps to a DataFusion binary operator, but has more than 2 arguments, it is mapped to a tree of BinaryExpr. This BinaryExpr tree is not balanced, and its depth is the number of arguments: Op / \ arg1 Op / \ arg2 ... / \ argN Op Since many functions manipulating the logical plan are recursive, it means that N arguments result in an O(N) recursion, leading to stack overflows for large N (1000 for example). Transforming these function calls into a balanced tree mitigates the issue: .__ Op __. / \ Op Op / \ / \ ... ... ... ... / \ / \ / \ / \ arg1 ... argN The recursion depth is now O(log2(N)), meaning that 1000 arguments results in a depth of ~10, and it would take 2^1000 arguments to reach a depth of 1000, which is a vastly unreasonable amount of data. Therefore, it's not possible to use this flaw anymore to trigger stack overflows in processes running DataFusion. * arg_list_to_binary_op_tree: avoid cloning Expr * cargo fmt * from_scalar_function: improve error handling * Move test_binary_op_large_argument_list test to scalar_function module * arg_list_to_binary_op_tree: add more unit tests Courtesy of @gabotechs * substrait consumer scalar_function tests: more explicit function name --------- Co-authored-by: Andrew Lamb <[email protected]> (cherry picked from commit 2eaac22)
Which issue does this PR close?
Rationale for this change
Some substrait plans can cause DataFusion to trigger stack overflow, crashing the hosting process and preventing to run them. This stack overflow is caused by very large recursion levels. See issue for more details.
This kind of issue may also occur in the SQL --> Logical plan path, but I did not investigate it yet. The "ideal" fix would be not to rely on recursion for most operations on plans, but this sounds like an unreasonably large and complex code change for now.
This is a fairly small change, the high diff count is because of the added sample substrait plan, that has to be large to reproduce the issue.==> The PR has been updated to avoid the large test file.
What changes are included in this PR?
When transforming a substrait function call to DataFusion logical plan,
if the substrait function maps to a DataFusion binary operator, but has
more than 2 arguments, it is mapped to a tree of BinaryExpr. This
BinaryExpr tree is not balanced, and its depth is the number of
arguments:
Since many functions manipulating the logical plan are recursive, it
means that N arguments result in an
O(N)
recursion, leading to stackoverflows for large N (1000 for example).
Transforming these function calls into a balanced tree mitigates the
issue:
The recursion depth is now
O(log2(N))
, meaning that 1000 argumentsresults in a depth of ~10, and it would take 2^1000 arguments to reach a
depth of 1000, which is a vastly unreasonable amount of data.
Therefore, it's not possible to use this flaw anymore to trigger stack
overflows in processes running DataFusion.
The new balanced tree preserves the order of execution (
arg1
evaluatedbefore
arg2
) and has the same order of magnitude of operator nodes.The main change is that there is now at least
O(log2(N))
operators evaluated,whereas before the minimum was 1, if the first operator exited after evaluating
only its left side.
Are these changes tested?
Yes
Are there any user-facing changes?
No, the change to the logical plan is semantically equivalent and should not be exposed to users.