Skip to content

fix: stack overflow for substrait functions with large argument lists that translate to DataFusion binary operators #16031

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
May 19, 2025

Conversation

fmonjalet
Copy link
Contributor

@fmonjalet fmonjalet commented May 12, 2025

Which issue does this PR close?

Rationale for this change

Some substrait plans can cause DataFusion to trigger stack overflow, crashing the hosting process and preventing to run them. This stack overflow is caused by very large recursion levels. See issue for more details.

This kind of issue may also occur in the SQL --> Logical plan path, but I did not investigate it yet. The "ideal" fix would be not to rely on recursion for most operations on plans, but this sounds like an unreasonably large and complex code change for now.

This is a fairly small change, the high diff count is because of the added sample substrait plan, that has to be large to reproduce the issue.
==> The PR has been updated to avoid the large test file.

What changes are included in this PR?

When transforming a substrait function call to DataFusion logical plan,
if the substrait function maps to a DataFusion binary operator, but has
more than 2 arguments, it is mapped to a tree of BinaryExpr. This
BinaryExpr tree is not balanced, and its depth is the number of
arguments:

       Op
      /  \
    arg1  Op
         /  \
       arg2  ...
             /  \
           argN  Op

Since many functions manipulating the logical plan are recursive, it
means that N arguments result in an O(N) recursion, leading to stack
overflows for large N (1000 for example).

Transforming these function calls into a balanced tree mitigates the
issue:

             .__ Op __.
            /          \
          Op            Op
         /  \          /  \
       ...  ...      ...  ...
      /  \  /  \    /  \  /  \
    arg1        ...          argN

The recursion depth is now O(log2(N)), meaning that 1000 arguments
results in a depth of ~10, and it would take 2^1000 arguments to reach a
depth of 1000, which is a vastly unreasonable amount of data.

Therefore, it's not possible to use this flaw anymore to trigger stack
overflows in processes running DataFusion.

The new balanced tree preserves the order of execution (arg1 evaluated
before arg2) and has the same order of magnitude of operator nodes.
The main change is that there is now at least O(log2(N)) operators evaluated,
whereas before the minimum was 1, if the first operator exited after evaluating
only its left side.

Are these changes tested?

Yes

Are there any user-facing changes?

No, the change to the logical plan is semantically equivalent and should not be exposed to users.

@github-actions github-actions bot added the substrait Changes to the substrait crate label May 12, 2025
@fmonjalet
Copy link
Contributor Author

According to the contributor guide, mentioning @alamb to trigger the CI because you were the most frequent reviewer on the substrait topic recently (sorry if the direct ping is too noisy -- tell me if I should proceed otherwise).

Also pinging @gabotechs as the latest committer on this file.

@jayzhan211
Copy link
Contributor

Is it possible to add the test that generate equivalent json?

@fmonjalet
Copy link
Contributor Author

fmonjalet commented May 12, 2025

Is it possible to add the test that generate equivalent json?

@jayzhan211 sure, I can even try to build the substrait plan programatically. This will be more verbose, but more explicit. I'll keep you posted!

@fmonjalet
Copy link
Contributor Author

fmonjalet commented May 13, 2025

@jayzhan211 After looking into it, generating the plan programatically may become very unwieldy (very verbose). Is your concern about the size of the file checked into git, or the fact that it's hard to read because it's so large?
If it's the former, I can put in two files and put together some string formatting. If it's the former, this won't help as much, but will still be more readable for humans.

@jayzhan211
Copy link
Contributor

jayzhan211 commented May 13, 2025

@jayzhan211 After looking into it, generating the plan programatically may become very unwieldy (very verbose). Is your concern about the size of the file checked into git, or the fact that it's hard to read because it's so large? If it's the former, I can put in two files and put together some string formatting. If it's the former, this won't help as much, but will still be more readable for humans.

My idea is that we can generate such parameters for the code we want to test, we don't need end to end test from json.

I take a quick look, if we enter this function, I think it is possible to generate a deep nested Expression that easily cause stack overflow.

    async fn consume_expression(
        &self,
        expr: &Expression,
        input_schema: &DFSchema,
    ) -> datafusion::common::Result<Expr> {
        from_substrait_rex(self, expr, input_schema).await
    }

If that doesn't work, I think we don't even need to keep such test since the change to me is quite trivial (pulling function out is common to fix the overflow issue).

@fmonjalet
Copy link
Contributor Author

Oh right it makes sense, I'll try to produce a test based on consume_expression then. Thanks!

@fmonjalet
Copy link
Contributor Author

@jayzhan211 and @gabotechs thank you for your guidance, I updated the PR:

  • Expressions are not cloned anymore.
  • The test has been scoped to a much more unit test style, where we build a substrait ScalarFunction call and translate it to a DataFusion expression. I manually checked that reverting the fix made the test fail. It requires a larger argument list to trigger the issue because the expressions are simpler than in the original complex plan I submitted.

Ready for a next round of feedback if any!

Copy link
Contributor

@gabotechs gabotechs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 very smart usage of std::vec::Drain +1

I don't think it's strictly necessary, but leaving here some further testing ideas that I explored on my own:

not strictly necessary unit tests
#[cfg(test)]
mod tests {
    use super::*;
    use insta::assert_snapshot;

    fn n(n: i64) -> Expr {
        Expr::Literal(ScalarValue::Int64(Some(n)))
    }

    #[test]
    fn arg_list_to_binary_op_tree_1_arg() {
        let expr = arg_list_to_binary_op_tree(Operator::Or, vec![n(1)]);
        assert_snapshot!(expr.to_string(), @"Int64(1)");
    }

    #[test]
    fn arg_list_to_binary_op_tree_2_args() {
        let expr = arg_list_to_binary_op_tree(Operator::Or, vec![n(1), n(2)]);
        assert_snapshot!(expr.to_string(), @"Int64(1) OR Int64(2)");
    }

    #[test]
    fn arg_list_to_binary_op_tree_3_args() {
        let expr = arg_list_to_binary_op_tree(Operator::Or, vec![n(1), n(2), n(3)]);
        assert_snapshot!(expr.to_string(), @"Int64(1) OR Int64(2) OR Int64(3)");
    }

    #[test]
    fn arg_list_to_binary_op_tree_4_args() {
        let expr = arg_list_to_binary_op_tree(Operator::Or, vec![n(1), n(2), n(3), n(4)]);
        assert_snapshot!(expr.to_string(), @"Int64(1) OR Int64(2) OR Int64(3) OR Int64(4)");
    }
}

fmonjalet added 4 commits May 14, 2025 08:36

Verified

This commit was signed with the committer’s verified signature.
fmonjalet Florent Monjalet

Verified

This commit was signed with the committer’s verified signature.
fmonjalet Florent Monjalet
When transforming a substrait function call to DataFusion logical plan,
if the substrait function maps to a DataFusion binary operator, but has
more than 2 arguments, it is mapped to a tree of BinaryExpr. This
BinaryExpr tree is not balanced, and its depth is the number of
arguments:

       Op
      /  \
    arg1  Op
         /  \
       arg2  ...
             /  \
           argN  Op

Since many functions manipulating the logical plan are recursive, it
means that N arguments result in an O(N) recursion, leading to stack
overflows for large N (1000 for example).

Transforming these function calls into a balanced tree mitigates the
issue:

             .__ Op __.
            /          \
          Op            Op
         /  \          /  \
       ...  ...      ...  ...
      /  \  /  \    /  \  /  \
    arg1        ...          argN

The recursion depth is now O(log2(N)), meaning that 1000 arguments
results in a depth of ~10, and it would take 2^1000 arguments to reach a
depth of 1000, which is a vastly unreasonable amount of data.

Therefore, it's not possible to use this flaw anymore to trigger stack
overflows in processes running DataFusion.

Verified

This commit was signed with the committer’s verified signature.
fmonjalet Florent Monjalet

Verified

This commit was signed with the committer’s verified signature.
fmonjalet Florent Monjalet
@fmonjalet fmonjalet force-pushed the fix/substrait-stack-overflow branch from 29942f6 to b693229 Compare May 14, 2025 06:51
@fmonjalet
Copy link
Contributor Author

fmonjalet commented May 14, 2025

Thanks for the second round of feedback! This is now ready for a re-review.
All the comments should have been taken into account. I force pushed to remove the first commit that added the large file, so that PR passes the CI (otherwise the large file is still in the history and the CI check fails), but all new changes have been added as additional commits for easier review, starting with this one.

fmonjalet added 3 commits May 14, 2025 08:55

Verified

This commit was signed with the committer’s verified signature.
fmonjalet Florent Monjalet

Verified

This commit was signed with the committer’s verified signature.
fmonjalet Florent Monjalet

Verified

This commit was signed with the committer’s verified signature.
fmonjalet Florent Monjalet
Courtesy of @gabotechs
@fmonjalet fmonjalet force-pushed the fix/substrait-stack-overflow branch from b693229 to 1fc8552 Compare May 14, 2025 06:56
Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @fmonjalet I think it is LGTM to be now, love the tests

Verified

This commit was signed with the committer’s verified signature.
fmonjalet Florent Monjalet

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
@alamb
Copy link
Contributor

alamb commented May 19, 2025

I plan to merge this PR to main once the CI has passed

@alamb alamb merged commit 2eaac22 into apache:main May 19, 2025
27 checks passed
@alamb
Copy link
Contributor

alamb commented May 19, 2025

Thanks again everyone

@fmonjalet
Copy link
Contributor Author

Thanks everyone for the review and the merge!

fmonjalet added a commit to DataDog/datafusion that referenced this pull request May 20, 2025

Verified

This commit was signed with the committer’s verified signature.
fmonjalet Florent Monjalet
… that translate to DataFusion binary operators (apache#16031)

* Add substrait consumer test causing a stack overflow

* Mitigate stack overflow for substrait binary op with large arg list

When transforming a substrait function call to DataFusion logical plan,
if the substrait function maps to a DataFusion binary operator, but has
more than 2 arguments, it is mapped to a tree of BinaryExpr. This
BinaryExpr tree is not balanced, and its depth is the number of
arguments:

       Op
      /  \
    arg1  Op
         /  \
       arg2  ...
             /  \
           argN  Op

Since many functions manipulating the logical plan are recursive, it
means that N arguments result in an O(N) recursion, leading to stack
overflows for large N (1000 for example).

Transforming these function calls into a balanced tree mitigates the
issue:

             .__ Op __.
            /          \
          Op            Op
         /  \          /  \
       ...  ...      ...  ...
      /  \  /  \    /  \  /  \
    arg1        ...          argN

The recursion depth is now O(log2(N)), meaning that 1000 arguments
results in a depth of ~10, and it would take 2^1000 arguments to reach a
depth of 1000, which is a vastly unreasonable amount of data.

Therefore, it's not possible to use this flaw anymore to trigger stack
overflows in processes running DataFusion.

* arg_list_to_binary_op_tree: avoid cloning Expr

* cargo fmt

* from_scalar_function: improve error handling

* Move test_binary_op_large_argument_list test to scalar_function module

* arg_list_to_binary_op_tree: add more unit tests

Courtesy of @gabotechs

* substrait consumer scalar_function tests: more explicit function name

---------

Co-authored-by: Andrew Lamb <[email protected]>
(cherry picked from commit 2eaac22)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
substrait Changes to the substrait crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stack overflow for substrait functions with large argument lists that translate to DataFusion binary operators
6 participants