fix: stack overflow for substrait functions with large argument lists that translate to DataFusion binary operators #16031

fmonjalet · 2025-05-12T12:00:01Z

Which issue does this PR close?

Closes Stack overflow for substrait functions with large argument lists that translate to DataFusion binary operators #16030

Rationale for this change

Some substrait plans can cause DataFusion to trigger stack overflow, crashing the hosting process and preventing to run them. This stack overflow is caused by very large recursion levels. See issue for more details.

This kind of issue may also occur in the SQL --> Logical plan path, but I did not investigate it yet. The "ideal" fix would be not to rely on recursion for most operations on plans, but this sounds like an unreasonably large and complex code change for now.

This is a fairly small change, the high diff count is because of the added sample substrait plan, that has to be large to reproduce the issue.
==> The PR has been updated to avoid the large test file.

What changes are included in this PR?

When transforming a substrait function call to DataFusion logical plan,
if the substrait function maps to a DataFusion binary operator, but has
more than 2 arguments, it is mapped to a tree of BinaryExpr. This
BinaryExpr tree is not balanced, and its depth is the number of
arguments:

       Op
      /  \
    arg1  Op
         /  \
       arg2  ...
             /  \
           argN  Op

Since many functions manipulating the logical plan are recursive, it
means that N arguments result in an O(N) recursion, leading to stack
overflows for large N (1000 for example).

Transforming these function calls into a balanced tree mitigates the
issue:

             .__ Op __.
            /          \
          Op            Op
         /  \          /  \
       ...  ...      ...  ...
      /  \  /  \    /  \  /  \
    arg1        ...          argN

The recursion depth is now O(log2(N)), meaning that 1000 arguments
results in a depth of ~10, and it would take 2^1000 arguments to reach a
depth of 1000, which is a vastly unreasonable amount of data.

Therefore, it's not possible to use this flaw anymore to trigger stack
overflows in processes running DataFusion.

The new balanced tree preserves the order of execution (arg1 evaluated
before arg2) and has the same order of magnitude of operator nodes.
The main change is that there is now at least O(log2(N)) operators evaluated,
whereas before the minimum was 1, if the first operator exited after evaluating
only its left side.

Are these changes tested?

Yes

Are there any user-facing changes?

No, the change to the logical plan is semantically equivalent and should not be exposed to users.

fmonjalet · 2025-05-12T12:13:43Z

According to the contributor guide, mentioning @alamb to trigger the CI because you were the most frequent reviewer on the substrait topic recently (sorry if the direct ping is too noisy -- tell me if I should proceed otherwise).

Also pinging @gabotechs as the latest committer on this file.

jayzhan211 · 2025-05-12T12:19:54Z

Is it possible to add the test that generate equivalent json?

datafusion/substrait/src/logical_plan/consumer/expr/scalar_function.rs

fmonjalet · 2025-05-12T16:43:12Z

Is it possible to add the test that generate equivalent json?

@jayzhan211 sure, I can even try to build the substrait plan programatically. This will be more verbose, but more explicit. I'll keep you posted!

fmonjalet · 2025-05-13T14:51:50Z

@jayzhan211 After looking into it, generating the plan programatically may become very unwieldy (very verbose). Is your concern about the size of the file checked into git, or the fact that it's hard to read because it's so large?
If it's the former, I can put in two files and put together some string formatting. If it's the former, this won't help as much, but will still be more readable for humans.

jayzhan211 · 2025-05-13T15:07:19Z

@jayzhan211 After looking into it, generating the plan programatically may become very unwieldy (very verbose). Is your concern about the size of the file checked into git, or the fact that it's hard to read because it's so large? If it's the former, I can put in two files and put together some string formatting. If it's the former, this won't help as much, but will still be more readable for humans.

My idea is that we can generate such parameters for the code we want to test, we don't need end to end test from json.

I take a quick look, if we enter this function, I think it is possible to generate a deep nested Expression that easily cause stack overflow.

    async fn consume_expression(
        &self,
        expr: &Expression,
        input_schema: &DFSchema,
    ) -> datafusion::common::Result<Expr> {
        from_substrait_rex(self, expr, input_schema).await
    }

If that doesn't work, I think we don't even need to keep such test since the change to me is quite trivial (pulling function out is common to fix the overflow issue).

fmonjalet · 2025-05-13T16:20:03Z

Oh right it makes sense, I'll try to produce a test based on consume_expression then. Thanks!

fmonjalet · 2025-05-13T17:01:24Z

@jayzhan211 and @gabotechs thank you for your guidance, I updated the PR:

Expressions are not cloned anymore.
The test has been scoped to a much more unit test style, where we build a substrait ScalarFunction call and translate it to a DataFusion expression. I manually checked that reverting the fix made the test fail. It requires a larger argument list to trigger the issue because the expressions are simpler than in the original complex plan I submitted.

Ready for a next round of feedback if any!

gabotechs

💯 very smart usage of std::vec::Drain +1

I don't think it's strictly necessary, but leaving here some further testing ideas that I explored on my own:

not strictly necessary unit tests

#[cfg(test)]
mod tests {
    use super::*;
    use insta::assert_snapshot;

    fn n(n: i64) -> Expr {
        Expr::Literal(ScalarValue::Int64(Some(n)))
    }

    #[test]
    fn arg_list_to_binary_op_tree_1_arg() {
        let expr = arg_list_to_binary_op_tree(Operator::Or, vec![n(1)]);
        assert_snapshot!(expr.to_string(), @"Int64(1)");
    }

    #[test]
    fn arg_list_to_binary_op_tree_2_args() {
        let expr = arg_list_to_binary_op_tree(Operator::Or, vec![n(1), n(2)]);
        assert_snapshot!(expr.to_string(), @"Int64(1) OR Int64(2)");
    }

    #[test]
    fn arg_list_to_binary_op_tree_3_args() {
        let expr = arg_list_to_binary_op_tree(Operator::Or, vec![n(1), n(2), n(3)]);
        assert_snapshot!(expr.to_string(), @"Int64(1) OR Int64(2) OR Int64(3)");
    }

    #[test]
    fn arg_list_to_binary_op_tree_4_args() {
        let expr = arg_list_to_binary_op_tree(Operator::Or, vec![n(1), n(2), n(3), n(4)]);
        assert_snapshot!(expr.to_string(), @"Int64(1) OR Int64(2) OR Int64(3) OR Int64(4)");
    }
}

datafusion/substrait/src/logical_plan/consumer/expr/scalar_function.rs

datafusion/substrait/src/logical_plan/consumer/utils.rs

datafusion/substrait/src/logical_plan/consumer/expr/scalar_function.rs

When transforming a substrait function call to DataFusion logical plan, if the substrait function maps to a DataFusion binary operator, but has more than 2 arguments, it is mapped to a tree of BinaryExpr. This BinaryExpr tree is not balanced, and its depth is the number of arguments: Op / \ arg1 Op / \ arg2 ... / \ argN Op Since many functions manipulating the logical plan are recursive, it means that N arguments result in an O(N) recursion, leading to stack overflows for large N (1000 for example). Transforming these function calls into a balanced tree mitigates the issue: .__ Op __. / \ Op Op / \ / \ ... ... ... ... / \ / \ / \ / \ arg1 ... argN The recursion depth is now O(log2(N)), meaning that 1000 arguments results in a depth of ~10, and it would take 2^1000 arguments to reach a depth of 1000, which is a vastly unreasonable amount of data. Therefore, it's not possible to use this flaw anymore to trigger stack overflows in processes running DataFusion.

fmonjalet · 2025-05-14T06:54:32Z

Thanks for the second round of feedback! This is now ready for a re-review.
All the comments should have been taken into account. I force pushed to remove the first commit that added the large file, so that PR passes the CI (otherwise the large file is still in the history and the CI check fails), but all new changes have been added as additional commits for easier review, starting with this one.

@gabotechs

Courtesy of @gabotechs

comphead

Thanks @fmonjalet I think it is LGTM to be now, love the tests

datafusion/substrait/src/logical_plan/consumer/expr/scalar_function.rs

alamb · 2025-05-19T17:27:59Z

I plan to merge this PR to main once the CI has passed

alamb · 2025-05-19T18:02:41Z

Thanks again everyone

fmonjalet · 2025-05-20T07:01:40Z

Thanks everyone for the review and the merge!

@gabotechs

… that translate to DataFusion binary operators (apache#16031) * Add substrait consumer test causing a stack overflow * Mitigate stack overflow for substrait binary op with large arg list When transforming a substrait function call to DataFusion logical plan, if the substrait function maps to a DataFusion binary operator, but has more than 2 arguments, it is mapped to a tree of BinaryExpr. This BinaryExpr tree is not balanced, and its depth is the number of arguments: Op / \ arg1 Op / \ arg2 ... / \ argN Op Since many functions manipulating the logical plan are recursive, it means that N arguments result in an O(N) recursion, leading to stack overflows for large N (1000 for example). Transforming these function calls into a balanced tree mitigates the issue: .__ Op __. / \ Op Op / \ / \ ... ... ... ... / \ / \ / \ / \ arg1 ... argN The recursion depth is now O(log2(N)), meaning that 1000 arguments results in a depth of ~10, and it would take 2^1000 arguments to reach a depth of 1000, which is a vastly unreasonable amount of data. Therefore, it's not possible to use this flaw anymore to trigger stack overflows in processes running DataFusion. * arg_list_to_binary_op_tree: avoid cloning Expr * cargo fmt * from_scalar_function: improve error handling * Move test_binary_op_large_argument_list test to scalar_function module * arg_list_to_binary_op_tree: add more unit tests Courtesy of @gabotechs * substrait consumer scalar_function tests: more explicit function name --------- Co-authored-by: Andrew Lamb <[email protected]> (cherry picked from commit 2eaac22)

github-actions bot added the substrait label May 12, 2025

gabotechs reviewed May 12, 2025

View reviewed changes

datafusion/substrait/src/logical_plan/consumer/expr/scalar_function.rs Outdated Show resolved Hide resolved

gabotechs approved these changes May 13, 2025

View reviewed changes

datafusion/substrait/src/logical_plan/consumer/expr/scalar_function.rs Outdated Show resolved Hide resolved

datafusion/substrait/src/logical_plan/consumer/utils.rs Outdated Show resolved Hide resolved

datafusion/substrait/src/logical_plan/consumer/utils.rs Outdated Show resolved Hide resolved

comphead reviewed May 13, 2025

View reviewed changes

datafusion/substrait/src/logical_plan/consumer/expr/scalar_function.rs Outdated Show resolved Hide resolved

comphead reviewed May 13, 2025

View reviewed changes

datafusion/substrait/src/logical_plan/consumer/expr/scalar_function.rs Outdated Show resolved Hide resolved

comphead reviewed May 13, 2025

View reviewed changes

datafusion/substrait/src/logical_plan/consumer/expr/scalar_function.rs Outdated Show resolved Hide resolved

comphead reviewed May 13, 2025

View reviewed changes

datafusion/substrait/src/logical_plan/consumer/expr/scalar_function.rs Outdated Show resolved Hide resolved

fmonjalet added 4 commits May 14, 2025 08:36

Add substrait consumer test causing a stack overflow

Verified

This commit was signed with the committer’s verified signature.

fmonjalet Florent Monjalet

SSH Key Fingerprint: efNK9CSJXYe+xMqTg9lVDtJ4h/vYQKDb/WDObL01NwQ
Verified
Learn about vigilant mode

7ba043d

arg_list_to_binary_op_tree: avoid cloning Expr

Verified

This commit was signed with the committer’s verified signature.

fmonjalet Florent Monjalet

SSH Key Fingerprint: efNK9CSJXYe+xMqTg9lVDtJ4h/vYQKDb/WDObL01NwQ
Verified
Learn about vigilant mode

2830c51

cargo fmt

Verified

This commit was signed with the committer’s verified signature.

fmonjalet Florent Monjalet

SSH Key Fingerprint: efNK9CSJXYe+xMqTg9lVDtJ4h/vYQKDb/WDObL01NwQ
Verified
Learn about vigilant mode

e5fdb88

fmonjalet force-pushed the fix/substrait-stack-overflow branch from 29942f6 to b693229 Compare May 14, 2025 06:51

fmonjalet added 3 commits May 14, 2025 08:55

from_scalar_function: improve error handling

Verified

This commit was signed with the committer’s verified signature.

fmonjalet Florent Monjalet

SSH Key Fingerprint: efNK9CSJXYe+xMqTg9lVDtJ4h/vYQKDb/WDObL01NwQ
Verified
Learn about vigilant mode

ce19e9f

Move test_binary_op_large_argument_list test to scalar_function module

Verified

This commit was signed with the committer’s verified signature.

fmonjalet Florent Monjalet

SSH Key Fingerprint: efNK9CSJXYe+xMqTg9lVDtJ4h/vYQKDb/WDObL01NwQ
Verified
Learn about vigilant mode

205d751

fmonjalet force-pushed the fix/substrait-stack-overflow branch from b693229 to 1fc8552 Compare May 14, 2025 06:56

gabotechs approved these changes May 14, 2025

View reviewed changes

gabotechs mentioned this pull request May 14, 2025

[Branch 47] Cherry pick consumer refactor DataDog/datafusion#23

Merged

comphead approved these changes May 14, 2025

View reviewed changes

jayzhan211 reviewed May 15, 2025

View reviewed changes

datafusion/substrait/src/logical_plan/consumer/expr/scalar_function.rs Outdated Show resolved Hide resolved

jayzhan211 approved these changes May 15, 2025

View reviewed changes

Merge branch 'main' into fix/substrait-stack-overflow

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

Loading
Loading status checks…

3477ca5

alamb merged commit 2eaac22 into apache:main May 19, 2025
27 checks passed

fmonjalet mentioned this pull request May 20, 2025

[branch-47] Fix stack overflow for large binary operations DataDog/datafusion#25

Merged

fix: stack overflow for substrait functions with large argument lists that translate to DataFusion binary operators #16031

fix: stack overflow for substrait functions with large argument lists that translate to DataFusion binary operators #16031

Conversation

fmonjalet commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

fmonjalet commented May 12, 2025

Uh oh!

jayzhan211 commented May 12, 2025

Uh oh!

Uh oh!

fmonjalet commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fmonjalet commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jayzhan211 commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fmonjalet commented May 13, 2025

Uh oh!

fmonjalet commented May 13, 2025

Uh oh!

gabotechs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fmonjalet commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

comphead left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alamb commented May 19, 2025

Uh oh!

Uh oh!

alamb commented May 19, 2025

Uh oh!

fmonjalet commented May 20, 2025

Uh oh!

Uh oh!

fmonjalet commented May 12, 2025 •

edited

Loading

fmonjalet commented May 12, 2025 •

edited

Loading

fmonjalet commented May 13, 2025 •

edited

Loading

jayzhan211 commented May 13, 2025 •

edited

Loading

fmonjalet commented May 14, 2025 •

edited

Loading

comphead left a comment •

edited

Loading