Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] [FOLLOW-UP] [Hybrid/C2C] Validate predicate push down and filtering #11892

Open
res-life opened this issue Dec 19, 2024 · 2 comments
Open
Labels
? - Needs Triage Need team to review and classify feature request New feature or request

Comments

@res-life
Copy link
Collaborator

res-life commented Dec 19, 2024

Is your feature request related to a problem? Please describe.
It's from #11720 (comment)

Can we add some tests to validate that predicate push down and
filtering is working correctly? It would be nice to have

  1. simple filters
  2. complex filters that are not supported by normal parquet predicate push down. (like the ors at the top level instead of ands)
  3. filters that have operators in them that velox does not support, but spark rapids does.

Describe the solution you'd like
First test predicate push down and filtering.
Then add support/fix for the failed cases.

Additional context
It's related to Hybrid/C2C feature.

@res-life res-life added ? - Needs Triage Need team to review and classify feature request New feature or request labels Dec 19, 2024
@res-life
Copy link
Collaborator Author

@thirtiseven I remember you have a fix related to this, please clarify it, thanks.

@thirtiseven
Copy link
Collaborator

In 11720, Scan followed by a Filter will lead to a case that all conditions being pushed down to the Scan but still remaining in the Filter at the same time, so the filter conditions are evaluated twice. Typically the second evaluation is quite fast so it won't be a big problem (but we still want to remove it). But if there are some conditions that are not supported by Velox or Rapids it will cause some problems.

So for this issue, we need to check the following cases:

  • If a filter condition is not supported by either Velox or Rapids, we should fallback to the CPU somewhere (I think it should be handled by the Filter and not pushed down).
  • If a filter condition is only supported by Velox, we should push it down to the Scan. For current code it will lead to a unnecessary fallback while the values are already filtered by the Scan.
  • If a filter condition is only supported by Rapids, we should not push it down to the Scan and let Rapids handle it. In the current code I believe it will lead to some kind of exception because we don't do fallback on the Velox side.
  • If all conditions are pushed down to the Scan, we should just remove the GpuFilter node to avoid the double evaluation.

We already have a POC version of this in customer enviroment, but it was quite hack and have some hardcode. We can find a better solution to check if Velox supports a condition and make the plan change a rule.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify feature request New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants