Skip to content
This repository has been archived by the owner on Jun 14, 2024. It is now read-only.

[WIP] PartitionEliminationFilterIndexRule implementation #390

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

apoorvedave1
Copy link
Contributor

What is the context for this pull request?

In this PR we introduce FilterIndexRule for PartitionEliminationNonCoveringIndex. Rule Algorithm:

  1. Identify if an index contains index columns which can improve point lookups or range queries on data source
  2. Query the index using spark to identify list of data files which could satisfy the query
  3. Redirect original query to this subset of data files instead of complete list of files.
  • Tracking Issue: If you expect any subjective discussions around this pull request, please consider opening a tracking issue and link to the PR. Write N/A, if this pull request is self-contained.
  • Parent Issue: Link to the issue that captures the overall plan. Write N/A, if this is a stand-alone pull request with a tracking issue OR self-contained pull request.
  • Dependencies: Links to issues you depend on for this pull request to work. Write N/A, if no dependencies.
    • Issue 1
    • Issue 2

What changes were proposed in this pull request?

Does this PR introduce any user-facing change?

No

How was this patch tested?


val filteredDf =
spark.read
.parquet(index.content.files.map(_.toString): _*)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we build an index of (value => file id list) instead of full scan of covering index.
Could you measure the perf of optimize phase with 1TB TPCH dataset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would not work if we want to merge Covering and partition elimination index.

Copy link
Collaborator

@sezruby sezruby Apr 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot deliver this rule without a proper performance validation as it seems expensive. Could you run TPCH 1TB 100k chunk dataset and share the result?

  1. w/o FEFilterIndexRule for comparison - explain time & query execution time
  2. w/ PEFilterindexRule - explain time & query execution time

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants