-
Notifications
You must be signed in to change notification settings - Fork 465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GLUTEN-8623][CH] Support File meta and row index for parquet #8624
Conversation
Run Gluten Clickhouse CI on x86 |
3 similar comments
Run Gluten Clickhouse CI on x86 |
Run Gluten Clickhouse CI on x86 |
Run Gluten Clickhouse CI on x86 |
fd7cd44
to
abcd560
Compare
Run Gluten Clickhouse CI on x86 |
abcd560
to
dd2b207
Compare
Run Gluten Clickhouse CI on x86 |
dd2b207
to
db908d0
Compare
Run Gluten Clickhouse CI on x86 |
1 similar comment
Run Gluten Clickhouse CI on x86 |
cfcee54
to
e410ff6
Compare
Run Gluten Clickhouse CI on x86 |
e410ff6
to
4346e8b
Compare
Run Gluten Clickhouse CI on x86 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What changes were proposed in this pull request?
(Fixes: #8623)
This PR supports file meta for all supported format and row index for parquet.
Supporting File Meta
To support File Meta,
FileReaderWrapper
is renamed toBaseReader
, and add a member namedDB::Columns addVirtualColumn(DB::Chunk dataChunk, size_t rowNum = 0) const
, which is called atXXFileReader::pull
.NormalFileReader::pull
is responsible for reading real data from file, andConstColumnsFileReader::pull
is responsible for generating n rows of meta data when there is no need to read real data.After read data from file, file meta are added.
Supporting Row index for parquetParquetInputFormat::generate
To support row index for parquet, I refactor
FormatFile::InputFormat
and create a new child classParquetInputFormat
In
ParquetInputFormat::generate
, we did same asXXFileReader::pull
, reading real data from parquet file first, and then add row index.How was this patch tested?
spark 35 test are added
see https://opencicd.kyligence.com/blue/rest/organizations/jenkins/pipelines/gluten/pipelines/gluten-ci/runs/14511/nodes/151/steps/205/log/?start=0