Describe the bug, including details regarding any error messages, version, and platform.
Issue Title
parquet-go: ClickHouse fails to read generated Parquet files (No more data to read)
Description
When using github.com/xitongsys/parquet-go (v1.6.2) to write Parquet files, ClickHouse cannot read them and throws:
DB::Exception: apache::thrift::transport::TTransportException: No more data to read: read stage: ColumnData
What works vs what doesn't
Root cause analysis
The issue is that parquet-go does not properly write ColumnIndex and OffsetIndex metadata. ClickHouse relies on these indexes for predicate pushdown. When indexes are missing/corrupted, ClickHouse throws "No more data to read" because it expects index entries that don't exist.
parquet-tools inspect test.parquet shows no column_index or offset_index information in the output, confirming the indexes are missing.
Related issues
Proposed fix
Merge the fix from #547 or incorporate changes from the RudderLabs fork that:
- Correctly calculate
CompressedPageSize (include header size)
- Exclude dictionary pages from ColumnIndex arrays
- Properly write OffsetIndex metadata
Workaround for users
- In ClickHouse:
SET input_format_parquet_use_native_reader = 0;
- Or replace dependency:
replace github.com/xitongsys/parquet-go => github.com/rudderlabs/parquet-go v0.0.3
Environment
- parquet-go version: v1.6.2
- ClickHouse version: 23.x / 24.x
Component(s)
Parquet
Describe the bug, including details regarding any error messages, version, and platform.
Issue Title
parquet-go: ClickHouse fails to read generated Parquet files (No more data to read)
Description
When using
github.com/xitongsys/parquet-go(v1.6.2) to write Parquet files, ClickHouse cannot read them and throws:DB::Exception: apache::thrift::transport::TTransportException: No more data to read: read stage: ColumnData
What works vs what doesn't
parquet-tools catcan read all dataRoot cause analysis
The issue is that
parquet-godoes not properly write ColumnIndex and OffsetIndex metadata. ClickHouse relies on these indexes for predicate pushdown. When indexes are missing/corrupted, ClickHouse throws "No more data to read" because it expects index entries that don't exist.parquet-tools inspect test.parquetshows no column_index or offset_index information in the output, confirming the indexes are missing.Related issues
Proposed fix
Merge the fix from #547 or incorporate changes from the RudderLabs fork that:
CompressedPageSize(include header size)Workaround for users
SET input_format_parquet_use_native_reader = 0;replace github.com/xitongsys/parquet-go => github.com/rudderlabs/parquet-go v0.0.3Environment
Component(s)
Parquet