Skip to content

parquet-go: ClickHouse fails to read generated Parquet files (No more data to read) #742

@harrylee2015

Description

@harrylee2015

Describe the bug, including details regarding any error messages, version, and platform.

Issue Title

parquet-go: ClickHouse fails to read generated Parquet files (No more data to read)

Description

When using github.com/xitongsys/parquet-go (v1.6.2) to write Parquet files, ClickHouse cannot read them and throws:

DB::Exception: apache::thrift::transport::TTransportException: No more data to read: read stage: ColumnData

What works vs what doesn't

Root cause analysis

The issue is that parquet-go does not properly write ColumnIndex and OffsetIndex metadata. ClickHouse relies on these indexes for predicate pushdown. When indexes are missing/corrupted, ClickHouse throws "No more data to read" because it expects index entries that don't exist.

parquet-tools inspect test.parquet shows no column_index or offset_index information in the output, confirming the indexes are missing.

Related issues

Proposed fix

Merge the fix from #547 or incorporate changes from the RudderLabs fork that:

  1. Correctly calculate CompressedPageSize (include header size)
  2. Exclude dictionary pages from ColumnIndex arrays
  3. Properly write OffsetIndex metadata

Workaround for users

  1. In ClickHouse: SET input_format_parquet_use_native_reader = 0;
  2. Or replace dependency: replace github.com/xitongsys/parquet-go => github.com/rudderlabs/parquet-go v0.0.3

Environment

  • parquet-go version: v1.6.2
  • ClickHouse version: 23.x / 24.x

Component(s)

Parquet

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type: bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions