PARQUET-2139: fix file_offset field in ColumnChunk metadata #1369

etseidl · 2024-06-03T17:40:49Z

Fixes the referenced issue wherein the file_offset field of the ColumnChunk object is improperly set to the offset of the first page in the column chunk. Because parquet-java does not write a copy of ColumnMetaData after the column chunk, this PR simply sets the value of file_offset to 0.

Make sure you have checked all steps below.

Jira

My PR addresses the following Parquet Jira issues and references
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
- https://issues.apache.org/jira/browse/PARQUET-2139
- In case you are adding a dependency, check if the license complies with
  the ASF 3rd Party License Policy.

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason:
The footer metadata lacks the file_offset field, so unit testing is difficult. Manual inspection of generated files confirms the desired output.

Commits

My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines
from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Style

My contribution adheres to the code style guidelines and Spotless passes.
- To apply the necessary changes, run mvn spotless:apply -Pvector-plugins

Documentation

In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain Javadoc that explain what it does

wgtmac · 2024-06-04T01:40:05Z

parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java

-          new ColumnChunk(columnMetaData.getFirstDataPageOffset()); // verify this is the right offset
+      // There is no ColumnMetaData written after the chunk data, so set the ColumnChunk
+      // file_offset to 0
+      ColumnChunk columnChunk = new ColumnChunk(0);


This is something that @etseidl and me have discussed in https://issues.apache.org/jira/browse/PARQUET-2139. The best fix is to write ColumnMetaData at the end of each column chunk (currently it does not) and store the correct offset here. However, it has been wrong since day 1 and takes some effort to make it right. Since we have not seen any issue around this these years, I'm inclined to deprecate this field together with the v3 discussion. Therefore I'm fine with setting an invalid value here (0 or -1). WDYT? @gszadovszky @julienledem

I also brought this up on the mailing list.

@wgtmac, I agree to write invalid value here (0 is as invalid as -1 because of the magic bytes at the beginning of the file) and remove the field for v3.

I don't think we intended to write the ColumnMetaData at the end of the column chunk though. Is it something that is ambiguous in the spec?

(I followed up on the mailing list on the thread above)

parquet-cpp actually writes ColumnMetaData right after the last page of the column and stores it into file_offset field: https://github.com/apache/arrow/blob/6800be9331d88024bf550c77865a06c592a22699/cpp/src/parquet/metadata.cc#L1473-L1478

julienledem · 2024-06-05T01:18:19Z

I don't remember all the context, but if this is completely wrong, I'd rather deprecate the field and document it should not be used rather than setting the value to zero.
Setting to zero has a few issues:

it doesn't properly communicate that the field should not be used and can be confusing
it might break implementations that have been using this to find the first page. Since setting it to zero doesn't improve the situation as it is merely a different wrong value I'd rather we don't change the behavior until the field has been removed at the end of the deprecation cycle.

What do other implementations put in this field? (if no other implementation sets this, then this might be a different story)

etseidl · 2024-06-05T04:39:56Z

I'd rather deprecate the field and document it should not be used rather than setting the value to zero.

I agree with deprecating, but I'm less sanguine about leaving an incorrect value in parquet-java, especially given the fact that arrow-cpp (and arrow-rs I believe) populate this field correctly. Having such a big difference between major implementations is IMO more confusing than stating the field should be set to 0 (or -1) if there is no second copy of the ColumnMetaData.

it might break implementations that have been using this to find the first page.

Implementations that do this will break anyway if they try to read a file produced by arrow, so I don't know how big of a concern this is.

That said, if the consensus is to just leave this be, that's fine too...we'd just have to make note of differing interpretations in the format documents.

PARQUET-2139: set metadata offset to 0 since it is not written at all

f7fb556

wgtmac reviewed Jun 4, 2024

View reviewed changes

etseidl mentioned this pull request Jun 25, 2024

PARQUET-2139: Deprecate ColumnChunk::file_offset field apache/parquet-format#440

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-2139: fix file_offset field in ColumnChunk metadata #1369

PARQUET-2139: fix file_offset field in ColumnChunk metadata #1369

etseidl commented Jun 3, 2024

wgtmac Jun 4, 2024

etseidl Jun 4, 2024

gszadovszky Jun 4, 2024

julienledem Jun 5, 2024

julienledem Jun 5, 2024

wgtmac Jun 5, 2024

julienledem commented Jun 5, 2024 •

edited

Loading

etseidl commented Jun 5, 2024

PARQUET-2139: fix file_offset field in ColumnChunk metadata #1369

Are you sure you want to change the base?

PARQUET-2139: fix file_offset field in ColumnChunk metadata #1369

Conversation

etseidl commented Jun 3, 2024

Jira

Tests

Commits

Style

Documentation

wgtmac Jun 4, 2024

Choose a reason for hiding this comment

etseidl Jun 4, 2024

Choose a reason for hiding this comment

gszadovszky Jun 4, 2024

Choose a reason for hiding this comment

julienledem Jun 5, 2024

Choose a reason for hiding this comment

julienledem Jun 5, 2024

Choose a reason for hiding this comment

wgtmac Jun 5, 2024

Choose a reason for hiding this comment

julienledem commented Jun 5, 2024 • edited Loading

etseidl commented Jun 5, 2024

julienledem commented Jun 5, 2024 •

edited

Loading