Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2139: fix file_offset field in ColumnChunk metadata #1369

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

etseidl
Copy link

@etseidl etseidl commented Jun 3, 2024

Fixes the referenced issue wherein the file_offset field of the ColumnChunk object is improperly set to the offset of the first page in the column chunk. Because parquet-java does not write a copy of ColumnMetaData after the column chunk, this PR simply sets the value of file_offset to 0.

Make sure you have checked all steps below.

Jira

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:
    The footer metadata lacks the file_offset field, so unit testing is difficult. Manual inspection of generated files confirms the desired output.

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines
    from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Style

  • My contribution adheres to the code style guidelines and Spotless passes.
    • To apply the necessary changes, run mvn spotless:apply -Pvector-plugins

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

new ColumnChunk(columnMetaData.getFirstDataPageOffset()); // verify this is the right offset
// There is no ColumnMetaData written after the chunk data, so set the ColumnChunk
// file_offset to 0
ColumnChunk columnChunk = new ColumnChunk(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something that @etseidl and me have discussed in https://issues.apache.org/jira/browse/PARQUET-2139. The best fix is to write ColumnMetaData at the end of each column chunk (currently it does not) and store the correct offset here. However, it has been wrong since day 1 and takes some effort to make it right. Since we have not seen any issue around this these years, I'm inclined to deprecate this field together with the v3 discussion. Therefore I'm fine with setting an invalid value here (0 or -1). WDYT? @gszadovszky @julienledem

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also brought this up on the mailing list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wgtmac, I agree to write invalid value here (0 is as invalid as -1 because of the magic bytes at the beginning of the file) and remove the field for v3.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we intended to write the ColumnMetaData at the end of the column chunk though. Is it something that is ambiguous in the spec?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I followed up on the mailing list on the thread above)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parquet-cpp actually writes ColumnMetaData right after the last page of the column and stores it into file_offset field: https://github.com/apache/arrow/blob/6800be9331d88024bf550c77865a06c592a22699/cpp/src/parquet/metadata.cc#L1473-L1478

@julienledem
Copy link
Member

julienledem commented Jun 5, 2024

I don't remember all the context, but if this is completely wrong, I'd rather deprecate the field and document it should not be used rather than setting the value to zero.
Setting to zero has a few issues:

  • it doesn't properly communicate that the field should not be used and can be confusing
  • it might break implementations that have been using this to find the first page. Since setting it to zero doesn't improve the situation as it is merely a different wrong value I'd rather we don't change the behavior until the field has been removed at the end of the deprecation cycle.

What do other implementations put in this field? (if no other implementation sets this, then this might be a different story)

@etseidl
Copy link
Author

etseidl commented Jun 5, 2024

I'd rather deprecate the field and document it should not be used rather than setting the value to zero.

I agree with deprecating, but I'm less sanguine about leaving an incorrect value in parquet-java, especially given the fact that arrow-cpp (and arrow-rs I believe) populate this field correctly. Having such a big difference between major implementations is IMO more confusing than stating the field should be set to 0 (or -1) if there is no second copy of the ColumnMetaData.

it might break implementations that have been using this to find the first page.

Implementations that do this will break anyway if they try to read a file produced by arrow, so I don't know how big of a concern this is.

That said, if the consensus is to just leave this be, that's fine too...we'd just have to make note of differing interpretations in the format documents.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants