Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

toParquetMetadata method in ParquetMetadataConverter does not set dictionary page offset bit #2901

Open
asfimport opened this issue May 1, 2024 · 0 comments

Comments

@asfimport
Copy link
Collaborator

toParquetMetadata method converts org.apache.parquet.hadoop.metadata.ParquetMetadata to org.apache.parquet.format.FileMetaData but this does not set the dictionary page offset bit in FileMetaData.

When a FileMetaData object is serialized while writing to the footer and then deserialized, the dictionary offset is lost as the dictionary page offset bit was never set.

PARQUET-1850  tried to fix this but it did only a partial fix.

It sets setDictionary_page_offset only if getEncodingStats are present

if (columnMetaData.getEncodingStats() != null
&& columnMetaData.getEncodingStats().hasDictionaryPages())
{ metaData.setDictionary_page_offset(columnMetaData.getDictionaryPageOffset()); } 

However, it should setDictionary_page_offset even when getEncodingStats are not present but encodings are present.

It should use the implementation in ColumnChunkMetatdata below:

public boolean hasDictionaryPage() {
EncodingStats stats = getEncodingStats();
if (stats != null) { 
return stats.hasDictionaryPages() && stats.hasDictionaryEncodedPages(); 
}

Set<Encoding> encodings = getEncodings();
return (encodings.contains(PLAIN_DICTIONARY) || encodings.contains(RLE_DICTIONARY));
} 

So new change in ParquetMetadataCOnvertor should be like:

 

if (columnMetaData.hasDictionaryPage()) { metaData.setDictionary_page_offset(columnMetaData.getDictionaryPageOffset()); } 

Reporter: Abhishek Dixit

PRs and other links:

Note: This issue was originally created as PARQUET-2464. Please see the migration documentation for further details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant