Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2464: Fix dictionaryPageOffset flag setting in toParquetMetadata method #1340

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

abhishekd0907
Copy link

Issue

toParquetMetadata method converts org.apache.parquet.hadoop.metadata.ParquetMetadata to org.apache.parquet.format.FileMetaData but this does not set the dictionary page offset bit in FileMetaData.

When a FileMetaData object is serialized while writing to the footer and then deserialized, the dictionary offset is lost as the dictionary page offset bit was never set.

PARQUET-1850  tried to fix this but it did only a partial fix.

It sets setDictionary_page_offset only if getEncodingStats are present

if (columnMetaData.getEncodingStats() != null
&& columnMetaData.getEncodingStats().hasDictionaryPages())
{ metaData.setDictionary_page_offset(columnMetaData.getDictionaryPageOffset()); } 

Fix

However, it should setDictionary_page_offset even when getEncodingStats are not present but encodings are present.

It should use the implementation in ColumnChunkMetatdata below:

public boolean hasDictionaryPage() {
EncodingStats stats = getEncodingStats();
if (stats != null) { 
return stats.hasDictionaryPages() && stats.hasDictionaryEncodedPages(); 
}

Set<Encoding> encodings = getEncodings();
return (encodings.contains(PLAIN_DICTIONARY) || encodings.contains(RLE_DICTIONARY));
} 

So new change in ParquetMetadataCOnvertor should be like:

if (columnMetaData.hasDictionaryPage()) { metaData.setDictionary_page_offset(columnMetaData.getDictionaryPageOffset()); }

Test

Added a new UT for this scenario. All existing UTs should also pass.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Thanks for the fix and adding tests!

@wgtmac
Copy link
Member

wgtmac commented May 6, 2024

Error:  Failed to execute goal com.diffplug.spotless:spotless-maven-plugin:2.30.0:check (default) on project parquet-hadoop: The following files had format violations:
Error:      src/test/java/org/apache/parquet/format/converter/TestParquetMetadataConverter.java
Error:          @@ -1277,9 +1277,8 @@
Error:           ····return·createParquetMetaData(dicEncoding,·dataEncoding,·true);
Error:           ··}
Error:           
Error:          -
Error:          -··private·static·ParquetMetadata·createParquetMetaData(Encoding·dicEncoding,·Encoding·dataEncoding,
Error:          -·······················································boolean·includeDicStats)·{
Error:          +··private·static·ParquetMetadata·createParquetMetaData(
Error:          +······Encoding·dicEncoding,·Encoding·dataEncoding,·boolean·includeDicStats)·{
Error:           ····MessageType·schema·=·parseMessageType("message·schema·{·optional·int32·col·(INT_32);·}");
Error:           ····org.apache.parquet.hadoop.metadata.FileMetaData·fileMetaData·=
Error:           ········new·org.apache.parquet.hadoop.metadata.FileMetaData(schema,·new·HashMap<String,·String>(),·null);
Error:  Run 'mvn spotless:apply' to fix these violations.
Error:  -> [Help 1]

Please make the CI happy.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems there are some test failures. Please fix them accordingly.

@abhishekd0907 abhishekd0907 marked this pull request as draft May 7, 2024 08:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants