PARQUET-2464: Fix dictionaryPageOffset flag setting in toParquetMetadata method #1340

abhishekd0907 · 2024-05-02T06:54:52Z

Issue

toParquetMetadata method converts org.apache.parquet.hadoop.metadata.ParquetMetadata to org.apache.parquet.format.FileMetaData but this does not set the dictionary page offset bit in FileMetaData.

When a FileMetaData object is serialized while writing to the footer and then deserialized, the dictionary offset is lost as the dictionary page offset bit was never set.

PARQUET-1850 tried to fix this but it did only a partial fix.

It sets setDictionary_page_offset only if getEncodingStats are present

if (columnMetaData.getEncodingStats() != null
&& columnMetaData.getEncodingStats().hasDictionaryPages())
{ metaData.setDictionary_page_offset(columnMetaData.getDictionaryPageOffset()); }

Fix

However, it should setDictionary_page_offset even when getEncodingStats are not present but encodings are present.

It should use the implementation in ColumnChunkMetatdata below:

public boolean hasDictionaryPage() {
EncodingStats stats = getEncodingStats();
if (stats != null) { 
return stats.hasDictionaryPages() && stats.hasDictionaryEncodedPages(); 
}

Set<Encoding> encodings = getEncodings();
return (encodings.contains(PLAIN_DICTIONARY) || encodings.contains(RLE_DICTIONARY));
}

So new change in ParquetMetadataCOnvertor should be like:

if (columnMetaData.hasDictionaryPage()) { metaData.setDictionary_page_offset(columnMetaData.getDictionaryPageOffset()); }

Test

Added a new UT for this scenario. All existing UTs should also pass.

wgtmac

+1

Thanks for the fix and adding tests!

wgtmac · 2024-05-06T02:05:36Z

Error:  Failed to execute goal com.diffplug.spotless:spotless-maven-plugin:2.30.0:check (default) on project parquet-hadoop: The following files had format violations:
Error:      src/test/java/org/apache/parquet/format/converter/TestParquetMetadataConverter.java
Error:          @@ -1277,9 +1277,8 @@
Error:           ····return·createParquetMetaData(dicEncoding,·dataEncoding,·true);
Error:           ··}
Error:           
Error:          -
Error:          -··private·static·ParquetMetadata·createParquetMetaData(Encoding·dicEncoding,·Encoding·dataEncoding,
Error:          -·······················································boolean·includeDicStats)·{
Error:          +··private·static·ParquetMetadata·createParquetMetaData(
Error:          +······Encoding·dicEncoding,·Encoding·dataEncoding,·boolean·includeDicStats)·{
Error:           ····MessageType·schema·=·parseMessageType("message·schema·{·optional·int32·col·(INT_32);·}");
Error:           ····org.apache.parquet.hadoop.metadata.FileMetaData·fileMetaData·=
Error:           ········new·org.apache.parquet.hadoop.metadata.FileMetaData(schema,·new·HashMap<String,·String>(),·null);
Error:  Run 'mvn spotless:apply' to fix these violations.
Error:  -> [Help 1]

Please make the CI happy.

wgtmac

It seems there are some test failures. Please fix them accordingly.

update test and fix

061ba3d

wgtmac approved these changes May 6, 2024

View reviewed changes

fix styling error

f71202f

wgtmac requested changes May 7, 2024

View reviewed changes

abhishekd0907 marked this pull request as draft May 7, 2024 08:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-2464: Fix dictionaryPageOffset flag setting in toParquetMetadata method #1340

PARQUET-2464: Fix dictionaryPageOffset flag setting in toParquetMetadata method #1340

abhishekd0907 commented May 2, 2024

wgtmac left a comment

wgtmac commented May 6, 2024

wgtmac left a comment

PARQUET-2464: Fix dictionaryPageOffset flag setting in toParquetMetadata method #1340

Are you sure you want to change the base?

PARQUET-2464: Fix dictionaryPageOffset flag setting in toParquetMetadata method #1340

Conversation

abhishekd0907 commented May 2, 2024

Issue

Fix

Test

wgtmac left a comment

Choose a reason for hiding this comment

wgtmac commented May 6, 2024

wgtmac left a comment

Choose a reason for hiding this comment