[Format] Encoding spec incorrect for dictionary fallback #404

asfimport · 2023-01-03T14:23:16Z

The spec for DICTIONARY_ENCODING states that:

If the dictionary grows too big, whether in size or number of distinct values, the encoding will fall back to the plain encoding.

https://github.com/apache/parquet-format/blob/master/Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8

However, the parquet-mr implementation was deliberately changed to a different fallback mechanism in https://issues.apache.org/jira/browse/PARQUET-52

I'm assuming the parquet-mr implementation is authoritative here. But then the spec is incorrect and should be fixed to reflect expected behavior.

Reporter: Antoine Pitrou / @pitrou

_{Note: This issue was originally created as PARQUET-2221. Please see the migration documentation for further details.}

asfimport · 2023-01-03T14:27:06Z

Antoine Pitrou / @pitrou:
cc @julienledem @piyushnarang @rdblue @isnotinvain

asfimport · 2023-01-04T02:46:58Z

Gang Wu / @wgtmac:
IMHO, the specs is authoritative to the reader implementation to correctly read Parquet files created by different writers. But it is writer implementer's choice to fallback to any standard encoding. This is what the video coding standard does (e.g. H.264/AVC and H.265/HEVC).

What's more, the writer implementation can even rewrite the dictionary page and dictionary-encoded data pages to fallback encoding if fallback happens and discard the dictionary-encoded pages, just like what Apache ORC does. Mixing dictionary encoding and non-dictionary encoding in the same column chunk makes the implementation of features like reading dictionary and predicate pushdown much complicated.

cc [~[email protected]]

asfimport · 2023-11-21T19:24:15Z

Micah Kornfield / @emkornfield:
I agree with @wgtmac here. I think we should probably have language like we've done in previous cases like "for maximum compatibility" but then say any mix of page encodings is valid as long as the ordering is valid.

In terms of mixing dictionary encodings with others, it does make things a little bit harder but I don't think we should make it not-allowed (but point out the potential benefits of unified encoding).

wgtmac removed Component: Format labels Jul 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Format] Encoding spec incorrect for dictionary fallback #404

[Format] Encoding spec incorrect for dictionary fallback #404

asfimport commented Jan 3, 2023

asfimport commented Jan 3, 2023

asfimport commented Jan 4, 2023

asfimport commented Nov 21, 2023

[Format] Encoding spec incorrect for dictionary fallback #404

[Format] Encoding spec incorrect for dictionary fallback #404

Comments

asfimport commented Jan 3, 2023

asfimport commented Jan 3, 2023

asfimport commented Jan 4, 2023

asfimport commented Nov 21, 2023