You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current default for V1 pages is PLAIN encoding. This encoding mixes string length with string data. This is inefficient for for skipping N values, as the encoding does not allow random access. It's also slow to decode as the interleaving of lengths with data does not allow efficient batched implementations and forces most implementations to make copies of the data to fit the usual representation of separate offsets and data for strings.
### Delta-length byte array: (DELTA_LENGTH_BYTE_ARRAY = 6)
Supported Types: BYTE_ARRAY
This encoding is always preferred over PLAIN for byte array columns.
V2 pages use DELTA_BYTE_ARRAY as the default encoding, this is an improvement over PLAIN but adds complexity which makes it slower to decode than DELTA_LENGTH_BYTE_ARRAY with the potential benefit of lower storage requirements.
JMH benchmarks in Trino's parquet reader at io.trino.parquet.reader.BenchmarkBinaryColumnReader showed that DELTA_LENGTH_BYTE_ARRAY can be decoded at over 5X speed and DELTA_BYTE_ARRAY at over 2X the speed of decoding PLAIN encoding.
Given the above recommendation of parquet-format spec and significant performance difference, the reference implementation here should be updated to use DELTA_LENGTH_BYTE_ARRAY by default.
Component(s)
Core
The text was updated successfully, but these errors were encountered:
Describe the enhancement requested
The current default for V1 pages is PLAIN encoding. This encoding mixes string length with string data. This is inefficient for for skipping N values, as the encoding does not allow random access. It's also slow to decode as the interleaving of lengths with data does not allow efficient batched implementations and forces most implementations to make copies of the data to fit the usual representation of separate offsets and data for strings.
DELTA_LENGTH_BYTE_ARRAY has none of the above problems as it separates offsets and data. The parquet-format spec also seems to recommend this
https://github.com/apache/parquet-format/blob/c70281359087dfaee8bd43bed9748675f4aabe11/Encodings.md?plain=1#L299
V2 pages use DELTA_BYTE_ARRAY as the default encoding, this is an improvement over PLAIN but adds complexity which makes it slower to decode than DELTA_LENGTH_BYTE_ARRAY with the potential benefit of lower storage requirements.
JMH benchmarks in Trino's parquet reader at
io.trino.parquet.reader.BenchmarkBinaryColumnReader
showed that DELTA_LENGTH_BYTE_ARRAY can be decoded at over 5X speed and DELTA_BYTE_ARRAY at over 2X the speed of decoding PLAIN encoding.Given the above recommendation of parquet-format spec and significant performance difference, the reference implementation here should be updated to use DELTA_LENGTH_BYTE_ARRAY by default.
Component(s)
Core
The text was updated successfully, but these errors were encountered: