PARQUET-2249: Introduce IEEE 754 total order #221

JFinis · 2023-11-22T19:06:30Z

This commit adds a new column order IEEE754TotalOrder, which can be used for floating point types (FLOAT, DOUBLE, FLOAT16).

The advantage of the new order is a well-defined ordering between -0,+0 and the various possible bit patterns of NaNs. Thus, every single possible bit pattern of a floating point value has a well-defined order now, so there are no possibilities where two implementations might apply different orders when the new column order is used.

With the default column order, there were many problems w.r.t. NaN values which lead to reading engines not being able to use statistics of floating point columns for scan pruning even in the case where no NaNs were in the data set. The problems are discussed in detail in the next section.

This solution to the problem is the result of the extended discussion in #196, which ended with the consensus that IEEE 754 total ordering is the best approach to solve the problem in a simple manner without introducing special fields for floating point columns (such as nan_counts, which was proposed in that PR). Please refer to the discussion in that PR for all the details why this solution was chosen over various design alternatives.

Note that this solution is fully backward compatible and should not break neither old readers nor writers, as a new column order is added. Legacy writers can continue not writing this new order and instead writing the default type defined order. Legacy readers should avoid using any statistics on columns that have a column order they do not understand and therefore should just not use the statistics for columns ordered using the new order.

The remainder of this message explains in detail what the problems are and how the proposed solution fixes them.

Problem Description

Currently, the way NaN values are to be handled in statistics inhibits most scan pruning once NaN values are present in DOUBLE or FLOAT columns. Concretely the following problems exist:

Statistics don't tell whether NaNs are present

As NaN values are not to be incorporated in min/max bounds, a reader cannot know whether NaN values are present. This might seem to be not too problematic, as most queries will not filter for NaNs. However, NaN is ordered in most database systems. For example, Postgres, DB2, and Oracle treat NaN as greater than any other value, while MSSQL and MySQL treat it as less than any other value. An overview over what different systems are doing can be found here. The gist of it is that different systems with different semantics exist w.r.t. NaNs and most of the systems do order NaNs; either less than or greater than all other values.

For example, if the semantics of the reading query engine mandate that NaN is to be treated greater than all other values, the predicate x > 1.0 should include NaN values. If a page has max = 0.0 now, the engine would not be able to skip the page, as the page might contain NaNs which would need to be included in the query result.

Likewise, the predicate x < 1.0 should include NaN if NaN is treated to be less than all other values by the reading engine. Again, a page with min = 2.0 couldn't be skipped in this case by the reader.

Thus, even if a user doesn't query for NaN explicitly, they might use other predictes that need to filter or retain NaNs in the semantics of the reading engine, so the fact that we currently can't know whether a page or row group contains NaN is a bigger problem than it might seem on first sight.

Currently, any predicate that needs to retain NaNs cannot use min and max bounds in Parquet and therefore cannot be used for scan pruning at all. And as state, that can be many seemingly innocuous greater than or less than predicates in most databases systems. Conversely, it would be nice if Parquet would enable scan pruning in these cases, regardless of whether the reader and writer agree upon whether NaN is smaller, greater, or incomparable to all other values.

Note that the problem exists especially if the Parquet file doesn't include any NaNs, so this is not only a problem in the edge case where NaNs are present; it is a problem in the way more common case of NaNs not being present.

Handling NaNs in a ColumnIndex

There is currently no well-defined way to write a spec-conforming ColumnIndex once a page has only NaN (and possibly null) values. NaN values should not be included in min/max bounds, but if a page contains only NaN values, then there is no other value to put into the min/max bounds. However, bounds in a ColumnIndex are non-optional, so we have to put something in here. The spec does not describe what engines should do in this case. Parquet-mr takes the safe route and does not write a column index once NaNs are present. But this is a huge pessimization, as a single page containing NaNs will prevent writing a column index for the column chunk containing that page, so even pages in that chunk that don't contain NaNs will not be indexed.

It would be nice if there was a defined way of writing the ColumnIndex when NaNs (and especially only-NaN pages) are present.

Handling only-NaN pages & column chunks

Note: Hereinafter, whenever the term only-NaN is used, it refers to a page or column chunk, whose only non-null values are NaNs. E.g., an only-NaN page is allowed to have a mixture of null values and NaNs or only NaNs, but no non-NaN non-null values.

The Statistics objects stored in page headers and in the file footer have a similar, albeit smaller problem: min_value and max_value are optional here, so it is easier to not include NaNs in the min/max in case of an only-NaN page or column chunk: Simply omit these optional fields. However, this brings a semantic ambiguity with it, as it is now unclear whether the min/max value wasn't written because there were only NaNs, or simply because the writing engine did decide to omit them for whatever other reason, which is allowed by the spec as the field is optional.

Consequently, a reader cannot know whether missing min_value and max_value means "only NaNs, you can skip this page if you are looking for only non-NaN values" or "no stats written, you have to read this page as it is undefined what values it contains".

It would be nice if we could handle NaNs in a way that would allow scan pruning for these only-NaN pages.

Solution

IEEE 754 total order solves all the mentioned problems. As NaNs now have a defined place in the ordering, they can be incorporated into min and max bounds. In fact, in contrast to the default ordering, they do not need any special casing anymore, so all the remarks how readers and writers should special-handle NaNs and -0/+0 no longer apply to the new ordering.

As NaNs are incorporated into min and max, a reader can now see whether NaNs are contained through the statistics. Thus, a reading engine just has to map its NaN semantics to the NaN semantics of total ordering. For example, if the semantics of the reading engine treat all NaNs (also -NaNs) as greater than all other values, a reading engine having a predicate x > 5.0 (which should include NaNs) may not filter any pages / row groups if either min or max are (+/-)NaN.

Only-NaN pages can now also be included in the column index, as they are no longer a special case.

In conclusion, all mentioned problems are solved by using IEEE 754 total ordering.

Make sure you have checked all steps below.

Jira

My PR addresses the following Parquet Jira issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR"
- https://issues.apache.org/jira/browse/PARQUET-XXX
- In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Commits

My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain Javadoc that explain what it does

This commit adds a new column order `IEEE754TotalOrder`, which can be used for floating point types (FLOAT, DOUBLE, FLOAT16). The advantage of the new order is a well-defined ordering between -0,+0 and the various possible bit patterns of NaNs. Thus, every single possible bit pattern of a floating point value has a well-defined order now, so there are no possibilities where two implementations might apply different orders when the new column order is used. With the default column order, there were many problems w.r.t. NaN values which lead to reading engines not being able to use statistics of floating point columns for scan pruning even in the case where no NaNs were in the data set. The problems are discussed in detail in the next section. This solution to the problem is the result of the extended discussion in apache#196, which ended with the consensus that IEEE 754 total ordering is the best approach to solve the problem in a simple manner without introducing special fields for floating point columns (such as `nan_counts`, which was proposed in that PR). Please refer to the discussion in that PR for all the details why this solution was chosen over various design alternatives. Note that this solution is fully backward compatible and should not break neither old readers nor writers, as a new column order is added. Legacy writers can continue not writing this new order and instead writing the default type defined order. Legacy readers should avoid using any statistics on columns that have a column order they do not understand and therefore should just not use the statistics for columns ordered using the new order. The remainder of this message explains in detail what the problems are and how the proposed solution fixes them. Problem Description =================== Currently, the way NaN values are to be handled in statistics inhibits most scan pruning once NaN values are present in DOUBLE or FLOAT columns. Concretely the following problems exist: Statistics don't tell whether NaNs are present ---------------------------------------------- As NaN values are not to be incorporated in min/max bounds, a reader cannot know whether NaN values are present. This might seem to be not too problematic, as most queries will not filter for NaNs. However, NaN is ordered in most database systems. For example, Postgres, DB2, and Oracle treat NaN as greater than any other value, while MSSQL and MySQL treat it as less than any other value. An overview over what different systems are doing can be found here. The gist of it is that different systems with different semantics exist w.r.t. NaNs and most of the systems do order NaNs; either less than or greater than all other values. For example, if the semantics of the reading query engine mandate that NaN is to be treated greater than all other values, the predicate x > 1.0 should include NaN values. If a page has max = 0.0 now, the engine would not be able to skip the page, as the page might contain NaNs which would need to be included in the query result. Likewise, the predicate x < 1.0 should include NaN if NaN is treated to be less than all other values by the reading engine. Again, a page with min = 2.0 couldn't be skipped in this case by the reader. Thus, even if a user doesn't query for NaN explicitly, they might use other predictes that need to filter or retain NaNs in the semantics of the reading engine, so the fact that we currently can't know whether a page or row group contains NaN is a bigger problem than it might seem on first sight. Currently, any predicate that needs to retain NaNs cannot use min and max bounds in Parquet and therefore cannot be used for scan pruning at all. And as state, that can be many seemingly innocuous greater than or less than predicates in most databases systems. Conversely, it would be nice if Parquet would enable scan pruning in these cases, regardless of whether the reader and writer agree upon whether NaN is smaller, greater, or incomparable to all other values. Note that the problem exists especially if the Parquet file doesn't include any NaNs, so this is not only a problem in the edge case where NaNs are present; it is a problem in the way more common case of NaNs not being present. Handling NaNs in a ColumnIndex ------------------------------ There is currently no well-defined way to write a spec-conforming ColumnIndex once a page has only NaN (and possibly null) values. NaN values should not be included in min/max bounds, but if a page contains only NaN values, then there is no other value to put into the min/max bounds. However, bounds in a ColumnIndex are non-optional, so we have to put something in here. The spec does not describe what engines should do in this case. Parquet-mr takes the safe route and does not write a column index once NaNs are present. But this is a huge pessimization, as a single page containing NaNs will prevent writing a column index for the column chunk containing that page, so even pages in that chunk that don't contain NaNs will not be indexed. It would be nice if there was a defined way of writing the ColumnIndex when NaNs (and especially only-NaN pages) are present. Handling only-NaN pages & column chunks --------------------------------------- Note: Hereinafter, whenever the term only-NaN is used, it refers to a page or column chunk, whose only non-null values are NaNs. E.g., an only-NaN page is allowed to have a mixture of null values and NaNs or only NaNs, but no non-NaN non-null values. The Statistics objects stored in page headers and in the file footer have a similar, albeit smaller problem: min_value and max_value are optional here, so it is easier to not include NaNs in the min/max in case of an only-NaN page or column chunk: Simply omit these optional fields. However, this brings a semantic ambiguity with it, as it is now unclear whether the min/max value wasn't written because there were only NaNs, or simply because the writing engine did decide to omit them for whatever other reason, which is allowed by the spec as the field is optional. Consequently, a reader cannot know whether missing min_value and max_value means "only NaNs, you can skip this page if you are looking for only non-NaN values" or "no stats written, you have to read this page as it is undefined what values it contains". It would be nice if we could handle NaNs in a way that would allow scan pruning for these only-NaN pages. Solution ======== IEEE 754 total order solves all the mentioned problems. As NaNs now have a defined place in the ordering, they can be incorporated into min and max bounds. In fact, in contrast to the default ordering, they do not need any special casing anymore, so all the remarks how readers and writers should special-handle NaNs and -0/+0 no longer apply to the new ordering. As NaNs are incorporated into min and max, a reader can now see whether NaNs are contained through the statistics. Thus, a reading engine just has to map its NaN semantics to the NaN semantics of total ordering. For example, if the semantics of the reading engine treat all NaNs (also -NaNs) as greater than all other values, a reading engine having a predicate `x > 5.0` (which should include NaNs) may not filter any pages / row groups if either min or max are (+/-)NaN. Only-NaN pages can now also be included in the column index, as they are no longer a special case. In conclusion, all mentioned problems are solved by using IEEE 754 total ordering.

etseidl

Thanks @JFinis! First read looks pretty good to me. I just have a few small suggestions.

etseidl · 2023-11-22T19:55:41Z

src/main/thrift/parquet.thrift

@@ -288,7 +288,7 @@ struct MapType {}     // see LogicalTypes.md
 struct ListType {}    // see LogicalTypes.md
 struct EnumType {}    // allowed for BINARY, must be encoded with UTF-8
 struct DateType {}    // allowed for INT32
-struct Float16Type {} // allowed for FIXED[2], must encoded raw FLOAT16 bytes
+struct Float16Type {} // allowed for FIXED[2], must encoded raw FLOAT16 bytes (see LogicalTypes.md)


'must encode' or 'must be encoded as'?

etseidl · 2023-11-22T23:18:53Z

src/main/thrift/parquet.thrift

+   *     it is recommended that writers use IEEE_754_TOTAL_ORDER
+   *     for these types.
+   *
+   *     If TYPE_ORDER is used for floating point types, then the following


This line threw me (at least while using my phone 😉...on my computer I can see TYPE_ORDER below). Maybe this could instead say "If this ordering is used for floating..." or "If this type-defined ordering..."

tustvold

Looks great to me, thank you for writing this up

wgtmac · 2024-01-15T01:37:22Z

README.md

-    * BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY - Lexicographic unsigned byte-wise
-      comparison.
-
+Column Index, and Data Page). These statistics are according to a sort order,


Sorry for the delay. What about splitting this PR into two separate ones? One is for the styling change (including the relocation here) and the other is purely for the new order.

JFinis mentioned this pull request Nov 22, 2023

PARQUET-2249: Add nan_count to handle NaNs in statistics #196

Open

3 tasks

etseidl reviewed Nov 22, 2023

View reviewed changes

tustvold reviewed Nov 24, 2023

View reviewed changes

Jefffrey mentioned this pull request Dec 2, 2023

Parquet: write column_orders in FileMetaData apache/arrow-rs#5158

Merged

wgtmac reviewed Jan 15, 2024

View reviewed changes

etseidl mentioned this pull request May 31, 2024

DRAFT: Alternative Strawman proposal for a new V3 footer format in Parquet #250

Open

asfimport mentioned this pull request Jan 15, 2024

Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs #406

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-2249: Introduce IEEE 754 total order #221

PARQUET-2249: Introduce IEEE 754 total order #221

JFinis commented Nov 22, 2023 •

edited

Loading

etseidl left a comment

etseidl Nov 22, 2023

etseidl Nov 22, 2023

tustvold left a comment

wgtmac Jan 15, 2024

PARQUET-2249: Introduce IEEE 754 total order #221

Are you sure you want to change the base?

PARQUET-2249: Introduce IEEE 754 total order #221

Conversation

JFinis commented Nov 22, 2023 • edited Loading

Problem Description

Statistics don't tell whether NaNs are present

Handling NaNs in a ColumnIndex

Handling only-NaN pages & column chunks

Solution

Jira

Commits

Documentation

etseidl left a comment

Choose a reason for hiding this comment

etseidl Nov 22, 2023

Choose a reason for hiding this comment

etseidl Nov 22, 2023

Choose a reason for hiding this comment

tustvold left a comment

Choose a reason for hiding this comment

wgtmac Jan 15, 2024

Choose a reason for hiding this comment

JFinis commented Nov 22, 2023 •

edited

Loading