Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs #406

asfimport · 2023-02-19T12:09:23Z

Currently, the specification of ColumnIndex in parquet.thrift is inconsistent, leading to cases where it is impossible to create a parquet file that is conforming to the spec.

The problem is with double/float columns if a page contains only NaN values. The spec mentions that NaN values should not be included in min/max bounds, so a page consisting of only NaN values has no defined min/max bound. To quote the spec:


   *     When writing statistics the following rules should be followed:
   *     - NaNs should not be written to min or max statistics fields.

However, the comments in the ColumnIndex on the null_pages member states the following:


struct ColumnIndex {
  /**
   * A list of Boolean values to determine the validity of the corresponding
   * min and max values. If true, a page contains only null values, and writers
   * have to set the corresponding entries in min_values and max_values to
   * byte[0], so that all lists have the same length. If false, the
   * corresponding entries in min_values and max_values must be valid.
   */
  1: required list<bool> null_pages

For a page with only NaNs, we now have a problem. The page definitly does not only contain null values, so null_pages should be false for this page. However, in this case the spec requires valid min/max values in min_values and max_values for this page. As the only value in the page is NaN, the only valid min/max value we could enter here is NaN, but as mentioned before, NaNs should never be written to min/max values.

Thus, no writer can currently create a parquet file that conforms to this specification as soon as there is a only-NaN column and column indexes are to be written.

I see three possible solutions:

A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's null_pages entry set to {}true{}.
A page consisting of only NaNs (or a mixture of NaNs and nulls) has byte[0] as min/max, even though the null_pages entry is set to {}false{}.
A page consisting of only NaNs (or a mixture of NaNs and nulls) does have NaN as min & max in the column index.

None of the solutions is perfect. But I guess solution 3. is the best of them. It gives us valid min/max bounds, makes null_pages compatible with this, and gives us a way to determine only-Nan pages (min=max=NaN).

As a general note: I would say that it is a shortcoming that Parquet doesn't track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't have this inconsistency. In a future version, NaN counts could be introduced, but that doesn't help for backward compatibility, so we do need a solution for now.

Any of the solutions is better than the current situation where engines writing such a page cannot write a conforming parquet file and will randomly pick any of the solutions.

Thus, my suggestion would be to update parquet.thrift to use solution 3. I.e., rewrite the comments saying that NaNs shouldn't be included in min/max bounds by adding a clause stating that "if a page contains only NaNs or a mixture of NaNs and NULLs, then NaN should be written as min & max".

Reporter: Jan Finis / @JFinis

PRs and other links:

GitHub Pull Request #221

_{Note: This issue was originally created as PARQUET-2249. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2023-02-20T05:22:00Z

Gang Wu / @wgtmac:
When a page only contains NaN, the page statistics do not set min_value and max_value. However, it is not a null page. IMHO, the ColumnIndex of this column chunk should be dropped because at least one page does not provide valid min/max values.

asfimport · 2023-02-20T09:48:39Z

Jan Finis / @JFinis:
@wgtmac True, not writing a column index in this case is also a solution. Note though that this is a pessimization for pages not containing NaN in the same column chunk. It would be a shame if a single NaN makes a whole column chunk non-indexable. It might be a good interim solution, but it's not too satisfying.

The whole topic of NaN handling in Parquet currently seems to be lacking and somewhat inconsistent, making columns with NaNs mostly unusable for scan pruning. Maybe there should be a redefinition of the semantics in a new version, so that columns with NaNs can be used for indexing as other columns. As mentioned, Iceberg has solved this problem by providing NaN counts.

asfimport · 2023-02-20T13:29:50Z

Xuwei Fu / @mapleFU:
The problem seem to be that, float point is so widely used, but they are "partial order". Seems that iceberg provides NaN counts. And min-max is un-related to NaN.

In Sorting, iceberg forces that:

Sorting floating-point numbers should produce the following behavior: -NaN < -Infinity < -value < -0 < 0 < value < Infinity < {}NaN{}. This aligns with the implementation of Java floating-point types comparisons.

I think (1) is bad, because NaN is never equal to NULL.

IEEE754 (https://ieeexplore.ieee.org/document/8766229) and C++ standard support some "totalOrder", but I think regard it as totalOrder is strange, so min-max as byte[0] is wierd for me. I think iceberg style looks great

If I'm wrong, please correct me

asfimport · 2023-02-20T13:46:52Z

Jan Finis / @JFinis:
I would be willing to propose a fixing commit for this, but I'm not part of ASF and the whole process, yet, so I don't know exactly how to get that going. I could start a PR on the parquet-format github repo. Is that the right point to suggest changes to the spec/parquet.thrift?

Sidepoint: Note that NaN being larger than all other values is also:

The semantic that SQL has for NaN
What parquet-mr seems to be doing right now. At least, I have found parquet files that have NaN written as max_value in row group statistics.

However, treating NaN as something extra by maintaining NaN counts will allow incorporating any NaN semantics the query engine wishes to use.

asfimport · 2023-02-20T14:14:46Z

Gang Wu / @wgtmac:
As of today, there are many different parquet implementations (java, cpp, rust, etc). Directly making a PR to parquet-format repo is a good start to reach the maintainers of different implementations. However, the tricky thing is that we have to deal with legacy readers and files.

asfimport · 2023-02-20T17:27:59Z

Xuwei Fu / @mapleFU:
I guess NaN is not always larger than all values.

Postgres, DB2 and Oracle put NULL higher than any other values
MSSQL and MySQL going the other way.

SQL may have null_first for this case. Still I think your idea is ok.

asfimport · 2023-02-20T17:32:23Z

Xuwei Fu / @mapleFU:
I guess maybe we can take a look at:

asfimport · 2023-02-21T11:16:54Z

Jan Finis / @JFinis:
@wgtmac my proposal would be backward compatible. It would only add optional fields, so that legacy readers not implementing the logic could still read the file and legacy writers could also still write files not using this without any code changes.

@mapleFU I agree. Treating NaN as larger or smaller than any other value doesn't fit the semantics of all engines. Therefore, my fix for this would enable both types of semantics (and a third semantics, where NaN should be treated as neither larger nor smaller) to work.

I guess I'll just create a pull request. We can then start the discussion based on this. Sounds good?

asfimport · 2023-02-21T14:21:52Z

Gang Wu / @wgtmac:
@JFinis Yes, please submit a PR when you are ready. Thanks.

asfimport · 2023-03-22T11:42:05Z

Jan Finis / @JFinis:
@wgtmac @mapleFU I have created the pull request. Have a look at it :).

Should I advertise the pull request somewhere else?

wgtmac removed Component: Format labels Jul 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs #406

Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs #406

asfimport commented Feb 19, 2023

asfimport commented Feb 20, 2023

asfimport commented Feb 20, 2023

asfimport commented Feb 20, 2023

asfimport commented Feb 20, 2023

asfimport commented Feb 20, 2023

asfimport commented Feb 20, 2023

asfimport commented Feb 20, 2023

asfimport commented Feb 21, 2023

asfimport commented Feb 21, 2023

asfimport commented Mar 22, 2023

Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs #406

Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs #406

Comments

asfimport commented Feb 19, 2023

PRs and other links:

asfimport commented Feb 20, 2023

asfimport commented Feb 20, 2023

asfimport commented Feb 20, 2023

asfimport commented Feb 20, 2023

asfimport commented Feb 20, 2023

asfimport commented Feb 20, 2023

asfimport commented Feb 20, 2023

asfimport commented Feb 21, 2023

asfimport commented Feb 21, 2023

asfimport commented Mar 22, 2023