You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We should build a corpus of Parquet files that client implementations can use for validation. In addition to the input files, it should contain a description or a verbatim copy of the data in each file, so that readers can validate their results.
Deepak Majeti / @majetideepak:
The Parquet format is being extended with many new features such as indexes, correct statistics, etc. Having compatibility across various writers (parquet-mr, parquet-cpp, Impala, etc.) is very important for the community to trust/depend on the Parquet file format. We should discuss this Jira in our next sync and start working towards improving the compatibility.
Also, it seems like the parquet-testing repository contains example parquet files written with various different features so maybe that is enough to close this issue
We should build a corpus of Parquet files that client implementations can use for validation. In addition to the input files, it should contain a description or a verbatim copy of the data in each file, so that readers can validate their results.
As a starting point we can look at the old parquet-compatibility repo and Impala's test data, in particular the Parquet files it contains.
Impala also has a tool to generate Parquet files from JSON files: https://github.com/apache/incubator-impala/blob/master/testdata/src/main/java/org/apache/impala/datagenerator/JsonToParquetConverter.java
Arrow has a similar tool: https://github.com/apache/arrow/blob/master/integration/integration_test.py
Reporter: Lars Volker / @lekv
Related issues:
Note: This issue was originally created as PARQUET-1118. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: