Build a corpus of Parquet files that client implementations can use for validation #273

asfimport · 2017-09-28T23:43:55Z

We should build a corpus of Parquet files that client implementations can use for validation. In addition to the input files, it should contain a description or a verbatim copy of the data in each file, so that readers can validate their results.

As a starting point we can look at the old parquet-compatibility repo and Impala's test data, in particular the Parquet files it contains.

$ find testdata | grep -i parq
testdata/workloads/tpch/queries/insert_parquet.test
testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test
testdata/workloads/functional-planner/queries/PlannerTest/parquet-stats-agg.test
testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test
testdata/workloads/functional-query/queries/QueryTest/parquet-zero-rows.test
testdata/workloads/functional-query/queries/QueryTest/insert_parquet_invalid_codec.test
testdata/workloads/functional-query/queries/QueryTest/parquet-corrupt-rle-counts-abort.test
testdata/workloads/functional-query/queries/QueryTest/parquet-ambiguous-list-legacy.test
testdata/workloads/functional-query/queries/QueryTest/parquet-stats-agg.test
testdata/workloads/functional-query/queries/QueryTest/parquet-deprecated-stats.test
testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test
testdata/workloads/functional-query/queries/QueryTest/parquet-resolution-by-name.test
testdata/workloads/functional-query/queries/QueryTest/parquet-abort-on-error.test
testdata/workloads/functional-query/queries/QueryTest/mt-dop-parquet.test
testdata/workloads/functional-query/queries/QueryTest/parquet.test
testdata/workloads/functional-query/queries/QueryTest/parquet-corrupt-rle-counts.test
testdata/workloads/functional-query/queries/QueryTest/parquet-continue-on-error.test
testdata/workloads/functional-query/queries/QueryTest/mt-dop-parquet-nested.test
testdata/workloads/functional-query/queries/QueryTest/parquet-ambiguous-list-modern.test
testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test
testdata/max_nesting_depth/int_map/file.parq
testdata/max_nesting_depth/struct/file.parq
testdata/max_nesting_depth/struct_map/file.parq
testdata/max_nesting_depth/int_array/file.parq
testdata/max_nesting_depth/struct_array/file.parq
testdata/parquet_nested_types_encodings
testdata/parquet_nested_types_encodings/README
testdata/parquet_nested_types_encodings/UnannotatedListOfGroups.parquet
testdata/parquet_nested_types_encodings/AmbiguousList_Modern.parquet
testdata/parquet_nested_types_encodings/UnannotatedListOfPrimitives.parquet
testdata/parquet_nested_types_encodings/AmbiguousList.json
testdata/parquet_nested_types_encodings/AvroPrimitiveInList.parquet
testdata/parquet_nested_types_encodings/ThriftPrimitiveInList.parquet
testdata/parquet_nested_types_encodings/bad-avro.parquet
testdata/parquet_nested_types_encodings/AmbiguousList.avsc
testdata/parquet_nested_types_encodings/SingleFieldGroupInList.parquet
testdata/parquet_nested_types_encodings/ThriftSingleFieldGroupInList.parquet
testdata/parquet_nested_types_encodings/AvroSingleFieldGroupInList.parquet
testdata/parquet_nested_types_encodings/AmbiguousList_Legacy.parquet
testdata/parquet_nested_types_encodings/bad-thrift.parquet
testdata/ComplexTypesTbl/nonnullable.parq
testdata/ComplexTypesTbl/nullable.parq
testdata/bad_parquet_data
testdata/bad_parquet_data/README
testdata/bad_parquet_data/dict-encoded-out-of-bounds.parq
testdata/bad_parquet_data/plain-encoded-negative-len.parq
testdata/bad_parquet_data/plain-encoded-out-of-bounds.parq
testdata/bad_parquet_data/dict-encoded-negative-len.parq
testdata/parquet_schema_resolution
testdata/parquet_schema_resolution/README
testdata/parquet_schema_resolution/switched_map.json
testdata/parquet_schema_resolution/switched_map.avsc
testdata/parquet_schema_resolution/switched_map.parq
testdata/src/main/java/org/apache/impala/datagenerator/JsonToParquetConverter.java
testdata/LineItemMultiBlock/lineitem_one_row_group.parquet
testdata/LineItemMultiBlock/lineitem_sixblocks.parquet
testdata/data/zero_rows_zero_row_groups.parquet
testdata/data/chars-formats.parquet
testdata/data/multiple_rowgroups.parquet
testdata/data/bad_parquet_data.parquet
testdata/data/bad_metadata_len.parquet
testdata/data/huge_num_rows.parquet
testdata/data/bad_compressed_size.parquet
testdata/data/zero_rows_one_row_group.parquet
testdata/data/bad_rle_repeat_count.parquet
testdata/data/bad_column_metadata.parquet
testdata/data/alltypesagg_hive_13_1.parquet
testdata/data/bad_dict_page_offset.parquet
testdata/data/bad_rle_literal_count.parquet
testdata/data/bad_magic_number.parquet
testdata/data/repeated_values.parquet
testdata/data/schemas/malformed_decimal_tiny.parquet
testdata/data/schemas/alltypestiny.parquet
testdata/data/schemas/nested/modern_nested.parquet
testdata/data/schemas/nested/legacy_nested.parquet
testdata/data/schemas/enum/enum.parquet
testdata/data/schemas/decimal.parquet
testdata/data/schemas/zipcode_incomes.parquet
testdata/data/repeated_root_schema.parquet
testdata/data/long_page_header.parquet
testdata/data/deprecated_statistics.parquet
testdata/data/kite_required_fields.parquet
testdata/data/out_of_range_timestamp.parquet

Impala also has a tool to generate Parquet files from JSON files: https://github.com/apache/incubator-impala/blob/master/testdata/src/main/java/org/apache/impala/datagenerator/JsonToParquetConverter.java

Arrow has a similar tool: https://github.com/apache/arrow/blob/master/integration/integration_test.py

Reporter: Lars Volker / @lekv

Related issues:

[C++] Use LZ4 frame format (is related to)

_{Note: This issue was originally created as PARQUET-1118. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2019-01-28T14:05:50Z

Deepak Majeti / @majetideepak:
The Parquet format is being extended with many new features such as indexes, correct statistics, etc. Having compatibility across various writers (parquet-mr, parquet-cpp, Impala, etc.) is very important for the community to trust/depend on the Parquet file format. We should discuss this Jira in our next sync and start working towards improving the compatibility.

alamb · 2024-06-26T15:08:48Z

Related discussion is here: #441

Also, it seems like the parquet-testing repository contains example parquet files written with various different features so maybe that is enough to close this issue

cc @julienledem and @wgtmac

wgtmac · 2024-07-03T02:04:42Z

Agreed, the purpose of parquet-testing repo is exactly for interoperability test.

Fokko · 2024-07-03T05:57:17Z

I think it would be nice to have a reference from the README to the parquet-testing repository. I've created a PR here: #442

asfimport mentioned this issue Jun 23, 2024

[C++][Parquet] Use LZ4 frame format apache/arrow#42803

Closed

Fokko mentioned this issue Jul 3, 2024

Add section of testing #442

Merged

3 tasks

emkornfield closed this as completed in #442 Jul 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build a corpus of Parquet files that client implementations can use for validation #273

Build a corpus of Parquet files that client implementations can use for validation #273

asfimport commented Sep 28, 2017 •

edited

Loading

asfimport commented Jan 28, 2019

alamb commented Jun 26, 2024

wgtmac commented Jul 3, 2024

Fokko commented Jul 3, 2024

Build a corpus of Parquet files that client implementations can use for validation #273

Build a corpus of Parquet files that client implementations can use for validation #273

Comments

asfimport commented Sep 28, 2017 • edited Loading

Related issues:

asfimport commented Jan 28, 2019

alamb commented Jun 26, 2024

wgtmac commented Jul 3, 2024

Fokko commented Jul 3, 2024

asfimport commented Sep 28, 2017 •

edited

Loading