Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Primitive REPEATED fields not contained in LIST annotated groups aren't read as lists by record reader #6648

Closed
zeevm opened this issue Oct 29, 2024 · 13 comments · Fixed by #6649
Labels
bug parquet Changes to the parquet crate

Comments

@zeevm
Copy link
Contributor

zeevm commented Oct 29, 2024

Describe the bug
Primitive REPEATED fields not contained in LIST annotated groups should be read as lists according to the format but aren't.

To Reproduce

Expected behavior

Additional context

@zeevm
Copy link
Contributor Author

zeevm commented Oct 30, 2024

repeated_no_list.parquet.gz
The attached file has two REPEATED primitive fields that aren't read correctly with the current implementation:
{Int32: 0, String: "foo"}
{Int32: 1, String: "zero"}
{Int32: 2, String: "one"}
{Int32: 3, String: "two"}

The fix yields the proper result:
{Int32: [0, 1, 2, 3], String: ["foo", "zero", "one", "two"]}
{Int32: [], String: ["three"]}
{Int32: [4], String: ["four"]}
{Int32: [5, 6, 7, 8], String: ["five", "six", "seven", "eight"]}

@etseidl
Copy link
Contributor

etseidl commented Oct 30, 2024

TIL, thanks! One question, however. The spec actual says A repeated field that is neither contained by a LIST- or MAP-annotated group nor annotated by LIST or MAP should be interpreted as a required list of required elements where the element type is the type of the field. So is it allowable for the max def level to be 1 here? If the elements are required, shouldn't max_def be 0? TBH I'm not sure why the spec is worded that way...nulls are clearly detectable so I'd think the required vs optional could be deduced from the max def level. In fact, arrow-cpp/pyarrow seems to read the given file just fine (although the arrow schema seems do indicate all elements are not null).

>>> df = pq.read_table('repeated_no_list.parquet')
>>> df
pyarrow.Table
Int32: list<Int32: int32 not null> not null
  child 0, Int32: int32 not null
String: list<String: string not null> not null
  child 0, String: string not null
----
Int32: [[[0,1,2,3],[],[4],[5,6,7,8]]]
String: [[["foo","zero","one","two"],["three"],["four"],["five","six","seven","eight"]]]

Then again, lists confuse me. So even if required, is it that a REPEATED field has 0 to many entries?

@mapleFU
Copy link
Member

mapleFU commented Oct 30, 2024

FYI, We currently meet a similiar problem and issue a discussion here[1], also refer to [2]

[1] https://lists.apache.org/thread/s6b25j3x26009v054yqjov0f1z49ctqj
[2] apache/parquet-format#466

@wgtmac
Copy link
Member

wgtmac commented Oct 30, 2024

This is not a normal LIST type specified by the Parquet spec where a LIST annotation on a group type is required. Should we disable writing lists as this in parquet-rs? We should follow the three level list encoding.

@etseidl
Copy link
Contributor

etseidl commented Oct 30, 2024

But IIUC this form is permitted by the spec. It just can't be mixed with the LIST annotation.

@tustvold tustvold changed the title Primitive REPEATED fields not contained in LIST annotated groups aren't read as lists Primitive REPEATED fields not contained in LIST annotated groups aren't read as lists by record reader Oct 30, 2024
@tustvold
Copy link
Contributor

tustvold commented Oct 30, 2024

I believe this is specific to the record reader, i.e. the non arrow reader, and so have updated the title to reflect this. Let me know if I am mistaken

#2394 is also related

@zeevm
Copy link
Contributor Author

zeevm commented Oct 30, 2024

@etseidl

So even if required, is it that a REPEATED field has 0 to many entries?

I think so, it aligns with the definition of REPEATED in the thrift:

" /** The field is repeated and can contain 0 or more values */
REPEATED = 2;"

@zeevm
Copy link
Contributor Author

zeevm commented Oct 30, 2024

@tustvold

I believe this is specific to the record reader, i.e. the non arrow reader, and so have updated the title to reflect this. Let me know if I am mistaken

#2394 is also related

Correct, this is only for the record reader. TBH I totally forgot I've opened the same issue you've mentioned from two years ago :)

Every time we encounter such a file with one of our customers I'm looking into it again, finally decided to just fix it.

@zeevm
Copy link
Contributor Author

zeevm commented Oct 30, 2024

@wgtmac

This is not a normal LIST type specified by the Parquet spec where a LIST annotation on a group type is required. Should we disable writing lists as this in parquet-rs? We should follow the three level list encoding.

This is allowed by the format, other writers sometimes generate such files (e.g. the customer file that tripped our code was created with parquet-mr)

@wgtmac
Copy link
Member

wgtmac commented Oct 30, 2024

Thanks @zeevm! I think the provided file is worth adding to https://github.com/apache/parquet-testing as a good example for interoperability test.

@wgtmac
Copy link
Member

wgtmac commented Oct 30, 2024

It seems that parquet-cli (backed by parquet-mr) cannot read it:

> parquet-cli meta repeated_no_list.parquet

File path:  repeated_no_list.parquet
Created by: parquet-rs version 53.2.0
Properties: (none)
Schema:
message schema {
  repeated int32 Int32;
  repeated binary String (STRING);
}


Row group 0:  count: 4  65.75 B records  start: 4  total(compressed): 263 B total(uncompressed):263 B
--------------------------------------------------------------------------------
        type      encodings count     avg size   nulls   min / max
Int32   INT32     _ RR_     10        10.90 B    1       "0" / "8"
String  BINARY    _ RR_     10        15.40 B    0       "eight" / "zero"

> parquet-cli cat repeated_no_list.parquet
Unknown error
java.lang.RuntimeException: Failed on record 0 in file repeated_no_list.parquet
	at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:89)
	at org.apache.parquet.cli.Main.run(Main.java:163)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
	at org.apache.parquet.cli.Main.main(Main.java:191)
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:/Users/gangwu/Downloads/repeated_no_list.parquet
	at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:280)
	at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136)
	at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:140)
	at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:356)
	at org.apache.parquet.cli.BaseCommand$1$1.<init>(BaseCommand.java:337)
	at org.apache.parquet.cli.BaseCommand$1.iterator(BaseCommand.java:335)
	at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:76)
	... 3 more
Caused by: org.apache.parquet.io.ParquetDecodingException: The requested schema is not compatible with the file schema. incompatible types: required group Int32 (LIST) {
  repeated int32 array;
} != repeated int32 Int32
	at org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.incompatibleSchema(ColumnIOFactory.java:104)
	at org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren(ColumnIOFactory.java:81)
	at org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:57)
	at org.apache.parquet.schema.MessageType.accept(MessageType.java:52)
	at org.apache.parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:167)
	at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:155)
	at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:245)
	... 9 more

@etseidl
Copy link
Contributor

etseidl commented Oct 30, 2024

It seems that parquet-cli (backed by parquet-mr) cannot read it:

Strange. An old parquet-tools 1.11.1 jar I have can read it, however.

row group 0 
--------------------------------------------------------------------------------
Int32:   INT32 UNCOMPRESSED DO:4 FPO:54 SZ:109/109/1.00 VC:10 ENC:PLAIN,RLE,RLE_DICTIONARY ST:[min: 0, max: 8, num_nulls: 1]
String:  BINARY UNCOMPRESSED DO:113 FPO:208 SZ:154/154/1.00 VC:10 ENC:PLAIN,RLE,RLE_DICTIONARY ST:[min: eight, max: zero, num_nulls: 0]

    Int32 TV=10 RL=1 DL=1 DS:  9 DE:PLAIN
    ----------------------------------------------------------------------------
    page 0:                     DLE:RLE RLE:RLE VLE:RLE_DICTIONARY ST:[min: 0, max: 8, num_nulls: 1] CRC:[none] SZ:24 VC:10

    String TV=10 RL=1 DL=1 DS: 10 DE:PLAIN
    ----------------------------------------------------------------------------
    page 0:                     DLE:RLE RLE:RLE VLE:RLE_DICTIONARY ST:[min: eight, max: zero, num_nulls: 0] CRC:[none] SZ:23 VC:10

INT32 Int32 
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 10 *** 
value 1:  R:0 D:1 V:0
value 2:  R:1 D:1 V:1
value 3:  R:1 D:1 V:2
value 4:  R:1 D:1 V:3
value 5:  R:0 D:0 V:<null>
value 6:  R:0 D:1 V:4
value 7:  R:0 D:1 V:5
value 8:  R:1 D:1 V:6
value 9:  R:1 D:1 V:7
value 10: R:1 D:1 V:8

BINARY String 
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 10 *** 
value 1:  R:0 D:1 V:foo
value 2:  R:1 D:1 V:zero
value 3:  R:1 D:1 V:one
value 4:  R:1 D:1 V:two
value 5:  R:0 D:1 V:three
value 6:  R:0 D:1 V:four
value 7:  R:0 D:1 V:five
value 8:  R:1 D:1 V:six
value 9:  R:1 D:1 V:seven
value 10: R:1 D:1 V:eight

@alamb alamb added the parquet Changes to the parquet crate label Nov 16, 2024
@alamb
Copy link
Contributor

alamb commented Nov 16, 2024

label_issue.py automatically added labels {'parquet'} from #6649

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants