Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-465: Clarify backward-compatibility rules on LIST type #466

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

wgtmac
Copy link
Member

@wgtmac wgtmac commented Oct 30, 2024

Rationale for this change

The C++ reader of parquet-cpp is having a hard time to read Parquet file written by parquet-java with parquet.avro.write-old-list-structure=true and schema below:

optional group a (LIST) {
  repeated group array (LIST) {
    repeated int32 array;
  }
}

See apache/arrow#43994 and apache/arrow#43995

What changes are included in this PR?

Clarify the rules for various list types.

  1. Unannotated repeated fields.
repeated int32 num_list;

repeated group my_list {
  required int32 num;
  optional binary str (STRING);
}
  1. LIST-annotated 2-level structures.
<list-repetition> group <name> (LIST) {
  repeated <element-type> <element-name>;
}

list-repetition: can be repeated, required, and optional.

Do these changes have PoC implementations?

Not required

Closes #465

@wgtmac wgtmac force-pushed the old-list-structure branch 3 times, most recently from abc3d1d to d9579bc Compare October 30, 2024 03:23
@mapleFU
Copy link
Member

mapleFU commented Oct 30, 2024

BTW, can you also point out the Java code in this pr?

LogicalTypes.md Outdated Show resolved Hide resolved
@mapleFU
Copy link
Member

mapleFU commented Oct 30, 2024

Other LGTM but I think it worths issue a disscussion...

@wgtmac
Copy link
Member Author

wgtmac commented Oct 30, 2024

I have sent a discussion thread to the dev ML. It would be good if you can take a look. Thanks! @emkornfield @pitrou @gszadovszky @rdblue @etseidl @clairemcginty

LogicalTypes.md Outdated Show resolved Hide resolved
Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a nice clarification to me. One little (ignorable) nit.

I'm also wondering if we should add an example above for the non-LIST annotated repeating fields, or should that be a new PR?

LogicalTypes.md Outdated Show resolved Hide resolved
LogicalTypes.md Outdated Show resolved Hide resolved
@wgtmac

This comment was marked as resolved.

LogicalTypes.md Outdated Show resolved Hide resolved
LogicalTypes.md Outdated Show resolved Hide resolved
LogicalTypes.md Outdated Show resolved Hide resolved
LogicalTypes.md Outdated Show resolved Hide resolved
Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying! This is so tricky especially rule 3 and rule 4 in 2-level lists...

LogicalTypes.md Show resolved Hide resolved
LogicalTypes.md Outdated Show resolved Hide resolved
LogicalTypes.md Outdated Show resolved Hide resolved
LogicalTypes.md Outdated Show resolved Hide resolved
LogicalTypes.md Outdated Show resolved Hide resolved
Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for being patient with me Gang 😄. And thanks for taking on this needed clarification.

LogicalTypes.md Outdated Show resolved Hide resolved
LogicalTypes.md Outdated Show resolved Hide resolved
LogicalTypes.md Outdated Show resolved Hide resolved
LogicalTypes.md Outdated Show resolved Hide resolved
LogicalTypes.md Outdated Show resolved Hide resolved
@wgtmac
Copy link
Member Author

wgtmac commented Nov 17, 2024

@etseidl I think we are on the same page that current rule 3 has an issue that it does not forbid the inner-most layer to be repeated. However, I feel that it is not a good idea to add would otherwise be a valid 3-level structure as outlined above to rule 3. The reason is that a normal 3-level structure can simply ignore its synthetic middle layer but the middle layer produces a OneTuple in the rule 3, which may also confuse readers. In addition, it is not that straight-forward. So I agree with @rdblue that it would be good to insert a new rule for the new case. IMO the new rule should be placed above the current rule 3 so we are pretty sure that the inner layer cannot be repeated in the remaining cases (including the rule 3). The only caveat is that the new rule 5 seems to be unreachable.

@etseidl
Copy link
Contributor

etseidl commented Nov 18, 2024

Ok, so the new Rule 3 is in effect saying that if the child of the repeated middle group is repeated, then because 1-level and LIST/MAP annotated schemas cannot mix, the middle group cannot be the type so the child is the type. The side consequence of this is that the middle group has to have a LIST or MAP annotation to allow the 3rd level to be repeated.

So is it true then that the repeated keyword may only be used for the child of a LIST or MAP annotated field when those annotations are in use? And if that's true, then a 2-level structure cannot be repeated as stated above?

In contrast to 3-level structure, the repetition of 2-level structure can be `optional`, `required`, or `repeated`.

@wgtmac
Copy link
Member Author

wgtmac commented Nov 19, 2024

Ok, so the new Rule 3 is in effect saying that if the child of the repeated middle group is repeated, then because 1-level and LIST/MAP annotated schemas cannot mix, the middle group cannot be the type so the child is the type. The side consequence of this is that the middle group has to have a LIST or MAP annotation to allow the 3rd level to be repeated

No, the middle level can never be a LIST-annotated or MAP-annotated 3-level structure, because they cannot be repeated. As you have said, the middle level cannot be an unannotated group with a single repeated field due to mixing LIST-annotation and repeated field without annotation. So the only valid case for rule 3 is that the middle level is a LIST-annotated 2-level structure.

So is it true then that the repeated keyword may only be used for the child of a LIST or MAP annotated field when those annotations are in use? And if that's true, then a 2-level structure cannot be repeated as stated above?

You're right. A LIST-annotated 2-level structure can only be repeated when it is a direct child of another LIST-annotated 2-level structure. I had missed the statement that disallows annotated and unannotated types to be mixed and modified these rules back and forth. Now this is pretty clear. Let me reflect these discussions in the PR.

LogicalTypes.md Show resolved Hide resolved
LogicalTypes.md Show resolved Hide resolved
@etseidl
Copy link
Contributor

etseidl commented Nov 19, 2024

No, the middle level can never be a LIST-annotated or MAP-annotated 3-level structure, because they cannot be repeated.

Ah, ok. Now I understand. That's why the 2-level list needs the ability to be repeated, but only as part of a nested list. Thanks again for the clarification @wgtmac. I think all my questions are answered now.

@wgtmac
Copy link
Member Author

wgtmac commented Nov 22, 2024

@gszadovszky @etseidl @rdblue @pitrou @emkornfield Do you have any concern about this PR? I think we need to move forward now.

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All my concerns have been addressed. Thanks!

LogicalTypes.md Show resolved Hide resolved
LogicalTypes.md Show resolved Hide resolved
LogicalTypes.md Show resolved Hide resolved
LogicalTypes.md Show resolved Hide resolved
LogicalTypes.md Show resolved Hide resolved
LogicalTypes.md Outdated Show resolved Hide resolved
LogicalTypes.md Outdated Show resolved Hide resolved
Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

LogicalTypes.md Show resolved Hide resolved
LogicalTypes.md Show resolved Hide resolved
repeated group my_list {
required int32 num;
optional binary str (STRING);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this example is counter-productive. We don't want anyone using un-annotated lists and maps. While the paragraph above explains how to interpret un-annotated repeated fields, I don't want anyone to see an example here and think that it is something that should be copied. I think it is already clear enough and I would simply move on rather than drawing attention to this as a possibility.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That make sense. Let me remove these examples first. I think a followup is to deprecate it by moving it to the backward compatibility section and adding strong words to discourage writers to emit it.

LogicalTypes.md Outdated Show resolved Hide resolved
Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rules part is looking good, but I think that spending time documenting what people did incorrectly years ago makes the doc more confusing and increases chances that people will write invalid lists. I'd prefer to revert most of the changes that explain what people did incorrectly.

@mapleFU
Copy link
Member

mapleFU commented Nov 27, 2024

The rules part is looking good, but I think that spending time documenting what people did incorrectly years ago makes the doc more confusing and increases chances that people will write invalid lists. I'd prefer to revert most of the changes that explain what people did incorrectly.

I agree. But I think those can be posted on the pull-request description

@wgtmac
Copy link
Member Author

wgtmac commented Nov 27, 2024

@rdblue Thanks for your review! I have removed all unnecessary changes. Please take a look again.

or uses the `LIST`-annotated group's name with `_tuple` appended then the
repeated type is the element type and elements are required.
4. Otherwise, the repeated field's type is the element type with the repeated
5. Otherwise, the repeated field's type is the element type with the repeated
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reverted most of the previous changes and now it should be clear. @etseidl @mapleFU

To resolve a LIST-annotated group, we should apply rules in order:

  • check if it is a 2-level structure (rule 1 to 3)
  • check if it is a special 2-level structure (rule 4)
  • it is a 3-level structure (rule 5)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to add an example for rule 5 because it is already at line 685

@pitrou
Copy link
Member

pitrou commented Nov 27, 2024

@rdblue @wgtmac I'm not sure I agree with removing the examples. They make it less likely that an implementation would interpret the written rules wrongly, IMO. Maybe we can instead strengthen the wording that these examples show existing legacy schemas, not anything to be written by modern writers.

@pitrou
Copy link
Member

pitrou commented Nov 27, 2024

Concretely, if I have to review a PR implementing said compatibility rules, I'd rather have the spec spell out the examples because it makes it easier to check the PR's implementation (and unit tests!).

@wgtmac
Copy link
Member Author

wgtmac commented Nov 27, 2024

Perhaps add them like this? @pitrou

WARNING: WRITERS SHOULD NOT PRODUCE LIST TYPES LIKE THESE EXAMPLES!
THEY ARE FOR THE PURPOSE OF READING EXISTING DATA ONLY.

// List<Integer> (non-null list, non-null elements)
repeated int32 num;

// List<Tuple<Integer, String>> (non-null list, non-null elements)
repeated group my_list {
  required int32 num;
  optional binary str (STRING);
}

@pitrou
Copy link
Member

pitrou commented Nov 27, 2024

Ok, please don't use UPPERCASE :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Backward-compatibility rules on LIST type is unclear
7 participants