-
Notifications
You must be signed in to change notification settings - Fork 84
feat: improved type validation for creating dataframes, implement xquery entity unescaping #1395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…ames to use RDDs directly
…d missing entity ref unescaping in several ast paths
Test Results (qt3tests)RumbleDB, XQuery parser
RumbleDB, JSONiq parser
|
Test Results (qt3tests)RumbleDB, XQuery parser
RumbleDB, JSONiq parser
|
|
I tested it with the Python library and the DataFrame query output and it seems some cases that were previously successfully converted to DataFrames no longer succeed. Would it maybe be possible to extend the error message with the inferred common denominator schema to help understand what the issue is? Then I will run it again and see if there is an obvious overlook. Thanks! My guess right now is that upon merging two different object types, RumbleDB just outputs the topmost "object" primitive type instead of merging field by field. The reason is that the type hierarchy in JSound is explicitly based on the given base types and does not logically follow just from the object content layout. We might need to add an option to the method that computes the least common super type that says "lax" or "strict" and if it is "lax", we merge field by field to a new anonymous type, if it is "strict", we output the topmost object primitive type. |
Test Results (qt3tests)RumbleDB, XQuery parser
RumbleDB, JSONiq parser
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am setting the review on "Request changes" so we can investigate why some outputs no longer get validated (see my comment above).
…ode (#3) * feat(types): implement common super type lax mode, custom implementations for objects and arrays * test(types): add tests for common supertype lax mode * fix(validatetypeiterator): use new lax common supertype method for rdd schema inference
|
@ghislainfourny I’ve added the common supertype lax mode as requested (see EPMatt#3). Please let me know if this resolves the issue. Regarding the detailed error log: could you share the output (logs and/or stack trace) you’re seeing with the failing test cases? This will help me determine where to improve the error reporting in the code. Thanks! |
Test Results (qt3tests)RumbleDB, XQuery parser
RumbleDB, JSONiq parser
|
Test Results (qt3tests)RumbleDB, XQuery parser
RumbleDB, JSONiq parser
|
Changes
Support &-based escaped string literals
Implement &-based escaping for Character References and Predefined Character References in String Literals, Key Specifiers (XQuery 4.0), BracedURILiterals, Text Node content, and Attribute Content Values.
The escaping is implemented at the static analysis level (post automated parsing).
Refactor schema detection in RuntimeIterator for creating Dataframes directly from an RDD
Replace the current JSON-based schema detection mechanism (which uses Spark's
schema_of_variant_aggand only supports JSON types) with a native Java/JSONiq implementation that works directly with Item types using thefindLeastCommonSuperTypeWithmethod.Testing
See improved test coverage for XQuery Parser (PR CI comment).