feat: improved type validation for creating dataframes, implement xquery entity unescaping #1395

EPMatt · 2025-11-12T15:09:48Z

Changes

Support &-based escaped string literals

Implement &-based escaping for Character References and Predefined Character References in String Literals, Key Specifiers (XQuery 4.0), BracedURILiterals, Text Node content, and Attribute Content Values.

The escaping is implemented at the static analysis level (post automated parsing).

Refactor schema detection in RuntimeIterator for creating Dataframes directly from an RDD

Replace the current JSON-based schema detection mechanism (which uses Spark's schema_of_variant_agg and only supports JSON types) with a native Java/JSONiq implementation that works directly with Item types using the findLeastCommonSuperTypeWith method.

Testing

See improved test coverage for XQuery Parser (PR CI comment).

…ames to use RDDs directly

…d missing entity ref unescaping in several ast paths

github-actions · 2025-11-12T15:20:13Z

Test Results (qt3tests)

RumbleDB, XQuery parser

Test Suite	Passing	Failing	Errors	Skipped	Total
MathTest	147	0	2	0	149
MiscTest	181	373	180	137	871
Prod1Test	4836	851	1808	743	8238
Fn1Test	2588	610	1730	367	5295
SerTest	4	2	1	336	343
Fn2Test	3156	956	1264	464	5840
AppTest	989	46	1084	38	2157
Prod2Test	1733	543	1169	524	3969
ArrayTest	0	45	155	9	209
XsTest	89	14	12	49	164
OpTest	4012	117	194	43	4366
MapTest	4	23	193	0	220
Total	17739	3580	7792	2710	31821

RumbleDB, JSONiq parser

Test Suite	Passing	Failing	Errors	Skipped	Total
MiscTest	162	284	114	311	871
ArrayTest	0	0	0	209	209
Fn1Test	2400	177	118	2600	5295
XsTest	89	0	0	75	164
Prod1Test	3902	201	324	3811	8238
SerTest	4	0	0	339	343
Fn2Test	2659	288	85	2808	5840
MapTest	3	1	14	202	220
AppTest	971	17	20	1149	2157
Prod2Test	1320	221	129	2299	3969
OpTest	3742	28	21	575	4366
MathTest	147	0	1	1	149
Total	15399	1217	826	14379	31821

Download detailed test results

github-actions · 2025-11-20T13:19:07Z

Test Results (qt3tests)

RumbleDB, XQuery parser

Test Suite	Passing	Failing	Errors	Skipped	Total
MathTest	147	0	2	0	149
MiscTest	181	373	180	137	871
Prod1Test	4836	851	1808	743	8238
Fn1Test	2588	610	1730	367	5295
SerTest	4	2	1	336	343
Fn2Test	3156	956	1264	464	5840
AppTest	989	46	1084	38	2157
Prod2Test	1733	543	1169	524	3969
ArrayTest	0	45	155	9	209
XsTest	89	14	12	49	164
OpTest	4012	117	194	43	4366
MapTest	4	23	193	0	220
Total	17739	3580	7792	2710	31821

RumbleDB, JSONiq parser

Test Suite	Passing	Failing	Errors	Skipped	Total
MiscTest	162	284	114	311	871
ArrayTest	0	0	0	209	209
Fn1Test	2400	177	118	2600	5295
XsTest	89	0	0	75	164
Prod1Test	3902	201	324	3811	8238
SerTest	4	0	0	339	343
Fn2Test	2659	288	85	2808	5840
MapTest	3	1	14	202	220
AppTest	971	17	20	1149	2157
Prod2Test	1320	221	129	2299	3969
OpTest	3742	28	21	575	4366
MathTest	147	0	1	1	149
Total	15399	1217	826	14379	31821

Download detailed test results

ghislainfourny · 2025-11-20T13:30:15Z

I tested it with the Python library and the DataFrame query output and it seems some cases that were previously successfully converted to DataFrames no longer succeed. Would it maybe be possible to extend the error message with the inferred common denominator schema to help understand what the issue is? Then I will run it again and see if there is an obvious overlook. Thanks!

My guess right now is that upon merging two different object types, RumbleDB just outputs the topmost "object" primitive type instead of merging field by field. The reason is that the type hierarchy in JSound is explicitly based on the given base types and does not logically follow just from the object content layout. We might need to add an option to the method that computes the least common super type that says "lax" or "strict" and if it is "lax", we merge field by field to a new anonymous type, if it is "strict", we output the topmost object primitive type.

github-actions · 2025-11-20T14:22:38Z

Test Results (qt3tests)

RumbleDB, XQuery parser

Test Suite	Passing	Failing	Errors	Skipped	Total
MathTest	147	0	2	0	149
MiscTest	181	373	180	137	871
Prod1Test	4836	851	1808	743	8238
Fn1Test	2588	610	1730	367	5295
SerTest	4	2	1	336	343
Fn2Test	3156	948	1272	464	5840
AppTest	989	46	1084	38	2157
Prod2Test	1733	543	1169	524	3969
ArrayTest	0	45	155	9	209
XsTest	89	14	12	49	164
OpTest	4012	117	194	43	4366
MapTest	4	23	193	0	220
Total	17739	3572	7800	2710	31821

RumbleDB, JSONiq parser

Test Suite	Passing	Failing	Errors	Skipped	Total
MiscTest	162	284	114	311	871
ArrayTest	0	0	0	209	209
Fn1Test	2400	177	118	2600	5295
XsTest	89	0	0	75	164
Prod1Test	3902	201	324	3811	8238
SerTest	4	0	0	339	343
Fn2Test	2659	282	91	2808	5840
MapTest	3	1	14	202	220
AppTest	971	17	20	1149	2157
Prod2Test	1320	221	129	2299	3969
OpTest	3742	28	21	575	4366
MathTest	147	0	1	1	149
Total	15399	1211	832	14379	31821

Download detailed test results

ghislainfourny

I am setting the review on "Request changes" so we can investigate why some outputs no longer get validated (see my comment above).

…ode (#3) * feat(types): implement common super type lax mode, custom implementations for objects and arrays * test(types): add tests for common supertype lax mode * fix(validatetypeiterator): use new lax common supertype method for rdd schema inference

EPMatt · 2025-11-28T17:59:24Z

@ghislainfourny I’ve added the common supertype lax mode as requested (see EPMatt#3). Please let me know if this resolves the issue.

Regarding the detailed error log: could you share the output (logs and/or stack trace) you’re seeing with the failing test cases? This will help me determine where to improve the error reporting in the code.

Thanks!

github-actions · 2025-11-28T18:02:54Z

Test Results (qt3tests)

RumbleDB, XQuery parser

Test Suite	Passing	Failing	Errors	Skipped	Total
MathTest	147	0	2	0	149
MiscTest	181	373	180	137	871
Prod1Test	4836	851	1808	743	8238
Fn1Test	2588	610	1730	367	5295
SerTest	4	2	1	336	343
Fn2Test	3156	948	1272	464	5840
AppTest	989	46	1084	38	2157
Prod2Test	1733	543	1169	524	3969
ArrayTest	0	45	155	9	209
XsTest	89	14	12	49	164
OpTest	4012	117	194	43	4366
MapTest	4	23	193	0	220
Total	17739	3572	7800	2710	31821

RumbleDB, JSONiq parser

Test Suite	Passing	Failing	Errors	Skipped	Total
MiscTest	162	284	114	311	871
ArrayTest	0	0	0	209	209
Fn1Test	2400	177	118	2600	5295
XsTest	89	0	0	75	164
Prod1Test	3902	201	324	3811	8238
SerTest	4	0	0	339	343
Fn2Test	2659	282	91	2808	5840
MapTest	3	1	14	202	220
AppTest	971	17	20	1149	2157
Prod2Test	1320	221	129	2299	3969
OpTest	3742	28	21	575	4366
MathTest	147	0	1	1	149
Total	15399	1211	832	14379	31821

Download detailed test results

github-actions · 2025-12-06T13:38:33Z

Test Results (qt3tests)

RumbleDB, XQuery parser

Test Suite	Passing	Failing	Errors	Skipped	Total
MathTest	147	0	2	0	149
MiscTest	181	373	180	137	871
Prod1Test	4836	851	1808	743	8238
Fn1Test	2588	610	1730	367	5295
SerTest	4	2	1	336	343
Fn2Test	3156	948	1272	464	5840
AppTest	989	46	1084	38	2157
Prod2Test	1733	543	1169	524	3969
ArrayTest	0	45	155	9	209
XsTest	89	14	12	49	164
OpTest	4012	117	194	43	4366
MapTest	4	23	193	0	220
Total	17739	3572	7800	2710	31821

RumbleDB, JSONiq parser

Test Suite	Passing	Failing	Errors	Skipped	Total
MiscTest	162	284	114	311	871
ArrayTest	0	0	0	209	209
Fn1Test	2400	177	118	2600	5295
XsTest	89	0	0	75	164
Prod1Test	3902	201	324	3811	8238
SerTest	4	0	0	339	343
Fn2Test	2659	282	91	2808	5840
MapTest	3	1	14	202	220
AppTest	971	17	20	1149	2157
Prod2Test	1320	221	129	2299	3969
OpTest	3742	28	21	575	4366
MathTest	147	0	1	1	149
Total	15399	1211	832	14379	31821

Download detailed test results

EPMatt added 4 commits November 11, 2025 13:45

feat: refactor type validation in RuntimeIterator for creating datafr…

9c8e424

…ames to use RDDs directly

feat(jsoniq): move string escaping to translation visitor

f036208

feat(xquery): implement entity ref unescaping with apache library, ad…

0d41a5c

…d missing entity ref unescaping in several ast paths

trigger ci

15e20fc

EPMatt marked this pull request as ready for review November 12, 2025 15:15

EPMatt requested a review from ghislainfourny as a code owner November 12, 2025 15:15

Merge branch 'master' into master-matteo

5595b18

This comment was marked as outdated.

Sign in to view

Merge branch 'master' into master-matteo

c6e4d2d

ghislainfourny self-requested a review November 20, 2025 16:06

ghislainfourny requested changes Nov 20, 2025

View reviewed changes

Merge branch 'master' into master-matteo

b153f58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: improved type validation for creating dataframes, implement xquery entity unescaping #1395

feat: improved type validation for creating dataframes, implement xquery entity unescaping #1395

Uh oh!

EPMatt commented Nov 12, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 12, 2025

Uh oh!

github-actions bot commented Nov 20, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

ghislainfourny commented Nov 20, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 20, 2025

Uh oh!

ghislainfourny left a comment •

edited

Loading

Uh oh!

EPMatt commented Nov 28, 2025

Uh oh!

github-actions bot commented Nov 28, 2025

Uh oh!

github-actions bot commented Dec 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: improved type validation for creating dataframes, implement xquery entity unescaping #1395

Are you sure you want to change the base?

feat: improved type validation for creating dataframes, implement xquery entity unescaping #1395

Uh oh!

Conversation

EPMatt commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Support &-based escaped string literals

Refactor schema detection in RuntimeIterator for creating Dataframes directly from an RDD

Testing

Uh oh!

github-actions bot commented Nov 12, 2025

Test Results (qt3tests)

Uh oh!

github-actions bot commented Nov 20, 2025

Test Results (qt3tests)

Uh oh!

This comment was marked as outdated.

Uh oh!

ghislainfourny commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 20, 2025

Test Results (qt3tests)

Uh oh!

ghislainfourny left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EPMatt commented Nov 28, 2025

Uh oh!

github-actions bot commented Nov 28, 2025

Test Results (qt3tests)

Uh oh!

github-actions bot commented Dec 6, 2025

Test Results (qt3tests)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

EPMatt commented Nov 12, 2025 •

edited

Loading

ghislainfourny commented Nov 20, 2025 •

edited

Loading

ghislainfourny left a comment •

edited

Loading