[FEA] Revisit multi-get_json_object now that CUDF has better performance #11879

revans2 · 2024-12-16T17:49:52Z

Is your feature request related to a problem? Please describe.
We put a lot of work into get_json_object and we were able to speed up a specific customer query by over 3x from the original GPU version we tested, and over 4x from the CPU version.

After that together with the CUDF team we have optimized from_json and JSON Scan significantly. I think it is time for us to revisit multi-get_json_object. If I rewrite this customer query to use from_json where possible we are able to speed up the current CPU implementation by an additional 1.45x making the total GPU speedup closer to 5x than 3x.

Describe the solution you'd like
This is mostly an experiment. We could try and write custom code that uses the tokens from the cudf JSON tokenizer to process multiple JSON paths in parallel similar to what we do today with multi-get_json_object. We could also just rewrite the query so that parts we feel confident doing with from_json we can do that way. We could also just say that we are at a good point and stay there. But we need to make an informed decision and ideally use more than one benchmark/query to make that decision.

revans2 added ? - Needs Triage Need team to review and classify performance A performance related task/issue labels Dec 16, 2024

mattahrens removed the ? - Needs Triage Need team to review and classify label Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Revisit multi-get_json_object now that CUDF has better performance #11879

[FEA] Revisit multi-get_json_object now that CUDF has better performance #11879

revans2 commented Dec 16, 2024

[FEA] Revisit multi-get_json_object now that CUDF has better performance #11879

[FEA] Revisit multi-get_json_object now that CUDF has better performance #11879

Comments

revans2 commented Dec 16, 2024