Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A trick solution to achieve multi-field parsing and get compression of structured data by efficient parsing performance #59

Open
heykirby opened this issue Oct 4, 2024 · 2 comments

Comments

@heykirby
Copy link

heykirby commented Oct 4, 2024

hello, piotrrzysko, In many business scenarios, parsing multi value from json only requires path array as parameters and get the string type value only one time, usage like hive's json_tuple udf, for example: parseValue(json, 'path1', 'path2', 'path3',,,,), and return (value1,vaule2,value3,,,)
Therefore, we can quickly get the value from json by bitIndexs built by simdjson. The advantage of this solution is that it avoids creating many java object instance for each json node, thereby avoiding garbage collection overhead, and can perform pruning operations, which can make performance better.

a simple example,
json value is: {"field1":{"field2":"value2","field3":3},"field4":["value4","value5"]}
we want to get paths is: [$.field1.field2,$.field4.0, $.field4]. ($.field4 will compress list to string, $.field4.0 will get first element from list)
expect return value is [value2, value4, '["value4","value5"]']

Solution Implementation
first, we can convert the path array to a tree。if node color is blue, means we want get value for the path, if the node is container type, we will compress it to string. for example $.field4

image

second,loop through the bitindex,and fill values into paths tree。
In the above example, the bitindex value is [0, 1, 9, 10, 11, 19, 20, 28, 29, 37, 38, 39, 40, 41, 49, 50, 51, 59, 60, 68, 69]
In the picture below, I marked the position marked by bitindex with ‘#’.
We can know that bitindex will mark the starting and ending positions of map type and list type ([ ] { }); the starting position of map type key and value and the middle ':' , and the position of ',' between different elements.

image

for the above example, we loop through the bitindex, step by step get the value of each node of json path tree, following is a simple flow chart

image image image image image image image

Since the json path tree can be reused, in the process of parsing multiple jsons, there is no need to build a json node tree for each json, but only a tree for the required path, which can improving parsing performance, and support compressing container type json data, and parsing multiple values ​​at the same time, and is compatible with the case where the json value on the path is null.

@heykirby
Copy link
Author

heykirby commented Oct 7, 2024

benchmark, simdjson2 vs jackson, performance is more than 6 times higher. if parsing less of json fields, the performance improvement is particularly obvious. reference

simdjson: 95.936 ops
jackson: 15.833

@arouel
Copy link

arouel commented Oct 12, 2024

benchmark, simdjson2 vs jackson, performance is more than 6 times higher. if parsing less of json fields, the performance improvement is particularly obvious. reference

simdjson: 95.936 ops
jackson: 15.833

I think the benchmark is flawed due to the current setup, see #60 (comment) for details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants