-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize get_json_object
by calling the main kernel only once
#2129
Conversation
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
src/main/cpp/src/get_json_object.cu
Outdated
// size so that we will not write output strings into an out-of-bound position. | ||
// Checking out-of-bound needs to be performed in the main kernel to make sure we will not have | ||
// data corruption. | ||
constexpr auto padding_ratio = 1.01; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: How do we get this ratio? I think we have some case that will make the output larger than input_size*1.01
, like number normalization, adding quotes under some styles, and escaping some chars.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the total size, not one row's size. If we have 1M rows, this will be the size of approx. 1'010'000 rows. In other word, we have around 10'000 more rows to avoid invalid memory access issue. We don't care if data of row n
is written cross its boundary to row n+1
which causes corruption, since we will discard the entire buffer if such overflow is detected. The extra 10'000 rows at the end is just to make sure data of the last row will not be written into out-of-bound of the entire buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The extra 10'000
rows seem to be too much. We can turn up this padding ratio to be a dynamic value. For example: the padding should always be the average size of 100 rows. That should be large enough to avoid invalid memory access when writing the last row, but it is not guaranteed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One extreme case:
scala> val d = Seq("['\n']", "['\n\n\n\n\n\n\n\n\n\n']")
scala> val df = d.toDF
scala> df.createOrReplaceTempView("tab")
scala> spark.sql("select length(value) from tab").show()
+-------------+
|length(value)|
+-------------+
| 5|
| 17|
+-------------+
scala> spark.sql("select length(get_json_object(value, '$')) from tab").show()
+---------------------------------+
|length(get_json_object(value, $))|
+---------------------------------+
| 6|
| 30|
+---------------------------------+
This case will cause invalid memory access:
round up 2 * 1.01 = 3, avg size is 11, total allocated size is 33, but total write size will be 35 (5 + 30).
It causes overlap writing as expected(is not an issue), but the tailing writing causes invalid memory access.
One option:
Add a method in json_generator
like:
write_byte(char c) {
if (curr_size >= max_allowed_size) {
// write nothing
curr_size++;
} else {
*curr_pos = c;
curr_size++;
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my latest code, I pad the buffer by 100 * average_row_size
. In this case, we have avg_row_size = 22/2=11
, so buffer size will be 22 + 100*11
, no overflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add that code and check the benchmark to see how it will affect the overall performance.
Edit: I indeed attempted to implement such bound check write, but gave up as it requires to modify all the json generator, json parser and ftos converter. In the meantime, I have an idea to further optimize writing with shared memory (#2130) so let this be a future work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the padding, which is now computed using max row size. There is some overhead with this but it is small. This way can avoid invalid memory access in normal situations and also extreme cases such as when the column has all null rows except one very long row.
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
build |
It would be nice to add a unit test:
|
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
build |
…IA#2129) * Implement the CPU code Signed-off-by: Nghia Truong <[email protected]> * Reimplement kernel Signed-off-by: Nghia Truong <[email protected]> * Fix the kernel caller Signed-off-by: Nghia Truong <[email protected]> * Optimize validity computation Signed-off-by: Nghia Truong <[email protected]> * Cleanup Signed-off-by: Nghia Truong <[email protected]> * Cleanup Signed-off-by: Nghia Truong <[email protected]> * Cleanup Signed-off-by: Nghia Truong <[email protected]> * Add comment Signed-off-by: Nghia Truong <[email protected]> * Turning kernel occupancy Signed-off-by: Nghia Truong <[email protected]> * Cleanup Signed-off-by: Nghia Truong <[email protected]> * Change padding for the scratch buffer Signed-off-by: Nghia Truong <[email protected]> * Update docs Signed-off-by: Nghia Truong <[email protected]> * Add test for overflow case Signed-off-by: Nghia Truong <[email protected]> * Pad the output buffer using max row size Signed-off-by: Nghia Truong <[email protected]> * Update test Signed-off-by: Nghia Truong <[email protected]> * Change the padding ratio Signed-off-by: Nghia Truong <[email protected]> --------- Signed-off-by: Nghia Truong <[email protected]>
This PR optimizes the host code for
get_json_object
function, improving performance by calling the main kernel only once.Currently, the main kernel is called twice, one to compute the output string sizes and another one is to write the output strings. The idea of this work is to pre-allocate a temporary char column to write the output without knowing the output string sizes. Then, while calling to the main kernel, we write out both the output strings along with their sizes, using just one kernel call. A (relatively cheap) gathering step is then executed to extract the final output strings column.
Closes #2124.