feat: add cleanup_non_empty_nulls kernel#9970
Conversation
This is useful in: 1. lambda function on lists - since we need to remove the values that are not null 2. `explode` sql function 3. have kernels that use the list values without need to check if the value should be processed or not - for example implementing `array_distinct` which is keeping in each list the unique items
| assert_eq!(cleaned.nulls(), Some(&input_nulls)); | ||
| } | ||
|
|
||
| // ===== Underlying child array is sliced ===== |
There was a problem hiding this comment.
Should the children arrays below be sliced? The first test below looks very similiar as list_cleanup_nulls_with_null_pointing_to_non_empty_list_and_have_empty_list in line 464
| unsafe { Buffer::from_trusted_len_iter(iter) } | ||
| }; | ||
|
|
||
| let cleanup_array = crate::take::take( |
There was a problem hiding this comment.
non-blocking: for lists/maps with big chunks of valid or empty sublists, is possible that using MutableArrayData directly be faster, since we can copy the given chunk in a single memcpy for some data types, and perform less dynamic dispatches compared to take_list?
Now for string/binary I'm not sure since is static
There was a problem hiding this comment.
yes it is possible
but we can optimize the code further to have dedicated impl for each type but for now it is ok I think.
the next optimization possible is using filter on values and only updating list offsets.
There was a problem hiding this comment.
I also think that is ok, that's for sure, I just wanted to comment instead of forgetting about it later. Using filter is a great idea, nice
| return Ok(Arc::new(self.clone())); | ||
| }; | ||
|
|
||
| // Find an empty value so we can use the `take` kernel |
There was a problem hiding this comment.
why does it matter? having empty value will allow us to use the optimized take version rather than the fallback
There was a problem hiding this comment.
Because I believe we may not need the the empty value and can simplify this to take(self, &UInt32Array::from_iter(0..self.len() as u32), None), including in the interleave fallback path, since take doesn't copy underlying values of nulls. Within the take kernel the version used would be the same as today
comphead
left a comment
There was a problem hiding this comment.
Thanks @rluvaton @gstvg before expanding arrow-rs lets try to scope the concern.
Ideally to include in the PR description the explanation why this problem is happening, what is the reason of ArrayRef require cleaning, because under usual circumstances it should not. When we discussed the change in apache/datafusion#22158 (comment) I feel the scope was to explore if BooleanBuffer can calculate has_false without adding another BooleanBuilder wrapper on top of it
This was before the discussion to move the this can happen under normal circumstances, for example using |
Which issue does this PR close?
N/A
Rationale for this change
when working with lists or variable size arrays you cant operate on the underlying values/bytes of variable length array as is as nulls might point to non empty values
Cases when this is useful:
are not null
explodesql function - list values behind nulls cannot be keptvalue should be processed or not - for example implementing
array_distinctwhich is keeping in each list the unique itemsthe cases where having nulls for non empty values can happen for example when using the
nullifkernelWhat changes are included in this PR?
added to
arrow-selectcleanup_non_empty_nullsmodule which include 2 functionscleanup_non_empty_nullswhich is the logic for removing non empty nulls valueshas_non_empty_nullswhich can be called before calling thecleanup_non_empty_nullsfunction to check if the expensive work is even neededOriginally I wanted to add the function on
ListArrayandStringArrayand so on, but because the use of take and interleave we cannot do thatAre these changes tested?
Yes
Are there any user-facing changes?
yes, new kernel