-
Notifications
You must be signed in to change notification settings - Fork 850
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add is_valid
and truncate
methods to NullBufferBuilder
#7013
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for picking this up, left some comments 👍
arrow-buffer/src/builder/null.rs
Outdated
if let Some(buf) = self.bitmap_builder.as_ref() { | ||
buf.capacity() | ||
} else { | ||
0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think need to reference self.capacity
here instead of 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I'm not sure when bitmap_builder
hasn't been initialized, we should return the actual capacity of it (that is 0 because no builder existed) or self.capacity
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion it would feel weird if I initialize a NullBufferBuilder
with a capacity:
let nbb = NullBufferBuilder::new(10)
But then checking the capacity right after would result in it saying 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it is a bit strange, but I think the PR as written makes the most sense.
Specifically, downstream in DataFusion (and other places) we use the capacity
as a way to calculate how much memory has been allocated -- if there is no bitmap_builder there is no memory allocated.
I think we can make this less confusing with some comments. I left some suggestions
Another thing that might help might be to rename it to fn allocated_capacity()
to make the difference more explicit 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I also made Minor: Clarify NullBufferBuilder::new capacity parameter #7016 to try and clarify the behavior of
capacity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fn allocated_capacity()
would make sense, but it seems we already have that:
arrow-rs/arrow-buffer/src/builder/null.rs
Lines 188 to 194 in 0c07ec7
/// Return the allocated size of this builder, in bytes, useful for memory accounting. | |
pub fn allocated_size(&self) -> usize { | |
self.bitmap_builder | |
.as_ref() | |
.map(|b| b.capacity()) | |
.unwrap_or(0) | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or rename capacity
field to initial_capacity
or initial_bits
to clearly indicate this is for initialization only?
pub fn new(capacity: usize) -> Self {
Self {
bitmap_builder: None,
len: 0,
initial_capacity,
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need to add a capacity()
method given allocated_size()
seems to already do what is required
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed capacity method in 8ed7851
arrow-buffer/src/builder/null.rs
Outdated
if len > self.len { | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we materialize, is self.len
accurate anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what you mean -- in this case the user requested to truncate to a value larger than the current length which has no effect.
This behavior seems to be consistent with Vec::truncate
so it makes sense to me: https://doc.rust-lang.org/std/vec/struct.Vec.html#method.truncate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider this test case:
#[test]
fn test123() {
let mut builder = NullBufferBuilder::new(0);
assert_eq!(builder.len(), 0);
builder.append_n_nulls(2);
assert_eq!(builder.len(), 2);
builder.truncate(1);
assert_eq!(builder.len(), 1); // fails here
}
It would fail at the last assertion because it was materialized after appending two nulls, but then truncating down to 1 is a noop since the internal self.len
stays 0 (not updated after materialization)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are (of course) correct. This is a great test and find
I took the liberty of pushing a commit to this branch that includes this test case (and will fail CI until it is fixed so block this PR from merging)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in d0d3c62 and added some more documentation on the relationship between len
and builder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Chen-Yuan-Lai and @Jefffrey -- I think this PR looks really nice.
The comments / suggestions from @Jefffrey are worth looking at I think but in my opinion we could also merge this PR as is.
I will wait for @Jefffrey 's comments before doing so
arrow-buffer/src/builder/null.rs
Outdated
if let Some(buf) = self.bitmap_builder.as_ref() { | ||
buf.capacity() | ||
} else { | ||
0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it is a bit strange, but I think the PR as written makes the most sense.
Specifically, downstream in DataFusion (and other places) we use the capacity
as a way to calculate how much memory has been allocated -- if there is no bitmap_builder there is no memory allocated.
I think we can make this less confusing with some comments. I left some suggestions
Another thing that might help might be to rename it to fn allocated_capacity()
to make the difference more explicit 🤔
arrow-buffer/src/builder/null.rs
Outdated
if len > self.len { | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what you mean -- in this case the user requested to truncate to a value larger than the current length which has no effect.
This behavior seems to be consistent with Vec::truncate
so it makes sense to me: https://doc.rust-lang.org/std/vec/struct.Vec.html#method.truncate
As I am working on the next release, I am going to try and push this PR along |
…-rs into add_required_methods
NullBufferBuilder
is_valid
and truncate
methods to NullBufferBuilder
@@ -221,6 +250,7 @@ mod tests { | |||
builder.append_n_nulls(2); | |||
builder.append_n_non_nulls(2); | |||
assert_eq!(6, builder.len()); | |||
assert_eq!(512, builder.allocated_size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it appears that @XiangpengHao had already added allocated_size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we removed capacity()
, I think assertions for builder.allocated_size()
can also be removed in the test_null_buffer_builder_truncate
test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there was no other testing of allocated_size
that I could find, I thought it would be good to leave these tests in as the increase coverage for the existing behavior
@Chen-Yuan-Lai and @Jefffrey -- I made some changes to this PR that I think address the (excellent) comments. Let me know what you think |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
Thanks again @Chen-Yuan-Lai and @Jefffrey |
Which issue does this PR close?
Closes #7002 .
Rationale for this change
As #7002 says, some required methods are implemented due to the need to replace
BooleanBufferBuilder
withNullBufferBuilder
in DataFsusion.What changes are included in this PR?
capacity()
is_valid()
(as same asget_bit()
inBooleanBufferBuilder
)truncate()
Are there any user-facing changes?