Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add is_valid and truncate methods to NullBufferBuilder #7013

Merged
merged 14 commits into from
Jan 27, 2025

Conversation

Chen-Yuan-Lai
Copy link
Contributor

@Chen-Yuan-Lai Chen-Yuan-Lai commented Jan 24, 2025

Which issue does this PR close?

Closes #7002 .

Rationale for this change

As #7002 says, some required methods are implemented due to the need to replace BooleanBufferBuilder with NullBufferBuilder in DataFsusion.

What changes are included in this PR?

  • Implemented methods below:
    • capacity()
    • is_valid() (as same as get_bit() in BooleanBufferBuilder)
    • truncate()
  • Add related unit tests

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jan 24, 2025
Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this up, left some comments 👍

if let Some(buf) = self.bitmap_builder.as_ref() {
buf.capacity()
} else {
0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think need to reference self.capacity here instead of 0?

Copy link
Contributor Author

@Chen-Yuan-Lai Chen-Yuan-Lai Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'm not sure when bitmap_builder hasn't been initialized, we should return the actual capacity of it (that is 0 because no builder existed) or self.capacity.

Copy link
Contributor

@Jefffrey Jefffrey Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion it would feel weird if I initialize a NullBufferBuilder with a capacity:

let nbb = NullBufferBuilder::new(10)

But then checking the capacity right after would result in it saying 0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it is a bit strange, but I think the PR as written makes the most sense.

Specifically, downstream in DataFusion (and other places) we use the capacity as a way to calculate how much memory has been allocated -- if there is no bitmap_builder there is no memory allocated.

I think we can make this less confusing with some comments. I left some suggestions

Another thing that might help might be to rename it to fn allocated_capacity() to make the difference more explicit 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fn allocated_capacity() would make sense, but it seems we already have that:

/// Return the allocated size of this builder, in bytes, useful for memory accounting.
pub fn allocated_size(&self) -> usize {
self.bitmap_builder
.as_ref()
.map(|b| b.capacity())
.unwrap_or(0)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or rename capacity field to initial_capacity or initial_bits to clearly indicate this is for initialization only?

pub fn new(capacity: usize) -> Self {
      Self {
          bitmap_builder: None,
          len: 0,
          initial_capacity,
      }
 }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to add a capacity() method given allocated_size() seems to already do what is required

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed capacity method in 8ed7851

arrow-buffer/src/builder/null.rs Outdated Show resolved Hide resolved
Comment on lines 161 to 163
if len > self.len {
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we materialize, is self.len accurate anymore?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what you mean -- in this case the user requested to truncate to a value larger than the current length which has no effect.

This behavior seems to be consistent with Vec::truncate so it makes sense to me: https://doc.rust-lang.org/std/vec/struct.Vec.html#method.truncate

Copy link
Contributor

@Jefffrey Jefffrey Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider this test case:

    #[test]
    fn test123() {
        let mut builder = NullBufferBuilder::new(0);
        assert_eq!(builder.len(), 0);
        builder.append_n_nulls(2);
        assert_eq!(builder.len(), 2);
        builder.truncate(1);
        assert_eq!(builder.len(), 1); // fails here
    }

It would fail at the last assertion because it was materialized after appending two nulls, but then truncating down to 1 is a noop since the internal self.len stays 0 (not updated after materialization)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are (of course) correct. This is a great test and find

I took the liberty of pushing a commit to this branch that includes this test case (and will fail CI until it is fixed so block this PR from merging)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d0d3c62 and added some more documentation on the relationship between len and builder

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Chen-Yuan-Lai and @Jefffrey -- I think this PR looks really nice.

The comments / suggestions from @Jefffrey are worth looking at I think but in my opinion we could also merge this PR as is.

I will wait for @Jefffrey 's comments before doing so

if let Some(buf) = self.bitmap_builder.as_ref() {
buf.capacity()
} else {
0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it is a bit strange, but I think the PR as written makes the most sense.

Specifically, downstream in DataFusion (and other places) we use the capacity as a way to calculate how much memory has been allocated -- if there is no bitmap_builder there is no memory allocated.

I think we can make this less confusing with some comments. I left some suggestions

Another thing that might help might be to rename it to fn allocated_capacity() to make the difference more explicit 🤔

arrow-buffer/src/builder/null.rs Outdated Show resolved Hide resolved
Comment on lines 161 to 163
if len > self.len {
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what you mean -- in this case the user requested to truncate to a value larger than the current length which has no effect.

This behavior seems to be consistent with Vec::truncate so it makes sense to me: https://doc.rust-lang.org/std/vec/struct.Vec.html#method.truncate

@alamb
Copy link
Contributor

alamb commented Jan 27, 2025

As I am working on the next release, I am going to try and push this PR along

@alamb alamb changed the title Add required methods to NullBufferBuilder Add is_valid and truncate methods to NullBufferBuilder Jan 27, 2025
@@ -221,6 +250,7 @@ mod tests {
builder.append_n_nulls(2);
builder.append_n_non_nulls(2);
assert_eq!(6, builder.len());
assert_eq!(512, builder.allocated_size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it appears that @XiangpengHao had already added allocated_size

Copy link
Contributor Author

@Chen-Yuan-Lai Chen-Yuan-Lai Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we removed capacity(), I think assertions for builder.allocated_size() can also be removed in the test_null_buffer_builder_truncate test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there was no other testing of allocated_size that I could find, I thought it would be good to leave these tests in as the increase coverage for the existing behavior

@alamb
Copy link
Contributor

alamb commented Jan 27, 2025

@Chen-Yuan-Lai and @Jefffrey -- I made some changes to this PR that I think address the (excellent) comments. Let me know what you think

Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@Chen-Yuan-Lai
Copy link
Contributor Author

Chen-Yuan-Lai commented Jan 27, 2025

LGTM, thanks @alamb and @Jefffrey so much!

@alamb
Copy link
Contributor

alamb commented Jan 27, 2025

Thanks again @Chen-Yuan-Lai and @Jefffrey

@alamb alamb merged commit 6aaff7e into apache:main Jan 27, 2025
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add required methods to access inner builder for NullBufferBuilder
3 participants