Skip to content

Conversation

@mkleen
Copy link
Contributor

@mkleen mkleen commented Jan 2, 2026

Which issue does this PR close?

Relates to #19052 (comment)

Rationale for this change

This adds a heap_size method returning the amount of memory a statistics struct allocates on the heap.

What changes are included in this PR?

NA.

Are these changes tested?

Yes

Are there any user-facing changes?

One new method on Statistics.

@github-actions github-actions bot added common Related to common crate execution Related to the execution crate labels Jan 2, 2026
pub fn heap_size(&self) -> usize {
// column_statistics + num_rows + total_byte_size
self.column_statistics.capacity() * size_of::<ColumnStatistics>()
+ size_of::<Precision<usize>>() * 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here Precision<usize> is an enum and does not have a heap allocated fields, so it is allocated in the stack.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these things are usually Arc'ed - so everything should be moved to the heap, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So size_of::<Precision<usize>>() * 2 should be removed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, if we want to follow the trait in arrow, which I think according to #19599 (comment) was the conclusion of the next step? Do you plan on push a commit to do so?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes i am on it. pr coming up soon.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to be able to get the heap size of arrays to implement it for Statistics? What's the chain of fields that takes us there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bummer. Isn't there ways to get the size of an array in memory? E.g. Array::get_array_memory_size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may actually work. Thanks, I will try that.

/// Returns the memory size in bytes.
pub fn heap_size(&self) -> usize {
// column_statistics + num_rows + total_byte_size
self.column_statistics.capacity() * size_of::<ColumnStatistics>()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO you need to iterate over the column_statistics and add their heap allocations
https://github.com/mkleen/datafusion/blob/41b875d19d9c3671e141cec11814afd91a06a0f1/datafusion/common/src/stats.rs#L732-L736 - there are Precision<ScalarValue> fields and the ScalarValue enum has variants which use String, Vec and Box

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Should we add heap_size() to ScalarValue as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. So would it make sense to add heap_size() to column_statistics ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we want to accurately calculate the sizes we need to propagate that method all the way down. I'm not sure at what point we should just be using https://crates.io/crates/deepsize or similar.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly none looks all that popular (going off of stars) or battle tested. Hard choice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deepsize2 is a fork of deepsize + applying all opened PRs and updating the (optional) dependencies

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah makes sense then as the choice if we’re going to use an external crate. It seems like the most ergonomic. This does seems like something that should be more of a community wide decision than isolated in this PR though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the next step then? File a new issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #19615. I think for this PR I would recommend adopting arrow-rs's HeapSize with manual implementations. If we add a macro there later we can update to use that.

This adds a heap_size method returning the amount of memory a statistics
struct allocates on the heap.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate execution Related to the execution crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants