Skip to content

Conversation

@CarloMariaProietti
Copy link
Contributor

@CarloMariaProietti CarloMariaProietti commented Dec 19, 2025

Fixed version of #1636

Fixes #1492
The idea is the following:
ValueColumnInternal is an interface for statistic values, which in this way are not exposed as public.
Implementations of ValueColumnInternal contain the actual cache.

It was necessary to have two caches for each stat (for the moment only max) because computing the stat may give different outputs basing on skipNaN boolean parameter.

I implemented the solution by overloading aggregateSingleColumn, this overload exploits the original aggregateSingleColumn by wrapping it so that it is possible to exploit caches.

For the moment there is only max, however it would be easy to do the same with min, sum, mean and median.
For percentile and std it could be done something similar.

@Jolanrensen
Copy link
Collaborator

Jolanrensen commented Jan 19, 2026

Alright, it seems the implementation is in place. However, we never test the cache is actually being used. For all we know, it's still being recalculated every time. Please add some tests in the statistics tests to verify that the cache is actually being set when a new statistic is calculated.

You could also test the retrieving part by saving something wrong into the cache manually and then check if it reappears when the statistic is requested again.

(now you can see why it makes sense to have the logic of interacting with the cache inside the ValueColumn instead of on the aggregator side :) Makes it easier to test)

@CarloMariaProietti
Copy link
Contributor Author

(now you can see why it makes sense to have the logic of interacting with the cache inside the ValueColumn instead of on the aggregator side :) Makes it easier to test)

Yes, much better. Until now I tested the behaviour of cache only by using prints and it seems to work correctly.
However, I wrote a message on the slack datascience channel where I tell about Google Summer of Code, I would be very gratefull if you could read that proposal and give me your feedback :)

@Jolanrensen
Copy link
Collaborator

However, I wrote a message on the slack datascience channel where I tell about Google Summer of Code, I would be very gratefull if you could read that proposal and give me your feedback :)

We'll discuss this in the team first and get back to you later, if that's okay

@CarloMariaProietti
Copy link
Contributor Author

However, I wrote a message on the slack datascience channel where I tell about Google Summer of Code, I would be very gratefull if you could read that proposal and give me your feedback :)

We'll discuss this in the team first and get back to you later, if that's okay

Yes, of course.

@Jolanrensen
Copy link
Collaborator

Jolanrensen commented Jan 22, 2026

To be able to close #1492 we also need some test that verify the original issue now works. So please try the example with df.filter { x < df["valueCol"].cast<Int>().max() } or something like that, where .max() would be called for each row in the DF and now should benefit from the cache being there.

Also some dataframe-wide statistic function like df.max { "valueCol"<Int>() } (and then checking that df["valueCol"].asValueColumn().internalValueColumn().getStatisticsCache... actually works).

This actually made me think of something: Try renaming a column that has some statistics cache :) Even though the contents of the column are exactly the same as before, the cache will be gone as a new ValueColumnImpl is instantiated. This also applies to changeType(). I believe you could supply your statisticsCache to the new one :) (I also think you don't even need to 'copy' the cache, just pass it on. As values is immutable, anything calculated from the same values will hold, even if the column is renamed and inside another dataframe)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Lazy statistics for columns

2 participants