Lazy Statistics #1656

CarloMariaProietti · 2025-12-19T14:30:49Z

Fixed version of #1636

Fixes #1492
The idea is the following:
ValueColumnInternal is an interface for statistic values, which in this way are not exposed as public.
Implementations of ValueColumnInternal contain the actual cache.

It was necessary to have two caches for each stat (for the moment only max) because computing the stat may give different outputs basing on skipNaN boolean parameter.

I implemented the solution by overloading aggregateSingleColumn, this overload exploits the original aggregateSingleColumn by wrapping it so that it is possible to exploit caches.

For the moment there is only max, however it would be easy to do the same with min, sum, mean and median.
For percentile and std it could be done something similar.

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/columns/ValueColumnImpl.kt

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/aggregation/aggregators/Aggregator.kt

.../src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/aggregation/aggregators/Aggregators.kt

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/columns/ValueColumnImpl.kt

…nts ValueColumnInternal

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/sort.kt

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/columns/ValueColumnImpl.kt

...org/jetbrains/kotlinx/dataframe/impl/aggregation/aggregators/AggregatorAggregationHandler.kt

Jolanrensen · 2026-01-19T11:21:05Z

Alright, it seems the implementation is in place. However, we never test the cache is actually being used. For all we know, it's still being recalculated every time. Please add some tests in the statistics tests to verify that the cache is actually being set when a new statistic is calculated.

You could also test the retrieving part by saving something wrong into the cache manually and then check if it reappears when the statistic is requested again.

(now you can see why it makes sense to have the logic of interacting with the cache inside the ValueColumn instead of on the aggregator side :) Makes it easier to test)

CarloMariaProietti · 2026-01-19T12:11:52Z

(now you can see why it makes sense to have the logic of interacting with the cache inside the ValueColumn instead of on the aggregator side :) Makes it easier to test)

Yes, much better. Until now I tested the behaviour of cache only by using prints and it seems to work correctly.
However, I wrote a message on the slack datascience channel where I tell about Google Summer of Code, I would be very gratefull if you could read that proposal and give me your feedback :)

Jolanrensen · 2026-01-19T13:04:49Z

However, I wrote a message on the slack datascience channel where I tell about Google Summer of Code, I would be very gratefull if you could read that proposal and give me your feedback :)

We'll discuss this in the team first and get back to you later, if that's okay

CarloMariaProietti · 2026-01-19T13:44:38Z

However, I wrote a message on the slack datascience channel where I tell about Google Summer of Code, I would be very gratefull if you could read that proposal and give me your feedback :)

We'll discuss this in the team first and get back to you later, if that's okay

Yes, of course.

core/src/test/kotlin/org/jetbrains/kotlinx/dataframe/api/statistics.kt

Jolanrensen · 2026-01-22T15:59:46Z

To be able to close #1492 we also need some test that verify the original issue now works. So please try the example with df.filter { x < df["valueCol"].cast<Int>().max() } or something like that, where .max() would be called for each row in the DF and now should benefit from the cache being there.

Also some dataframe-wide statistic function like df.max { "valueCol"<Int>() } (and then checking that df["valueCol"].asValueColumn().internalValueColumn().getStatisticsCache... actually works).

This actually made me think of something: Try renaming a column that has some statistics cache :) Even though the contents of the column are exactly the same as before, the cache will be gone as a new ValueColumnImpl is instantiated. This also applies to changeType(). I believe you could supply your statisticsCache to the new one :) (I also think you don't even need to 'copy' the cache, just pass it on. As values is immutable, anything calculated from the same values will hold, even if the column is renamed and inside another dataframe)

fixed PR

def1b4b

CarloMariaProietti mentioned this pull request Dec 19, 2025

Lazy statistics for ValueColumn #1636

Closed

CarloMariaProietti commented Dec 28, 2025

View reviewed changes

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/columns/ValueColumnImpl.kt Show resolved Hide resolved

CarloMariaProietti added 2 commits December 28, 2025 20:22

each statistic is now exploiting dynamically allocated cache

b3b1f82

apiDump

53411f7

Jolanrensen reviewed Jan 2, 2026

View reviewed changes

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/columns/ValueColumnImpl.kt Outdated Show resolved Hide resolved

CarloMariaProietti added 2 commits January 2, 2026 18:44

ParameterValue -> Any

ee23718

apiDump

2f509c1

Jolanrensen requested changes Jan 5, 2026

View reviewed changes

CarloMariaProietti added 3 commits January 5, 2026 19:21

changes after review

c4f1912

get/put statistics

ab6dfac

introducing ValueColumn.internal breaks ValueColumnWithParent

a88847d

Jolanrensen reviewed Jan 8, 2026

View reviewed changes

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/columns/ValueColumnImpl.kt Outdated Show resolved Hide resolved

add .internalValueColumn(). Now each impl of ValueColumn also impleme…

1de2791

…nts ValueColumnInternal

Jolanrensen reviewed Jan 15, 2026

View reviewed changes

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/sort.kt Outdated Show resolved Hide resolved

Jolanrensen reviewed Jan 15, 2026

View reviewed changes

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/columns/ValueColumnImpl.kt Outdated Show resolved Hide resolved

CarloMariaProietti added 2 commits January 17, 2026 11:43

refining inheritance

231f8d4

update generated sources

6405d28

Jolanrensen reviewed Jan 19, 2026

View reviewed changes

...org/jetbrains/kotlinx/dataframe/impl/aggregation/aggregators/AggregatorAggregationHandler.kt Outdated Show resolved Hide resolved

test case

13aee17

Jolanrensen reviewed Jan 21, 2026

View reviewed changes

core/src/test/kotlin/org/jetbrains/kotlinx/dataframe/api/statistics.kt Outdated Show resolved Hide resolved

test case

1dbcf9e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lazy Statistics #1656

Lazy Statistics #1656

Uh oh!

CarloMariaProietti commented Dec 19, 2025 •

edited by Jolanrensen

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Jolanrensen commented Jan 19, 2026 •

edited

Loading

Uh oh!

CarloMariaProietti commented Jan 19, 2026

Uh oh!

Jolanrensen commented Jan 19, 2026

Uh oh!

CarloMariaProietti commented Jan 19, 2026

Uh oh!

Uh oh!

Jolanrensen commented Jan 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Lazy Statistics #1656

Are you sure you want to change the base?

Lazy Statistics #1656

Uh oh!

Conversation

CarloMariaProietti commented Dec 19, 2025 • edited by Jolanrensen Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Jolanrensen commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CarloMariaProietti commented Jan 19, 2026

Uh oh!

Jolanrensen commented Jan 19, 2026

Uh oh!

CarloMariaProietti commented Jan 19, 2026

Uh oh!

Uh oh!

Jolanrensen commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CarloMariaProietti commented Dec 19, 2025 •

edited by Jolanrensen

Loading

Jolanrensen commented Jan 19, 2026 •

edited

Loading

Jolanrensen commented Jan 22, 2026 •

edited

Loading