docs: simplify `Coreax.Coreset` docstring maths #684

tc85324 · 2024-06-28T09:03:55Z

PR Type

Documentation content changes

Description

Improves the presentation and clarity of the mathematics within the Coreax.Coreset and Coreax.Coresubset docstrings. This should make it much easier for users and contributors to understand our definitions and/or assumptions about "coresets".

If you have any ideas on how to make the documentation even simpler, or more mathematically precision, I would be keen to discuss them here.

How Has This Been Tested?

Documentation builds.

Does this PR introduce a breaking change?

N/A

Checklist before requesting a review

I have made sure that my PR is not a duplicate.
My code follows the style guidelines of this project.
I have commented my code, particularly in hard-to-understand areas.
I have performed a self-review of my code.
I have made corresponding changes to the documentation.
My changes generate no new warnings.
I have added tests that prove my fix is effective or that my feature works.
New and existing unit tests pass locally with my changes.
Any dependent changes have been merged and published in downstream modules.

tc85324 · 2024-06-28T10:49:30Z

Given the definitions here, should we be checking that the weights are non-negative in some kind of __check_init__? More generally, should this check be on the Data classes. I.E. should we only allow handling of non-negative weights in Data? If we don't expect any of the current or future solvers/metrics to handle negative weights, then I don't think we would be losing anything by adding this constraint to Data.

db091756 · 2024-07-01T10:14:43Z

Given the definitions here, should we be checking that the weights are non-negative in some kind of __check_init__? More generally, should this check be on the Data classes. I.E. should we only allow handling of non-negative weights in Data? If we don't expect any of the current or future solvers/metrics to handle negative weights, then I don't think we would be losing anything by adding this constraint to Data.

3.1.1 from https://arxiv.org/pdf/1204.1664 gives good reasoning for why non-negative and normalised weights should probably not be enforced in Coreax.

db091756 · 2024-07-01T10:30:06Z

The definition given in the docstring is very good, but seems quite linked to the way in which the recombination discusses coresets. e.g. not all coreset algorithms will satisfy the preservation of the CoM.

I think trying to get an extremely mathematically precise definition of a coreset is going to be very hard as coresets are discussed and introduced in very different ways in the literature. The TLDR:

"a coreset is a reduced set of :math:\hat{n} (potentially weighted) data
points that, in some sense, best represent the "important" properties of a larger
set of :math:n > \hat{n} (potentially weighted) data points."

is a great (not super mathematical) summary of what a coreset is. When we start talking about measures things could get more confusing for users as there exist many coreset algorithms out there, including probably the most popular in Kernel Herding, which don't discuss measures at all but instead talk in terms of distributions.

I am not sure what the best action would be here, but I would be tempted to keep things as simple as possible, and transfer the measure discussion to the Tree recombination docstring.

Perhaps following the TLDR we could add a few (as brief as possible without being misleading) examples of what the "important" properties are for specific algorithms such as CoM for Tree Recombination or the MMD for Kernel Herding.

db091756

See comments.

tc85324 · 2024-07-08T10:58:12Z

I would argue that Kernel Herding does preserve the CoM, where the test-functions are implied by the kernel. I.E. the kernel mean embedding (KME) is the CoM and the MMD gives a measure of how well the CoM has been preserved.

While I think we can argue the CoM case for any method that attempts to (approximately) preserve some integrated quantity of a dataset, it might be best to move this to the module docstring. There we can explain that certain algorithms such as recombination, (I think kernel herding), etc... attempt to solve this more precisely defined problem, but that the general definition of "important" can be very much solver/algorithm dependant. We could refer to such coreset algorithms as "integral-preserving/expectation-approximating" coresets, but finding a good taxonomy for all the algorithms would be a fair amount of work. What do you think?

I agree that there needs to be a clearer mention of distributions, or at least the fact that a measure, in the sense considered here, is a generalisation of a distribution.

db091756 · 2024-07-09T08:11:59Z

I would argue that Kernel Herding does preserve the CoM, where the test-functions are implied by the kernel. I.E. the kernel mean embedding (KME) is the CoM and the MMD gives a measure of how well the CoM has been preserved.

While I think we can argue the CoM case for any method that attempts to (approximately) preserve some integrated quantity of a dataset, it might be best to move this to the module docstring. There we can explain that certain algorithms such as recombination, (I think kernel herding), etc... attempt to solve this more precisely defined problem, but that the general definition of "important" can be very much solver/algorithm dependant. We could refer to such coreset algorithms as "integral-preserving/expectation-approximating" coresets, but finding a good taxonomy for all the algorithms would be a fair amount of work. What do you think?

I agree that there needs to be a clearer mention of distributions, or at least the fact that a measure, in the sense considered here, is a generalisation of a distribution.

I agree that KH falls under the CoM definition, but I think my main issue is that this language is never used to introduce the concept of, or discuss, coresets in the KH paper. It feels like quite a recombination specific definition. There is no problem with that if we feel like this definition is the widest possible one, and that is what we want the goal of the Coreset docstring to be . But I agree that finding a precise mathematical definition that covers all current (and future) Coreax coreset algorithms is going to be very tricky, and not even necessarily what the goal of the docstring should be.

I would lean towards avoiding trying to define some mathematical language i.e. "integral-preserving/expectation-approximating", and just say coresets tend to be built such that they "(approximately) preserve some integrated quantity of a dataset". The more specific we get with the mathematical terminology, I feel the more likely we are to exclude some algorithms when we need not, and confuse the user, when we could just point towards papers which introduce coresets in their own individual way.

Hopefully I am making sense.

db091756 · 2024-07-30T14:38:32Z

@tc85324 Do you have any more thoughts on this?

tc85324 · 2024-07-31T08:02:45Z

I will revisit this once #504 is merged.

db091756 · 2024-08-09T08:36:08Z

I will revisit this once #504 is merged.

Blocked due to ^

db091756 · 2024-09-08T08:24:09Z

Now unblocked @tc85324

tp832944 · 2024-10-16T10:34:40Z

@tc85324 where are we with this PR?

@db091756

Replaces complex measure theoretic definitions with more accessible ones while maintaining the required mathematical notation. Thanks to @db091756 for the helpful discussions. Refs: #684

github-actions · 2024-10-18T14:56:29Z

Performance review

Commit `e6e0b76` - Merge `e3d2291` into `75a8b41`

No significant changes to performance.

…et-docs

github-actions · 2024-10-19T10:16:02Z

Performance review

Commit `a230649` - Merge `66cf481` into `75a8b41`

No significant changes to performance.

tc85324

@tc85324 where are we with this PR?

Sorry for the long delay in getting back to this. With the latest commits, and thanks to the disucssions with @db091756, the PR should now be ready for review again.

github-actions · 2024-10-21T11:37:50Z

Performance review

Commit `25ef21d` - Merge `a9e8571` into `03220a5`

No significant changes to performance.

rg936672

Approving just @pc532627's latest commit - I haven't looked at the rest of the changes. So long as @pc532627 is happy with everything before that commit (and so long as the CI passes!), this can be merged :)

Changes made since, reviewed by others

docs: simplify Coreax.Coreset docstring maths

facc25f

tc85324 requested review from tp832944 and db091756 June 28, 2024 09:04

db091756 self-assigned this Jul 1, 2024

db091756 previously requested changes Jul 1, 2024

View reviewed changes

tp832944 assigned tp832944 and unassigned db091756 Oct 16, 2024

docs: simplify coreset docstrings.

e3d2291

Replaces complex measure theoretic definitions with more accessible ones while maintaining the required mathematical notation. Thanks to @db091756 for the helpful discussions. Refs: #684

Merge remote-tracking branch 'origin/main' into feature/improve-cores…

66cf481

…et-docs

tc85324 commented Oct 19, 2024

View reviewed changes

tc85324 requested a review from db091756 October 19, 2024 10:17

Minor typo updates

a9e8571

rg936672 approved these changes Oct 21, 2024

View reviewed changes

pc532627 merged commit bc6584f into main Oct 21, 2024
19 checks passed

pc532627 deleted the feature/improve-coreset-docs branch October 21, 2024 12:11

docs: simplify Coreax.Coreset docstring maths #684

docs: simplify Coreax.Coreset docstring maths #684

Uh oh!

Conversation

tc85324 commented Jun 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Type

Description

How Has This Been Tested?

Does this PR introduce a breaking change?

Checklist before requesting a review

Uh oh!

tc85324 commented Jun 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

db091756 commented Jul 1, 2024

Uh oh!

db091756 commented Jul 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

db091756 left a comment

Choose a reason for hiding this comment

Uh oh!

tc85324 commented Jul 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

db091756 commented Jul 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

db091756 commented Jul 30, 2024

Uh oh!

tc85324 commented Jul 31, 2024

Uh oh!

db091756 commented Aug 9, 2024

Uh oh!

db091756 commented Sep 8, 2024

Uh oh!

tp832944 commented Oct 16, 2024

Uh oh!

github-actions bot commented Oct 18, 2024

Performance review

Commit e6e0b76 - Merge e3d2291 into 75a8b41

Uh oh!

github-actions bot commented Oct 19, 2024

Performance review

Commit a230649 - Merge 66cf481 into 75a8b41

Uh oh!

tc85324 left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 21, 2024

Performance review

Commit 25ef21d - Merge a9e8571 into 03220a5

Uh oh!

rg936672 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

docs: simplify `Coreax.Coreset` docstring maths #684

docs: simplify `Coreax.Coreset` docstring maths #684

tc85324 commented Jun 28, 2024 •

edited

Loading

tc85324 commented Jun 28, 2024 •

edited

Loading

db091756 commented Jul 1, 2024 •

edited

Loading

tc85324 commented Jul 8, 2024 •

edited

Loading

db091756 commented Jul 9, 2024 •

edited

Loading

Commit `e6e0b76` - Merge `e3d2291` into `75a8b41`

Commit `a230649` - Merge `66cf481` into `75a8b41`

Commit `25ef21d` - Merge `a9e8571` into `03220a5`

rg936672 left a comment •

edited

Loading