Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: simplify Coreax.Coreset docstring maths #684

Merged
merged 4 commits into from
Oct 21, 2024

Conversation

tc85324
Copy link
Contributor

@tc85324 tc85324 commented Jun 28, 2024

PR Type

  • Documentation content changes

Description

Improves the presentation and clarity of the mathematics within the Coreax.Coreset and Coreax.Coresubset docstrings. This should make it much easier for users and contributors to understand our definitions and/or assumptions about "coresets".

If you have any ideas on how to make the documentation even simpler, or more mathematically precision, I would be keen to discuss them here.

How Has This Been Tested?

Documentation builds.

Does this PR introduce a breaking change?

N/A

Checklist before requesting a review

  • I have made sure that my PR is not a duplicate.
  • My code follows the style guidelines of this project.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have performed a self-review of my code.
  • I have made corresponding changes to the documentation.
  • My changes generate no new warnings.
  • I have added tests that prove my fix is effective or that my feature works.
  • New and existing unit tests pass locally with my changes.
  • Any dependent changes have been merged and published in downstream modules.

@tc85324
Copy link
Contributor Author

tc85324 commented Jun 28, 2024

Given the definitions here, should we be checking that the weights are non-negative in some kind of __check_init__? More generally, should this check be on the Data classes. I.E. should we only allow handling of non-negative weights in Data? If we don't expect any of the current or future solvers/metrics to handle negative weights, then I don't think we would be losing anything by adding this constraint to Data.

@db091756 db091756 self-assigned this Jul 1, 2024
@db091756
Copy link
Member

db091756 commented Jul 1, 2024

Given the definitions here, should we be checking that the weights are non-negative in some kind of __check_init__? More generally, should this check be on the Data classes. I.E. should we only allow handling of non-negative weights in Data? If we don't expect any of the current or future solvers/metrics to handle negative weights, then I don't think we would be losing anything by adding this constraint to Data.

3.1.1 from https://arxiv.org/pdf/1204.1664 gives good reasoning for why non-negative and normalised weights should probably not be enforced in Coreax.

@db091756
Copy link
Member

db091756 commented Jul 1, 2024

The definition given in the docstring is very good, but seems quite linked to the way in which the recombination discusses coresets. e.g. not all coreset algorithms will satisfy the preservation of the CoM.

I think trying to get an extremely mathematically precise definition of a coreset is going to be very hard as coresets are discussed and introduced in very different ways in the literature. The TLDR:

"a coreset is a reduced set of :math:\hat{n} (potentially weighted) data
points that, in some sense, best represent the "important" properties of a larger
set of :math:n > \hat{n} (potentially weighted) data points."

is a great (not super mathematical) summary of what a coreset is. When we start talking about measures things could get more confusing for users as there exist many coreset algorithms out there, including probably the most popular in Kernel Herding, which don't discuss measures at all but instead talk in terms of distributions.

I am not sure what the best action would be here, but I would be tempted to keep things as simple as possible, and transfer the measure discussion to the Tree recombination docstring.

Perhaps following the TLDR we could add a few (as brief as possible without being misleading) examples of what the "important" properties are for specific algorithms such as CoM for Tree Recombination or the MMD for Kernel Herding.

Copy link
Member

@db091756 db091756 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments.

@tc85324
Copy link
Contributor Author

tc85324 commented Jul 8, 2024

I would argue that Kernel Herding does preserve the CoM, where the test-functions are implied by the kernel. I.E. the kernel mean embedding (KME) is the CoM and the MMD gives a measure of how well the CoM has been preserved.

While I think we can argue the CoM case for any method that attempts to (approximately) preserve some integrated quantity of a dataset, it might be best to move this to the module docstring. There we can explain that certain algorithms such as recombination, (I think kernel herding), etc... attempt to solve this more precisely defined problem, but that the general definition of "important" can be very much solver/algorithm dependant. We could refer to such coreset algorithms as "integral-preserving/expectation-approximating" coresets, but finding a good taxonomy for all the algorithms would be a fair amount of work. What do you think?

I agree that there needs to be a clearer mention of distributions, or at least the fact that a measure, in the sense considered here, is a generalisation of a distribution.

@db091756
Copy link
Member

db091756 commented Jul 9, 2024

I would argue that Kernel Herding does preserve the CoM, where the test-functions are implied by the kernel. I.E. the kernel mean embedding (KME) is the CoM and the MMD gives a measure of how well the CoM has been preserved.

While I think we can argue the CoM case for any method that attempts to (approximately) preserve some integrated quantity of a dataset, it might be best to move this to the module docstring. There we can explain that certain algorithms such as recombination, (I think kernel herding), etc... attempt to solve this more precisely defined problem, but that the general definition of "important" can be very much solver/algorithm dependant. We could refer to such coreset algorithms as "integral-preserving/expectation-approximating" coresets, but finding a good taxonomy for all the algorithms would be a fair amount of work. What do you think?

I agree that there needs to be a clearer mention of distributions, or at least the fact that a measure, in the sense considered here, is a generalisation of a distribution.

I agree that KH falls under the CoM definition, but I think my main issue is that this language is never used to introduce the concept of, or discuss, coresets in the KH paper. It feels like quite a recombination specific definition. There is no problem with that if we feel like this definition is the widest possible one, and that is what we want the goal of the Coreset docstring to be . But I agree that finding a precise mathematical definition that covers all current (and future) Coreax coreset algorithms is going to be very tricky, and not even necessarily what the goal of the docstring should be.

I would lean towards avoiding trying to define some mathematical language i.e. "integral-preserving/expectation-approximating", and just say coresets tend to be built such that they "(approximately) preserve some integrated quantity of a dataset". The more specific we get with the mathematical terminology, I feel the more likely we are to exclude some algorithms when we need not, and confuse the user, when we could just point towards papers which introduce coresets in their own individual way.

Hopefully I am making sense.

@db091756
Copy link
Member

@tc85324 Do you have any more thoughts on this?

@tc85324
Copy link
Contributor Author

tc85324 commented Jul 31, 2024

I will revisit this once #504 is merged.

@db091756
Copy link
Member

db091756 commented Aug 9, 2024

I will revisit this once #504 is merged.

Blocked due to ^

@db091756
Copy link
Member

db091756 commented Sep 8, 2024

Now unblocked @tc85324

@tp832944 tp832944 assigned tp832944 and unassigned db091756 Oct 16, 2024
@tp832944
Copy link
Contributor

@tc85324 where are we with this PR?

Replaces complex measure theoretic definitions with more accessible
ones while maintaining the required mathematical notation.

Thanks to @db091756 for the helpful discussions.

Refs: #684
Copy link
Contributor

Performance review

Commit e6e0b76 - Merge e3d2291 into 75a8b41

No significant changes to performance.

Copy link
Contributor

Performance review

Commit a230649 - Merge 66cf481 into 75a8b41

No significant changes to performance.

Copy link
Contributor Author

@tc85324 tc85324 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tc85324 where are we with this PR?

Sorry for the long delay in getting back to this. With the latest commits, and thanks to the disucssions with @db091756, the PR should now be ready for review again.

@tc85324 tc85324 requested a review from db091756 October 19, 2024 10:17
Copy link
Contributor

Performance review

Commit 25ef21d - Merge a9e8571 into 03220a5

No significant changes to performance.

Copy link
Contributor

@rg936672 rg936672 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving just @pc532627's latest commit - I haven't looked at the rest of the changes. So long as @pc532627 is happy with everything before that commit (and so long as the CI passes!), this can be merged :)

@pc532627 pc532627 dismissed db091756’s stale review October 21, 2024 12:11

Changes made since, reviewed by others

@pc532627 pc532627 merged commit bc6584f into main Oct 21, 2024
19 checks passed
@pc532627 pc532627 deleted the feature/improve-coreset-docs branch October 21, 2024 12:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants