-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: simplify Coreax.Coreset
docstring maths
#684
Conversation
Given the definitions here, should we be checking that the weights are non-negative in some kind of |
3.1.1 from https://arxiv.org/pdf/1204.1664 gives good reasoning for why non-negative and normalised weights should probably not be enforced in Coreax. |
The definition given in the docstring is very good, but seems quite linked to the way in which the recombination discusses coresets. e.g. not all coreset algorithms will satisfy the preservation of the CoM. I think trying to get an extremely mathematically precise definition of a coreset is going to be very hard as coresets are discussed and introduced in very different ways in the literature. The TLDR: "a coreset is a reduced set of :math: is a great (not super mathematical) summary of what a coreset is. When we start talking about measures things could get more confusing for users as there exist many coreset algorithms out there, including probably the most popular in Kernel Herding, which don't discuss measures at all but instead talk in terms of distributions. I am not sure what the best action would be here, but I would be tempted to keep things as simple as possible, and transfer the measure discussion to the Tree recombination docstring. Perhaps following the TLDR we could add a few (as brief as possible without being misleading) examples of what the "important" properties are for specific algorithms such as CoM for Tree Recombination or the MMD for Kernel Herding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments.
I would argue that Kernel Herding does preserve the CoM, where the test-functions are implied by the kernel. I.E. the kernel mean embedding (KME) is the CoM and the MMD gives a measure of how well the CoM has been preserved. While I think we can argue the CoM case for any method that attempts to (approximately) preserve some integrated quantity of a dataset, it might be best to move this to the module docstring. There we can explain that certain algorithms such as recombination, (I think kernel herding), etc... attempt to solve this more precisely defined problem, but that the general definition of "important" can be very much solver/algorithm dependant. We could refer to such coreset algorithms as "integral-preserving/expectation-approximating" coresets, but finding a good taxonomy for all the algorithms would be a fair amount of work. What do you think? I agree that there needs to be a clearer mention of distributions, or at least the fact that a measure, in the sense considered here, is a generalisation of a distribution. |
I agree that KH falls under the CoM definition, but I think my main issue is that this language is never used to introduce the concept of, or discuss, coresets in the KH paper. It feels like quite a recombination specific definition. There is no problem with that if we feel like this definition is the widest possible one, and that is what we want the goal of the I would lean towards avoiding trying to define some mathematical language i.e. "integral-preserving/expectation-approximating", and just say coresets tend to be built such that they "(approximately) preserve some integrated quantity of a dataset". The more specific we get with the mathematical terminology, I feel the more likely we are to exclude some algorithms when we need not, and confuse the user, when we could just point towards papers which introduce coresets in their own individual way. Hopefully I am making sense. |
@tc85324 Do you have any more thoughts on this? |
I will revisit this once #504 is merged. |
Blocked due to ^ |
Now unblocked @tc85324 |
@tc85324 where are we with this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes made since, reviewed by others
PR Type
Description
Improves the presentation and clarity of the mathematics within the
Coreax.Coreset
andCoreax.Coresubset
docstrings. This should make it much easier for users and contributors to understand our definitions and/or assumptions about "coresets".If you have any ideas on how to make the documentation even simpler, or more mathematically precision, I would be keen to discuss them here.
How Has This Been Tested?
Documentation builds.
Does this PR introduce a breaking change?
N/A
Checklist before requesting a review