Skip to content

Commit bc6584f

Browse files
authored
Merge pull request #684 from gchq/feature/improve-coreset-docs
docs: simplify `Coreax.Coreset` docstring maths
2 parents 03220a5 + a9e8571 commit bc6584f

File tree

2 files changed

+19
-46
lines changed

2 files changed

+19
-46
lines changed

.cspell/library_terms.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -83,6 +83,7 @@ ndim
8383
newaxis
8484
nobs
8585
nonzero
86+
notin
8687
numpy
8788
opencv
8889
operatorname

coreax/coreset.py

Lines changed: 18 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -35,41 +35,16 @@ class Coreset(eqx.Module, Generic[_Data]):
3535
r"""
3636
Data structure for representing a coreset.
3737
38-
TLDR: a coreset is a reduced set of :math:`\hat{n}` (potentially weighted) data
39-
points that, in some sense, best represent the "important" properties of a larger
40-
set of :math:`n > \hat{n}` (potentially weighted) data points.
38+
A coreset is a reduced set of :math:`\hat{n}` (potentially weighted) data points,
39+
:math:`\hat{X} := \{(\hat{x}_i, \hat{w}_i)\}_{i=1}^\hat{n}` that, in some sense,
40+
best represent the "important" properties of a larger set of :math:`n > \hat{n}`
41+
(potentially weighted) data points :math:`X := \{(x_i, w_i)\}_{i=1}^n`.
4142
42-
Given a dataset :math:`X = \{x_i\}_{i=1}^n, x \in \Omega`, where each node is paired
43-
with a non-negative (probability) weight :math:`w_i \in \mathbb{R} \ge 0`, there
44-
exists an implied discrete (probability) measure over :math:`\Omega`
43+
:math:`\hat{x}_i, x_i \in \Omega` represent the data points/nodes and
44+
:math:`\hat{w}_i, w_i \in \mathbb{R}` represent the associated weights.
4545
46-
.. math::
47-
\eta_n = \sum_{i=1}^{n} w_i \delta_{x_i}.
48-
49-
If we then specify a set of test-functions :math:`\Phi = {\phi_1, \dots, \phi_M}`,
50-
where :math:`\phi_i \colon \Omega \to \mathbb{R}`, which somehow capture the
51-
"important" properties of the data, then there also exists an implied push-forward
52-
measure over :math:`\mathbb{R}^M`
53-
54-
.. math::
55-
\mu_n = \sum_{i=1}^{n} w_i \delta_{\Phi(x_i)}.
56-
57-
A coreset is simply a reduced measure containing :math:`\hat{n} < n` updated nodes
58-
:math:`\hat{x}_i` and weights :math:`\hat{w}_i`, such that the push-forward measure
59-
of the coreset :math:`\nu_\hat{n}` has (approximately for some algorithms) the same
60-
"centre-of-mass" as the push-forward measure for the original data :math:`\mu_n`
61-
62-
.. math::
63-
\text{CoM}(\mu_n) = \text{CoM}(\nu_\hat{n}),
64-
\text{CoM}(\nu_\hat{n}) = \int_\Omega \Phi(\omega) d\nu_\hat{x}(\omega),
65-
\text{CoM}(\nu_\hat{n}) = \sum_{i=1}^\hat{n} \hat{w}_i \delta_{\Phi(\hat{x}_i)}.
66-
67-
.. note::
68-
Depending on the algorithm, the test-functions may be explicitly specified by
69-
the user, or implicitly defined by the algorithm's specific objectives.
70-
71-
:param nodes: The (weighted) coreset nodes, math:`x_i \in \text{supp}(\nu_\hat{n})`;
72-
once instantiated, the nodes should be accessed via :meth:`Coresubset.coreset`
46+
:param nodes: The (weighted) coreset nodes, :math:`\hat{x}_i`; once instantiated,
47+
the nodes should only be accessed via :meth:`Coresubset.coreset`
7348
:param pre_coreset_data: The dataset :math:`X` used to construct the coreset.
7449
"""
7550

@@ -125,27 +100,24 @@ class Coresubset(Coreset[Data], Generic[_Data]):
125100
r"""
126101
Data structure for representing a coresubset.
127102
128-
A coresubset is a :class`Coreset`, with the additional condition that the support of
129-
the reduced measure (the coreset), must be a subset of the support of the original
130-
measure (the original data), such that
103+
A coresubset is a :class:`Coreset`, with the additional condition that the coreset
104+
data points/nodes must be a subset of the original data points/nodes, such that
131105
132106
.. math::
133107
\hat{x}_i = x_i, \forall i \in I,
134-
I \subset \{1, \dots, n\}, text{card}(I) = \hat{n}.
108+
I \subset \{1, \dots, n\}, \text{card}(I) = \hat{n}.
135109
136110
Thus, a coresubset, unlike a coreset, ensures that feasibility constraints on the
137-
support of the measure are maintained :cite:`litterer2012recombination`. This is
138-
vital if, for example, the test-functions are only defined on the support of the
139-
original measure/nodes, rather than all of :math:`\Omega`.
111+
support of the measure are maintained :cite:`litterer2012recombination`.
140112
141-
In coresubsets, the measure reduction can be implicit (setting weights/nodes to
142-
zero for all :math:`i \in I \ {1, \dots, n}`) or explicit (removing entries from the
143-
weight/node arrays). The implicit approach is useful when input/output array shape
144-
stability is required (E.G. for some JAX transformations); the explicit approach is
145-
more similar to a standard coreset.
113+
In coresubsets, the dataset reduction can be implicit (setting weights/nodes to zero
114+
for all :math:`i \notin I`) or explicit (removing entries from the weight/node
115+
arrays). The implicit approach is useful when input/output array shape stability is
116+
required (E.G. for some JAX transformations); the explicit approach is more similar
117+
to a standard coreset.
146118
147119
:param nodes: The (weighted) coresubset node indices, :math:`I`; the materialised
148-
coresubset nodes should be accessed via :meth:`Coresubset.coreset`.
120+
coresubset nodes should only be accessed via :meth:`Coresubset.coreset`.
149121
:param pre_coreset_data: The dataset :math:`X` used to construct the coreset.
150122
"""
151123

0 commit comments

Comments
 (0)