-
Notifications
You must be signed in to change notification settings - Fork 814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possibility to fix points in the low embedding #606
Comments
The code, as it stands, would need some changes to allow this to happen. This is actually already a requested feature (see #432) -- although that request may differ in specific details compared to what you need. It was on my development plan, but I have not gotten to it yet. It is not really that difficult to add, so if you wanted to have a look at doing it yourself I can provide guidance as to what changes would need to be made. |
Yes, that would be fantastic! |
You can email me at "leland DOT mcinnes AT gmail DOT com" and I can give you a general outline and we can discuss from there. |
@matthieuheitz, thank you very much for tackling this! How is it going so far? Update: Ah, I just found your related pull request #620 . |
Yes, it's not complete yet, and it's still breaking some CI checks, so I'll work on that this week. |
Brute-force pinning as in "embedding constraints" (lmcinnes#432 (comment)), while "Possibility to fix points in the low embedding" (lmcinnes#606) is not done.
In the meantime I have tested @dribnet's brute force implementation from #432 (comment) (see my code). I have now first computed an embedding without pinning, then I saved some resulting data point coordinates from there. Of course, computing another embedding (with same data and random seed) by pinning these data points to the saved coordinates leads to an overall different embedding (except for the pinned points). I now wonder whether there could be a stable implementation of pinning, s.t. pinning data points to their current embedding locations would not change the embedding. Maybe one could internally first compute the unpinned embedding, then do an additional "pinning adjustment" phase (which would result in no further change in the above case)? |
In my tests, the mentioned brute force implementation also has the unwanted effect that sometimes the fixed points stay separated from the remaining data. This effect can also be seen in the last image of #432 (comment), I have marked it here: @matthieuheitz, do you also see this effect in your PR? What could be done about this? Maybe the fixed points' positions should already influence how the low dimensional embedding is initialized? |
With jondo@7416cce, based on the current PR state, I could even get an implementation of pinning that is "stable" in the sense I described above: I first compute the sumdist2 = 0
for (index, point) in pinning_points:
dist2 = (point - initial_embedding[index])**2
sumdist2 += dist2.mean()
initial_embedding[index] = point
pin_mask[index] = 0
pinning_rmse = np.sqrt(sumdist2 / len(pinning_points))
n_epochs = int(pinning_rmse * 200) Then I do the pinning epochs (which can be as low as zero due to e442bcd): reducer = umap.UMAP(n_epochs=n_epochs, init=initial_embedding)
embedding = reducer.fit_transform(X, pin_mask=pin_mask) For However, in comparison to just using the spectral embedding as |
The "fixed points stay less connected" artifacts could probably be avoided by implementing this whole "fixing points" feature outside UMAP instead: by using some graph layout library to pull the low-dimensional graph of the unconstrained embedding into the right shape/place. |
Sorry for the long silence! This above is an example where actually, fixed points are not super necessary, because the reconstruction without them makes sense. But this also happens on datasets where fixed points would make more sense. For example, this is an orthogonal star in high dimension: points where only one coordinate have values in [0,0.1,0.2,...,1.0]. I don't plot the ground truth because I can't in 2D. Result with fixed points are in bottom right figure: for N=100,200,500,1000 points: In both, where it says "no initialization" actually means spectral init. Anyway, I can identify 2 cases of wanting to fix points, and they are quite different:
Those 2 uses cases should be coded very differently, and I'm not sure how to address that. Any thoughts? Also, do you think there is a way to predict from the number of points and other parameters, the global size (bounding box) of the low dimensional embedding? If there was, we could initialize the embedding with the fixed points at their given positions, scaled to fit in the bounding box, and all other points initialized with the spectral init. |
Thanks for running experiments. It was these sorts of issues that had me scaling the initialization in the first place. I guess the real catch now is that the scaling (or the "natural size" of the final embedding) is not that fixed and varies with the number of data points (not that surprising, but still...) I can think of a couple of easy(ish) ways to handle this, and otherwise I think it would require some deeper thought to try to discern the "right" thing to do. The first approach is to try to find a heuristic formula for scaling with dataset size. It is probably possible to come up with something halfway reasonable. The second approach is harder, but probably more useful: if we zero-center the initialization then we can apply consistent uniform scaling of fixed points depending on some measure of stress on them in the embedding. I'm not that happy with either of those approaches, and they only address the first use case: pinning relative position (hence out freedom in rescaling things). The second use case wants very fixed positions. Supporting that seems a lot easier right now, so potentially the best interim solution is to leave it on the user to provide a suitable initialization for the fixed points. One final option may be the following: we instead allow users to supply an initialization for fixed points only and don't require them to initialize everything else; we can then perform a spectral initialization of the whole dataset, procrustes align that with respect to the fixed points, and then snap the fix points to their user specified positions. This takes a lot of the effort out of getting the initialization "right" in some ways. Thoughts? |
The heuristic formula idea is what I was thinking about but it might be too complicated, it seems like there would be so many particular cases where it wouldn't apply. And it also depends on how "converged" we are, but n_epochs doesn't tell us that, from what I've tried, some datasets require 500 epochs to converge, others 5000... And maybe if the natural scale of an embedding is [0,100] (when it started from [0,10]), starting directly at [0,100] might not converge to the same "natural" scale, or might converge to the same scale, but yield different results (better or worse, I don't know...). I like the idea of the zero-centring + scaling depending on stress. And it seems that the result is practically invariant to translation of the initialization, so that's nice. I thought it would be invariant to scaling if we also multiplied mindist and spread by the same scaling factor, but judging from my tests, that's not the case. For your last suggestion, I'm not sure I understand, can you give more details? I'm confused by "with respect to the fixed points". What do you want to align the spectral initialization of the whole dataset with? |
Sorry for the slow response. The plan for the last idea would be that the user supplies only initial locations for points that are to remain fixed (this obviously doesn't work with the soft mask ideas). The code then temporarily ignores the fixed points position and does the usual spectral initialization of the full dataset. This now gives us a dataset of locations for the fixed points as supplied by the user and a dataset of of locations of the corresponding points as generated by the spectral initialization of the full dataset. The goal, then, if to find a suitable affine transformation that maps the spectral initialized locations of the points to the user supplied fixed point locations. The easiest way to do this is by translating both datasets to be origin centered, globally scalling them to have the same variance, and then solving the orthogonal procrustes problem to find the rotation and reflection that minimises the error. You then have a linear map given the the rotation + reflection + scaling + translation from the spectral initialized locations of the would be fixed points to (as close an approximation as a suitable linear map can provide of) the user supplied locations. You can apply that linear map to the spectral initialization of the full dataset. This doesn't solve all the problems, but perhaps it makes it easier for the user since they don't have to specify everything just where they want to fixed points. To be fair I am still having concerns with this idea. The stress based scaling feels like the best answer, but it is certainly a challenging problem as to how best to calculate it. At this stage I suspect the best thing to do is simply to go with roughly what you have already and simply providing good documentation of the concerns that can arise (that the user may need to be careful and/or scale their initialization to get the results that they want). The plot you posted in this issue would work very well in tutorial documentation describing these sorts of issues. |
hey @matthieuheitz curious what the status is of this PR/feature request. Is this still in active development? (Is there something I can do to help out?) Reading the discussion above, I'm of the opinion that releasing a well documented approach using the heuristic is the way to go simply because it will get some aspects of this feature out for general use. As people use the tool we'll get a much better understanding of actual use cases and edge cases, compared to artificial datasets/experiments. |
In my constraints branch, I did the Procrustes approach Leland suggests, without having read this discussion. It just seemed a natural way to do it. My branch translates pin_mask things into more general framework, where you supply point or tangent space projectors as numba functions. This allows a lot of flexibility in constraint types, such as hard pinning, spring force “pinning”, as well as generic constraints like dimension wise bounds, etc. |
@theahura Sorry for the very delayed response. Unfortunately, I've stopped working on that because I realized I did not need it as much, and I've been quite busy so I haven't taken the time to code a clean interface and write up a good documentation for it in #620. |
See the Perhaps begin with the docstrings for A KISS Although folks are using my stuff in its current form (and actually really need it) there are numerous tricky |
Oh... I ignore finer points about data scaling. Presumably, 'min dist' should be small enough to be OK for most datasets (?) The +/-4 gradient bound did not concern me --- close to convergence, all gradients should be small. I view it as I also punted on the issue of constraints interacting badly with co-located points, (point-indexed constraints I haven't thought about constraints involving the embeddings of pairs of points, but don't see any reason you |
Your second example is really interesting, and relevant to me. I also predict feature requests for such "radial" pinnings, For 2-D, the most difficult case, I still want the potentially incompatible init embeddings to "move past one another" and |
Hi,
I'd like to know if it would be possible to add a prior on the low embedding, meaning that I want certain points to be at certain coordinates in the low embedding, and I would like UMAP to optimize all other points.
Is that something that is doable with the actual code, or would it need some changes?
How difficult would it be to add this feature?
Thank you very much.
The text was updated successfully, but these errors were encountered: