Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ElMD compositional featurizer #726

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

sgbaird
Copy link

@sgbaird sgbaird commented Nov 20, 2021

Summary

Add featurization scheme for ElM2D and corresponding UMAP embeddings.

  • ElM2D distance computations
  • UMAP embeddings of ElM2D distances

TODO (if any)

  • Conform to Google Code Style for docs
  • Once the code is settled, move it into composition.py etc.
  • add dist_matrix as conda/pip dependency
  • deal with pandas and scipy dependencies

@ardunn
I spent a few hours refactoring my refactor of ElM2D (chem_wasserstein) to try to conform the code to be (at least closer) to matminer, but I'm realizing I may not be able to justify the time it will take to push the PR all the way through. I'm open to your feedback (especially after you've glanced at the basic skeleton of the changes), but I'm thinking it might be more reasonable for now to use chem_wasserstein as an external, optional dependency.

@ardunn
Copy link
Contributor

ardunn commented Nov 23, 2021

Hi @sgbaird thanks for the PR!

This looks great and it is encouraging to see it basically in a working form already.

I think it is preferable to have the ElM2D code directly in matminer if it is not too complicated, and from briefly looking over the code it doesn't look like it is too complex to include (from the perspective of keeping matminer self contained and not too sprawling at least.. definitely not saying it is easy to port the code over from wasserstein haha)

We will need to make some changes to the locations of some of the classes and such to keep things organized but this is an excellent start if it is working on your end.

Here are some structural (not formatting) changes I think we'd need to make to take this PR to completion:

PCA class

Would it be possible to use the PCA class from sklearn instead?

Imports (umap, dist_matrix)

Would it be possible to import these within a try/except ImportError block such that it is clear what dependencies are needed in order to run the class?

ElM2D inheriting BaseFeaturizer

  • doesn't seem like ElM2D currently implements featurize, which it needs to. Featurize is just the simple operation of (once fitted), returning an array of features for one sample (i.e., one composition)
  • BaseFeaturizer overrides transform and fit_transform (if fit is defined), so those should probably be moved into featurize if possible.

Save and load

  • If possible, it would be best to save the variables to a more regular format like JSON or similar, rather than pickle.

Presets

  • If possible, it would be good to have some default "preset" options for people to load easily. See some of the other featurizers' from_preset methods to see what I mean. The idea being that people can load a "pre-fit" version which will work generally across use cases (again, if this is possible and makes sense)

If this seems like too much to be worth the effort, I think it would be fine to include the wasserstein code as an external dependency and include it in composition.external featurizers.

It would probably be best if the wasserstein code could be pip installable, but if not possible, we can still work with it by including it in the requirements as a git requirement directly. If we did go this way, I think the only things we'd need to implement are:

  • A basefeaturizer class wrapping the wasserstein code, implementing featurize and fit if needed
  • The featurizer can also implement from_preset if that is viable

Let me know what you think!

@sgbaird
Copy link
Author

sgbaird commented Nov 23, 2021

@ardunn, this is great!

Thanks for looking through the code. Using sklearn PCA shouldn't be a problem (technically I think it would be the MDS algorithm, i.e. distance-matrix based, even though it says PCA in a number of places). The imports, save/load, and presets also seem reasonable. For reference, I have a (refactored) version of ElM2D available on PyPI and Anaconda under the pseudonym chem_wasserstein to avoid conflict with the base, but based on what you've mentioned, I think it's worth pushing a bit further for the standalone matminer version as you mentioned. I also packaged dist_matrix on PyPI and Anaconda, so that dependency shouldn't be a problem. I feel pretty confident in the test suite I developed for dist_matrix as well.

Featurize

There are a few options for featurize and it wasn't immediately obvious to me which is the best one. It could either be:

  1. the vectors used as inputs to the Wasserstein distance
  2. a Wasserstein distance matrix
  3. embedded UMAP coordinates based on the distance matrix

(1) isn't particularly interesting (encodes the composition and a machine-learned scalar representation of each element), but they are "features" in the strictest sense - a fixed compound always maps to the same vector.
(2) is probably the most versatile information available from ElM2D, but distance matrices aren't feature vectors, so I have a hard time imagining this being the output of featurize
(3) is really useful for clustering and visualization (i.e. the embedded coordinates cluster according to chemical similarity). This definitely fits into the category of (2D Cartesian) feature vectors in an embedding space, but the implementation would probably need to be adjusted as mentioned below.

Do you lean towards one of the above? I'll leave this one up to your knowledge of matminer.

Adjust UMAP embedding implementation

If we go with (3), it would probably be better (if not essential) for me to implement a version that allows fit to be called once, and then a transform that can be called repeatedly/reliably on new data (right now, since it depends on distance matrices, you have to feed in both "training" and "validation" data at the same time, hence the quotes). This is something I've been meaning to get to, so I don't consider this "extra" work necessarily. I came up with an easy workaround for the immediate application I had intended it for, so I'm a bit behind on this one.

@ardunn
Copy link
Contributor

ardunn commented Nov 23, 2021

@sgbaird

External dependencies

For chem_wasserstein on PyPi, I think it's totally fine to just reference it as an external dependency and ElM2D can live in featurizers.composition.external.elm2d (it's own module). If you think you can adapt the code to be more along the structure of what's currently in matminer though, that's fine too - really your call and I'd approve a good implementation of either.

Features

I see two main ways of doing this:

  1. Just have the featurizer output the vectors used as inputs, and include Wasserstein distance matrix/UMAP as totally separate functions that can easily work on the output of the featurizer as a dataframe.
  2. Select between the 3 output types with an argument set in __init__, defining what you want to get out of the featurizer at the time you create the class. If vector is requested, it's easy to return just that. If distance matrix is requested, allow list of formulas as an extra argument to fit to use to compute the wasserstein distances. I.e., "I want to get the distances for each sample I'm featurizing with respect to this set of other formulas I fit on.". For example:
emd = ElM2DFeaturizer(method="distance")
emd.fit(X=df_i_want_to_featurize["composition"], comparison=other_df_want_to_compare_against["composition"])

The .fit syntax is pretty flexible for featurizers so feel free to get more creative with it as well

@ardunn
Copy link
Contributor

ardunn commented Jan 13, 2022

Hey @sgbaird any thoughts on this? Should I close?

@sgbaird
Copy link
Author

sgbaird commented Jan 13, 2022

Sorry, I kind of dropped the ball on this. Thank you for the reminder! I think I'll reference it as an external dependency for now. I'll plan on getting this worked out in the next month. Will that work out?

@ardunn
Copy link
Contributor

ardunn commented Jan 13, 2022

Sorry, I kind of dropped the ball on this. Thank you for the reminder! I think I'll reference it as an external dependency for now. I'll plan on getting this worked out in the next month. Will that work out?

Yeah no problem! We don't have any time restrictions or anything. Just wanted to see if it was still something you wanted to do :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants