MODNet for precalculated features and feature selection for a selected list of features #90

github-ML-fan · 2022-04-25T16:05:27Z

github-ML-fan
Apr 25, 2022

Hi,

I want to use MODNet for predicting the activity of supported metal catalysts. I have all the features I have calculated for metal and the support separately.

Can I use MODNet for this kind of application without any precalculated features (using featurizers in MODNet) where you have two different type of materials (nano particles on the surface of bulk metal oxide or graphite/pyrolytic carbon) ?
Or
Since I have all the features I want precalculated, without using any featurizers can I feed my dataset directly to MODNetModel by skipping MODData ?
As my features I have the mol % of elements such as Co mol%, Pt mol% etc for which I have the training data and I also have features such as Zr mol %, Au mol% for which I have zero training data, but I want to include them as features since I want to use the model to predict the activity of catalysts with elements such as Zr and Au.
During the feature selection in MODNet would those features be eliminated since you have no training data for them or if they can be removed is there a way to feed a selected list of features for feature selection ?

Answered by ppdebreuck

May 2, 2022

Hi @github-ML-fan
Please find hereunder an example where I make custom features from two compounds (surface + particle), and also add some custom features. I hope this is more or less in line with your problem.
The important part is to play with the data.df_featurized dataframe and join() method to add features as desired. This dataframe is further used for feature selection and for fitting with e.g. model = ga.run(data).

from pymatgen.core import Composition
from modnet.preprocessing import MODData
from modnet.hyper_opt import FitGenetic
import pandas as pd
import numpy as np


def main():
    # define materials ids and composition, can also be structures !
    my_ids = ["id1", "id2", "…

View full answer

ml-evs · 2022-04-26T08:49:37Z

ml-evs
Apr 26, 2022
Collaborator

Hi @github-ML-fan, sounds like an interesting project!

I think using your precalculated features makes most sense. MODNet by default will just use all of the matminer featurizers, which as you suggest may not be suitable for your problem as they require a single structure or composition. You probably still want to create MODData that contains your precalculated descriptors, as this class is what does the feature selection. If you have a dataframe that just contains your calculated descriptors (featurized_df) and an array of activity values (targets), you should be able to do something like:
```
from modnet.preprocesing import MODData
data = MODData(
    targets=targets,
    df_featurized=my_featurized_df
)
data.feature_selection(n=num_desired_features)
```
You probably want to do feature selection separately for each test-train split, so you would need 1 MODData per split in that case.
The null features will be dropped during the feature selection process (i.e., they will be ranked at the bottom), but you could manually include them in the final data.optimal_features list by adding their column names. I'm not sure what you think including them will achieve though.

0 replies

ppdebreuck · 2022-04-26T09:05:27Z

ppdebreuck
Apr 26, 2022
Maintainer

Hi, interesting project!
Just further suggestion on the second part:

As my features I have the mol % of elements such as Co mol%, Pt mol% etc for which I have the training data and I also have features such as Zr mol %, Au mol% for which I have zero training data, but I want to include them as features since I want to use the model to predict the activity of catalysts with elements such as Zr and Au.

In this case, I would still suggest to use featurize() on composition though. If you provide only molar fractions, it will not be able to extrapolate to other dimensions (i.e. unseen elements). However using matminer features (i.e. featurize()), you will have average masse, radii, and other things that could extrapolate to your new elements. So you could use a combination of your features and MODNet featurizers, by joining it to the md.df_featurized data frame after featurization and before feature selection. I can provide you an example script if needed.

0 replies

github-ML-fan · 2022-04-26T13:33:35Z

github-ML-fan
Apr 26, 2022
Author

Hi @github-ML-fan, sounds like an interesting project!
I think using your precalculated features makes most sense. MODNet by default will just use all of the matminer featurizers, which as you suggest may not be suitable for your problem as they require a single structure or composition. You probably still want to create MODData that contains your precalculated descriptors, as this class is what does the feature selection. If you have a dataframe that just contains your calculated descriptors (featurized_df) and an array of activity values (targets), you should be able to do something like:
from modnet.preprocesing import MODData
data = MODData(
    targets=targets,
    df_featurized=my_featurized_df
)
data.feature_selection(n=num_desired_features)
You probably want to do feature selection separately for each test-train split, so you would need 1 MODData per split in that case.
The null features will be dropped during the feature selection process (i.e., they will be ranked at the bottom), but you could manually include them in the final data.optimal_features list by adding their column names. I'm not sure what you think including them will achieve though.

Sounds great. Thank you very much @ml-evs for the quick response with codes.

I thought of adding the null features since the mol% are only for the metals that are supported which are crucial for activity.

0 replies

github-ML-fan · 2022-04-26T13:44:57Z

github-ML-fan
Apr 26, 2022
Author

Hi, interesting project! Just further suggestion on the second part:

As my features I have the mol % of elements such as Co mol%, Pt mol% etc for which I have the training data and I also have features such as Zr mol %, Au mol% for which I have zero training data, but I want to include them as features since I want to use the model to predict the activity of catalysts with elements such as Zr and Au.

In this case, I would still suggest to use featurize() on composition though. If you provide only molar fractions, it will not be able to extrapolate to other dimensions (i.e. unseen elements). However using matminer features (i.e. featurize()), you will have average masse, radii, and other things that could extrapolate to your new elements. So you could use a combination of your features and MODNet featurizers, by joining it to the md.df_featurized data frame after featurization and before feature selection. I can provide you an example script if needed.

Thanks for the reply @ppdebreuck . I forgot to mention that I have used mol% for only the metals that are supported on the bulk compound. I have calculated the molar average and standard deviation of matminer features and more for the metals and support separately. I am not sure whether I could featurize for the whole catalyst since there is a separate crystal system for the support and another crystal system for the metals that are in the form of nano/bulk particles.

But I would like to use the method suggested by you and love to have an example script.

0 replies

ppdebreuck · 2022-05-02T12:41:44Z

ppdebreuck
May 2, 2022
Maintainer

Hi @github-ML-fan
Please find hereunder an example where I make custom features from two compounds (surface + particle), and also add some custom features. I hope this is more or less in line with your problem.
The important part is to play with the data.df_featurized dataframe and join() method to add features as desired. This dataframe is further used for feature selection and for fitting with e.g. model = ga.run(data).

from pymatgen.core import Composition
from modnet.preprocessing import MODData
from modnet.hyper_opt import FitGenetic
import pandas as pd
import numpy as np


def main():
    # define materials ids and composition, can also be structures !
    my_ids = ["id1", "id2", "id3"]
    surface = [Composition("Li2O"), Composition("MgO"), Composition("TiO2")]
    particle = [Composition("Zr"), Composition("Co"), Composition("Au")]

    my_features = pd.DataFrame({"f1": [0.12, 0.16, 0.56]}, index=my_ids)
    # some pandas dataframe containing your features

    # define the MODDatas
    surface_data = MODData(materials=surface, structure_ids=my_ids)
    particle_data = MODData(materials=particle, structure_ids=my_ids)

    # featuriztion
    surface_data.featurize()
    particle_data.featurize()
    particle_data.df_featurized.columns = [
        x + "_particle" for x in particle_data.df_featurized.columns
    ]  # simple name change such that particle features have different name than surface features

    # joining all features, including custom ones
    new_df_featurized = (
        surface_data.df_featurized.join(particle_data.df_featurized)
    ).join(my_features)

    # final MODData used for feature selection and fitting
    final_data = MODData(
        materials=[
            None for _ in range(len(surface))
        ],  # this is not used as you provide the features.
        targets=np.array([[1, 2, 3]]).T,
        target_names=["my_property"],
        df_featurized=new_df_featurized,
        structure_ids=my_ids,
    )
    final_data.feature_selection()

    # train model
    ga = FitGenetic(final_data)
    model = ga.run()

    model.predict(
        test_data
    )  # predict on new samples, but test_data should follow same featurization pipeline as train data !


if __name__ == "__main__":
    main()

0 replies

github-ML-fan · 2022-05-10T13:36:15Z

github-ML-fan
May 10, 2022
Author

Hi @github-ML-fan Please find hereunder an example where I make custom features from two compounds (surface + particle), and also add some custom features. I hope this is more or less in line with your problem. The important part is to play with the data.df_featurized dataframe and join() method to add features as desired. This dataframe is further used for feature selection and for fitting with e.g. model = ga.run(data).

from pymatgen.core import Composition
from modnet.preprocessing import MODData
from modnet.hyper_opt import FitGenetic
import pandas as pd
import numpy as np


def main():
    # define materials ids and composition, can also be structures !
    my_ids = ["id1", "id2", "id3"]
    surface = [Composition("Li2O"), Composition("MgO"), Composition("TiO2")]
    particle = [Composition("Zr"), Composition("Co"), Composition("Au")]

    my_features = pd.DataFrame({"f1": [0.12, 0.16, 0.56]}, index=my_ids)
    # some pandas dataframe containing your features

    # define the MODDatas
    surface_data = MODData(materials=surface, structure_ids=my_ids)
    particle_data = MODData(materials=particle, structure_ids=my_ids)

    # featuriztion
    surface_data.featurize()
    particle_data.featurize()
    particle_data.df_featurized.columns = [
        x + "_particle" for x in particle_data.df_featurized.columns
    ]  # simple name change such that particle features have different name than surface features

    # joining all features, including custom ones
    new_df_featurized = (
        surface_data.df_featurized.join(particle_data.df_featurized)
    ).join(my_features)

    # final MODData used for feature selection and fitting
    final_data = MODData(
        materials=[
            None for _ in range(len(surface))
        ],  # this is not used as you provide the features.
        targets=np.array([[1, 2, 3]]).T,
        target_names=["my_property"],
        df_featurized=new_df_featurized,
        structure_ids=my_ids,
    )
    final_data.feature_selection()

    # train model
    ga = FitGenetic(final_data)
    model = ga.run()

    model.predict(
        test_data
    )  # predict on new samples, but test_data should follow same featurization pipeline as train data !


if __name__ == "__main__":
    main()

Thank you so much!!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MODNet for precalculated features and feature selection for a selected list of features #90

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

MODNet for precalculated features and feature selection for a selected list of features #90

github-ML-fan Apr 25, 2022

Replies: 6 comments

ml-evs Apr 26, 2022 Collaborator

ppdebreuck Apr 26, 2022 Maintainer

github-ML-fan Apr 26, 2022 Author

github-ML-fan Apr 26, 2022 Author

ppdebreuck May 2, 2022 Maintainer

github-ML-fan May 10, 2022 Author

github-ML-fan
Apr 25, 2022

ml-evs
Apr 26, 2022
Collaborator

ppdebreuck
Apr 26, 2022
Maintainer

github-ML-fan
Apr 26, 2022
Author

github-ML-fan
Apr 26, 2022
Author

ppdebreuck
May 2, 2022
Maintainer

github-ML-fan
May 10, 2022
Author