Does MODNet have facilities for including state variables such as temperature or pressure? #69

sgbaird · 2022-01-06T05:49:03Z

I was looking through the docs and example notebooks and didn't see where this information might fit in with the typical pipelines, but maybe I missed something.

ppdebreuck · 2022-01-06T11:21:59Z

Hi @sgbaird,

There is at this stage nothing that "easily" includes state variables (at least not expicitely). Though, two quick solutions exists. If the properties are available on a fixed range (e.g. temperature dependent property), this could be used as a vector (multi-property). Another (explicit) way is to append the state to the generated features.

sgbaird · 2022-01-06T14:43:45Z

Hi @ppdebreuck,

Thanks for the quick reply! For the first one, it sounds like you mean add the temperature as an additional target property? For the second one, perhaps I can just append a column to the ModData.df_featurized attribute?

modnet/modnet/preprocessing.py

Lines 530 to 556 in 719e028

    
           class MODData: 
        
               """The MODData class takes takes a list of `pymatgen.Structure` 
        
               objects and creates a `pandas.DataFrame` that contains many matminer 
        
               features per structure. It then uses mutual information between 
        
               features and targets, and between the features themselves, to 
        
               perform feature selection using relevance-redundancy indices. 
        
               Attributes: 
        
                   df_structure (pd.DataFrame): dataframe storing the `pymatgen.Structure` 
        
                       representations for each structured, indexed by ID. 
        
                   df_targets (pd.Dataframe): dataframe storing the prediction targets 
        
                       per structure, indexed by ID. 
        
                   df_featurized (pd.DataFrame): dataframe with columns storing all 
        
                       computed features per structure, indexed by ID. 
        
                   optimal_features (List[str]): if feature selection has been performed 
        
                       this attribute stores a list of the selected features. 
        
                   optimal_features_by_target (Dict[str, List[str]]): If feature selection has been performed 
        
                       this attribute stores a list of the selected features, broken down by target property. 
        
                   featurizer (MODFeaturizer): the class used to featurize the data. 
        
                   __modnet_version__ (str): The MODNet version number used to create the object 
        
                   cross_nmi (pd.DataFrame): If feature selection has been performed, this attribute 
        
                       stores the normalized mutual information between all features. 
        
                   feature_entropy (Dictionary): Information entropy of all features. Only computed after a call to compute cross_nmi. 
        
                   num_classes (Dictionary): Defining the target types (classification or regression). 
        
                       Should be constructed as follows: key: string giving the target name; value: integer n, 
        
                       with n=0 for regression and n>=2 for classification with n the number of classes. 
        
               """

Maybe something like the following:

from modnet.preprocessing import MODData
from modnet.models import MODNetModel

# Creating MODData
data = MODData(materials = structures,
               targets = targets,
              )
data.featurize()
data.df_featurized.append({"T": temperatures})
data.feature_selection(n=200)

# Creating MODNetModel
model = MODNetModel(target_hierarchy,
                    weights,
                    num_neurons=[[256],[64,64],[32]],
                    )
model.fit(data)

# Predicting on unlabeled data
data_to_predict = MODData(new_structures)
data_to_predict.featurize()
data_to_predict.df_featurized.append({"T": new temperatures})
df_predictions = model.predict(data_to_predict) # returns dataframe containing the prediction on new_structures

^{modified from Getting Started}

I haven't tried this yet, but if it seems reasonable I will probably give it a go later today.

ppdebreuck · 2022-01-06T15:14:41Z

For solution (1), yes the idea would be to have one target per temperature, like the thermodynamical data notebook.

# Creating MODNetModel
model = MODNetModel([[["S_5K","S_300K","S_500K"]]],
                    {"S_5K":1,"S_300K":1,"S_500K":1},
                    num_neurons=[[256],[64],[64],[32]],
                    )

With a few limitations : implicit, fixed temperatures, should be available for each sample, slower to train

I would indeed try what you suggested.

Created in response to ppdebreuck#69. No license was given in the original repository. See ziyan1996/VickersHardnessPrediction#1

sgbaird · 2022-01-07T09:17:14Z

It took some time, but I got it figured out and made an example notebook (see the PR above)

ppdebreuck · 2022-01-07T13:32:28Z

Cool! Indeed, option (1) was infeasible here. Thanks for this addition. A simple hyper opt might be worth adding as example:

from modnet.hyper_opt import FitGenetic
ga = FitGenetic(train)
model = ga.run(refit=0, nested=0, size_pop=10, num_generations=3, n_jobs=20) 
# size_pop, num_generations and n_jobs can be increased if computational power available

which avoids dealing with the model setup (num neurons etc.), around 5 mins to run and lowers MAE to +/- 2.2.

Btw, any benchmarking results available on this dataset ?

sgbaird · 2022-01-07T23:51:33Z

@ppdebreuck thanks!

Can you use both hyper_opt and EnsembleMODNetModel simultaneously? I'm guessing this just means using hyper_opt and then passing in a list of the optimized parameters to EnsembleMODNetModel. I tried with the EnsembleMODNetModel (no hyper_opt) and got a test MAE of around +/- 3.1 and test R^2 of 0.81.

As for benchmarking, in VickersHardnessPrediction/hv_predictions.py they split data according to:

train_test_split(train_size=0.9, test_size=0.1, random_state=100, shuffle=True)

They use XGBoost with recursive feature elimination (RFE) on physical descriptors. In the paper, they report an MSE of 5.7 GPa (RMSE --> 2.4) and an R-squared value of 0.97 (see also parity plots in Figure 2 of 10.1002/adma.202005112). The scripts they give in the repo aren't in a working state and it looks like a decent bit of work to resolve all the errors. If I don't get a response I might continue trying to refactor the repo. I'm also not sure if the repo is a reproducer for the paper results, so I wanted to run it myself.

Btw looks like modnet.hyper_opt isn't contained in 0.1.11, so I used:

pip install git+https://github.com/ppdebreuck/modnet@master

I know that MEGNet is geared towards state variables, but MEGNet only takes structures as inputs, not compositions. It can be paired with something like BOWSR, but I'd only imagine that working for single-phase structures (i.e. sort of a non-sensical physical representation if alloys are involved).

ppdebreuck · 2022-01-10T07:38:21Z

FitGenetic.run() will in fact always return an EnsembleModel, with the ensemble depending on the refit and nested argument.

If refit = 0: No refitting is done. Fitted models from the (nested) validation are simply reused. An ensemble is constructed from the best architecture over the inner folds (thus size 1 if nested=0, and size x if nested=x).
If refit = x >0 ; best params are refitted x times. Thus, an ensemble of x refitted models is returned. All models have the same architecture (i.e. the best founded by the GA). This is exactly what you want I think. (Using refit=1 would just be one MODNetModel with the Ensemble container.)

Thanks for the info! Yep, we need to clean things a bit up and make a new release on pypi when we find time :p

sgbaird · 2022-01-10T10:41:48Z

Ah, gotcha. Thank you!

Created in response to #69. No license was given in the original repository. See ziyan1996/VickersHardnessPrediction#1

sgbaird added a commit to sparks-baird/modnet that referenced this issue Jan 7, 2022

composition state example (hardness + load), dataset, and citation

2b4a88a

Created in response to ppdebreuck#69. No license was given in the original repository. See ziyan1996/VickersHardnessPrediction#1

sgbaird mentioned this issue Jan 7, 2022

composition state example (hardness + load), dataset, and citation #76

Merged

sgbaird mentioned this issue Jan 8, 2022

FitGenetic taking over 2.5 hrs instead of 5 min #77

Closed

ppdebreuck pushed a commit that referenced this issue Feb 8, 2022

composition state example (hardness + load), dataset, and citation

5a3346e

Created in response to #69. No license was given in the original repository. See ziyan1996/VickersHardnessPrediction#1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does MODNet have facilities for including state variables such as temperature or pressure? #69

Does MODNet have facilities for including state variables such as temperature or pressure? #69

sgbaird commented Jan 6, 2022

ppdebreuck commented Jan 6, 2022

sgbaird commented Jan 6, 2022

ppdebreuck commented Jan 6, 2022 •

edited

Loading

sgbaird commented Jan 7, 2022

ppdebreuck commented Jan 7, 2022

sgbaird commented Jan 7, 2022

ppdebreuck commented Jan 10, 2022

sgbaird commented Jan 10, 2022

Does MODNet have facilities for including state variables such as temperature or pressure? #69

Does MODNet have facilities for including state variables such as temperature or pressure? #69

Comments

sgbaird commented Jan 6, 2022

ppdebreuck commented Jan 6, 2022

sgbaird commented Jan 6, 2022

ppdebreuck commented Jan 6, 2022 • edited Loading

sgbaird commented Jan 7, 2022

ppdebreuck commented Jan 7, 2022

sgbaird commented Jan 7, 2022

ppdebreuck commented Jan 10, 2022

sgbaird commented Jan 10, 2022

ppdebreuck commented Jan 6, 2022 •

edited

Loading