Skip to content

finetuning problem #20

@Asif-Iqbal-Bhatti

Description

@Asif-Iqbal-Bhatti

Hello, I am trying to fine-tune the GRACE-2L-OAM using this input yaml file

'''
seed: 1
cutoff: 6.0

data:
filename: train_data.extxyz
test_filename: test_data.extxyz
reference_energy: {Cl: -0.29658376, S: -1.884791, P: -0.88440305, Li: -0.25125556}

reference_energy: {Al: -1.23, Li: -3.56}

save_dataset: False

stress_units: eV/A3 # eV/A3 (default) or GPa or kbar or -kbar

potential:

If elements not provided - determined automatically from data

preset: GRACE_1LAYER # LINEAR, FS, GRACE_1LAYER, GRACE_2LAYER

finetune_foundation_model: GRACE-2L-OAM
reduce_elements: False

For custom model from model.py::custom_model

custom: model.custom_model

shift: True # True/False
scale: True # False/True or float

fit:
loss: {
energy: { weight: 100, type: huber , delta: 0.01 },
forces: { weight: 10, type: huber , delta: 0.01 },
stress: { weight: 0.1, type: huber , delta: 0.01 },

}

maxiter: 600 # Number of epochs / iterations
optimizer: Adam
opt_params: {
learning_rate: 0.0001,
amsgrad: True,
use_ema: True,
ema_momentum: 0.99,
weight_decay: null,
clipvalue: 1.0,
}

for learning-rate reduction

learning_rate_reduction: { patience: 10, factor: 0.99, min: 5.0e-4, stop_at_min: True, resume_lr: True, }

optimizer: L-BFGS-B

opt_params: { "maxcor": 100, "maxls": 20 }

needed for low-energy tier metrics and for "convex_hull"-based distance of energy-based weighting scheme

compute_convex_hull: False
batch_size: 32 # Important hyperparameter for Adam and irrelevant (but must be) for L-BFGS-B
test_batch_size: 8 # test batch size (optional)

jit_compile: True
eval_init_stats: True # to evaluate initial metrics

train_max_n_buckets: 10 # max number of buckets (group of batches of same shape) in train set
test_max_n_buckets: 5 # same for test

checkpoint_freq: 1 # frequency for REGULAR checkpoints.

save_all_regular_checkpoints: True # to store ALL regular checkpoints

progressbar: True # show batch-evaluation progress bar
train_shuffle: True # shuffle train batches on every epoch

'''

but during fine-tuning I keep on seeing energy shift

Image

I have my fine-tune compatible to MPRelaxSet PBE. Despite setting shift and scale i don't see any improvement. Could you tell me what is going on?

'''
2025/11/27 14:57:52 I - Iteration #42/600 TRAIN(TEST): total_loss: 2.110e-02 (2.728e-02) mae/depa: 1.308e-02 (3.028e-02) rmse/depa: 2.052e-02 (5.360e-02) mae/f_comp: 1.219e-01 (8.690e-02) rmse/f_comp: 1.876e-01 (1.691e-01) mae/stress(GPa): 1.123e+00 (2.502e-01) rmse/stress(GPa): 1.518e+00 (3.833e-01) Time(mcs/at): 71 (21)
2025/11/27 14:57:52 I - Minimum value of learning rate 0.0005 is achieved 11 times - stopping
2025/11/27 14:57:53 I - Regular checkpointing
2025/11/27 14:57:53 I - Sharding callback duration: 42 microseconds
2025/11/27 14:57:54 I - Loading best test loss model
2025/11/27 14:57:56 I - Loaded checkpoint from seed/1/checkpoints/checkpoint.best_test_loss
2025/11/27 14:57:56 I - Saving model to final_model
INFO:tensorflow:Assets written to: seed/1/final_model/assets
2025/11/27 14:58:04 I - Assets written to: seed/1/final_model/assets
'''

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions