You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The following is the error log printed when using more than 2 GPUs on a single node. I am not sure why it works with 2 GPUs, but there might just be a lucky race condition happening. The problem is that each process removes the lock file, yet it is necessary to be present for this.
A quick fix is to allow that the file has already been removed, but I wonder if this will cause different problems down the line. I think @chrisiacovella might know more about this.
(modelforge) [mwieder@node01 test]$ bash training_run.sh
2024-08-23 18:54:03.939 | INFO | modelforge.train.training:read_config:1272 - Reading config from : config.toml
2024-08-23 18:54:03.940 | DEBUG | modelforge.potential.models:generate_potential:618 - training_parameter=TrainingParameters(number_of_epochs=1000, remove_self_energies=True, batch_size=32, lr=0.0005, monitor='val/per_molecule_energy/rmse', lr_scheduler=SchedulerConfig(frequency=1, mode='min', factor=0.1, patience=100, cooldown=50, min_lr=1e-08, threshold=0.1, threshold_mode='abs', monitor='val/per_molecule_energy/rmse', interval='epoch'), loss_parameter=LossParameter(loss_property=['per_molecule_energy'], weight={'per_molecule_energy': 0.9999}), early_stopping=EarlyStopping(verbose=True, monitor='loss/per_molecule_energy/mse', min_delta=0.001, patience=50), splitting_strategy=SplittingStrategy(name='random_record_splitting_strategy', data_split=[0.8, 0.1, 0.1], seed=42), stochastic_weight_averaging=None, experiment_logger=ExperimentLogger(logger_name='tensorboard', tensorboard_configuration=TensorboardConfig(save_dir='logs'), wandb_configuration=None), verbose=False, optimizer=<class 'torch.optim.adamw.AdamW'>)
2024-08-23 18:54:03.940 | DEBUG | modelforge.potential.models:generate_potential:619 - potential_parameter=SchNetParameters(potential_name='SchNet', core_parameter=CoreParameter(number_of_radial_basis_functions=32, maximum_interaction_radius=<Quantity(5.0, 'angstrom')>, number_of_interaction_modules=8, number_of_filters=128, shared_interactions=True, activation_function_parameter=ActivationFunctionConfig(activation_function_name='ShiftedSoftplus', activation_function_arguments=None, activation_function=ShiftedSoftplus()), featurization=Featurization(properties_to_featurize=['atomic_number'], maximum_atomic_number=101, number_of_per_atom_features=128)), postprocessing_parameter=PostProcessingParameter(per_atom_energy=PerAtomEnergy(normalize=True, from_atom_to_molecule_reduction=True, keep_per_atom_property=True), general_postprocessing_operation=GeneralPostProcessingOperation(calculate_molecular_self_energy=True, calculate_atomic_self_energy=False)), potential_seed=None)
2024-08-23 18:54:03.940 | DEBUG | modelforge.potential.models:generate_potential:620 - dataset_parameter=DatasetParameters(dataset_name='PhAlkEthOH', version_select='latest', num_workers=6, pin_memory=True)
2024-08-23 18:54:03.943 | DEBUG | modelforge.dataset.phalkethoh:__init__:119 - Loading config data from /data/shared/projects/mamba/envs/modelforge/lib/python3.11/site-packages/modelforge/dataset/yaml_files/PhAlkEthOH.yaml
2024-08-23 18:54:03.945 | INFO | modelforge.dataset.phalkethoh:__init__:129 - Using the latest dataset: full_dataset_v0
2024-08-23 18:54:03.945 | INFO | modelforge.dataset.dataset:create_dataset:1040 - Creating PhAlkEthOH dataset
2024-08-23 18:54:03.955 | DEBUG | modelforge.dataset.dataset:_from_file_cache:861 - Loading processed data from ./cache/PhAlkEthOH_dataset_v0_processed.npz generated on 2024-08-23 17:52:47.843887
2024-08-23 18:54:03.955 | DEBUG | modelforge.dataset.dataset:_from_file_cache:864 - Properties of Interest in .npz file: ['atomic_numbers', 'dft_total_energy', 'geometry', 'dft_total_force', 'total_charge']
2024-08-23 18:54:06.432 | INFO | modelforge.dataset.dataset:prepare_data:1155 - Loading dataset statistics from disk: ./cache/PhAlkEthOH_dataset_statistic.toml
2024-08-23 18:54:06.435 | DEBUG | modelforge.dataset.dataset:prepare_data:1176 - Process dataset ...
2024-08-23 18:54:06.435 | INFO | modelforge.dataset.dataset:_per_datapoint_operations:1346 - Performing per datapoint operations in the dataset dataset
2024-08-23 18:54:06.436 | INFO | modelforge.dataset.dataset:_per_datapoint_operations:1348 - Removing self energies from the dataset
Process dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1188709/1188709 [00:18<00:00, 65143.67it/s]
Calculating pairlist for dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2378/2378 [00:53<00:00, 44.32it/s]
2024-08-23 18:57:01.136 | DEBUG | modelforge.dataset.utils:split:383 - Using random splitting strategy with seed 42 ...
2024-08-23 18:57:01.137 | DEBUG | modelforge.dataset.utils:split:384 - Splitting dataset into 0.8, 0.1, 0.1 ...
2024-08-23 18:57:01.174 | INFO | modelforge.train.training:read_dataset_statistics:956 - Setting per_atom_energy_mean and per_atom_energy_stddev for SchNet
2024-08-23 18:57:01.174 | INFO | modelforge.train.training:read_dataset_statistics:959 - per_atom_energy_mean: -399.4757085765604 kilojoule_per_mole
2024-08-23 18:57:01.174 | INFO | modelforge.train.training:read_dataset_statistics:962 - per_atom_energy_stddev: 16.901179852695574 kilojoule_per_mole
2024-08-23 18:57:01.201 | DEBUG | modelforge.potential.models:_initialize_postprocessing:953 - ['normalize', 'from_atom_to_molecule_reduction']
2024-08-23 18:57:01.201 | DEBUG | modelforge.potential.models:_initialize_postprocessing:953 - ['calculate_molecular_self_energy']
2024-08-23 18:57:01.201 | DEBUG | modelforge.potential.schnet:__init__:78 - Initializing the SchNet architecture.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/3
2024-08-23 18:57:09.223 | INFO | modelforge.train.training:read_config:1272 - Reading config from : config.toml
2024-08-23 18:57:09.223 | DEBUG | modelforge.potential.models:generate_potential:618 - training_parameter=TrainingParameters(number_of_epochs=1000, remove_self_energies=True, batch_size=32, lr=0.0005, monitor='val/per_molecule_energy/rmse', lr_scheduler=SchedulerConfig(frequency=1, mode='min', factor=0.1, patience=100, cooldown=50, min_lr=1e-08, threshold=0.1, threshold_mode='abs', monitor='val/per_molecule_energy/rmse', interval='epoch'), loss_parameter=LossParameter(loss_property=['per_molecule_energy'], weight={'per_molecule_energy': 0.9999}), early_stopping=EarlyStopping(verbose=True, monitor='loss/per_molecule_energy/mse', min_delta=0.001, patience=50), splitting_strategy=SplittingStrategy(name='random_record_splitting_strategy', data_split=[0.8, 0.1, 0.1], seed=42), stochastic_weight_averaging=None, experiment_logger=ExperimentLogger(logger_name='tensorboard', tensorboard_configuration=TensorboardConfig(save_dir='logs'), wandb_configuration=None), verbose=False, optimizer=<class 'torch.optim.adamw.AdamW'>)
2024-08-23 18:57:09.223 | INFO | modelforge.train.training:read_config:1272 - Reading config from : config.toml
2024-08-23 18:57:09.223 | DEBUG | modelforge.potential.models:generate_potential:619 - potential_parameter=SchNetParameters(potential_name='SchNet', core_parameter=CoreParameter(number_of_radial_basis_functions=32, maximum_interaction_radius=<Quantity(5.0, 'angstrom')>, number_of_interaction_modules=8, number_of_filters=128, shared_interactions=True, activation_function_parameter=ActivationFunctionConfig(activation_function_name='ShiftedSoftplus', activation_function_arguments=None, activation_function=ShiftedSoftplus()), featurization=Featurization(properties_to_featurize=['atomic_number'], maximum_atomic_number=101, number_of_per_atom_features=128)), postprocessing_parameter=PostProcessingParameter(per_atom_energy=PerAtomEnergy(normalize=True, from_atom_to_molecule_reduction=True, keep_per_atom_property=True), general_postprocessing_operation=GeneralPostProcessingOperation(calculate_molecular_self_energy=True, calculate_atomic_self_energy=False)), potential_seed=None)
2024-08-23 18:57:09.223 | DEBUG | modelforge.potential.models:generate_potential:620 - dataset_parameter=DatasetParameters(dataset_name='PhAlkEthOH', version_select='latest', num_workers=6, pin_memory=True)
2024-08-23 18:57:09.223 | DEBUG | modelforge.potential.models:generate_potential:618 - training_parameter=TrainingParameters(number_of_epochs=1000, remove_self_energies=True, batch_size=32, lr=0.0005, monitor='val/per_molecule_energy/rmse', lr_scheduler=SchedulerConfig(frequency=1, mode='min', factor=0.1, patience=100, cooldown=50, min_lr=1e-08, threshold=0.1, threshold_mode='abs', monitor='val/per_molecule_energy/rmse', interval='epoch'), loss_parameter=LossParameter(loss_property=['per_molecule_energy'], weight={'per_molecule_energy': 0.9999}), early_stopping=EarlyStopping(verbose=True, monitor='loss/per_molecule_energy/mse', min_delta=0.001, patience=50), splitting_strategy=SplittingStrategy(name='random_record_splitting_strategy', data_split=[0.8, 0.1, 0.1], seed=42), stochastic_weight_averaging=None, experiment_logger=ExperimentLogger(logger_name='tensorboard', tensorboard_configuration=TensorboardConfig(save_dir='logs'), wandb_configuration=None), verbose=False, optimizer=<class 'torch.optim.adamw.AdamW'>)
2024-08-23 18:57:09.224 | DEBUG | modelforge.potential.models:generate_potential:619 - potential_parameter=SchNetParameters(potential_name='SchNet', core_parameter=CoreParameter(number_of_radial_basis_functions=32, maximum_interaction_radius=<Quantity(5.0, 'angstrom')>, number_of_interaction_modules=8, number_of_filters=128, shared_interactions=True, activation_function_parameter=ActivationFunctionConfig(activation_function_name='ShiftedSoftplus', activation_function_arguments=None, activation_function=ShiftedSoftplus()), featurization=Featurization(properties_to_featurize=['atomic_number'], maximum_atomic_number=101, number_of_per_atom_features=128)), postprocessing_parameter=PostProcessingParameter(per_atom_energy=PerAtomEnergy(normalize=True, from_atom_to_molecule_reduction=True, keep_per_atom_property=True), general_postprocessing_operation=GeneralPostProcessingOperation(calculate_molecular_self_energy=True, calculate_atomic_self_energy=False)), potential_seed=None)
2024-08-23 18:57:09.224 | DEBUG | modelforge.potential.models:generate_potential:620 - dataset_parameter=DatasetParameters(dataset_name='PhAlkEthOH', version_select='latest', num_workers=6, pin_memory=True)
2024-08-23 18:57:09.225 | DEBUG | modelforge.dataset.phalkethoh:__init__:119 - Loading config data from /data/shared/projects/mamba/envs/modelforge/lib/python3.11/site-packages/modelforge/dataset/yaml_files/PhAlkEthOH.yaml
2024-08-23 18:57:09.225 | DEBUG | modelforge.dataset.phalkethoh:__init__:119 - Loading config data from /data/shared/projects/mamba/envs/modelforge/lib/python3.11/site-packages/modelforge/dataset/yaml_files/PhAlkEthOH.yaml
2024-08-23 18:57:09.227 | INFO | modelforge.dataset.phalkethoh:__init__:129 - Using the latest dataset: full_dataset_v0
2024-08-23 18:57:09.227 | INFO | modelforge.dataset.dataset:create_dataset:1040 - Creating PhAlkEthOH dataset
2024-08-23 18:57:09.227 | INFO | modelforge.dataset.phalkethoh:__init__:129 - Using the latest dataset: full_dataset_v0
2024-08-23 18:57:09.227 | INFO | modelforge.dataset.dataset:create_dataset:1040 - Creating PhAlkEthOH dataset
2024-08-23 18:57:09.229 | DEBUG | modelforge.utils.misc:__enter__:296 - ./cache/PhAlkEthOH_dataset_v0_processed.json.lockfile in locked by another process; waiting until lock is released.
2024-08-23 18:57:09.232 | DEBUG | modelforge.dataset.dataset:_from_file_cache:861 - Loading processed data from ./cache/PhAlkEthOH_dataset_v0_processed.npz generated on 2024-08-23 17:52:47.843887
2024-08-23 18:57:09.232 | DEBUG | modelforge.dataset.dataset:_from_file_cache:864 - Properties of Interest in .npz file: ['atomic_numbers', 'dft_total_energy', 'geometry', 'dft_total_force', 'total_charge']
Traceback (most recent call last):
File "/home/mwieder/Work/Projects/modelforge/scripts/test/perform_training.py", line 47, in <module>
read_config_and_train(
File "/data/shared/projects/mamba/envs/modelforge/lib/python3.11/site-packages/modelforge/train/training.py", line 1413, in read_config_and_train
model = NeuralNetworkPotentialFactory.generate_potential(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/shared/projects/mamba/envs/modelforge/lib/python3.11/site-packages/modelforge/potential/models.py", line 637, in generate_potential
model = ModelTrainer(
^^^^^^^^^^^^^
File "/data/shared/projects/mamba/envs/modelforge/lib/python3.11/site-packages/modelforge/train/training.py", line 910, in __init__
self.datamodule = self.setup_datamodule()
^^^^^^^^^^^^^^^^^^^^^^^
File "/data/shared/projects/mamba/envs/modelforge/lib/python3.11/site-packages/modelforge/train/training.py", line 992, in setup_datamodule
dm.prepare_data()
File "/data/shared/projects/mamba/envs/modelforge/lib/python3.11/site-packages/modelforge/dataset/dataset.py", line 1148, in prepare_data
torch_dataset = self._create_torch_dataset(dataset)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/shared/projects/mamba/envs/modelforge/lib/python3.11/site-packages/modelforge/dataset/dataset.py", line 1251, in _create_torch_dataset
return DatasetFactory().create_dataset(dataset)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/shared/projects/mamba/envs/modelforge/lib/python3.11/site-packages/modelforge/dataset/dataset.py", line 1041, in create_dataset
DatasetFactory._load_or_process_data(data)
File "/data/shared/projects/mamba/envs/modelforge/lib/python3.11/site-packages/modelforge/dataset/dataset.py", line 999, in _load_or_process_data
data._from_file_cache()
File "/data/shared/projects/mamba/envs/modelforge/lib/python3.11/site-packages/modelforge/dataset/dataset.py", line 857, in _from_file_cache
if self._metadata_validation(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/shared/projects/mamba/envs/modelforge/lib/python3.11/site-packages/modelforge/dataset/dataset.py", line 610, in _metadata_validation
os.remove(f"{file_path}/{file_name}.lockfile")
FileNotFoundError: [Errno 2] No such file or directory: './cache/PhAlkEthOH_dataset_v0_processed.json.lockfile'
The text was updated successfully, but these errors were encountered:
I added a decorator to lock a method. The use case is slightly different than what you had implemented: in that case we don't want to lock a filestream but only a single process should execute the method.
The following is the error log printed when using more than 2 GPUs on a single node. I am not sure why it works with 2 GPUs, but there might just be a lucky race condition happening. The problem is that each process removes the lock file, yet it is necessary to be present for this.
A quick fix is to allow that the file has already been removed, but I wonder if this will cause different problems down the line. I think @chrisiacovella might know more about this.
The text was updated successfully, but these errors were encountered: