Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error loading dataset when using more than 2 GPUs #244

Open
wiederm opened this issue Aug 23, 2024 · 3 comments
Open

Error loading dataset when using more than 2 GPUs #244

wiederm opened this issue Aug 23, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@wiederm
Copy link
Member

wiederm commented Aug 23, 2024

The following is the error log printed when using more than 2 GPUs on a single node. I am not sure why it works with 2 GPUs, but there might just be a lucky race condition happening. The problem is that each process removes the lock file, yet it is necessary to be present for this.

A quick fix is to allow that the file has already been removed, but I wonder if this will cause different problems down the line. I think @chrisiacovella might know more about this.

(modelforge) [mwieder@node01 test]$ bash training_run.sh 
2024-08-23 18:54:03.939 | INFO     | modelforge.train.training:read_config:1272 - Reading config from : config.toml
2024-08-23 18:54:03.940 | DEBUG    | modelforge.potential.models:generate_potential:618 - training_parameter=TrainingParameters(number_of_epochs=1000, remove_self_energies=True, batch_size=32, lr=0.0005, monitor='val/per_molecule_energy/rmse', lr_scheduler=SchedulerConfig(frequency=1, mode='min', factor=0.1, patience=100, cooldown=50, min_lr=1e-08, threshold=0.1, threshold_mode='abs', monitor='val/per_molecule_energy/rmse', interval='epoch'), loss_parameter=LossParameter(loss_property=['per_molecule_energy'], weight={'per_molecule_energy': 0.9999}), early_stopping=EarlyStopping(verbose=True, monitor='loss/per_molecule_energy/mse', min_delta=0.001, patience=50), splitting_strategy=SplittingStrategy(name='random_record_splitting_strategy', data_split=[0.8, 0.1, 0.1], seed=42), stochastic_weight_averaging=None, experiment_logger=ExperimentLogger(logger_name='tensorboard', tensorboard_configuration=TensorboardConfig(save_dir='logs'), wandb_configuration=None), verbose=False, optimizer=<class 'torch.optim.adamw.AdamW'>)
2024-08-23 18:54:03.940 | DEBUG    | modelforge.potential.models:generate_potential:619 - potential_parameter=SchNetParameters(potential_name='SchNet', core_parameter=CoreParameter(number_of_radial_basis_functions=32, maximum_interaction_radius=<Quantity(5.0, 'angstrom')>, number_of_interaction_modules=8, number_of_filters=128, shared_interactions=True, activation_function_parameter=ActivationFunctionConfig(activation_function_name='ShiftedSoftplus', activation_function_arguments=None, activation_function=ShiftedSoftplus()), featurization=Featurization(properties_to_featurize=['atomic_number'], maximum_atomic_number=101, number_of_per_atom_features=128)), postprocessing_parameter=PostProcessingParameter(per_atom_energy=PerAtomEnergy(normalize=True, from_atom_to_molecule_reduction=True, keep_per_atom_property=True), general_postprocessing_operation=GeneralPostProcessingOperation(calculate_molecular_self_energy=True, calculate_atomic_self_energy=False)), potential_seed=None)
2024-08-23 18:54:03.940 | DEBUG    | modelforge.potential.models:generate_potential:620 - dataset_parameter=DatasetParameters(dataset_name='PhAlkEthOH', version_select='latest', num_workers=6, pin_memory=True)
2024-08-23 18:54:03.943 | DEBUG    | modelforge.dataset.phalkethoh:__init__:119 - Loading config data from /data/shared/projects/mamba/envs/modelforge/lib/python3.11/site-packages/modelforge/dataset/yaml_files/PhAlkEthOH.yaml
2024-08-23 18:54:03.945 | INFO     | modelforge.dataset.phalkethoh:__init__:129 - Using the latest dataset: full_dataset_v0
2024-08-23 18:54:03.945 | INFO     | modelforge.dataset.dataset:create_dataset:1040 - Creating PhAlkEthOH dataset
2024-08-23 18:54:03.955 | DEBUG    | modelforge.dataset.dataset:_from_file_cache:861 - Loading processed data from ./cache/PhAlkEthOH_dataset_v0_processed.npz generated on 2024-08-23 17:52:47.843887
2024-08-23 18:54:03.955 | DEBUG    | modelforge.dataset.dataset:_from_file_cache:864 - Properties of Interest in .npz file: ['atomic_numbers', 'dft_total_energy', 'geometry', 'dft_total_force', 'total_charge']
2024-08-23 18:54:06.432 | INFO     | modelforge.dataset.dataset:prepare_data:1155 - Loading dataset statistics from disk: ./cache/PhAlkEthOH_dataset_statistic.toml
2024-08-23 18:54:06.435 | DEBUG    | modelforge.dataset.dataset:prepare_data:1176 - Process dataset ...
2024-08-23 18:54:06.435 | INFO     | modelforge.dataset.dataset:_per_datapoint_operations:1346 - Performing per datapoint operations in the dataset dataset
2024-08-23 18:54:06.436 | INFO     | modelforge.dataset.dataset:_per_datapoint_operations:1348 - Removing self energies from the dataset
Process dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1188709/1188709 [00:18<00:00, 65143.67it/s]
Calculating pairlist for dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2378/2378 [00:53<00:00, 44.32it/s]
2024-08-23 18:57:01.136 | DEBUG    | modelforge.dataset.utils:split:383 - Using random splitting strategy with seed 42 ...
2024-08-23 18:57:01.137 | DEBUG    | modelforge.dataset.utils:split:384 - Splitting dataset into 0.8, 0.1, 0.1 ...
2024-08-23 18:57:01.174 | INFO     | modelforge.train.training:read_dataset_statistics:956 - Setting per_atom_energy_mean and per_atom_energy_stddev for SchNet
2024-08-23 18:57:01.174 | INFO     | modelforge.train.training:read_dataset_statistics:959 - per_atom_energy_mean: -399.4757085765604 kilojoule_per_mole
2024-08-23 18:57:01.174 | INFO     | modelforge.train.training:read_dataset_statistics:962 - per_atom_energy_stddev: 16.901179852695574 kilojoule_per_mole
2024-08-23 18:57:01.201 | DEBUG    | modelforge.potential.models:_initialize_postprocessing:953 - ['normalize', 'from_atom_to_molecule_reduction']
2024-08-23 18:57:01.201 | DEBUG    | modelforge.potential.models:_initialize_postprocessing:953 - ['calculate_molecular_self_energy']
2024-08-23 18:57:01.201 | DEBUG    | modelforge.potential.schnet:__init__:78 - Initializing the SchNet architecture.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/3
2024-08-23 18:57:09.223 | INFO     | modelforge.train.training:read_config:1272 - Reading config from : config.toml
2024-08-23 18:57:09.223 | DEBUG    | modelforge.potential.models:generate_potential:618 - training_parameter=TrainingParameters(number_of_epochs=1000, remove_self_energies=True, batch_size=32, lr=0.0005, monitor='val/per_molecule_energy/rmse', lr_scheduler=SchedulerConfig(frequency=1, mode='min', factor=0.1, patience=100, cooldown=50, min_lr=1e-08, threshold=0.1, threshold_mode='abs', monitor='val/per_molecule_energy/rmse', interval='epoch'), loss_parameter=LossParameter(loss_property=['per_molecule_energy'], weight={'per_molecule_energy': 0.9999}), early_stopping=EarlyStopping(verbose=True, monitor='loss/per_molecule_energy/mse', min_delta=0.001, patience=50), splitting_strategy=SplittingStrategy(name='random_record_splitting_strategy', data_split=[0.8, 0.1, 0.1], seed=42), stochastic_weight_averaging=None, experiment_logger=ExperimentLogger(logger_name='tensorboard', tensorboard_configuration=TensorboardConfig(save_dir='logs'), wandb_configuration=None), verbose=False, optimizer=<class 'torch.optim.adamw.AdamW'>)
2024-08-23 18:57:09.223 | INFO     | modelforge.train.training:read_config:1272 - Reading config from : config.toml
2024-08-23 18:57:09.223 | DEBUG    | modelforge.potential.models:generate_potential:619 - potential_parameter=SchNetParameters(potential_name='SchNet', core_parameter=CoreParameter(number_of_radial_basis_functions=32, maximum_interaction_radius=<Quantity(5.0, 'angstrom')>, number_of_interaction_modules=8, number_of_filters=128, shared_interactions=True, activation_function_parameter=ActivationFunctionConfig(activation_function_name='ShiftedSoftplus', activation_function_arguments=None, activation_function=ShiftedSoftplus()), featurization=Featurization(properties_to_featurize=['atomic_number'], maximum_atomic_number=101, number_of_per_atom_features=128)), postprocessing_parameter=PostProcessingParameter(per_atom_energy=PerAtomEnergy(normalize=True, from_atom_to_molecule_reduction=True, keep_per_atom_property=True), general_postprocessing_operation=GeneralPostProcessingOperation(calculate_molecular_self_energy=True, calculate_atomic_self_energy=False)), potential_seed=None)
2024-08-23 18:57:09.223 | DEBUG    | modelforge.potential.models:generate_potential:620 - dataset_parameter=DatasetParameters(dataset_name='PhAlkEthOH', version_select='latest', num_workers=6, pin_memory=True)
2024-08-23 18:57:09.223 | DEBUG    | modelforge.potential.models:generate_potential:618 - training_parameter=TrainingParameters(number_of_epochs=1000, remove_self_energies=True, batch_size=32, lr=0.0005, monitor='val/per_molecule_energy/rmse', lr_scheduler=SchedulerConfig(frequency=1, mode='min', factor=0.1, patience=100, cooldown=50, min_lr=1e-08, threshold=0.1, threshold_mode='abs', monitor='val/per_molecule_energy/rmse', interval='epoch'), loss_parameter=LossParameter(loss_property=['per_molecule_energy'], weight={'per_molecule_energy': 0.9999}), early_stopping=EarlyStopping(verbose=True, monitor='loss/per_molecule_energy/mse', min_delta=0.001, patience=50), splitting_strategy=SplittingStrategy(name='random_record_splitting_strategy', data_split=[0.8, 0.1, 0.1], seed=42), stochastic_weight_averaging=None, experiment_logger=ExperimentLogger(logger_name='tensorboard', tensorboard_configuration=TensorboardConfig(save_dir='logs'), wandb_configuration=None), verbose=False, optimizer=<class 'torch.optim.adamw.AdamW'>)
2024-08-23 18:57:09.224 | DEBUG    | modelforge.potential.models:generate_potential:619 - potential_parameter=SchNetParameters(potential_name='SchNet', core_parameter=CoreParameter(number_of_radial_basis_functions=32, maximum_interaction_radius=<Quantity(5.0, 'angstrom')>, number_of_interaction_modules=8, number_of_filters=128, shared_interactions=True, activation_function_parameter=ActivationFunctionConfig(activation_function_name='ShiftedSoftplus', activation_function_arguments=None, activation_function=ShiftedSoftplus()), featurization=Featurization(properties_to_featurize=['atomic_number'], maximum_atomic_number=101, number_of_per_atom_features=128)), postprocessing_parameter=PostProcessingParameter(per_atom_energy=PerAtomEnergy(normalize=True, from_atom_to_molecule_reduction=True, keep_per_atom_property=True), general_postprocessing_operation=GeneralPostProcessingOperation(calculate_molecular_self_energy=True, calculate_atomic_self_energy=False)), potential_seed=None)
2024-08-23 18:57:09.224 | DEBUG    | modelforge.potential.models:generate_potential:620 - dataset_parameter=DatasetParameters(dataset_name='PhAlkEthOH', version_select='latest', num_workers=6, pin_memory=True)
2024-08-23 18:57:09.225 | DEBUG    | modelforge.dataset.phalkethoh:__init__:119 - Loading config data from /data/shared/projects/mamba/envs/modelforge/lib/python3.11/site-packages/modelforge/dataset/yaml_files/PhAlkEthOH.yaml
2024-08-23 18:57:09.225 | DEBUG    | modelforge.dataset.phalkethoh:__init__:119 - Loading config data from /data/shared/projects/mamba/envs/modelforge/lib/python3.11/site-packages/modelforge/dataset/yaml_files/PhAlkEthOH.yaml
2024-08-23 18:57:09.227 | INFO     | modelforge.dataset.phalkethoh:__init__:129 - Using the latest dataset: full_dataset_v0
2024-08-23 18:57:09.227 | INFO     | modelforge.dataset.dataset:create_dataset:1040 - Creating PhAlkEthOH dataset
2024-08-23 18:57:09.227 | INFO     | modelforge.dataset.phalkethoh:__init__:129 - Using the latest dataset: full_dataset_v0
2024-08-23 18:57:09.227 | INFO     | modelforge.dataset.dataset:create_dataset:1040 - Creating PhAlkEthOH dataset
2024-08-23 18:57:09.229 | DEBUG    | modelforge.utils.misc:__enter__:296 - ./cache/PhAlkEthOH_dataset_v0_processed.json.lockfile in locked by another process; waiting until lock is released.
2024-08-23 18:57:09.232 | DEBUG    | modelforge.dataset.dataset:_from_file_cache:861 - Loading processed data from ./cache/PhAlkEthOH_dataset_v0_processed.npz generated on 2024-08-23 17:52:47.843887
2024-08-23 18:57:09.232 | DEBUG    | modelforge.dataset.dataset:_from_file_cache:864 - Properties of Interest in .npz file: ['atomic_numbers', 'dft_total_energy', 'geometry', 'dft_total_force', 'total_charge']
Traceback (most recent call last):
  File "/home/mwieder/Work/Projects/modelforge/scripts/test/perform_training.py", line 47, in <module>
    read_config_and_train(
  File "/data/shared/projects/mamba/envs/modelforge/lib/python3.11/site-packages/modelforge/train/training.py", line 1413, in read_config_and_train
    model = NeuralNetworkPotentialFactory.generate_potential(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/shared/projects/mamba/envs/modelforge/lib/python3.11/site-packages/modelforge/potential/models.py", line 637, in generate_potential
    model = ModelTrainer(
            ^^^^^^^^^^^^^
  File "/data/shared/projects/mamba/envs/modelforge/lib/python3.11/site-packages/modelforge/train/training.py", line 910, in __init__
    self.datamodule = self.setup_datamodule()
                      ^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/shared/projects/mamba/envs/modelforge/lib/python3.11/site-packages/modelforge/train/training.py", line 992, in setup_datamodule
    dm.prepare_data()
  File "/data/shared/projects/mamba/envs/modelforge/lib/python3.11/site-packages/modelforge/dataset/dataset.py", line 1148, in prepare_data
    torch_dataset = self._create_torch_dataset(dataset)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/shared/projects/mamba/envs/modelforge/lib/python3.11/site-packages/modelforge/dataset/dataset.py", line 1251, in _create_torch_dataset
    return DatasetFactory().create_dataset(dataset)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/shared/projects/mamba/envs/modelforge/lib/python3.11/site-packages/modelforge/dataset/dataset.py", line 1041, in create_dataset
    DatasetFactory._load_or_process_data(data)
  File "/data/shared/projects/mamba/envs/modelforge/lib/python3.11/site-packages/modelforge/dataset/dataset.py", line 999, in _load_or_process_data
    data._from_file_cache()
  File "/data/shared/projects/mamba/envs/modelforge/lib/python3.11/site-packages/modelforge/dataset/dataset.py", line 857, in _from_file_cache
    if self._metadata_validation(
       ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/shared/projects/mamba/envs/modelforge/lib/python3.11/site-packages/modelforge/dataset/dataset.py", line 610, in _metadata_validation
    os.remove(f"{file_path}/{file_name}.lockfile")
FileNotFoundError: [Errno 2] No such file or directory: './cache/PhAlkEthOH_dataset_v0_processed.json.lockfile'
@wiederm wiederm added the bug Something isn't working label Aug 23, 2024
@wiederm
Copy link
Member Author

wiederm commented Aug 27, 2024

I have been looking at the locking procedure for some other work, and I am wondering if we should transition to the filelock library?

@wiederm
Copy link
Member Author

wiederm commented Aug 27, 2024

ah, I see that you have added a context manger here! That is super useful!

@wiederm
Copy link
Member Author

wiederm commented Aug 27, 2024

I added a decorator to lock a method. The use case is slightly different than what you had implemented: in that case we don't want to lock a filestream but only a single process should execute the method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants