Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NODE path error #33

Open
duncanmcelfresh opened this issue Sep 16, 2022 · 5 comments
Open

NODE path error #33

duncanmcelfresh opened this issue Sep 16, 2022 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@duncanmcelfresh
Copy link
Collaborator

something about saving model files or checkpoints

Traceback (most recent call last):
  File "/home/shared/tabzilla/TabSurvey/tabzilla_experiment.py", line 136, in __call__
    result = cross_validation(model, self.dataset, self.time_limit)
  File "/home/shared/tabzilla/TabSurvey/tabzilla_utils.py", line 237, in cross_validation
    loss_history, val_loss_history = curr_model.fit(
  File "/home/shared/tabzilla/TabSurvey/models/node.py", line 174, in fit
    self.trainer.load_checkpoint(tag="best")
  File "/home/shared/tabzilla/TabSurvey/models/node_lib/trainer.py", line 73, in load_checkpoint
    checkpoint = torch.load(path)
  File "/opt/conda/envs/torch/lib/python3.10/site-packages/torch/serialization.py", line 699, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/opt/conda/envs/torch/lib/python3.10/site-packages/torch/serialization.py", line 231, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/opt/conda/envs/torch/lib/python3.10/site-packages/torch/serialization.py", line 212, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'logs/openml__arrhythmia__5_2022.09.16_20:21:07/checkpoint_best.pth'
@duncanmcelfresh duncanmcelfresh added the bug Something isn't working label Sep 16, 2022
@jonathan-valverde-l
Copy link
Collaborator

jonathan-valverde-l commented Sep 30, 2022

The issue seems to occur when logging_period is greater than the number of epochs. Checkpointing is only done periodically (according to logging_period). I have changed the behavior so that checkpointing occurs at the last iteration, and at multiples of logging_period before. This ensures checkpointing will occur at least one time.

However, we need to be careful when selecting hyperparameters for this method. If logging_period is too high, then the validation loss is not tracked often, and we could run the risk of overfitting more easily. If it is too low, then too many checkpoints are saved, resulting in possible storage issues. (We can fix this separately as well by having the model eliminate old checkpoints).

When fixing this issue, I did run into issues related to #27. I have implemented a fix for that in the latest commit as well. After fixing these two issues, I am able to run a full trial for NODE on openml__arrhythmia__5 without a problem.

@suj97
Copy link
Collaborator

suj97 commented Oct 3, 2022

Have you checked if creating the logs directory manually solve the issue?

@suj97
Copy link
Collaborator

suj97 commented Oct 3, 2022

oh nvm, looks like it's fixed.

@duncanmcelfresh
Copy link
Collaborator Author

@jonathan-valverde-l resolved this with a previous commit

@duncanmcelfresh
Copy link
Collaborator Author

NODE needs to be tested

@duncanmcelfresh duncanmcelfresh assigned palti117 and unassigned pcvishak Oct 17, 2022
paper3193 pushed a commit to paper3193/tabzilla that referenced this issue Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants