Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

restart run failure #1159

Closed
juliasloan25 opened this issue Jan 30, 2025 · 1 comment · Fixed by #1161
Closed

restart run failure #1159

juliasloan25 opened this issue Jan 30, 2025 · 1 comment · Fixed by #1161
Labels
bug Something isn't working

Comments

@juliasloan25
Copy link
Member

After #1121, our restart run has been failing because the files to restart from can't be found. cc @Sbozzolo

For example, see the following error from this build

<head></head>
ERROR: LoadError: unable to determine if experiments/ClimaEarth/output/amip_coarse_ft64_hourly_checkpoints_restart/artifacts/checkpoint/checkpoint_ClimaAtmosSimulation_400.hdf5 is accessible in the HDF5 format (file may not exist)
--
  | Stacktrace:
  | [1] error(s::String)
  | @ Base ./error.jl:35
  | [2] h5open(filename::String, mode::String, fapl::HDF5.FileAccessProperties, fcpl::HDF5.FileCreateProperties; swmr::Bool)
  | @ HDF5 /central/scratch/esm/slurm-buildkite/climacoupler-ci/depot/cpu/packages/HDF5/Z859u/src/file.jl:48
  | [3] h5open
  | @ /central/scratch/esm/slurm-buildkite/climacoupler-ci/depot/cpu/packages/HDF5/Z859u/src/file.jl:20 [inlined]
  | [4] h5open(filename::String, mode::String; swmr::Bool, fapl::HDF5.FileAccessProperties, fcpl::HDF5.FileCreateProperties, pv::@Kwargs{…})
  | @ HDF5 /central/scratch/esm/slurm-buildkite/climacoupler-ci/depot/cpu/packages/HDF5/Z859u/src/file.jl:75
  | [5] #h5open#2
  | @ /central/scratch/esm/slurm-buildkite/climacoupler-ci/depot/cpu/packages/HDF5/Z859u/ext/MPIExt.jl:96 [inlined]
  | [6] h5open (repeats 2 times)
  | @ /central/scratch/esm/slurm-buildkite/climacoupler-ci/depot/cpu/packages/HDF5/Z859u/ext/MPIExt.jl:89 [inlined]
  | [4] h5open(filename::String, mode::String; swmr::Bool, fapl::HDF5.FileAccessProperties, fcpl::HDF5.FileCreateProperties, pv::@Kwargs{…})
  | @ HDF5 /central/scratch/esm/slurm-buildkite/climacoupler-ci/depot/cpu/packages/HDF5/Z859u/src/file.jl:75
  | [5] #h5open#2
  | @ /central/scratch/esm/slurm-buildkite/climacoupler-ci/depot/cpu/packages/HDF5/Z859u/ext/MPIExt.jl:96 [inlined]
  | [6] h5open (repeats 2 times)
  | @ /central/scratch/esm/slurm-buildkite/climacoupler-ci/depot/cpu/packages/HDF5/Z859u/ext/MPIExt.jl:89 [inlined]
  | [7] ClimaCore.InputOutput.HDF5Reader(filename::String, context::ClimaComms.MPICommsContext{…})
  | @ ClimaCore.InputOutput /central/scratch/esm/slurm-buildkite/climacoupler-ci/depot/cpu/packages/ClimaCore/rSpyb/src/InputOutput/readers.jl:98
  | [7] ClimaCore.InputOutput.HDF5Reader(filename::String, context::ClimaComms.MPICommsContext{…})
  | @ ClimaCore.InputOutput /central/scratch/esm/slurm-buildkite/climacoupler-ci/depot/cpu/packages/ClimaCore/rSpyb/src/InputOutput/readers.jl:98
  | [8] restart_model_state!(sim::ClimaAtmosSimulation{…}, comms_ctx::ClimaComms.MPICommsContext{…}, t::Int64; input_dir::String)
  | @ ClimaCoupler.Checkpointer /central/scratch/esm/slurm-buildkite/climacoupler-ci/5390/climacoupler-ci/src/Checkpointer.jl:67
  | [9] top-level scope
  | @ /central/scratch/esm/slurm-buildkite/climacoupler-ci/5390/climacoupler-ci/experiments/ClimaEarth/run_amip.jl:601
  | in expression starting at /central/scratch/esm/slurm-buildkite/climacoupler-ci/5390/climacoupler-ci/experiments/ClimaEarth/run_amip.jl:598
  | [8] restart_model_state!(sim::ClimaAtmosSimulation{…}, comms_ctx::ClimaComms.MPICommsContext{…}, t::Int64; input_dir::String)
  | @ ClimaCoupler.Checkpointer /central/scratch/esm/slurm-buildkite/climacoupler-ci/5390/climacoupler-ci/src/Checkpointer.jl:67
  | [9] top-level scope
  | @ /central/scratch/esm/slurm-buildkite/climacoupler-ci/5390/climacoupler-ci/experiments/ClimaEarth/run_amip.jl:601
  | in expression starting at /central/scratch/esm/slurm-buildkite/climacoupler-ci/5390/climacoupler-ci/experiments/ClimaEarth/run_amip.jl:598
  | srun: error: hpc-35-32: task 1: Exited with exit code 1
  | srun: launch/slurm: _step_signal: Terminating StepId=46451093.1
  | slurmstepd: error: *** STEP 46451093.1 ON hpc-34-07 CANCELLED AT 2024-12-23T15:52:08 ***
  | srun: error: hpc-34-07: task 0: Terminated
  | srun: Force Terminated StepId=46451093.1
  | ls: cannot access 'experiments/ClimaEarth/output/amip_coarse_ft64_hourly_checkpoints_restart/artifacts//checkpoint': No such file or directory
  | Error: RESTART_DIR does not contain enough files
@juliasloan25 juliasloan25 added the bug Something isn't working label Jan 30, 2025
@juliasloan25
Copy link
Member Author

For now, I'm commenting out the restart run in #1157, since it's slowing down all of our CI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant