You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Below is a minimal working example of the problem:
using Oceananigans
using Printf
grid =RectilinearGrid(Float64,
size = (4, 4, 4),
x = (0, 1),
y = (0, 1),
z = (-1, 0),
topology = (Periodic, Periodic, Bounded))
b_initial(x, y, z) =rand()
model =NonhydrostaticModel(;
grid = grid,
buoyancy =BuoyancyTracer(),
tracers = (:b),
timestepper =:RungeKutta3)
simulation =Simulation(model, Δt=0.1, stop_iteration=200)
outputs =merge(model.velocities, model.tracers)
simulation.output_writers[:jld2] =JLD2OutputWriter(model, outputs,
filename ="$(FILE_DIR)/instantaneous_fields.jld2",
schedule =IterationInterval(1),
max_filesize=200e3,
part=1)
simulation.output_writers[:checkpointer] =Checkpointer(model, schedule=TimeInterval(100), prefix="$(FILE_DIR)/model_checkpoint")
run!(simulation)
# up until here most recent output file is instantaneous_fields_part4.jld2, but I want to continue running the simulation
simulation.stop_iteration =400
simulation.output_writers[:jld2] =JLD2OutputWriter(model, outputs,
filename ="instantaneous_fields.jld2",
schedule =IterationInterval(1),
max_filesize=200e3,
part=4)
run!(simulation, pickup="model_checkpoint_iteration0.jld2")
What I'm doing is creating a directory test_outputwriter, and then writing fields into it with a specified file size and starting part number.
After the first run!(simulation), 4 output files were written, most recent being instantaneous_fields_part4.jld2, and a checkpoint file model_checkpoint_iteration0.jld2 is written.
Let's say I want to keep running this model, so I increase simulation.stop_iteration. I pick up the model from the most recent checkpoint, and specify part=4 (the most recent file written). This creates a instantaneous_fields.jld2 and keeps writing into it, while throwing a warning
Warning: Failed to save and serialize [:grid, :coriolis, :buoyancy, :closure] in ./test_outputwriter/instantaneous_fields.jld2 because ArgumentError: ArgumentError: a group or dataset named Nx is already present within this group
It never actually writes into instantaneous_fields_part4.jld2, and it keeps writing and rewriting into instantaneous_fields.jld2 . If instead I specify part=10 or any number larger than 4, the same problem occurs.
If I use part=1 in my 2nd spin up of the simulation, it throws
ERROR: ArgumentError: '.\./test_outputwriter/instantaneous_fields_part1.jld2' exists. `force=true` is required to remove '.\./test_outputwriter/instantaneous_fields_part1.jld2' before moving.
Not sure what the intended user experience but I was imagining that if for some reason the simulation stops and I want to rerun the simulation from a checkpoint, 2 potential options would be available:
The model runs from the latest checkpoint, and continues writing into the most recent output file once it catches up to the latest unsaved iteration. Note that since the model is running from the checkpoint the saved iterations which the model is running at could be in earlier parts than the most recent output. But the simulation should know that and only starts writing into the latest part once it catches up to the latest saved iteration.
I specify a part number that is larger than all the previous output files, and the simulation picks up from the checkpoint and writes into the new part number. This could mean that there are repetitive iterations saved when examining all output files (new and old).
is potentially the most important and common use case, but 2) might not be an unreasonable usage as well. However in the current implementation neither can be achieved.
The text was updated successfully, but these errors were encountered:
The intended user experience is that only one line should need to be changed: pickup=false to pickup=true in run!.
Therefore, users should not have to manually specify the "part" that they want to pick up from. I don't like option 2 above.
I think that fixing this problem may become much easier if we can "delay" the creation of the output file. Right now, the output file is created when we build the output writer. But at that point, we have no way of knowing whether we are going to pick up or not.
I've long wanted to implement this "delay" but more pressing matters have intervened...
The basic thing we need to do is to add an initialize!(output_writer, sim) utility, which will create the output file. That function then will know whether the simulation is starting fresh (because iteration(sim) == 0, or whether it is "continuing"). One huge feature this will enable is the ability to avoid overwriting an existing file when it represents the output from the current continuing run. That's a huge problem with the current interface, is that you have to be really careful about overwrite_existing if you are trying to pickup from a checkpoint. And I think that's a big problem.
With that feature I think we can also figure out how to handle output that is split into multiple files --- because we know if a simulation is continuing that we will have to figure out which part to use (if any).
continues writing into the most recent output file once it catches up to the latest unsaved iteration.
This is a separate feature from what I was talking about, but I think it's also a great idea! There also may be a clue how to solve a roundoff error issue, where two outputs are written one iteration separate from one another, but at virtually identical times (eg distinguished only by machine epsilon).
PS: I simplified the example a bit to help me understand it
Below is a minimal working example of the problem:
What I'm doing is creating a directory
test_outputwriter
, and then writing fields into it with a specified file size and starting part number.After the first
run!(simulation)
, 4 output files were written, most recent beinginstantaneous_fields_part4.jld2
, and a checkpoint filemodel_checkpoint_iteration0.jld2
is written.Let's say I want to keep running this model, so I increase
simulation.stop_iteration
. I pick up the model from the most recent checkpoint, and specifypart=4
(the most recent file written). This creates ainstantaneous_fields.jld2
and keeps writing into it, while throwing a warningIt never actually writes into
instantaneous_fields_part4.jld2
, and it keeps writing and rewriting intoinstantaneous_fields.jld2
. If instead I specifypart=10
or any number larger than 4, the same problem occurs.If I use
part=1
in my 2nd spin up of the simulation, it throwsNot sure what the intended user experience but I was imagining that if for some reason the simulation stops and I want to rerun the simulation from a checkpoint, 2 potential options would be available:
The text was updated successfully, but these errors were encountered: