`JLD2OutputWriter` and `Checkpointer` don't work when `max_filesize` and `part` are specified. #3399

xkykai · 2023-12-01T19:36:02Z

Below is a minimal working example of the problem:

using Oceananigans
using Printf

grid = RectilinearGrid(Float64,
                       size = (4, 4, 4),
                       x = (0, 1),
                       y = (0, 1),
                       z = (-1, 0),
                       topology = (Periodic, Periodic, Bounded))

b_initial(x, y, z) = rand()

model = NonhydrostaticModel(; 
            grid = grid,
            buoyancy = BuoyancyTracer(),
            tracers = (:b),
            timestepper = :RungeKutta3)

simulation = Simulation(model, Δt=0.1, stop_iteration=200)
outputs = merge(model.velocities, model.tracers)

simulation.output_writers[:jld2] = JLD2OutputWriter(model, outputs,
                                                          filename = "$(FILE_DIR)/instantaneous_fields.jld2",
                                                          schedule = IterationInterval(1),
                                                          max_filesize=200e3,
                                                          part=1)

simulation.output_writers[:checkpointer] = Checkpointer(model, schedule=TimeInterval(100), prefix="$(FILE_DIR)/model_checkpoint")

run!(simulation)

# up until here most recent output file is instantaneous_fields_part4.jld2, but I want to continue running the simulation

simulation.stop_iteration = 400

simulation.output_writers[:jld2] = JLD2OutputWriter(model, outputs,
                                                          filename = "instantaneous_fields.jld2",
                                                          schedule = IterationInterval(1),
                                                          max_filesize=200e3,
                                                          part=4)

run!(simulation, pickup="model_checkpoint_iteration0.jld2")

What I'm doing is creating a directory test_outputwriter, and then writing fields into it with a specified file size and starting part number.
After the first run!(simulation), 4 output files were written, most recent being instantaneous_fields_part4.jld2, and a checkpoint file model_checkpoint_iteration0.jld2 is written.

Let's say I want to keep running this model, so I increase simulation.stop_iteration. I pick up the model from the most recent checkpoint, and specify part=4 (the most recent file written). This creates a instantaneous_fields.jld2 and keeps writing into it, while throwing a warning

Warning: Failed to save and serialize [:grid, :coriolis, :buoyancy, :closure] in ./test_outputwriter/instantaneous_fields.jld2 because ArgumentError: ArgumentError: a group or dataset named Nx is already present within this group

It never actually writes into instantaneous_fields_part4.jld2, and it keeps writing and rewriting into instantaneous_fields.jld2 . If instead I specify part=10 or any number larger than 4, the same problem occurs.

If I use part=1 in my 2nd spin up of the simulation, it throws

ERROR: ArgumentError: '.\./test_outputwriter/instantaneous_fields_part1.jld2' exists. `force=true` is required to remove '.\./test_outputwriter/instantaneous_fields_part1.jld2' before moving.

Not sure what the intended user experience but I was imagining that if for some reason the simulation stops and I want to rerun the simulation from a checkpoint, 2 potential options would be available:

The model runs from the latest checkpoint, and continues writing into the most recent output file once it catches up to the latest unsaved iteration. Note that since the model is running from the checkpoint the saved iterations which the model is running at could be in earlier parts than the most recent output. But the simulation should know that and only starts writing into the latest part once it catches up to the latest saved iteration.
I specify a part number that is larger than all the previous output files, and the simulation picks up from the checkpoint and writes into the new part number. This could mean that there are repetitive iterations saved when examining all output files (new and old).

is potentially the most important and common use case, but 2) might not be an unreasonable usage as well. However in the current implementation neither can be achieved.

The text was updated successfully, but these errors were encountered:

glwagner · 2023-12-02T17:37:44Z

The intended user experience is that only one line should need to be changed: pickup=false to pickup=true in run!.

Therefore, users should not have to manually specify the "part" that they want to pick up from. I don't like option 2 above.

I think that fixing this problem may become much easier if we can "delay" the creation of the output file. Right now, the output file is created when we build the output writer. But at that point, we have no way of knowing whether we are going to pick up or not.

I've long wanted to implement this "delay" but more pressing matters have intervened...

The basic thing we need to do is to add an initialize!(output_writer, sim) utility, which will create the output file. That function then will know whether the simulation is starting fresh (because iteration(sim) == 0, or whether it is "continuing"). One huge feature this will enable is the ability to avoid overwriting an existing file when it represents the output from the current continuing run. That's a huge problem with the current interface, is that you have to be really careful about overwrite_existing if you are trying to pickup from a checkpoint. And I think that's a big problem.

With that feature I think we can also figure out how to handle output that is split into multiple files --- because we know if a simulation is continuing that we will have to figure out which part to use (if any).

continues writing into the most recent output file once it catches up to the latest unsaved iteration.

This is a separate feature from what I was talking about, but I think it's also a great idea! There also may be a clue how to solve a roundoff error issue, where two outputs are written one iteration separate from one another, but at virtually identical times (eg distinguished only by machine epsilon).

PS: I simplified the example a bit to help me understand it

glwagner · 2023-12-02T17:38:42Z

Why do we even have the "part" kw for JLD2OutputWriter? I feel this is a weird detail and users should not have to set that.

xkykai added output 💾 user interface/experience 💻 labels Dec 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`JLD2OutputWriter` and `Checkpointer` don't work when `max_filesize` and `part` are specified. #3399

`JLD2OutputWriter` and `Checkpointer` don't work when `max_filesize` and `part` are specified. #3399

xkykai commented Dec 1, 2023 •

edited by glwagner

Loading

glwagner commented Dec 2, 2023 •

edited

Loading

glwagner commented Dec 2, 2023

JLD2OutputWriter and Checkpointer don't work when max_filesize and part are specified. #3399

JLD2OutputWriter and Checkpointer don't work when max_filesize and part are specified. #3399

Comments

xkykai commented Dec 1, 2023 • edited by glwagner Loading

glwagner commented Dec 2, 2023 • edited Loading

glwagner commented Dec 2, 2023

`JLD2OutputWriter` and `Checkpointer` don't work when `max_filesize` and `part` are specified. #3399

`JLD2OutputWriter` and `Checkpointer` don't work when `max_filesize` and `part` are specified. #3399

xkykai commented Dec 1, 2023 •

edited by glwagner

Loading

glwagner commented Dec 2, 2023 •

edited

Loading