intermittent RuntimeError: NetCDF: HDF #2038
-
QuestionQuestionDear all, I'm running an experiment using the from_nemo() method with multiple NetCDF files as input. For the most part, it works well. I've been able to run several shorter simulations successfully. However, when I attempt longer experiments, I occasionally encounter the error below. To troubleshoot, I re-downloaded the hydrodynamic input files (based on #1029 ) and checked if xarray is able to open all files being listed with glob on the code below (using both xr.open_dataset and xr.open_mfdataset), but the issue persists. What's particularly puzzling is that a 20-day experiment might run fine once, then fail with the same exact setup if I rerun it. Unfortunately, I haven't been able to reproduce this in a minimal example. That said, I'm including both my full script and the error log in the hope that they'll provide enough context for debugging or pointing me in the right direction. Digging around, it might be something related to the computeTimeChunk from xarray (version 2025.3.1), but I haven't found what is causing it. Thank you in advance for any help! Supporting code/error messagesfrom glob import glob
from datetime import datetime, timedelta
import numpy as np
import parcels
def delete_particle(particle, fieldset, time):
particle.delete()
def random_pset(fieldset=None, lon_range=(-48, -44), lat_range=(3,6),
npart=100):
""" """
return parcels.ParticleSet.from_list(
fieldset=fieldset,
pclass=parcels.ScipyParticle,
lon=np.random.uniform(*lon_range, size=(npart,)),
lat=np.random.uniform(*lat_range, size=(npart,)),
time=np.zeros(shape=(npart,)),
)
# general settings
data_path = '/home/nilodna/postdoc/data/glob16'
mesh_mask = f'{data_path}/GLOB16L98_mesh_mask_atlantic.nc'
# simulation_start should match with the available time span on filenames
simulation_start = datetime(2021, 9, 10, 12, 0, 0)
random_test = True
ufiles = sorted(glob(f"{data_path}/ROMEO.01_1d_uo_2021*.nc"))
vfiles = sorted(glob(f"{data_path}/ROMEO.01_1d_vo_2021*.nc"))
wfiles = sorted(glob(f"{data_path}/ROMEO.01_1d_wo_2021*.nc"))
filenames = {'U': {'lon': mesh_mask, 'lat': mesh_mask, 'depth': ufiles[0], 'data': ufiles},
'V': {'lon': mesh_mask, 'lat': mesh_mask, 'depth': ufiles[0], 'data': vfiles},
'W': {'lon': mesh_mask, 'lat': mesh_mask, 'depth': ufiles[0], 'data': wfiles}
}
variables = {'U': 'uo',
'V': 'vo',
'W': 'wo'
}
dimensions = {'lon': 'glamf', 'lat': 'gphif', 'depth': 'depthu', 'time': 'time_counter'}
fieldset = parcels.FieldSet.from_nemo(filenames, variables, dimensions,
indices={
'lon': [0, 1800],
'lat': [1000, 3000]
},
chunksize=False,
allow_time_extrapolation=True)
pset = random_pset(fieldset)
kernels = pset.Kernel(parcels.AdvectionRK4_3D)
output_file = pset.ParticleFile(name="Output.zarr", outputdt=timedelta(hours=3))
output_file.metadata["date_created"] = datetime.now().isoformat()
pset.execute(
kernels,
runtime=timedelta(days=14),
dt=timedelta(hours=3),
output_file=output_file
)
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 3 replies
-
This error looks very similar to pydata/xarray#4050 . Perhaps your file is corrupted? When you were loading your datasets in xarray, were you loading them into memory? (by default with large datasets xarray load Dask arrays, which are lazily executed). The following code should be close to what you want to try. # I haven't tested this code, but should work...
import xarray as xr
from itertools import pairwise
data_path = '/home/nilodna/postdoc/data/glob16'
mesh_mask = f'{data_path}/GLOB16L98_mesh_mask_atlantic.nc'
ds_mesh = xr.open_dataset(f'{data_path}/GLOB16L98_mesh_mask_atlantic.nc')
ds_u = xr.open_mfdataset(f"{data_path}/ROMEO.01_1d_uo_2021*.nc")
ds_v = xr.open_mfdataset(f"{data_path}/ROMEO.01_1d_vo_2021*.nc")
ds_w = xr.open_mfdataset(f"{data_path}/ROMEO.01_1d_wo_2021*.nc")
_ = ds_mesh.load() # load and discard from memory
# load arrays into memory
try:
for ds_full in [ds_u, ds_v, ds_w]:
# Load individual slices along the time dimension
for start, end in pairwise(range(ds_full.time_counter.size, step=3)):
ds = ds_full.isel(time = slice(start, end))
_ = ds.load() # load and discard from memory
except RuntimeError as e:
e.add_note(f"Error encountered on:\n{ds}")
raise e Hopefully that helps. My only idea at the moment is it being a data issue - I haven't seen this before.
Really not sure why this is, or why there is flakiness here. Hopefully the code above sheds some light on what the problem is. Thanks for the error log, code, and consideration for a minimal reproducer! Helps quite a bit to help debug |
Beta Was this translation helpful? Give feedback.
-
Hi @VeckoTheGecko, I'm coming back to this just to update with what I've found regarding the intermittent problem. The problem indeed wasn't with Parcels and it took me a few months to figure out what was going on. I am not completely sure about this and I don't know how to be sure about it, but here it goes: I found this comment from Deepak Cherian, mentioning the "bad disk" possibility and dug a bit on this. What I've found was that HDF has issues saving netcdf files on SSD storages (here), which sometimes corrupt the file during the writing or reading. I managed that by downloading my entire dataset on a HDD (using your code to double check whether the file was corrupted or not) and run the model, with the same setup from before. Surprisingly no RuntimeError poped up! I then rsync these files from the HDD to a SDD (everything on the same machine), and run the model: the error came back. I then tried to run the model with the files from the HDD, and the error was there. We then started to believe on a second problem: the cache management of xarray. The thing is the xarray saves files on cache to speed up new reading of the same file, perhaps using checksum to identify the files but I'm not sure about this. This was happening even when using different files from different partitions. By forcing a clean up of the cache in the unix system, we were able to run the model again without having to reboot the machine. So, I'm not entirely sure about all this, but the thing is that I am running the model on my old laptop (with hdd and unix system) without any RuntimeError for a while now. On my new laptop, with SSD, the model does not run, similarly to the server I was working on. So I'm pretty confident that it might be something to do with this sdd/hdd thing. My problem was solved, and I hope this information might be helpful for future users. Thank you very much for your help. Best, |
Beta Was this translation helpful? Give feedback.
How did you deal with the corrupt files? Did you remove them from the simulation/otherwise fix them/re-download them?
I'm not sure why the problem is intermittent, but it sounds like the problem is with corrupt data- this is something that your data provider would be able to help you with... Have you tried re-downloading the data? Discussing with the provider to check if they're hosting corrupt data?
This problem i…