Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nco randomly fails to append ts plots #365

Open
wwieder opened this issue Apr 5, 2025 · 5 comments
Open

nco randomly fails to append ts plots #365

wwieder opened this issue Apr 5, 2025 · 5 comments
Labels
bug Something isn't working LDF Specific request for land diagnostics

Comments

@wwieder
Copy link
Collaborator

wwieder commented Apr 5, 2025

What happened?

It seems like nco randomly fails to append land timeseries plots with area & landfrac data, which causes subsequent parts of the adf workflow to fail. Error code from ts generation is below.

nco_err_exit(): ERROR Short NCO-generated message (usually name of function that triggered error): nco_inq()
nco_err_exit(): ERROR Error code is -107. Translation into English with nc_strerror(-107) is "NetCDF: Can't open HDF5 attribute"
nco_err_exit(): ERROR NCO will now exit with system call exit(EXIT_FAILURE)

The source code that's causing the failure is in adf_diag

  "ncks", "-A", "-C", "-v", "area,landfrac,landmask",

I've tried changing the ncarenv being used, as well as nco and hdf5 versions that are loaded (e.g. nco/5.2.4 or 5.3.1). I also tried pointing to different cases and adding the -C flag to the ncks command.

The timeseries and climo generation typically proceeds, but then the LDF can fail when trying to calculate tables of global means / sums, although this is also inconsistent. In cases when the LDF fails I've been able to identify the case and variables where ncks fails to append ts files appropriately. Then I delete both of these ts and climo files, and repeat the LDF. Eventually this manual workaround is successful, but it's kind of a frustrating, time consuming process.

ADF Hash you are using

clm-diag branch

What machine were you running the ADF on?

CISL machine

@wwieder wwieder added bug Something isn't working LDF Specific request for land diagnostics labels Apr 5, 2025
@wwieder wwieder moved this to Todo in LDF on ADF Apr 6, 2025
@mvdebolskiy
Copy link

@adagj @YanchunHe have you encountered this on nird?

@mvdebolskiy
Copy link

@justin-richling
This might be the cause: https://www.unidata.ucar.edu/mailing_lists/archives/netcdfgroup/2017/msg00015.html
I suspect that since the multiprocessing is done outside of nco calls, the code modifies _FillValue atributtes to many times and you get this error due to some old netcdf linked in NCO.
Can you try getting rid of the attributes for the variables you are processing before you append them and then add them back at the end. Or reorganize the multiprocessing in that code, so nco commands are parallelized inside the calls and not ouside them?

@wwieder
Copy link
Collaborator Author

wwieder commented Apr 7, 2025

This seems to be an issues with running on too many processors and overloading the number of times a file is being opened/used? It's also more likely when we're running with more variables. Regardless, running on a single processor makes this work fine (although slowly). Is there another way to ncks time series files with multiprocessing?

wwieder added a commit that referenced this issue Apr 7, 2025
Avoid crashes related to #365, but runs slowly
wwieder added a commit that referenced this issue Apr 7, 2025
Avoid crashes related to #365, but runs slowly
@wwieder
Copy link
Collaborator Author

wwieder commented Apr 8, 2025

FWIW

Richard Valent at the the NCAR help desk commented:

Thanks, Will. I'm glad you have a workaround running sequentially.

The NCO User Guide https://nco.sourceforge.net/nco.pdf describes some parallel strategy.

It looks like it's on the user to program it correctly.

I'll need to study the Guide further to be sure, esp the section named "Parallel" starting on p. 415 of the Guide. I'll let you know when I have had a look.

PS I checked our past tickets and find no references to users trying to run CDO in parallel.

@samsrabin
Copy link
Member

This might be better solved by using (u)xarray with dask in Python rather than calling out to the nco utilities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working LDF Specific request for land diagnostics
Projects
Status: Todo
Development

No branches or pull requests

3 participants