"xarray.to_netcdf" needs too much memory for long datasets(e.g., R95P based on the ERA5 data) #1950
Replies: 1 comment
-
Hi @millet5818
percentile_doy computes percentiles for each day of the years, so you get 365 values per grid cells. On the other hand, quantile, when computed on the time axis, will compute 1 value per grid cells. These values can then be used as threshold to compute the exceedance in indices such as R95p. I would like to also draw your attention that, according to the ECAD's ATBD, R95p should be computed on "period percentiles" and not on doy percentiles as you have shown in your example. See https://knmi-ecad-assets-prd.s3.amazonaws.com/documents/atbd.pdf Regarding performancesFirst, if you compute period percentiles instead of doy percentiles, you may not have any performance issue because computing doy percentiles requires much more operations than for period percentiles. Then, if you still have perf issues read the following. I suggest that you try the distributed scheduler of dask, it gives much more control over the memory management of the computation. Have a look at the quickstart here: https://distributed.dask.org/en/latest/quickstart.html In short you first need to install it with pip or conda, like Then you can setup the Localcluster of dask with:
(adapt mem and threads to your machine). And then you can run your computation in the same python process (typically the same notebook). I hope this helps! |
Beta Was this translation helpful? Give feedback.
-
Generic Issue
Description
Dear developers:
I want to calculate the R95P based on ERA5 data from 1950 to 2023. The function
create_ensemble
oropen_mfdataset
were used to load the dataset. The functionsensemble_percentiles
orpercentile_doy
were used to calculate the percentile of the day of the year. Then, according to the functionxclim.indicators.icclim.R95p
, the R95P we got. However, the computer memory exploded when the R95P results were exported. On the other hand, i'm confused about the difference between the functionquantile
and the functionpercentile_doy
.Code
Computer memory explodes when proceeding at this point (
results_R95P .to_netcdf('../R95P.nc', format='NETCDF4', engine='netcdf4')
)What I Did
I initially guessed that the computer memory was too small, so I loaded the data for each grid into the computer memory and finally concat all grids, but this way made the calculation too time-consuming. Do you have a better way?
Simple example for my solution
Thank you very much for your help, and I look forward to your reply!
Code of Conduct
Beta Was this translation helpful? Give feedback.
All reactions