Skip to content

Iris save taking very long (hanging) with some netcdf files #3362

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rswamina opened this issue Jul 29, 2019 · 19 comments
Closed

Iris save taking very long (hanging) with some netcdf files #3362

rswamina opened this issue Jul 29, 2019 · 19 comments
Assignees
Labels

Comments

@rswamina
Copy link

I am trying to read in a few netcdf files, make a simple change to the data and concatenate them using the concatenate_cube function. It looks like iris takes a long time or hangs when saving this data/ accessing this data. I am using this data for some development work using the ESMValTool (https://github.com/ESMValGroup/ESMValTool) and the ultimate goal is to make the datset CMOR compliant. The version of Python being used is 3.7.3 and Iris is 2.2.1. Sample files are at
https://drive.google.com/drive/folders/1ZZQ6jRP3MsY4WPku-dj9wc2H1jKrNOZF?usp=sharing
A suggestion made by @valeriupredoi a core developer with ESMValTool is:

--
it fails to realize the lazy data - for a smaller array it's still crazy slow:

time c[0, :, :].data
CPU times: user 16.4 s, sys: 6.13 s, total: 22.6 s
Wall time: 23.1 s
where c is the cube from one of the files only (the 0th)

--
Would it be possible to identify and fix this issue?

@rswamina
Copy link
Author

An ESMValTool core development member suggested tagging @bjlittle on this , not sure if this does it.

@bjlittle
Copy link
Member

@rswamina Thanks for taking the time to raise this issue, much appreciated.

I'll see if I can recreate the issue on my side and investigate further 👍

@valeriupredoi
Copy link
Contributor

cheers @bjlittle man - one mention is that I tried with both dask=1.1.4 and dask=2.1.0 and I get the same weird behaviour - the cube loads fine (and all lazy oprations work) but when realizing the data simply by print(cube.data) it hangs. Accessing individual members of the data works fine but gets slower the larger the data slice to be accessed gets as @rswamina posted a code snippet above 🍺 I am going on hols today till Monday so won't be able to help much in this meantime am afraid

@bjlittle
Copy link
Member

@rswamina and @valeriupredoi This is pretty interesting, and also very relevant at the moment, see issue #3357

The behaviour that you're witnessing is the direct result of bad/unfavourable chunking at the file-format level of the associated netCDF variables.

The netCDF chunking of each data variable in the provided files (thanks @rswamina for providing the data btw 👍) is as follows:

>>> import netCDF4 as nc
>>> for i in range(3):
...     fname = 'GPP.ANN.CRUNCEPv6.monthly.198{}.nc'.format(i)
...     ds = nc.Dataset(fname)
...     var = ds.variables['GPP']
...     print(fname)
...     print(var)
...     print('chunking = {}\n'.format(var.chunking()))
... 
GPP.ANN.CRUNCEPv6.monthly.1980.nc
<class 'netCDF4._netCDF4.Variable'>
float32 GPP(time, lat, lon)
    long_name: GPP
    scale_factor: 1.0
    add_offset: 0.0
    missing_value: -9999.0
    _FillValue: -9999.0
    units: gC m-2 day-1
unlimited dimensions: 
current shape = (12, 360, 720)
filling on
chunking = [1, 10, 10]

GPP.ANN.CRUNCEPv6.monthly.1981.nc
<class 'netCDF4._netCDF4.Variable'>
float32 GPP(time, lat, lon)
    long_name: GPP
    scale_factor: 1.0
    add_offset: 0.0
    missing_value: -9999.0
    _FillValue: -9999.0
    units: gC m-2 day-1
unlimited dimensions: 
current shape = (12, 360, 720)
filling on
chunking = [1, 10, 10]

GPP.ANN.CRUNCEPv6.monthly.1982.nc
<class 'netCDF4._netCDF4.Variable'>
float32 GPP(time, lat, lon)
    long_name: GPP
    scale_factor: 1.0
    add_offset: 0.0
    missing_value: -9999.0
    _FillValue: -9999.0
    units: gC m-2 day-1
unlimited dimensions: 
current shape = (12, 360, 720)
filling on
chunking = [1, 10, 10]

As you see, each GPP data variable has been given a netCDF chunk of [1, 10, 10], which is:

  1. odd (to me, anyways)
  2. very small relative to the data variable shape of (12, 360, 720)

More importantly, at the moment in iris, it will honor this netCDF data variable chunking, and instruct dask that this is the chunks size for the associated lazy data of its cube.

Note that, each netCDF variable is a float32, so 4 x 12 x 360 x 720 ~= 12MB in size, which easily fits in memory. So let's see netCDF4 load one of these variables:

from timeit import timeit
>>> timeit('var[:]', number=3, globals=dict(var=var))
1.6068386339993594
>>> timeit('var[:]', number=1, globals=dict(var=var))
0.5159940029998324

This seems quite reasonable. Now let's try this with dask, but with a sensible chunks:

>>> import dask.array as da
>>> dvar = da.from_array(var, chunks=var.shape)
>>> timeit('dvar.compute()', number=3, globals=dict(dvar=dvar))
1.6165974420000566
>>> timeit('dvar.compute()', number=1, globals=dict(dvar=dvar))
0.543516056000044

This also seems pretty reasonable to me too.

Now, let's do the same again, but this time set the chunks with the associated netCDF chunking value:

>>> dvar = da.from_array(var, chunks=var.chunking())
>>> dvar.compute()
^CTraceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/bill/miniconda3/envs/iris/lib/python3.7/site-packages/dask/base.py", line 156, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/home/bill/miniconda3/envs/iris/lib/python3.7/site-packages/dask/base.py", line 398, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/home/bill/miniconda3/envs/iris/lib/python3.7/site-packages/dask/threaded.py", line 76, in get
    pack_exception=pack_exception, **kwargs)
  File "/home/bill/miniconda3/envs/iris/lib/python3.7/site-packages/dask/local.py", line 453, in get_async
    key, res_info, failed = queue_get(queue)
  File "/home/bill/miniconda3/envs/iris/lib/python3.7/site-packages/dask/local.py", line 141, in queue_get
    return q.get()
  File "/home/bill/miniconda3/envs/iris/lib/python3.7/queue.py", line 170, in get
    self.not_empty.wait()
  File "/home/bill/miniconda3/envs/iris/lib/python3.7/threading.py", line 296, in wait
    waiter.acquire()
KeyboardInterrupt

Yeah, well, I pretty much gave up... it hung, as I expected, and as you've been witnessing 😢

@bjlittle
Copy link
Member

@rswamina and @valeriupredoi

So in a nutshell, the issue (to me) is directly related to the netcdf chunking size specified for the data variables - it's just not appropriate to dask.

For this data and chunking, you hit the issue at the very point you attempt to realise the data i.e., in this case when you attempt to save the cube. This issue is related to #3333 #3357 and will hopefully be resolved by #3361

@bjlittle
Copy link
Member

Yup, just checked, #3361 does the job, nice one @pp-mo 👍

@rswamina
Copy link
Author

Hi, @bjlittle - Thanks for looking into this. Is there any other information I can provide towards debugging? Just that I am working towards a deadline and need the file saved after processing. Do let me know.

@rswamina
Copy link
Author

Ha...just noticed my page did not get refreshed...will check the messages on the thread and your responses...Thanks!

@rswamina
Copy link
Author

If I understand this correctly, then the change is made to Iris v2.3.0? Does that mean I should update my version of iris with ESMValtool (from 2.2.1 which is what it is currently to 2.3.0) for the fix to work? Sorry I am not very familiar with how these changes propagate.

@bjlittle
Copy link
Member

@rswamina We've yet to incorporate the change that fixes your issue into iris. Even so, we're not due to release iris version 2.3.0 until 30 August, see https://github.com/SciTools/iris/milestones

So it seems that you're going to have to workaround this issue in the meantime, sorry. You could do this in a couple of ways, either:

Does this help?

@rswamina
Copy link
Author

Right. Let me try the rechunking. This is particularly a problem when I try concatenating all the data into one cube. I will report back with problems/status.

@bjlittle
Copy link
Member

bjlittle commented Jul 31, 2019

@rswamina Are you using conda to install ESMValTool? If so, then I could talk you through installing the PR branch ofiris, if you're unable to rechunk using nccopy (or something similar)

@rswamina
Copy link
Author

I am installing ESMValTool from source. This is because I am adding code for new data sets. Is it still possible to install the PR branch of iris? Rechunking seems to take considerable time.

I followed these instructions https://github.com/ESMValGroup/ESMValTool/blob/version2_development/CONTRIBUTING.md

for installation from source. Is there any way to install and use a different branch of iris in my local environment for now?

@bjlittle
Copy link
Member

@rswamina I also tried using the nccopy approach last night, and kinda gave up after no response for 20 minutes, but I simply thought it was more of an issue with my laptop.

So if you've got ESMValTool in a conda environment, simply do the following:

  • activate the conda environment containing ESMValTool
  • clone the iris development branch somewhere e.g.
> mkdir iris-dev
> cd iris-dev
> git clone [email protected]:pp-mo/iris.git
> cd iris
> git fetch origin
> git checkout chunk_control
> pip install -e .

Or you could simply try

> pip install git+https://github.com/pp-mo/iris.git@chunk_control

I'm writing this via my mobile, so I can't test what I'm suggesting is perfectly correct.

Once installed, fire-up python to check that the development iris version is installed and being used e.g.

> python
>>> import iris
>>> iris.__version__

and this should give you the development version of iris.

Give that a try...

@rswamina
Copy link
Author

@bjlittle - Quick update: I figured out an optimum size to rechunk (12,180,360) and that seemed to do it. Chunking time was short and processing was fine. I will try out the iris cloning option and update this thread.

@bjlittle
Copy link
Member

Yup, dask Rechunking would also work 👍

@pp-mo
Copy link
Member

pp-mo commented Aug 1, 2019

dask Rechunking would also work

Re-chunking the files will work, but I don't think dask re-chunking will ??
Re-chunking some existing lazy arrays only adds another "layer" to the dask graph : the original small chunks still exist "underneath". So if they are too small, and file data fetching (not calculations) is the slow part, then dask rechunking can't fix that : You need it to take bigger bites of the file. Unfortunately, that decision can't (easily) be changed after the initial load is done. That's true for Iris, but also for Dask itself.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 8, 2022

In order to maintain a backlog of relevant issues, we automatically label them as stale after 500 days of inactivity.

If this issue is still important to you, then please comment on this issue and the stale label will be removed.

Otherwise this issue will be automatically closed in 28 days time.

@github-actions github-actions bot added the Stale A stale issue/pull-request label Jan 8, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Feb 6, 2022

This stale issue has been automatically closed due to a lack of community activity.

If you still care about this issue, then please either:

  • Re-open this issue, if you have sufficient permissions, or
  • Add a comment pinging @SciTools/iris-devs who will re-open on your behalf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants