Open
Description
from the pangeo working meeting discussion with @mgrover1 @jmunroe @norlandrhagen
Here's an outline for an intermediate tutorial talking about dask chunking specifically for Xarray users
Motivation: why care about chunk size?
- demonstrate relation between chunk size and computation time / number of tasks with a simple example?
- maybe even memory usage
- https://tutorial.dask.org/02_array.html#Choosing-good-chunk-sizes
- https://docs.dask.org/en/stable/array-chunks.html
Keeping track
- monitoring chunk sizes and num tasks throughout the pipeline using the repr
- use some images
- while output blocks may be small (say after a big reduction), intermediate blocks need not be.
- So keep monitoring chunksizes (and tasks) throughout the pipeline.
Why is it important to choose appropriate chunks early in the pipeline?
- Demonstrate that rechunking is not cheap in most cases
Specify chunks when reading data
- Avoid
chunks="auto"
. - Specifying
chunks
during data readopen_dataset
open_mfdataset
- Analysis vs storage chunks: