Workflow to rechunk NWM retrospective zarr data #122

jpolchlo · 2023-01-06T17:45:39Z

Overview

We have wanted to rechunk the NWM retrospective data to be optimized for time-series queries over a limited spatial extent, which seems to be a more common use case. Previous studies done by Azavea have shown that this style of query is faster on a rechunked data set.

When we run this process on a single EC2 node, we run out of memory for the job. This PR presents a solution based on a Dask cluster running in Kubernetes. Using an Argo workflow, we are able to execute the included python script (rechunk-retro-data.py) on an arbitrarily-sized cluster to perform the rechunking operation. In my test, I used 48 workers with 8GB of RAM. The result can be found at s3://azavea-noaa-hydro-data/experiments/jp/rechunk/output.zarr.

Closes #119

Checklist

Ran nbautoexport export . in /opt/src/notebooks and committed the generated scripts. This is to make reviewing notebooks easier. (Note the export will happen automatically after saving notebooks from the Jupyter web app.)
Documentation updated if needed
PR has a name that won't get you publicly shamed for vagueness

Notes

~~This is built on top of the contents of #120. This PR should be considered a draft until that PR is merged.~~ Rebased and ready.

Testing Instructions

Start a workflow based on the run-dask-job.yaml workflow template
Point the script-location parameter to a version of rechunk-retro-data.py that sets the proper output URI location (figuring out a clean way to pass arguments via the Argo UI is a task for the future)
Tune the scale of your cluster
Create the workflow
Monitor the logs for the Dask client dashboard, append it to https://jupyter.noaa.azavea.com, and direct a browser to that URL to watch the job progress

…e for time-series queries

jpolchlo · 2023-01-06T23:11:40Z

This is ready for review, but I don't intend for you to actually run this. Necessarily. I was thinking that the review would involve checking the zarr file I created against the original, NOAA-provided zarr file. If there are benchmarks that can be simply rerun from the ESIP work by changing a couple URIs, then that would be a good idea.

jpolchlo · 2023-02-02T22:30:15Z

Pushing the go button here. I had wanted a positive confirmation that the zarr that was generated showed the same speedup as our tests for ESIP, but that shouldn't hold this up anymore. When we do the test to confirm, we can readdress the script contributed here if there are problems.

jpolchlo · 2023-06-14T15:27:22Z

For posterity: This job required two r5.8xlarge and one r5.xlarge for two hours, which should have burned about $2.60 in additional compute costs at the time of execution on the spot market. I can't say how much additional S3 costs would have been triggered.

rajadain · 2023-06-14T16:21:58Z

Also, we checked out the size of the generated dataset:

aws --profile=noaa s3 ls --human-readable --summarize --recursive s3://azavea-noaa-hydro-data/experiments/jp/rechunk/output.zarr

...

Total Objects: 132293
   Total Size: 638.6 GiB

jpolchlo · 2023-06-15T15:57:49Z

It's important to put the above in context:

aws s3 ls --recursive --summarize --human-readable s3://noaa-nwm-retrospective-2-1-zarr-pds/chrtout.zarr

...

Total Objects: 102330
   Total Size: 1.3 TiB

We don't yet have an explanation for the factor-of-two change in size, though an obvious thing to check is if we coerced the data type inadvertently.

jpolchlo mentioned this pull request Jan 6, 2023

Evaluate cost of Kubernetes Workflow Actions #103

Closed

Add a workflow to rechunk the NOAA retrospective zarr file to optimiz…

dbe318c

…e for time-series queries

jpolchlo force-pushed the workflow/dask-rechunk branch from 94ddeda to dbe318c Compare January 6, 2023 23:08

jpolchlo requested review from rajadain and vlulla January 6, 2023 23:09

jpolchlo mentioned this pull request Jan 9, 2023

Saving subset of NWM runs out of memory #3

Closed

jpolchlo merged commit 3c1bf70 into master Feb 2, 2023

jpolchlo deleted the workflow/dask-rechunk branch February 2, 2023 22:30

jpolchlo mentioned this pull request Feb 2, 2023

Confirm that rechunked retrospective zarr is proper #125

Closed

jpolchlo restored the workflow/dask-rechunk branch February 21, 2023 19:21

jpolchlo deleted the workflow/dask-rechunk branch February 21, 2023 20:14

rajadain mentioned this pull request May 23, 2023

Add notebook comparing of base and rechunked Retrospective Zarrs #131

Merged

jpolchlo restored the workflow/dask-rechunk branch June 6, 2023 18:37

jpolchlo mentioned this pull request Jun 6, 2023

Task 1-5: Document NWM Reanalysis Data Generation and Consumption #30

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workflow to rechunk NWM retrospective zarr data #122

Workflow to rechunk NWM retrospective zarr data #122

jpolchlo commented Jan 6, 2023 •

edited

Loading

jpolchlo commented Jan 6, 2023

jpolchlo commented Feb 2, 2023

jpolchlo commented Jun 14, 2023

rajadain commented Jun 14, 2023 •

edited

Loading

jpolchlo commented Jun 15, 2023

Workflow to rechunk NWM retrospective zarr data #122

Workflow to rechunk NWM retrospective zarr data #122

Conversation

jpolchlo commented Jan 6, 2023 • edited Loading

Overview

Checklist

Notes

Testing Instructions

jpolchlo commented Jan 6, 2023

jpolchlo commented Feb 2, 2023

jpolchlo commented Jun 14, 2023

rajadain commented Jun 14, 2023 • edited Loading

jpolchlo commented Jun 15, 2023

jpolchlo commented Jan 6, 2023 •

edited

Loading

rajadain commented Jun 14, 2023 •

edited

Loading