Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow to rechunk NWM retrospective zarr data #122

Merged
merged 1 commit into from
Feb 2, 2023

Conversation

jpolchlo
Copy link
Collaborator

@jpolchlo jpolchlo commented Jan 6, 2023

Overview

We have wanted to rechunk the NWM retrospective data to be optimized for time-series queries over a limited spatial extent, which seems to be a more common use case. Previous studies done by Azavea have shown that this style of query is faster on a rechunked data set.

When we run this process on a single EC2 node, we run out of memory for the job. This PR presents a solution based on a Dask cluster running in Kubernetes. Using an Argo workflow, we are able to execute the included python script (rechunk-retro-data.py) on an arbitrarily-sized cluster to perform the rechunking operation. In my test, I used 48 workers with 8GB of RAM. The result can be found at s3://azavea-noaa-hydro-data/experiments/jp/rechunk/output.zarr.

Closes #119

Checklist

  • Ran nbautoexport export . in /opt/src/notebooks and committed the generated scripts. This is to make reviewing notebooks easier. (Note the export will happen automatically after saving notebooks from the Jupyter web app.)
  • Documentation updated if needed
  • PR has a name that won't get you publicly shamed for vagueness

Notes

This is built on top of the contents of #120. This PR should be considered a draft until that PR is merged. Rebased and ready.

Testing Instructions

  • Start a workflow based on the run-dask-job.yaml workflow template
  • Point the script-location parameter to a version of rechunk-retro-data.py that sets the proper output URI location (figuring out a clean way to pass arguments via the Argo UI is a task for the future)
  • Tune the scale of your cluster
  • Create the workflow
  • Monitor the logs for the Dask client dashboard, append it to https://jupyter.noaa.azavea.com, and direct a browser to that URL to watch the job progress

@jpolchlo jpolchlo force-pushed the workflow/dask-rechunk branch from 94ddeda to dbe318c Compare January 6, 2023 23:08
@jpolchlo jpolchlo requested review from rajadain and vlulla January 6, 2023 23:09
@jpolchlo
Copy link
Collaborator Author

jpolchlo commented Jan 6, 2023

This is ready for review, but I don't intend for you to actually run this. Necessarily. I was thinking that the review would involve checking the zarr file I created against the original, NOAA-provided zarr file. If there are benchmarks that can be simply rerun from the ESIP work by changing a couple URIs, then that would be a good idea.

@jpolchlo
Copy link
Collaborator Author

jpolchlo commented Feb 2, 2023

Pushing the go button here. I had wanted a positive confirmation that the zarr that was generated showed the same speedup as our tests for ESIP, but that shouldn't hold this up anymore. When we do the test to confirm, we can readdress the script contributed here if there are problems.

@jpolchlo jpolchlo merged commit 3c1bf70 into master Feb 2, 2023
@jpolchlo jpolchlo deleted the workflow/dask-rechunk branch February 2, 2023 22:30
@jpolchlo jpolchlo restored the workflow/dask-rechunk branch February 21, 2023 19:21
@jpolchlo jpolchlo deleted the workflow/dask-rechunk branch February 21, 2023 20:14
@jpolchlo jpolchlo restored the workflow/dask-rechunk branch June 6, 2023 18:37
@jpolchlo
Copy link
Collaborator Author

For posterity: This job required two r5.8xlarge and one r5.xlarge for two hours, which should have burned about $2.60 in additional compute costs at the time of execution on the spot market. I can't say how much additional S3 costs would have been triggered.

@rajadain
Copy link
Collaborator

rajadain commented Jun 14, 2023

Also, we checked out the size of the generated dataset:

aws --profile=noaa s3 ls --human-readable --summarize --recursive s3://azavea-noaa-hydro-data/experiments/jp/rechunk/output.zarr

...

Total Objects: 132293
   Total Size: 638.6 GiB

@jpolchlo
Copy link
Collaborator Author

It's important to put the above in context:

aws s3 ls --recursive --summarize --human-readable s3://noaa-nwm-retrospective-2-1-zarr-pds/chrtout.zarr

...

Total Objects: 102330
   Total Size: 1.3 TiB

We don't yet have an explanation for the factor-of-two change in size, though an obvious thing to check is if we coerced the data type inadvertently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rechunk retrospective channel routing data
2 participants