-
Notifications
You must be signed in to change notification settings - Fork 28
Xarray GPU optimization #771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@negin513 is attempting to deploy a commit to the xarray Team on Vercel. A member of the Team first needs to authorize it. |
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for writing this up!
Co-authored-by: Tom Augspurger <[email protected]>
- name: Katelyn Fitzgerald | ||
github: kafitzgerald | ||
|
||
summary: 'How to accelerate AI/ML workflows in Earth Sciences with GPU-native Xarray and Zarr.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this more direct? "X% speedup" or "XMBps throughput"?
src/posts/gpu-pipeline/index.md
Outdated
(TODO ongoing work) Eventually with this [cupy-xarray Pull Request merged](https://github.com/xarray-contrib/cupy-xarray/pull/70) (based on earlier work at https://xarray.dev/blog/xarray-kvikio), this can be simplified to: | ||
|
||
```python | ||
import cupy_xarray | ||
|
||
ds = xr.open_dataset(filename_or_obj="/tmp/air-temp.zarr", engine="kvikio") | ||
assert isinstance(ds.air.data, cp.ndarray) | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could go in a future work section at the end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm not sure if this API is feasible or even desirable (have tried to implement this in xarray-contrib/cupy-xarray#70, but no luck yet patching the buffer protocol). So ok to move this towards the end.
src/posts/gpu-pipeline/index.md
Outdated
|
||
 | ||
|
||
(TODO insert better nsight profiling figure than above showing overlapping CPU and GPU compute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that would be really nice!
src/posts/gpu-pipeline/index.md
Outdated
- Consider using GPU Direct Storage (GDS) for optimal performance, but be aware of the setup and configuration required. | ||
- GPU Direct Storage (GDS) can be an improvement for data-intensive workflows, but requires some setup and configuration. | ||
- NVIDIA DALI is a powerful tool for optimizing data loading, but requires some effort to integrate into existing workflows. | ||
- GPU-based decompression is a promising area for future work, but requires further development and testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Icechunk!
@@ -0,0 +1,223 @@ | |||
--- | |||
title: 'Accelerating AI/ML Workflows in Earth Sciences with GPU-Native Xarray and Zarr (and more!)' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
title: 'Accelerating AI/ML Workflows in Earth Sciences with GPU-Native Xarray and Zarr (and more!)' | |
title: 'GPU-Native Earth Science AI/ML Workflows Xarray, Zarr, DALI, and nvcomp' |
better SEO this way?
src/posts/gpu-pipeline/index.md
Outdated
(TODO ongoing work) Eventually with this [cupy-xarray Pull Request merged](https://github.com/xarray-contrib/cupy-xarray/pull/70) (based on earlier work at https://xarray.dev/blog/xarray-kvikio), this can be simplified to: | ||
|
||
```python | ||
import cupy_xarray | ||
|
||
ds = xr.open_dataset(filename_or_obj="/tmp/air-temp.zarr", engine="kvikio") | ||
assert isinstance(ds.air.data, cp.ndarray) | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm not sure if this API is feasible or even desirable (have tried to implement this in xarray-contrib/cupy-xarray#70, but no luck yet patching the buffer protocol). So ok to move this towards the end.
src/posts/gpu-pipeline/index.md
Outdated
- GPU Direct Storage (GDS) for optimal performance | ||
- NVIDIA DALI | ||
- Work out how to use GDS when reading from cloud object store instead of on-prem disk. | ||
- etc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Want to shout out that reading/writing Zarr shards with GPU buffers (thanks @maxrjones and @TomAugspurger!) at zarr-developers/zarr-python#2978 was just merged, and could go in here or somewhere above, depending on when this blog post gets published.
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Co-authored-by: Deepak Cherian <[email protected]>
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so great, thank you @negin513!!!
/> | ||
</div> | ||
|
||
We further quantified this bottleneck by comparing data loading and training throughput, as shown in the figure below: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We further quantified this bottleneck by comparing data loading and training throughput, as shown in the figure below: | |
We further quantified this bottleneck by comparing data loading and training throughput, as shown in the figure below (higher bars/more throughput is better): |
|
||
In the plot above, the three bars represent: | ||
|
||
- Baseline: Baseline throughput of the end-to-end pipeline using real data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Baseline: Baseline throughput of the end-to-end pipeline using real data. | |
- Training (real data): Baseline throughput of the end-to-end pipeline using real data. |
src/posts/gpu-pipeline/index.md
Outdated
1. **Optimized Chunking & Compression** | ||
- We explored different chunking and compression strategies to optimize the data loading performance. We found that using Zarr v3 with optimized chunking and compression significantly improved the data loading performance. | ||
2. **GPU-native data loading with Zarr v3 and KvikIO** | ||
- Leveraging Zarr v3's support for reading data directly into GPU memory using CuPy arrays, we utilized [KvikIO](https://docs.rapids.ai/api/kvikio/stable/) to bypass CPU memory, enabling direct data transfer from storage to GPU. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Leveraging Zarr v3's support for reading data directly into GPU memory using CuPy arrays, we utilized [KvikIO](https://docs.rapids.ai/api/kvikio/stable/) to bypass CPU memory, enabling direct data transfer from storage to GPU. | |
- Leveraging Zarr Python 3's support for reading data directly into GPU memory using CuPy arrays, we utilized [KvikIO](https://docs.rapids.ai/api/kvikio/stable/) to bypass CPU memory, enabling direct data transfer from storage to GPU. |
src/posts/gpu-pipeline/index.md
Outdated
|
||
1. **Optimized Chunking & Compression** | ||
- We explored different chunking and compression strategies to optimize the data loading performance. We found that using Zarr v3 with optimized chunking and compression significantly improved the data loading performance. | ||
2. **GPU-native data loading with Zarr v3 and KvikIO** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2. **GPU-native data loading with Zarr v3 and KvikIO** | |
2. **GPU-native data loading with Zarr Python 3 and KvikIO** |
### Step 1: Optimized chunking & Compression | ||
|
||
The ERA-5 dataset we were using had a sub-optimal chunking scheme of `{'time': 10, 'channel': C, 'height': H, 'width': W}`, which meant that a minimum of 10 timesteps of data was being read even if we only needed 2 consecutive timesteps at a time. | ||
We decided to rechunk the data to align with our access pattern of 1-timestep at a time, while reformating to Zarr v3. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We decided to rechunk the data to align with our access pattern of 1-timestep at a time, while reformating to Zarr v3. | |
We decided to rechunk the data to align with our access pattern of 1-timestep at a time, while reformating to Zarr format 3. |
airt.to_zarr(store="/tmp/air-temp.zarr", mode="w", zarr_format=3, consolidated=False) | ||
|
||
with zarr.config.enable_gpu(): | ||
ds = xr.open_dataset("/tmp/air-temp.zarr", engine="zarr", consolidated=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why not to use consolidated metadata for direct to GPU reading? I am surprised about this because I would expect it to improve performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consolidated metadata wasn't supported with GDS/kvikio, at least when I tested it 3 years ago in xarray-contrib/cupy-xarray#10 (comment). Maybe it works with Zarr v3, but we don't have a GDS device to verify on 🙂
src/posts/gpu-pipeline/index.md
Outdated
|
||
 | ||
|
||
Eventually with this [cupy-xarray Pull Request merged](https://github.com/xarray-contrib/cupy-xarray/pull/70) (based on earlier work at https://xarray.dev/blog/xarray-kvikio), this can be simplified to: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually with this [cupy-xarray Pull Request merged](https://github.com/xarray-contrib/cupy-xarray/pull/70) (based on earlier work at https://xarray.dev/blog/xarray-kvikio), this can be simplified to: | |
Eventually with this [cupy-xarray Pull Request merged](https://github.com/xarray-contrib/cupy-xarray/pull/70) (based on earlier work at [https://xarray.dev/blog/xarray-kvikio](https://xarray.dev/blog/xarray-kvikio)), this can be simplified to: |
|
||
Figure above shows benchmark comparing CPU vs GPU-based decompression, with or without GDS enabled using [the data reading benchmark here](https://github.com/pangeo-data/ncar-hackathon-xarray-on-gpus/blob/main/benchmark/era5_zarr_benchmark.py). | ||
|
||
 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you know why it's slower with GDS?
|
||
Next, checkout the [end-to-end example](https://github.com/pangeo-data/ncar-hackathon-xarray-on-gpus/tree/main/zarr_ML_optimization) directory, where we show how to integrate the DALI pipeline into a PyTorch DataLoader and training loop. This example demonstrates how to use DALI to load data from Zarr stores, preprocess it on the GPU, and feed it into a PyTorch model for training. | ||
|
||
Profiling results from the DALI pipeline demonstrate effective overlap between CPU and GPU workloads, significantly reducing GPU idle time (blue) and increasing overall training throughput: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the GPU activity is blue, right? rather than the idle time
src/posts/gpu-pipeline/index.md
Outdated
> - **Compression trade-offs**: Using compression can reduce the amount of data transferred, but can also increase the time spent on decompression. We found that using Zarr v3 with GPU-based decompression can significantly improve performance. | ||
> - **NVIDIA DALI** is a powerful tool for optimizing data loading, but requires some effort to integrate into existing workflows. | ||
> - **CuPy-Xarray integration** is still a work in progress, but can be very useful for GPU-native workflows. Please see this PR for more details: [xarray-contrib/cupy-xarray#70](https://github.com/xarray-contrib/cupy-xarray/pull/70). | ||
> - **GPU-native decompression** is a promising area for future work, but full support (e.g. GPU-side Zstd decompression) requires further development and testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
> - **Compression trade-offs**: Using compression can reduce the amount of data transferred, but can also increase the time spent on decompression. We found that using Zarr v3 with GPU-based decompression can significantly improve performance. | |
> - **NVIDIA DALI** is a powerful tool for optimizing data loading, but requires some effort to integrate into existing workflows. | |
> - **CuPy-Xarray integration** is still a work in progress, but can be very useful for GPU-native workflows. Please see this PR for more details: [xarray-contrib/cupy-xarray#70](https://github.com/xarray-contrib/cupy-xarray/pull/70). | |
> - **GPU-native decompression** is a promising area for future work, but full support (e.g. GPU-side Zstd decompression) requires further development and testing. | |
> - **Compression trade-offs**: Using compression can reduce the amount of data transferred, but can also increase the time spent on decompression. We found that using Zarr v3 with GPU-based decompression can significantly improve performance. | |
> - **GPU-native decompression** is a promising area for future work, but full support (e.g. GPU-side Zstd decompression) requires further development and testing. | |
> - **NVIDIA DALI** is a powerful tool for optimizing data loading, but requires some effort to integrate into existing workflows. | |
> - **CuPy-Xarray integration** is still a work in progress, but can be very useful for GPU-native workflows. Please see this PR for more details: [xarray-contrib/cupy-xarray#70](https://github.com/xarray-contrib/cupy-xarray/pull/70). | |
> - **NVIDIA Nsight** provides a [powerful tool](https://developer.nvidia.com/nsight-systems) for identifying bottlenecks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work, this is coming along really nicely already! Just some minor nitpicks, but hope that we can publish this next month!
|
||
- GPU Direct Storage (GDS) for optimal performance | ||
- NVIDIA DALI | ||
- Support for sharded Zarr with GPU-friendly access patterns [already merged in Zarr v3](). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Support for sharded Zarr with GPU-friendly access patterns [already merged in Zarr v3](). | |
- Support for sharded Zarr with GPU-friendly access patterns already [merged](https://github.com/zarr-developers/zarr-python/pull/2978) in Zarr v3.0.8. |
github: weiji14 | ||
- name: Max Jones | ||
github: maxrjones | ||
- name: Akshay Subranian |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- name: Akshay Subranian | |
- name: Akshay Subramaniam |
|
||
ML pipelines for large scientific datasets typically include steps: | ||
|
||
- Reading raw data from disk or object storage (often CPU-bound) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Reading raw data from disk or object storage (often CPU-bound) | |
- Reading raw data from disk or object storage (often IO-bound) |
- Transforming / preprocessing data (often CPU-bound) | ||
- Model Training/Inference (often GPU-bound) | ||
|
||
Although GPU compute is incredibly fast, the CPU can become a bottleneck when dealing with large datasets. In an ideal scenario, we want to saturate the GPU with data as quickly as possible to minimize idle time on both the CPU and GPU. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although GPU compute is incredibly fast, the CPU can become a bottleneck when dealing with large datasets. In an ideal scenario, we want to saturate the GPU with data as quickly as possible to minimize idle time on both the CPU and GPU. | |
Although GPU compute is incredibly fast, IO and CPU bottlenecks can be a pain when dealing with large datasets. In an ideal scenario, we want to saturate the GPU with data as quickly as possible to minimize idle time on both the CPU and GPU. |
This will read the data directly from the Zarr store to GPU memory, significantly reducing I/O latency, especially for large datasets. | ||
However, it relies on the [NVIDIA GPUDirect Storage (GDS)](https://docs.nvidia.com/datacenter/pgp/gds/index.html) feature to be enabled and correctly configured on your system. | ||
|
||
**Note**: Even with GDS, the decompression step is still occurs on the CPU (see next section for GPU solutions!). This means that the data is still being decompressed on the CPU before being transferred to the GPU. However, this is still a significant improvement over the previous method, as it reduces the amount of data that needs to be transferred over the PCIe bus. In the figure below, we show the flowchart of the data loading process with GDS enabled (i.e. using `kvikio`): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
**Note**: Even with GDS, the decompression step is still occurs on the CPU (see next section for GPU solutions!). This means that the data is still being decompressed on the CPU before being transferred to the GPU. However, this is still a significant improvement over the previous method, as it reduces the amount of data that needs to be transferred over the PCIe bus. In the figure below, we show the flowchart of the data loading process with GDS enabled (i.e. using `kvikio`): | |
**Note**: Even with GDS, the decompression step will still occur on the CPU (see next section for GPU solutions!). This means that the data is still being decompressed on the CPU before being transferred to the GPU. However, this is still a significant improvement over the previous method, as it reduces the amount of data that needs to be transferred over the PCIe bus. In the figure below, we show the flowchart of the data loading process with GDS enabled (i.e. using `kvikio`): |
|
||
For a fully GPU-native pipline, the decompression step should also be done on the GPU. This is where [NVIDIA's nvCOMP](https://developer.nvidia.com/nvcomp) library comes in. nvCOMP provides fast, GPU-native implementations of popular compression algorithms like Zstandard (Zstd) | ||
|
||
With nvCOMP, all steps of data loading including reading, decompressing, and transforming data can be done on the GPU, significantly reducing the time spent on data loading. Here is a flowchart of the data loading process with GDS and GPU-based decompression enabled: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With nvCOMP, all steps of data loading including reading, decompressing, and transforming data can be done on the GPU, significantly reducing the time spent on data loading. Here is a flowchart of the data loading process with GDS and GPU-based decompression enabled: | |
With nvCOMP, all steps of data loading including reading from disk, decompression, and transforming data can be done on the GPU, significantly reducing the time spent on data loading. Here is a flowchart of the data loading process with GDS and GPU-based decompression enabled: |
|
||
> These results show that GPU-based decompression can significantly reduce the time spent on data loading and cut I/O latency from storage to device (less data transfer over PCIe/NVLink). This is especially useful for large datasets, as it allows for faster data loading and processing. | ||
|
||
Keep an eye on this space, as we are working on integrating this into the Zarr ecosystem to enable GPU-based decompression for Zarr stores. This will allow for a fully GPU-native workflow, where all steps of data loading, including reading, decompressing, and transforming data, can be done on the GPU. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep an eye on this space, as we are working on integrating this into the Zarr ecosystem to enable GPU-based decompression for Zarr stores. This will allow for a fully GPU-native workflow, where all steps of data loading, including reading, decompressing, and transforming data, can be done on the GPU. | |
Keep an eye on this space, as we are working on integrating this into the Zarr ecosystem to enable GPU-based decompression for Zarr stores. This will allow for a fully GPU-native workflow, where all steps of data loading, including reading, decompression, and transforming data, can be done on the GPU. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for putting this together!
Mostly just a few minor suggestions from my end beyond the existing comments / questions.
github: akshaysubr | ||
- name: Thomas Augspurger | ||
github: tomaugspurger | ||
- name: Katelyn Fitzgerald |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- name: Katelyn Fitzgerald | |
- name: Katelyn FitzGerald |
|
||
## TL;DR | ||
|
||
Earth science AI/ML workflows are often bottlenecked by slow data loading, leaving GPUs underutilized while CPUs struggle to feed large climate datasets like ERA5. In this blog post, we discuss how to build a GPU-native pipeline using Zarr v3, CuPy, KvikIO, and NVIDIA DALI to accelerate data throughput. We walk through profiling results, chunking strategies, direct-to-GPU data reads, and GPU-accelerated preprocessing, all aimed at maximizing GPU usage and minimizing I/O overhead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Earth science AI/ML workflows are often bottlenecked by slow data loading, leaving GPUs underutilized while CPUs struggle to feed large climate datasets like ERA5. In this blog post, we discuss how to build a GPU-native pipeline using Zarr v3, CuPy, KvikIO, and NVIDIA DALI to accelerate data throughput. We walk through profiling results, chunking strategies, direct-to-GPU data reads, and GPU-accelerated preprocessing, all aimed at maximizing GPU usage and minimizing I/O overhead. | |
Earth science AI/ML workflows are often limited by slow data loading, leaving GPUs underutilized while CPUs struggle to feed large climate datasets like ERA5. In this blog post, we discuss how to build a GPU-native pipeline using Zarr v3, CuPy, KvikIO, and NVIDIA DALI to accelerate data throughput. We walk through profiling results, chunking strategies, direct-to-GPU data reads, and GPU-accelerated preprocessing, all aimed at maximizing GPU usage and minimizing I/O overhead. |
Not committed to this - just trying to vary the language a bit.
|
||
In large-scale geospatial AI and machine learning workflows, data loading is often the main bottleneck. Traditional pipelines rely on CPUs to preprocess and transfer massive datasets from storage to GPU memory, consuming resources and limiting scalability and effective use of GPU resources. | ||
|
||
To tackle this issue, a team from the [National Center for Atmospheric Research (NSF-NCAR)](https://ncar.ucar.edu) and [Development Seed](https://developmentseed.org) with mentors from [NVIDIA](https://www.nvidia.com) participated in the [OpenHackathon](https://www.openhackathons.org/s/) to demonstrate how AI/ML workflows in Earth system sciences can benefit from GPU-native workflows using tools such as [Zarr](https://zarr.readthedocs.io/), [KvikIO](https://docs.rapids.ai/api/kvikio/stable/), and [DALI](https://developer.nvidia.com/dali). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To tackle this issue, a team from the [National Center for Atmospheric Research (NSF-NCAR)](https://ncar.ucar.edu) and [Development Seed](https://developmentseed.org) with mentors from [NVIDIA](https://www.nvidia.com) participated in the [OpenHackathon](https://www.openhackathons.org/s/) to demonstrate how AI/ML workflows in Earth system sciences can benefit from GPU-native workflows using tools such as [Zarr](https://zarr.readthedocs.io/), [KvikIO](https://docs.rapids.ai/api/kvikio/stable/), and [DALI](https://developer.nvidia.com/dali). | |
To tackle this issue, a team from the [NSF National Center for Atmospheric Research (NSF NCAR)](https://ncar.ucar.edu) and [Development Seed](https://developmentseed.org) with mentors from [NVIDIA](https://www.nvidia.com) participated in an [Open Hackathon](https://www.openhackathons.org/s/) to demonstrate how Earth system science AI/ML workflows can benefit from GPU-native workflows using tools such as [Zarr](https://zarr.readthedocs.io/), [KvikIO](https://docs.rapids.ai/api/kvikio/stable/), and [DALI](https://developer.nvidia.com/dali). |
|
||
## Problem | ||
|
||
ML pipelines for large scientific datasets typically include steps: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ML pipelines for large scientific datasets typically include steps: | |
ML pipelines for large scientific datasets typically include the following steps: |
|
||
- Reading raw data from disk or object storage (often CPU-bound) | ||
- Transforming / preprocessing data (often CPU-bound) | ||
- Model Training/Inference (often GPU-bound) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Model Training/Inference (often GPU-bound) | |
- Model training / inference (often GPU-bound) |
|
||
### Step 1: Optimized Chunking & Compression | ||
|
||
The ERA-5 dataset we were using had a sub-optimal chunking scheme of `{'time': 10, 'channel': C, 'height': H, 'width': W}`, which meant that a minimum of 10 timesteps of data was being read even if we only needed 2 consecutive timesteps at a time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ERA-5 dataset we were using had a sub-optimal chunking scheme of `{'time': 10, 'channel': C, 'height': H, 'width': W}`, which meant that a minimum of 10 timesteps of data was being read even if we only needed 2 consecutive timesteps at a time. | |
The copy of the ERA5 dataset we were using initially had a suboptimal chunking scheme of `{'time': 10, 'channel': C, 'height': H, 'width': W}`, which meant that a minimum of 10 time steps of data was being read even if we only needed 2 consecutive time steps. |
``` | ||
|
||
This will read the data directly from the Zarr store to GPU memory, significantly reducing I/O latency, especially for large datasets. | ||
However, it relies on the [NVIDIA GPUDirect Storage (GDS)](https://docs.nvidia.com/datacenter/pgp/gds/index.html) feature to be enabled and correctly configured on your system. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, it relies on the [NVIDIA GPUDirect Storage (GDS)](https://docs.nvidia.com/datacenter/pgp/gds/index.html) feature to be enabled and correctly configured on your system. | |
However, it relies on the [NVIDIA GPUDirect Storage (GDS)](https://docs.nvidia.com/datacenter/pgp/gds/index.html) feature being enabled and correctly configured on your system. |
|
||
To address this inefficiency, we adopted [NVIDIA DALI (Data Loading Library)](https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html), which provides a flexible, GPU-accelerated data pipeline with built-in support for asynchronous execution across CPU and GPU stages. DALI helps reduce CPU pressure, enables concurrent preprocessing, and increases training throughput by pipelining operations. | ||
|
||
First, we began with a minimal example in [zarr_DALI directory](https://github.com/pangeo-data/ncar-hackathon-xarray-on-gpus/tree/main/zarr_DALI) with short, contained examples of a DALI pipeline loading directly from Zarr stores. This example shows how to build a custom DALI `pipeline` that uses an `ExternalSource` operator to load batched image data from a Zarr store and transfer them directly to GPU memory using CuPy arrays. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, we began with a minimal example in [zarr_DALI directory](https://github.com/pangeo-data/ncar-hackathon-xarray-on-gpus/tree/main/zarr_DALI) with short, contained examples of a DALI pipeline loading directly from Zarr stores. This example shows how to build a custom DALI `pipeline` that uses an `ExternalSource` operator to load batched image data from a Zarr store and transfer them directly to GPU memory using CuPy arrays. | |
First, we began with a minimal example in the [zarr_DALI directory](https://github.com/pangeo-data/ncar-hackathon-xarray-on-gpus/tree/main/zarr_DALI) with short, contained examples of a DALI pipeline loading directly from Zarr stores. This example shows how to build a custom DALI `pipeline` that uses an `ExternalSource` operator to load batched image data from a Zarr store and transfer them directly to GPU memory using CuPy arrays. |
- GPU Direct Storage (GDS) for optimal performance | ||
- NVIDIA DALI | ||
- Support for sharded Zarr with GPU-friendly access patterns [already merged in Zarr v3](). | ||
- Work out how to use GDS when reading from cloud object store instead of on-prem disk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Work out how to use GDS when reading from cloud object store instead of on-prem disk | |
- Using GDS when reading from cloud object storage instead of on-prem disk storage |
|
||
## Acknowledgements 🙌 | ||
|
||
This work was developed during the [NCAR/NOAA Open Hackathon](https://www.openhackathons.org/s/siteevent/a0CUP00000rwYYZ2A2/se000355) in Golden, Colorado from 18-27 February 2025. We would like to thank the OpenACC Hackathon for the opportunity to participate and learn from this experience. Special thanks to NCAR for providing access to NCAR’s Derecho supercomputer which we used for this project. Thanks also to the open-source communities behind [Xarray](https://github.com/pydata/xarray), [Zarr](https://github.com/zarr-developers/zarr-python), [CuPy](https://github.com/cupy/cupy), [KvikIO](https://github.com/rapidsai/kvikio), and [DALI](https://github.com/NVIDIA/DALI). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This work was developed during the [NCAR/NOAA Open Hackathon](https://www.openhackathons.org/s/siteevent/a0CUP00000rwYYZ2A2/se000355) in Golden, Colorado from 18-27 February 2025. We would like to thank the OpenACC Hackathon for the opportunity to participate and learn from this experience. Special thanks to NCAR for providing access to NCAR’s Derecho supercomputer which we used for this project. Thanks also to the open-source communities behind [Xarray](https://github.com/pydata/xarray), [Zarr](https://github.com/zarr-developers/zarr-python), [CuPy](https://github.com/cupy/cupy), [KvikIO](https://github.com/rapidsai/kvikio), and [DALI](https://github.com/NVIDIA/DALI). | |
This work was developed during the [NCAR/NOAA Open Hackathon](https://www.openhackathons.org/s/siteevent/a0CUP00000rwYYZ2A2/se000355) in Golden, Colorado from 18-27 February 2025. We would like to thank the OpenACC Hackathon for the opportunity to participate and learn from this experience. Special thanks to NSF NCAR for providing access to their Derecho supercomputer which we used for this project. Thanks also to the open-source communities behind [Xarray](https://github.com/pydata/xarray), [Zarr](https://github.com/zarr-developers/zarr-python), [CuPy](https://github.com/cupy/cupy), [KvikIO](https://github.com/rapidsai/kvikio), and [DALI](https://github.com/NVIDIA/DALI). |
Contributors: @negin513, @weiji14 , @TomAugspurger , @maxrjoes, @akshaysubr, @kafitzgerald