Xarray GPU optimization #771

negin513 · 2025-05-01T05:01:37Z

Contributors: @negin513, @weiji14 , @TomAugspurger , @maxrjoes, @akshaysubr, @kafitzgerald

vercel · 2025-05-01T05:01:41Z

@negin513 is attempting to deploy a commit to the xarray Team on Vercel.

A member of the Team first needs to authorize it.

for more information, see https://pre-commit.ci

TomAugspurger

Thanks for writing this up!

src/posts/gpu-pipeline/index.md

Co-authored-by: Tom Augspurger <[email protected]>

src/posts/gpu-pipeline/index.md

dcherian · 2025-05-08T17:10:35Z

src/posts/gpu-pipeline/index.md

+  - name: Katelyn Fitzgerald
+    github: kafitzgerald
+
+summary: 'How to accelerate AI/ML workflows in Earth Sciences with GPU-native Xarray and Zarr.'


Can we make this more direct? "X% speedup" or "XMBps throughput"?

src/posts/gpu-pipeline/index.md

dcherian · 2025-05-08T17:16:36Z

src/posts/gpu-pipeline/index.md

+(TODO ongoing work) Eventually with this [cupy-xarray Pull Request merged](https://github.com/xarray-contrib/cupy-xarray/pull/70) (based on earlier work at https://xarray.dev/blog/xarray-kvikio), this can be simplified to:
+
+```python
+import cupy_xarray
+
+ds = xr.open_dataset(filename_or_obj="/tmp/air-temp.zarr", engine="kvikio")
+assert isinstance(ds.air.data, cp.ndarray)
+```


This could go in a future work section at the end

Yeah, I'm not sure if this API is feasible or even desirable (have tried to implement this in xarray-contrib/cupy-xarray#70, but no luck yet patching the buffer protocol). So ok to move this towards the end.

dcherian · 2025-05-08T17:17:15Z

src/posts/gpu-pipeline/index.md

+
+![image](https://hackmd.io/_uploads/H1YVp6tR1l.png)
+
+(TODO insert better nsight profiling figure than above showing overlapping CPU and GPU compute)


that would be really nice!

dcherian · 2025-05-08T17:18:00Z

src/posts/gpu-pipeline/index.md

+- Consider using GPU Direct Storage (GDS) for optimal performance, but be aware of the setup and configuration required.
+- GPU Direct Storage (GDS) can be an improvement for data-intensive workflows, but requires some setup and configuration.
+- NVIDIA DALI is a powerful tool for optimizing data loading, but requires some effort to integrate into existing workflows.
+- GPU-based decompression is a promising area for future work, but requires further development and testing.


dcherian · 2025-05-08T17:19:58Z

src/posts/gpu-pipeline/index.md

@@ -0,0 +1,223 @@
+---
+title: 'Accelerating AI/ML Workflows in Earth Sciences with GPU-Native Xarray and Zarr (and more!)'


Suggested change

title: 'Accelerating AI/ML Workflows in Earth Sciences with GPU-Native Xarray and Zarr (and more!)'

title: 'GPU-Native Earth Science AI/ML Workflows Xarray, Zarr, DALI, and nvcomp'

better SEO this way?

src/posts/gpu-pipeline/index.md

weiji14 · 2025-05-08T23:20:52Z

src/posts/gpu-pipeline/index.md

+(TODO ongoing work) Eventually with this [cupy-xarray Pull Request merged](https://github.com/xarray-contrib/cupy-xarray/pull/70) (based on earlier work at https://xarray.dev/blog/xarray-kvikio), this can be simplified to:
+
+```python
+import cupy_xarray
+
+ds = xr.open_dataset(filename_or_obj="/tmp/air-temp.zarr", engine="kvikio")
+assert isinstance(ds.air.data, cp.ndarray)
+```


Yeah, I'm not sure if this API is feasible or even desirable (have tried to implement this in xarray-contrib/cupy-xarray#70, but no luck yet patching the buffer protocol). So ok to move this towards the end.

weiji14 · 2025-05-08T23:26:30Z

src/posts/gpu-pipeline/index.md

+- GPU Direct Storage (GDS) for optimal performance
+- NVIDIA DALI
+- Work out how to use GDS when reading from cloud object store instead of on-prem disk.
+- etc


Want to shout out that reading/writing Zarr shards with GPU buffers (thanks @maxrjones and @TomAugspurger!) at zarr-developers/zarr-python#2978 was just merged, and could go in here or somewhere above, depending on when this blog post gets published.

vercel · 2025-05-12T23:23:17Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
xarray-dev	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	May 12, 2025 11:24pm

src/posts/gpu-pipeline/index.md

for more information, see https://pre-commit.ci

Co-authored-by: Deepak Cherian <[email protected]>

for more information, see https://pre-commit.ci

maxrjones

This is so great, thank you @negin513!!!

maxrjones · 2025-05-23T19:00:59Z

src/posts/gpu-pipeline/index.md

+  />
+</div>
+
+We further quantified this bottleneck by comparing data loading and training throughput, as shown in the figure below:


Suggested change

We further quantified this bottleneck by comparing data loading and training throughput, as shown in the figure below:

We further quantified this bottleneck by comparing data loading and training throughput, as shown in the figure below (higher bars/more throughput is better):

maxrjones · 2025-05-23T19:01:17Z

src/posts/gpu-pipeline/index.md

+
+In the plot above, the three bars represent:
+
+- Baseline: Baseline throughput of the end-to-end pipeline using real data.


Suggested change

- Baseline: Baseline throughput of the end-to-end pipeline using real data.

- Training (real data): Baseline throughput of the end-to-end pipeline using real data.

maxrjones · 2025-05-23T19:04:44Z

src/posts/gpu-pipeline/index.md

+1. **Optimized Chunking & Compression**
+   - We explored different chunking and compression strategies to optimize the data loading performance. We found that using Zarr v3 with optimized chunking and compression significantly improved the data loading performance.
+2. **GPU-native data loading with Zarr v3 and KvikIO**
+   - Leveraging Zarr v3's support for reading data directly into GPU memory using CuPy arrays, we utilized [KvikIO](https://docs.rapids.ai/api/kvikio/stable/) to bypass CPU memory, enabling direct data transfer from storage to GPU.


Suggested change

- Leveraging Zarr v3's support for reading data directly into GPU memory using CuPy arrays, we utilized [KvikIO](https://docs.rapids.ai/api/kvikio/stable/) to bypass CPU memory, enabling direct data transfer from storage to GPU.

- Leveraging Zarr Python 3's support for reading data directly into GPU memory using CuPy arrays, we utilized [KvikIO](https://docs.rapids.ai/api/kvikio/stable/) to bypass CPU memory, enabling direct data transfer from storage to GPU.

maxrjones · 2025-05-23T19:04:52Z

src/posts/gpu-pipeline/index.md

+
+1. **Optimized Chunking & Compression**
+   - We explored different chunking and compression strategies to optimize the data loading performance. We found that using Zarr v3 with optimized chunking and compression significantly improved the data loading performance.
+2. **GPU-native data loading with Zarr v3 and KvikIO**


Suggested change

2. **GPU-native data loading with Zarr v3 and KvikIO**

2. **GPU-native data loading with Zarr Python 3 and KvikIO**

maxrjones · 2025-05-23T19:05:08Z

src/posts/gpu-pipeline/index.md

+### Step 1: Optimized chunking & Compression
+
+The ERA-5 dataset we were using had a sub-optimal chunking scheme of `{'time': 10, 'channel': C, 'height': H, 'width': W}`, which meant that a minimum of 10 timesteps of data was being read even if we only needed 2 consecutive timesteps at a time.
+We decided to rechunk the data to align with our access pattern of 1-timestep at a time, while reformating to Zarr v3.


Suggested change

We decided to rechunk the data to align with our access pattern of 1-timestep at a time, while reformating to Zarr v3.

We decided to rechunk the data to align with our access pattern of 1-timestep at a time, while reformating to Zarr format 3.

maxrjones · 2025-05-23T19:10:57Z

src/posts/gpu-pipeline/index.md

+airt.to_zarr(store="/tmp/air-temp.zarr", mode="w", zarr_format=3, consolidated=False)
+
+with zarr.config.enable_gpu():
+    ds = xr.open_dataset("/tmp/air-temp.zarr", engine="zarr", consolidated=False)


Can you explain why not to use consolidated metadata for direct to GPU reading? I am surprised about this because I would expect it to improve performance.

Consolidated metadata wasn't supported with GDS/kvikio, at least when I tested it 3 years ago in xarray-contrib/cupy-xarray#10 (comment). Maybe it works with Zarr v3, but we don't have a GDS device to verify on 🙂

maxrjones · 2025-05-23T19:12:22Z

src/posts/gpu-pipeline/index.md

+
+![Flowchart-technically decompression is still done on CPUs](/posts/gpu-pipline/flowchart_2.png)
+
+Eventually with this [cupy-xarray Pull Request merged](https://github.com/xarray-contrib/cupy-xarray/pull/70) (based on earlier work at https://xarray.dev/blog/xarray-kvikio), this can be simplified to:


Suggested change

Eventually with this [cupy-xarray Pull Request merged](https://github.com/xarray-contrib/cupy-xarray/pull/70) (based on earlier work at https://xarray.dev/blog/xarray-kvikio), this can be simplified to:

Eventually with this [cupy-xarray Pull Request merged](https://github.com/xarray-contrib/cupy-xarray/pull/70) (based on earlier work at [https://xarray.dev/blog/xarray-kvikio](https://xarray.dev/blog/xarray-kvikio)), this can be simplified to:

maxrjones · 2025-05-23T19:14:39Z

src/posts/gpu-pipeline/index.md

+
+Figure above shows benchmark comparing CPU vs GPU-based decompression, with or without GDS enabled using [the data reading benchmark here](https://github.com/pangeo-data/ncar-hackathon-xarray-on-gpus/blob/main/benchmark/era5_zarr_benchmark.py).
+
+![GPU native decompression](/posts/gpu-pipline/zstd_benchmark.png)


do you know why it's slower with GDS?

maxrjones · 2025-05-23T19:17:18Z

src/posts/gpu-pipeline/index.md

+
+Next, checkout the [end-to-end example](https://github.com/pangeo-data/ncar-hackathon-xarray-on-gpus/tree/main/zarr_ML_optimization) directory, where we show how to integrate the DALI pipeline into a PyTorch DataLoader and training loop. This example demonstrates how to use DALI to load data from Zarr stores, preprocess it on the GPU, and feed it into a PyTorch model for training.
+
+Profiling results from the DALI pipeline demonstrate effective overlap between CPU and GPU workloads, significantly reducing GPU idle time (blue) and increasing overall training throughput:


the GPU activity is blue, right? rather than the idle time

maxrjones · 2025-05-23T19:20:02Z

src/posts/gpu-pipeline/index.md

+> - **Compression trade-offs**: Using compression can reduce the amount of data transferred, but can also increase the time spent on decompression. We found that using Zarr v3 with GPU-based decompression can significantly improve performance.
+> - **NVIDIA DALI** is a powerful tool for optimizing data loading, but requires some effort to integrate into existing workflows.
+> - **CuPy-Xarray integration** is still a work in progress, but can be very useful for GPU-native workflows. Please see this PR for more details: [xarray-contrib/cupy-xarray#70](https://github.com/xarray-contrib/cupy-xarray/pull/70).
+> - **GPU-native decompression** is a promising area for future work, but full support (e.g. GPU-side Zstd decompression) requires further development and testing.


Suggested change

> - **Compression trade-offs**: Using compression can reduce the amount of data transferred, but can also increase the time spent on decompression. We found that using Zarr v3 with GPU-based decompression can significantly improve performance.

> - **NVIDIA DALI** is a powerful tool for optimizing data loading, but requires some effort to integrate into existing workflows.

> - **CuPy-Xarray integration** is still a work in progress, but can be very useful for GPU-native workflows. Please see this PR for more details: [xarray-contrib/cupy-xarray#70](https://github.com/xarray-contrib/cupy-xarray/pull/70).

> - **GPU-native decompression** is a promising area for future work, but full support (e.g. GPU-side Zstd decompression) requires further development and testing.

> - **Compression trade-offs**: Using compression can reduce the amount of data transferred, but can also increase the time spent on decompression. We found that using Zarr v3 with GPU-based decompression can significantly improve performance.

> - **GPU-native decompression** is a promising area for future work, but full support (e.g. GPU-side Zstd decompression) requires further development and testing.

> - **NVIDIA DALI** is a powerful tool for optimizing data loading, but requires some effort to integrate into existing workflows.

> - **CuPy-Xarray integration** is still a work in progress, but can be very useful for GPU-native workflows. Please see this PR for more details: [xarray-contrib/cupy-xarray#70](https://github.com/xarray-contrib/cupy-xarray/pull/70).

> - **NVIDIA Nsight** provides a [powerful tool](https://developer.nvidia.com/nsight-systems) for identifying bottlenecks.

for more information, see https://pre-commit.ci

weiji14

Awesome work, this is coming along really nicely already! Just some minor nitpicks, but hope that we can publish this next month!

weiji14 · 2025-05-24T08:16:56Z

src/posts/gpu-pipeline/index.md

+
+- GPU Direct Storage (GDS) for optimal performance
+- NVIDIA DALI
+- Support for sharded Zarr with GPU-friendly access patterns [already merged in Zarr v3]().


Suggested change

- Support for sharded Zarr with GPU-friendly access patterns [already merged in Zarr v3]().

- Support for sharded Zarr with GPU-friendly access patterns already [merged](https://github.com/zarr-developers/zarr-python/pull/2978) in Zarr v3.0.8.

weiji14 · 2025-05-24T08:23:25Z

src/posts/gpu-pipeline/index.md

+    github: weiji14
+  - name: Max Jones
+    github: maxrjones
+  - name: Akshay Subranian


Suggested change

- name: Akshay Subranian

- name: Akshay Subramaniam

weiji14 · 2025-05-24T08:24:36Z

src/posts/gpu-pipeline/index.md

+
+ML pipelines for large scientific datasets typically include steps:
+
+- Reading raw data from disk or object storage (often CPU-bound)


Suggested change

- Reading raw data from disk or object storage (often CPU-bound)

- Reading raw data from disk or object storage (often IO-bound)

weiji14 · 2025-05-24T08:25:36Z

src/posts/gpu-pipeline/index.md

+- Transforming / preprocessing data (often CPU-bound)
+- Model Training/Inference (often GPU-bound)
+
+Although GPU compute is incredibly fast, the CPU can become a bottleneck when dealing with large datasets. In an ideal scenario, we want to saturate the GPU with data as quickly as possible to minimize idle time on both the CPU and GPU.


Suggested change

Although GPU compute is incredibly fast, the CPU can become a bottleneck when dealing with large datasets. In an ideal scenario, we want to saturate the GPU with data as quickly as possible to minimize idle time on both the CPU and GPU.

Although GPU compute is incredibly fast, IO and CPU bottlenecks can be a pain when dealing with large datasets. In an ideal scenario, we want to saturate the GPU with data as quickly as possible to minimize idle time on both the CPU and GPU.

weiji14 · 2025-05-24T08:27:32Z

src/posts/gpu-pipeline/index.md

+This will read the data directly from the Zarr store to GPU memory, significantly reducing I/O latency, especially for large datasets.
+However, it relies on the [NVIDIA GPUDirect Storage (GDS)](https://docs.nvidia.com/datacenter/pgp/gds/index.html) feature to be enabled and correctly configured on your system.
+
+**Note**: Even with GDS, the decompression step is still occurs on the CPU (see next section for GPU solutions!). This means that the data is still being decompressed on the CPU before being transferred to the GPU. However, this is still a significant improvement over the previous method, as it reduces the amount of data that needs to be transferred over the PCIe bus. In the figure below, we show the flowchart of the data loading process with GDS enabled (i.e. using `kvikio`):


Suggested change

**Note**: Even with GDS, the decompression step is still occurs on the CPU (see next section for GPU solutions!). This means that the data is still being decompressed on the CPU before being transferred to the GPU. However, this is still a significant improvement over the previous method, as it reduces the amount of data that needs to be transferred over the PCIe bus. In the figure below, we show the flowchart of the data loading process with GDS enabled (i.e. using `kvikio`):

**Note**: Even with GDS, the decompression step will still occur on the CPU (see next section for GPU solutions!). This means that the data is still being decompressed on the CPU before being transferred to the GPU. However, this is still a significant improvement over the previous method, as it reduces the amount of data that needs to be transferred over the PCIe bus. In the figure below, we show the flowchart of the data loading process with GDS enabled (i.e. using `kvikio`):

weiji14 · 2025-05-24T08:28:33Z

src/posts/gpu-pipeline/index.md

+
+For a fully GPU-native pipline, the decompression step should also be done on the GPU. This is where [NVIDIA's nvCOMP](https://developer.nvidia.com/nvcomp) library comes in. nvCOMP provides fast, GPU-native implementations of popular compression algorithms like Zstandard (Zstd)
+
+With nvCOMP, all steps of data loading including reading, decompressing, and transforming data can be done on the GPU, significantly reducing the time spent on data loading. Here is a flowchart of the data loading process with GDS and GPU-based decompression enabled:


Suggested change

With nvCOMP, all steps of data loading including reading, decompressing, and transforming data can be done on the GPU, significantly reducing the time spent on data loading. Here is a flowchart of the data loading process with GDS and GPU-based decompression enabled:

With nvCOMP, all steps of data loading including reading from disk, decompression, and transforming data can be done on the GPU, significantly reducing the time spent on data loading. Here is a flowchart of the data loading process with GDS and GPU-based decompression enabled:

weiji14 · 2025-05-24T08:29:15Z

src/posts/gpu-pipeline/index.md

+
+> These results show that GPU-based decompression can significantly reduce the time spent on data loading and cut I/O latency from storage to device (less data transfer over PCIe/NVLink). This is especially useful for large datasets, as it allows for faster data loading and processing.
+
+Keep an eye on this space, as we are working on integrating this into the Zarr ecosystem to enable GPU-based decompression for Zarr stores. This will allow for a fully GPU-native workflow, where all steps of data loading, including reading, decompressing, and transforming data, can be done on the GPU.


Suggested change

Keep an eye on this space, as we are working on integrating this into the Zarr ecosystem to enable GPU-based decompression for Zarr stores. This will allow for a fully GPU-native workflow, where all steps of data loading, including reading, decompressing, and transforming data, can be done on the GPU.

Keep an eye on this space, as we are working on integrating this into the Zarr ecosystem to enable GPU-based decompression for Zarr stores. This will allow for a fully GPU-native workflow, where all steps of data loading, including reading, decompression, and transforming data, can be done on the GPU.

kafitzgerald

Thanks so much for putting this together!

Mostly just a few minor suggestions from my end beyond the existing comments / questions.

kafitzgerald · 2025-06-02T15:41:26Z

src/posts/gpu-pipeline/index.md

+    github: akshaysubr
+  - name: Thomas Augspurger
+    github: tomaugspurger
+  - name: Katelyn Fitzgerald


Suggested change

- name: Katelyn Fitzgerald

- name: Katelyn FitzGerald

kafitzgerald · 2025-06-02T15:53:27Z

src/posts/gpu-pipeline/index.md

+
+## TL;DR
+
+Earth science AI/ML workflows are often bottlenecked by slow data loading, leaving GPUs underutilized while CPUs struggle to feed large climate datasets like ERA5. In this blog post, we discuss how to build a GPU-native pipeline using Zarr v3, CuPy, KvikIO, and NVIDIA DALI to accelerate data throughput. We walk through profiling results, chunking strategies, direct-to-GPU data reads, and GPU-accelerated preprocessing, all aimed at maximizing GPU usage and minimizing I/O overhead.


Suggested change

Earth science AI/ML workflows are often bottlenecked by slow data loading, leaving GPUs underutilized while CPUs struggle to feed large climate datasets like ERA5. In this blog post, we discuss how to build a GPU-native pipeline using Zarr v3, CuPy, KvikIO, and NVIDIA DALI to accelerate data throughput. We walk through profiling results, chunking strategies, direct-to-GPU data reads, and GPU-accelerated preprocessing, all aimed at maximizing GPU usage and minimizing I/O overhead.

Earth science AI/ML workflows are often limited by slow data loading, leaving GPUs underutilized while CPUs struggle to feed large climate datasets like ERA5. In this blog post, we discuss how to build a GPU-native pipeline using Zarr v3, CuPy, KvikIO, and NVIDIA DALI to accelerate data throughput. We walk through profiling results, chunking strategies, direct-to-GPU data reads, and GPU-accelerated preprocessing, all aimed at maximizing GPU usage and minimizing I/O overhead.

Not committed to this - just trying to vary the language a bit.

kafitzgerald · 2025-06-02T15:57:59Z

src/posts/gpu-pipeline/index.md

+
+In large-scale geospatial AI and machine learning workflows, data loading is often the main bottleneck. Traditional pipelines rely on CPUs to preprocess and transfer massive datasets from storage to GPU memory, consuming resources and limiting scalability and effective use of GPU resources.
+
+To tackle this issue, a team from the [National Center for Atmospheric Research (NSF-NCAR)](https://ncar.ucar.edu) and [Development Seed](https://developmentseed.org) with mentors from [NVIDIA](https://www.nvidia.com) participated in the [OpenHackathon](https://www.openhackathons.org/s/) to demonstrate how AI/ML workflows in Earth system sciences can benefit from GPU-native workflows using tools such as [Zarr](https://zarr.readthedocs.io/), [KvikIO](https://docs.rapids.ai/api/kvikio/stable/), and [DALI](https://developer.nvidia.com/dali).


Suggested change

To tackle this issue, a team from the [National Center for Atmospheric Research (NSF-NCAR)](https://ncar.ucar.edu) and [Development Seed](https://developmentseed.org) with mentors from [NVIDIA](https://www.nvidia.com) participated in the [OpenHackathon](https://www.openhackathons.org/s/) to demonstrate how AI/ML workflows in Earth system sciences can benefit from GPU-native workflows using tools such as [Zarr](https://zarr.readthedocs.io/), [KvikIO](https://docs.rapids.ai/api/kvikio/stable/), and [DALI](https://developer.nvidia.com/dali).

To tackle this issue, a team from the [NSF National Center for Atmospheric Research (NSF NCAR)](https://ncar.ucar.edu) and [Development Seed](https://developmentseed.org) with mentors from [NVIDIA](https://www.nvidia.com) participated in an [Open Hackathon](https://www.openhackathons.org/s/) to demonstrate how Earth system science AI/ML workflows can benefit from GPU-native workflows using tools such as [Zarr](https://zarr.readthedocs.io/), [KvikIO](https://docs.rapids.ai/api/kvikio/stable/), and [DALI](https://developer.nvidia.com/dali).

kafitzgerald · 2025-06-02T15:58:40Z

src/posts/gpu-pipeline/index.md

+
+## Problem
+
+ML pipelines for large scientific datasets typically include steps:


Suggested change

ML pipelines for large scientific datasets typically include steps:

ML pipelines for large scientific datasets typically include the following steps:

kafitzgerald · 2025-06-02T15:59:37Z

src/posts/gpu-pipeline/index.md

+
+- Reading raw data from disk or object storage (often CPU-bound)
+- Transforming / preprocessing data (often CPU-bound)
+- Model Training/Inference (often GPU-bound)


Suggested change

- Model Training/Inference (often GPU-bound)

- Model training / inference (often GPU-bound)

kafitzgerald · 2025-06-02T16:14:14Z

src/posts/gpu-pipeline/index.md

+
+### Step 1: Optimized Chunking & Compression
+
+The ERA-5 dataset we were using had a sub-optimal chunking scheme of `{'time': 10, 'channel': C, 'height': H, 'width': W}`, which meant that a minimum of 10 timesteps of data was being read even if we only needed 2 consecutive timesteps at a time.


Suggested change

The ERA-5 dataset we were using had a sub-optimal chunking scheme of `{'time': 10, 'channel': C, 'height': H, 'width': W}`, which meant that a minimum of 10 timesteps of data was being read even if we only needed 2 consecutive timesteps at a time.

The copy of the ERA5 dataset we were using initially had a suboptimal chunking scheme of `{'time': 10, 'channel': C, 'height': H, 'width': W}`, which meant that a minimum of 10 time steps of data was being read even if we only needed 2 consecutive time steps.

kafitzgerald · 2025-06-02T16:16:40Z

src/posts/gpu-pipeline/index.md

+```
+
+This will read the data directly from the Zarr store to GPU memory, significantly reducing I/O latency, especially for large datasets.
+However, it relies on the [NVIDIA GPUDirect Storage (GDS)](https://docs.nvidia.com/datacenter/pgp/gds/index.html) feature to be enabled and correctly configured on your system.


Suggested change

However, it relies on the [NVIDIA GPUDirect Storage (GDS)](https://docs.nvidia.com/datacenter/pgp/gds/index.html) feature to be enabled and correctly configured on your system.

However, it relies on the [NVIDIA GPUDirect Storage (GDS)](https://docs.nvidia.com/datacenter/pgp/gds/index.html) feature being enabled and correctly configured on your system.

kafitzgerald · 2025-06-02T16:24:38Z

src/posts/gpu-pipeline/index.md

+
+To address this inefficiency, we adopted [NVIDIA DALI (Data Loading Library)](https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html), which provides a flexible, GPU-accelerated data pipeline with built-in support for asynchronous execution across CPU and GPU stages. DALI helps reduce CPU pressure, enables concurrent preprocessing, and increases training throughput by pipelining operations.
+
+First, we began with a minimal example in [zarr_DALI directory](https://github.com/pangeo-data/ncar-hackathon-xarray-on-gpus/tree/main/zarr_DALI) with short, contained examples of a DALI pipeline loading directly from Zarr stores. This example shows how to build a custom DALI `pipeline` that uses an `ExternalSource` operator to load batched image data from a Zarr store and transfer them directly to GPU memory using CuPy arrays.


Suggested change

First, we began with a minimal example in [zarr_DALI directory](https://github.com/pangeo-data/ncar-hackathon-xarray-on-gpus/tree/main/zarr_DALI) with short, contained examples of a DALI pipeline loading directly from Zarr stores. This example shows how to build a custom DALI `pipeline` that uses an `ExternalSource` operator to load batched image data from a Zarr store and transfer them directly to GPU memory using CuPy arrays.

First, we began with a minimal example in the [zarr_DALI directory](https://github.com/pangeo-data/ncar-hackathon-xarray-on-gpus/tree/main/zarr_DALI) with short, contained examples of a DALI pipeline loading directly from Zarr stores. This example shows how to build a custom DALI `pipeline` that uses an `ExternalSource` operator to load batched image data from a Zarr store and transfer them directly to GPU memory using CuPy arrays.

kafitzgerald · 2025-06-02T16:28:03Z

src/posts/gpu-pipeline/index.md

+- GPU Direct Storage (GDS) for optimal performance
+- NVIDIA DALI
+- Support for sharded Zarr with GPU-friendly access patterns [already merged in Zarr v3]().
+- Work out how to use GDS when reading from cloud object store instead of on-prem disk


Suggested change

- Work out how to use GDS when reading from cloud object store instead of on-prem disk

- Using GDS when reading from cloud object storage instead of on-prem disk storage

kafitzgerald · 2025-06-02T16:29:11Z

src/posts/gpu-pipeline/index.md

+
+## Acknowledgements 🙌
+
+This work was developed during the [NCAR/NOAA Open Hackathon](https://www.openhackathons.org/s/siteevent/a0CUP00000rwYYZ2A2/se000355) in Golden, Colorado from 18-27 February 2025. We would like to thank the OpenACC Hackathon for the opportunity to participate and learn from this experience. Special thanks to NCAR for providing access to NCAR’s Derecho supercomputer which we used for this project. Thanks also to the open-source communities behind [Xarray](https://github.com/pydata/xarray), [Zarr](https://github.com/zarr-developers/zarr-python), [CuPy](https://github.com/cupy/cupy), [KvikIO](https://github.com/rapidsai/kvikio), and [DALI](https://github.com/NVIDIA/DALI).


Suggested change

This work was developed during the [NCAR/NOAA Open Hackathon](https://www.openhackathons.org/s/siteevent/a0CUP00000rwYYZ2A2/se000355) in Golden, Colorado from 18-27 February 2025. We would like to thank the OpenACC Hackathon for the opportunity to participate and learn from this experience. Special thanks to NCAR for providing access to NCAR’s Derecho supercomputer which we used for this project. Thanks also to the open-source communities behind [Xarray](https://github.com/pydata/xarray), [Zarr](https://github.com/zarr-developers/zarr-python), [CuPy](https://github.com/cupy/cupy), [KvikIO](https://github.com/rapidsai/kvikio), and [DALI](https://github.com/NVIDIA/DALI).

This work was developed during the [NCAR/NOAA Open Hackathon](https://www.openhackathons.org/s/siteevent/a0CUP00000rwYYZ2A2/se000355) in Golden, Colorado from 18-27 February 2025. We would like to thank the OpenACC Hackathon for the opportunity to participate and learn from this experience. Special thanks to NSF NCAR for providing access to their Derecho supercomputer which we used for this project. Thanks also to the open-source communities behind [Xarray](https://github.com/pydata/xarray), [Zarr](https://github.com/zarr-developers/zarr-python), [CuPy](https://github.com/cupy/cupy), [KvikIO](https://github.com/rapidsai/kvikio), and [DALI](https://github.com/NVIDIA/DALI).

negin513 added 5 commits April 30, 2025 18:32

first draft

f69796c

adding headers

32a8e32

adding baseline image

d2f7e0d

update blog post

d23c74f

update chunking

64b45e1

[pre-commit.ci] auto fixes from pre-commit.com hooks

95e5d65

for more information, see https://pre-commit.ci

maxrjones mentioned this pull request May 4, 2025

Publish Xarray blog post on NCAR hackathon NASA-IMPACT/veda-odd#166

Open

TomAugspurger reviewed May 5, 2025

View reviewed changes

Apply suggestions from code review

b52e1e7

Co-authored-by: Tom Augspurger <[email protected]>

dcherian reviewed May 8, 2025

View reviewed changes

src/posts/gpu-pipeline/index.md Outdated Show resolved Hide resolved

dcherian reviewed May 8, 2025

View reviewed changes

src/posts/gpu-pipeline/index.md Show resolved Hide resolved

dcherian reviewed May 8, 2025

View reviewed changes

src/posts/gpu-pipeline/index.md Outdated Show resolved Hide resolved

dcherian reviewed May 8, 2025

View reviewed changes

src/posts/gpu-pipeline/index.md Outdated Show resolved Hide resolved

dcherian reviewed May 8, 2025

View reviewed changes

src/posts/gpu-pipeline/index.md Outdated Show resolved Hide resolved

dcherian reviewed May 8, 2025

View reviewed changes

src/posts/gpu-pipeline/index.md Show resolved Hide resolved

dcherian reviewed May 8, 2025

View reviewed changes

src/posts/gpu-pipeline/index.md Outdated Show resolved Hide resolved

dcherian reviewed May 8, 2025

View reviewed changes

weiji14 reviewed May 8, 2025

View reviewed changes

negin513 added 7 commits May 12, 2025 13:46

moving profiling_screenshot1.png over

a2416b3

adding profiling screenshot

67f9c29

update

5d168be

screenshot 1 added

8c292d3

moving baseline png

e9195c8

adding pngs for the plots

304acb0

updates

eccf86c

vercel bot deployed to Preview May 12, 2025 23:24 View deployment

dcherian reviewed May 12, 2025

View reviewed changes

src/posts/gpu-pipeline/index.md Outdated Show resolved Hide resolved

negin513 and others added 20 commits May 12, 2025 18:32

adding new flowchart

8c367e4

[pre-commit.ci] auto fixes from pre-commit.com hooks

0807efc

for more information, see https://pre-commit.ci

quick updates

587586c

adding zstd benchmark

d64e319

adding zstd benchmark

841ffce

updates

c11ce34

more revisions

5361a89

adding index.md

d13963c

[pre-commit.ci] auto fixes from pre-commit.com hooks

bfa138d

for more information, see https://pre-commit.ci

merge conflict

f3d6191

[pre-commit.ci] auto fixes from pre-commit.com hooks

1a03305

for more information, see https://pre-commit.ci

adding links

6970e29

add link to the repo and remove todo

88faa93

[pre-commit.ci] auto fixes from pre-commit.com hooks

e903e6a

for more information, see https://pre-commit.ci

Update src/posts/gpu-pipeline/index.md

88ad95a

Co-authored-by: Deepak Cherian <[email protected]>

adding DALI dataloader screenshot

ac66f7c

improvements

136bdd8

minor improvements

65076ba

[pre-commit.ci] auto fixes from pre-commit.com hooks

4083cd0

for more information, see https://pre-commit.ci

updates

bafb34f

maxrjones reviewed May 23, 2025

View reviewed changes

negin513 and others added 4 commits May 23, 2025 14:55

updates

2df1e72

updates and comments

77898aa

updates after meeting

98a96aa

[pre-commit.ci] auto fixes from pre-commit.com hooks

d0b4856

for more information, see https://pre-commit.ci

weiji14 reviewed May 24, 2025

View reviewed changes

kafitzgerald reviewed Jun 2, 2025

View reviewed changes


		![image](https://hackmd.io/_uploads/H1YVp6tR1l.png)

		(TODO insert better nsight profiling figure than above showing overlapping CPU and GPU compute)

		@@ -0,0 +1,223 @@
		---
		title: 'Accelerating AI/ML Workflows in Earth Sciences with GPU-Native Xarray and Zarr (and more!)'

	title: 'Accelerating AI/ML Workflows in Earth Sciences with GPU-Native Xarray and Zarr (and more!)'
	title: 'GPU-Native Earth Science AI/ML Workflows Xarray, Zarr, DALI, and nvcomp'

	We further quantified this bottleneck by comparing data loading and training throughput, as shown in the figure below:
	We further quantified this bottleneck by comparing data loading and training throughput, as shown in the figure below (higher bars/more throughput is better):


		In the plot above, the three bars represent:

		- Baseline: Baseline throughput of the end-to-end pipeline using real data.

	- Baseline: Baseline throughput of the end-to-end pipeline using real data.
	- Training (real data): Baseline throughput of the end-to-end pipeline using real data.

	- Leveraging Zarr v3's support for reading data directly into GPU memory using CuPy arrays, we utilized [KvikIO](https://docs.rapids.ai/api/kvikio/stable/) to bypass CPU memory, enabling direct data transfer from storage to GPU.
	- Leveraging Zarr Python 3's support for reading data directly into GPU memory using CuPy arrays, we utilized [KvikIO](https://docs.rapids.ai/api/kvikio/stable/) to bypass CPU memory, enabling direct data transfer from storage to GPU.

	2. GPU-native data loading with Zarr v3 and KvikIO
	2. GPU-native data loading with Zarr Python 3 and KvikIO

	We decided to rechunk the data to align with our access pattern of 1-timestep at a time, while reformating to Zarr v3.
	We decided to rechunk the data to align with our access pattern of 1-timestep at a time, while reformating to Zarr format 3.


		![Flowchart-technically decompression is still done on CPUs](/posts/gpu-pipline/flowchart_2.png)

		Eventually with this [cupy-xarray Pull Request merged](https://github.com/xarray-contrib/cupy-xarray/pull/70) (based on earlier work at https://xarray.dev/blog/xarray-kvikio), this can be simplified to:


		Figure above shows benchmark comparing CPU vs GPU-based decompression, with or without GDS enabled using [the data reading benchmark here](https://github.com/pangeo-data/ncar-hackathon-xarray-on-gpus/blob/main/benchmark/era5_zarr_benchmark.py).

		![GPU native decompression](/posts/gpu-pipline/zstd_benchmark.png)


		Next, checkout the [end-to-end example](https://github.com/pangeo-data/ncar-hackathon-xarray-on-gpus/tree/main/zarr_ML_optimization) directory, where we show how to integrate the DALI pipeline into a PyTorch DataLoader and training loop. This example demonstrates how to use DALI to load data from Zarr stores, preprocess it on the GPU, and feed it into a PyTorch model for training.

		Profiling results from the DALI pipeline demonstrate effective overlap between CPU and GPU workloads, significantly reducing GPU idle time (blue) and increasing overall training throughput:

	- Support for sharded Zarr with GPU-friendly access patterns [already merged in Zarr v3]().
	- Support for sharded Zarr with GPU-friendly access patterns already [merged](https://github.com/zarr-developers/zarr-python/pull/2978) in Zarr v3.0.8.


		ML pipelines for large scientific datasets typically include steps:

		- Reading raw data from disk or object storage (often CPU-bound)

	Although GPU compute is incredibly fast, the CPU can become a bottleneck when dealing with large datasets. In an ideal scenario, we want to saturate the GPU with data as quickly as possible to minimize idle time on both the CPU and GPU.
	Although GPU compute is incredibly fast, IO and CPU bottlenecks can be a pain when dealing with large datasets. In an ideal scenario, we want to saturate the GPU with data as quickly as possible to minimize idle time on both the CPU and GPU.


		For a fully GPU-native pipline, the decompression step should also be done on the GPU. This is where [NVIDIA's nvCOMP](https://developer.nvidia.com/nvcomp) library comes in. nvCOMP provides fast, GPU-native implementations of popular compression algorithms like Zstandard (Zstd)

		With nvCOMP, all steps of data loading including reading, decompressing, and transforming data can be done on the GPU, significantly reducing the time spent on data loading. Here is a flowchart of the data loading process with GDS and GPU-based decompression enabled:


		> These results show that GPU-based decompression can significantly reduce the time spent on data loading and cut I/O latency from storage to device (less data transfer over PCIe/NVLink). This is especially useful for large datasets, as it allows for faster data loading and processing.

		Keep an eye on this space, as we are working on integrating this into the Zarr ecosystem to enable GPU-based decompression for Zarr stores. This will allow for a fully GPU-native workflow, where all steps of data loading, including reading, decompressing, and transforming data, can be done on the GPU.


		## TL;DR

		Earth science AI/ML workflows are often bottlenecked by slow data loading, leaving GPUs underutilized while CPUs struggle to feed large climate datasets like ERA5. In this blog post, we discuss how to build a GPU-native pipeline using Zarr v3, CuPy, KvikIO, and NVIDIA DALI to accelerate data throughput. We walk through profiling results, chunking strategies, direct-to-GPU data reads, and GPU-accelerated preprocessing, all aimed at maximizing GPU usage and minimizing I/O overhead.


		In large-scale geospatial AI and machine learning workflows, data loading is often the main bottleneck. Traditional pipelines rely on CPUs to preprocess and transfer massive datasets from storage to GPU memory, consuming resources and limiting scalability and effective use of GPU resources.

		To tackle this issue, a team from the [National Center for Atmospheric Research (NSF-NCAR)](https://ncar.ucar.edu) and [Development Seed](https://developmentseed.org) with mentors from [NVIDIA](https://www.nvidia.com) participated in the [OpenHackathon](https://www.openhackathons.org/s/) to demonstrate how AI/ML workflows in Earth system sciences can benefit from GPU-native workflows using tools such as [Zarr](https://zarr.readthedocs.io/), [KvikIO](https://docs.rapids.ai/api/kvikio/stable/), and [DALI](https://developer.nvidia.com/dali).


		## Problem

		ML pipelines for large scientific datasets typically include steps:

Xarray GPU optimization #771

Are you sure you want to change the base?

Xarray GPU optimization #771

Uh oh!

Conversation

negin513 commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vercel bot commented May 1, 2025

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dcherian May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vercel bot commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

maxrjones left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

weiji14 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

negin513 commented May 1, 2025 •

edited

Loading

dcherian May 8, 2025 •

edited

Loading

vercel bot commented May 12, 2025 •

edited

Loading

	- Model Training/Inference (often GPU-bound)
	- Model training / inference (often GPU-bound)


		### Step 1: Optimized Chunking & Compression

		The ERA-5 dataset we were using had a sub-optimal chunking scheme of `{'time': 10, 'channel': C, 'height': H, 'width': W}`, which meant that a minimum of 10 timesteps of data was being read even if we only needed 2 consecutive timesteps at a time.

	The ERA-5 dataset we were using had a sub-optimal chunking scheme of `{'time': 10, 'channel': C, 'height': H, 'width': W}`, which meant that a minimum of 10 timesteps of data was being read even if we only needed 2 consecutive timesteps at a time.
	The copy of the ERA5 dataset we were using initially had a suboptimal chunking scheme of `{'time': 10, 'channel': C, 'height': H, 'width': W}`, which meant that a minimum of 10 time steps of data was being read even if we only needed 2 consecutive time steps.

	However, it relies on the [NVIDIA GPUDirect Storage (GDS)](https://docs.nvidia.com/datacenter/pgp/gds/index.html) feature to be enabled and correctly configured on your system.
	However, it relies on the [NVIDIA GPUDirect Storage (GDS)](https://docs.nvidia.com/datacenter/pgp/gds/index.html) feature being enabled and correctly configured on your system.


		To address this inefficiency, we adopted [NVIDIA DALI (Data Loading Library)](https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html), which provides a flexible, GPU-accelerated data pipeline with built-in support for asynchronous execution across CPU and GPU stages. DALI helps reduce CPU pressure, enables concurrent preprocessing, and increases training throughput by pipelining operations.

		First, we began with a minimal example in [zarr_DALI directory](https://github.com/pangeo-data/ncar-hackathon-xarray-on-gpus/tree/main/zarr_DALI) with short, contained examples of a DALI pipeline loading directly from Zarr stores. This example shows how to build a custom DALI `pipeline` that uses an `ExternalSource` operator to load batched image data from a Zarr store and transfer them directly to GPU memory using CuPy arrays.

	- Work out how to use GDS when reading from cloud object store instead of on-prem disk
	- Using GDS when reading from cloud object storage instead of on-prem disk storage


		## Acknowledgements 🙌

		This work was developed during the [NCAR/NOAA Open Hackathon](https://www.openhackathons.org/s/siteevent/a0CUP00000rwYYZ2A2/se000355) in Golden, Colorado from 18-27 February 2025. We would like to thank the OpenACC Hackathon for the opportunity to participate and learn from this experience. Special thanks to NCAR for providing access to NCAR’s Derecho supercomputer which we used for this project. Thanks also to the open-source communities behind [Xarray](https://github.com/pydata/xarray), [Zarr](https://github.com/zarr-developers/zarr-python), [CuPy](https://github.com/cupy/cupy), [KvikIO](https://github.com/rapidsai/kvikio), and [DALI](https://github.com/NVIDIA/DALI).