Skip to content

Xarray GPU optimization #771

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 43 commits into
base: main
Choose a base branch
from
Open

Conversation

negin513
Copy link
Contributor

@negin513 negin513 commented May 1, 2025

Contributors: @negin513, @weiji14 , @TomAugspurger , @maxrjoes, @akshaysubr, @kafitzgerald

Copy link

vercel bot commented May 1, 2025

@negin513 is attempting to deploy a commit to the xarray Team on Vercel.

A member of the Team first needs to authorize it.

Copy link

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing this up!

- name: Katelyn Fitzgerald
github: kafitzgerald

summary: 'How to accelerate AI/ML workflows in Earth Sciences with GPU-native Xarray and Zarr.'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this more direct? "X% speedup" or "XMBps throughput"?

Comment on lines 158 to 165
(TODO ongoing work) Eventually with this [cupy-xarray Pull Request merged](https://github.com/xarray-contrib/cupy-xarray/pull/70) (based on earlier work at https://xarray.dev/blog/xarray-kvikio), this can be simplified to:

```python
import cupy_xarray

ds = xr.open_dataset(filename_or_obj="/tmp/air-temp.zarr", engine="kvikio")
assert isinstance(ds.air.data, cp.ndarray)
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could go in a future work section at the end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not sure if this API is feasible or even desirable (have tried to implement this in xarray-contrib/cupy-xarray#70, but no luck yet patching the buffer protocol). So ok to move this towards the end.


![image](https://hackmd.io/_uploads/H1YVp6tR1l.png)

(TODO insert better nsight profiling figure than above showing overlapping CPU and GPU compute)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would be really nice!

- Consider using GPU Direct Storage (GDS) for optimal performance, but be aware of the setup and configuration required.
- GPU Direct Storage (GDS) can be an improvement for data-intensive workflows, but requires some setup and configuration.
- NVIDIA DALI is a powerful tool for optimizing data loading, but requires some effort to integrate into existing workflows.
- GPU-based decompression is a promising area for future work, but requires further development and testing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Icechunk!

@@ -0,0 +1,223 @@
---
title: 'Accelerating AI/ML Workflows in Earth Sciences with GPU-Native Xarray and Zarr (and more!)'
Copy link
Contributor

@dcherian dcherian May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
title: 'Accelerating AI/ML Workflows in Earth Sciences with GPU-Native Xarray and Zarr (and more!)'
title: 'GPU-Native Earth Science AI/ML Workflows Xarray, Zarr, DALI, and nvcomp'

better SEO this way?

Comment on lines 158 to 165
(TODO ongoing work) Eventually with this [cupy-xarray Pull Request merged](https://github.com/xarray-contrib/cupy-xarray/pull/70) (based on earlier work at https://xarray.dev/blog/xarray-kvikio), this can be simplified to:

```python
import cupy_xarray

ds = xr.open_dataset(filename_or_obj="/tmp/air-temp.zarr", engine="kvikio")
assert isinstance(ds.air.data, cp.ndarray)
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not sure if this API is feasible or even desirable (have tried to implement this in xarray-contrib/cupy-xarray#70, but no luck yet patching the buffer protocol). So ok to move this towards the end.

- GPU Direct Storage (GDS) for optimal performance
- NVIDIA DALI
- Work out how to use GDS when reading from cloud object store instead of on-prem disk.
- etc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to shout out that reading/writing Zarr shards with GPU buffers (thanks @maxrjones and @TomAugspurger!) at zarr-developers/zarr-python#2978 was just merged, and could go in here or somewhere above, depending on when this blog post gets published.

Copy link

vercel bot commented May 12, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
xarray-dev ✅ Ready (Inspect) Visit Preview 💬 Add feedback May 12, 2025 11:24pm

Copy link
Member

@maxrjones maxrjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so great, thank you @negin513!!!

/>
</div>

We further quantified this bottleneck by comparing data loading and training throughput, as shown in the figure below:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We further quantified this bottleneck by comparing data loading and training throughput, as shown in the figure below:
We further quantified this bottleneck by comparing data loading and training throughput, as shown in the figure below (higher bars/more throughput is better):


In the plot above, the three bars represent:

- Baseline: Baseline throughput of the end-to-end pipeline using real data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Baseline: Baseline throughput of the end-to-end pipeline using real data.
- Training (real data): Baseline throughput of the end-to-end pipeline using real data.

1. **Optimized Chunking & Compression**
- We explored different chunking and compression strategies to optimize the data loading performance. We found that using Zarr v3 with optimized chunking and compression significantly improved the data loading performance.
2. **GPU-native data loading with Zarr v3 and KvikIO**
- Leveraging Zarr v3's support for reading data directly into GPU memory using CuPy arrays, we utilized [KvikIO](https://docs.rapids.ai/api/kvikio/stable/) to bypass CPU memory, enabling direct data transfer from storage to GPU.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Leveraging Zarr v3's support for reading data directly into GPU memory using CuPy arrays, we utilized [KvikIO](https://docs.rapids.ai/api/kvikio/stable/) to bypass CPU memory, enabling direct data transfer from storage to GPU.
- Leveraging Zarr Python 3's support for reading data directly into GPU memory using CuPy arrays, we utilized [KvikIO](https://docs.rapids.ai/api/kvikio/stable/) to bypass CPU memory, enabling direct data transfer from storage to GPU.


1. **Optimized Chunking & Compression**
- We explored different chunking and compression strategies to optimize the data loading performance. We found that using Zarr v3 with optimized chunking and compression significantly improved the data loading performance.
2. **GPU-native data loading with Zarr v3 and KvikIO**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. **GPU-native data loading with Zarr v3 and KvikIO**
2. **GPU-native data loading with Zarr Python 3 and KvikIO**

### Step 1: Optimized chunking & Compression

The ERA-5 dataset we were using had a sub-optimal chunking scheme of `{'time': 10, 'channel': C, 'height': H, 'width': W}`, which meant that a minimum of 10 timesteps of data was being read even if we only needed 2 consecutive timesteps at a time.
We decided to rechunk the data to align with our access pattern of 1-timestep at a time, while reformating to Zarr v3.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We decided to rechunk the data to align with our access pattern of 1-timestep at a time, while reformating to Zarr v3.
We decided to rechunk the data to align with our access pattern of 1-timestep at a time, while reformating to Zarr format 3.

airt.to_zarr(store="/tmp/air-temp.zarr", mode="w", zarr_format=3, consolidated=False)

with zarr.config.enable_gpu():
ds = xr.open_dataset("/tmp/air-temp.zarr", engine="zarr", consolidated=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why not to use consolidated metadata for direct to GPU reading? I am surprised about this because I would expect it to improve performance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consolidated metadata wasn't supported with GDS/kvikio, at least when I tested it 3 years ago in xarray-contrib/cupy-xarray#10 (comment). Maybe it works with Zarr v3, but we don't have a GDS device to verify on 🙂


![Flowchart-technically decompression is still done on CPUs](/posts/gpu-pipline/flowchart_2.png)

Eventually with this [cupy-xarray Pull Request merged](https://github.com/xarray-contrib/cupy-xarray/pull/70) (based on earlier work at https://xarray.dev/blog/xarray-kvikio), this can be simplified to:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Eventually with this [cupy-xarray Pull Request merged](https://github.com/xarray-contrib/cupy-xarray/pull/70) (based on earlier work at https://xarray.dev/blog/xarray-kvikio), this can be simplified to:
Eventually with this [cupy-xarray Pull Request merged](https://github.com/xarray-contrib/cupy-xarray/pull/70) (based on earlier work at [https://xarray.dev/blog/xarray-kvikio](https://xarray.dev/blog/xarray-kvikio)), this can be simplified to:


Figure above shows benchmark comparing CPU vs GPU-based decompression, with or without GDS enabled using [the data reading benchmark here](https://github.com/pangeo-data/ncar-hackathon-xarray-on-gpus/blob/main/benchmark/era5_zarr_benchmark.py).

![GPU native decompression](/posts/gpu-pipline/zstd_benchmark.png)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know why it's slower with GDS?


Next, checkout the [end-to-end example](https://github.com/pangeo-data/ncar-hackathon-xarray-on-gpus/tree/main/zarr_ML_optimization) directory, where we show how to integrate the DALI pipeline into a PyTorch DataLoader and training loop. This example demonstrates how to use DALI to load data from Zarr stores, preprocess it on the GPU, and feed it into a PyTorch model for training.

Profiling results from the DALI pipeline demonstrate effective overlap between CPU and GPU workloads, significantly reducing GPU idle time (blue) and increasing overall training throughput:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the GPU activity is blue, right? rather than the idle time

Comment on lines 294 to 297
> - **Compression trade-offs**: Using compression can reduce the amount of data transferred, but can also increase the time spent on decompression. We found that using Zarr v3 with GPU-based decompression can significantly improve performance.
> - **NVIDIA DALI** is a powerful tool for optimizing data loading, but requires some effort to integrate into existing workflows.
> - **CuPy-Xarray integration** is still a work in progress, but can be very useful for GPU-native workflows. Please see this PR for more details: [xarray-contrib/cupy-xarray#70](https://github.com/xarray-contrib/cupy-xarray/pull/70).
> - **GPU-native decompression** is a promising area for future work, but full support (e.g. GPU-side Zstd decompression) requires further development and testing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> - **Compression trade-offs**: Using compression can reduce the amount of data transferred, but can also increase the time spent on decompression. We found that using Zarr v3 with GPU-based decompression can significantly improve performance.
> - **NVIDIA DALI** is a powerful tool for optimizing data loading, but requires some effort to integrate into existing workflows.
> - **CuPy-Xarray integration** is still a work in progress, but can be very useful for GPU-native workflows. Please see this PR for more details: [xarray-contrib/cupy-xarray#70](https://github.com/xarray-contrib/cupy-xarray/pull/70).
> - **GPU-native decompression** is a promising area for future work, but full support (e.g. GPU-side Zstd decompression) requires further development and testing.
> - **Compression trade-offs**: Using compression can reduce the amount of data transferred, but can also increase the time spent on decompression. We found that using Zarr v3 with GPU-based decompression can significantly improve performance.
> - **GPU-native decompression** is a promising area for future work, but full support (e.g. GPU-side Zstd decompression) requires further development and testing.
> - **NVIDIA DALI** is a powerful tool for optimizing data loading, but requires some effort to integrate into existing workflows.
> - **CuPy-Xarray integration** is still a work in progress, but can be very useful for GPU-native workflows. Please see this PR for more details: [xarray-contrib/cupy-xarray#70](https://github.com/xarray-contrib/cupy-xarray/pull/70).
> - **NVIDIA Nsight** provides a [powerful tool](https://developer.nvidia.com/nsight-systems) for identifying bottlenecks.

Copy link
Member

@weiji14 weiji14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work, this is coming along really nicely already! Just some minor nitpicks, but hope that we can publish this next month!


- GPU Direct Storage (GDS) for optimal performance
- NVIDIA DALI
- Support for sharded Zarr with GPU-friendly access patterns [already merged in Zarr v3]().
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Support for sharded Zarr with GPU-friendly access patterns [already merged in Zarr v3]().
- Support for sharded Zarr with GPU-friendly access patterns already [merged](https://github.com/zarr-developers/zarr-python/pull/2978) in Zarr v3.0.8.

github: weiji14
- name: Max Jones
github: maxrjones
- name: Akshay Subranian
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- name: Akshay Subranian
- name: Akshay Subramaniam


ML pipelines for large scientific datasets typically include steps:

- Reading raw data from disk or object storage (often CPU-bound)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Reading raw data from disk or object storage (often CPU-bound)
- Reading raw data from disk or object storage (often IO-bound)

- Transforming / preprocessing data (often CPU-bound)
- Model Training/Inference (often GPU-bound)

Although GPU compute is incredibly fast, the CPU can become a bottleneck when dealing with large datasets. In an ideal scenario, we want to saturate the GPU with data as quickly as possible to minimize idle time on both the CPU and GPU.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Although GPU compute is incredibly fast, the CPU can become a bottleneck when dealing with large datasets. In an ideal scenario, we want to saturate the GPU with data as quickly as possible to minimize idle time on both the CPU and GPU.
Although GPU compute is incredibly fast, IO and CPU bottlenecks can be a pain when dealing with large datasets. In an ideal scenario, we want to saturate the GPU with data as quickly as possible to minimize idle time on both the CPU and GPU.

This will read the data directly from the Zarr store to GPU memory, significantly reducing I/O latency, especially for large datasets.
However, it relies on the [NVIDIA GPUDirect Storage (GDS)](https://docs.nvidia.com/datacenter/pgp/gds/index.html) feature to be enabled and correctly configured on your system.

**Note**: Even with GDS, the decompression step is still occurs on the CPU (see next section for GPU solutions!). This means that the data is still being decompressed on the CPU before being transferred to the GPU. However, this is still a significant improvement over the previous method, as it reduces the amount of data that needs to be transferred over the PCIe bus. In the figure below, we show the flowchart of the data loading process with GDS enabled (i.e. using `kvikio`):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**Note**: Even with GDS, the decompression step is still occurs on the CPU (see next section for GPU solutions!). This means that the data is still being decompressed on the CPU before being transferred to the GPU. However, this is still a significant improvement over the previous method, as it reduces the amount of data that needs to be transferred over the PCIe bus. In the figure below, we show the flowchart of the data loading process with GDS enabled (i.e. using `kvikio`):
**Note**: Even with GDS, the decompression step will still occur on the CPU (see next section for GPU solutions!). This means that the data is still being decompressed on the CPU before being transferred to the GPU. However, this is still a significant improvement over the previous method, as it reduces the amount of data that needs to be transferred over the PCIe bus. In the figure below, we show the flowchart of the data loading process with GDS enabled (i.e. using `kvikio`):


For a fully GPU-native pipline, the decompression step should also be done on the GPU. This is where [NVIDIA's nvCOMP](https://developer.nvidia.com/nvcomp) library comes in. nvCOMP provides fast, GPU-native implementations of popular compression algorithms like Zstandard (Zstd)

With nvCOMP, all steps of data loading including reading, decompressing, and transforming data can be done on the GPU, significantly reducing the time spent on data loading. Here is a flowchart of the data loading process with GDS and GPU-based decompression enabled:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
With nvCOMP, all steps of data loading including reading, decompressing, and transforming data can be done on the GPU, significantly reducing the time spent on data loading. Here is a flowchart of the data loading process with GDS and GPU-based decompression enabled:
With nvCOMP, all steps of data loading including reading from disk, decompression, and transforming data can be done on the GPU, significantly reducing the time spent on data loading. Here is a flowchart of the data loading process with GDS and GPU-based decompression enabled:


> These results show that GPU-based decompression can significantly reduce the time spent on data loading and cut I/O latency from storage to device (less data transfer over PCIe/NVLink). This is especially useful for large datasets, as it allows for faster data loading and processing.

Keep an eye on this space, as we are working on integrating this into the Zarr ecosystem to enable GPU-based decompression for Zarr stores. This will allow for a fully GPU-native workflow, where all steps of data loading, including reading, decompressing, and transforming data, can be done on the GPU.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Keep an eye on this space, as we are working on integrating this into the Zarr ecosystem to enable GPU-based decompression for Zarr stores. This will allow for a fully GPU-native workflow, where all steps of data loading, including reading, decompressing, and transforming data, can be done on the GPU.
Keep an eye on this space, as we are working on integrating this into the Zarr ecosystem to enable GPU-based decompression for Zarr stores. This will allow for a fully GPU-native workflow, where all steps of data loading, including reading, decompression, and transforming data, can be done on the GPU.

Copy link

@kafitzgerald kafitzgerald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for putting this together!

Mostly just a few minor suggestions from my end beyond the existing comments / questions.

github: akshaysubr
- name: Thomas Augspurger
github: tomaugspurger
- name: Katelyn Fitzgerald

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- name: Katelyn Fitzgerald
- name: Katelyn FitzGerald


## TL;DR

Earth science AI/ML workflows are often bottlenecked by slow data loading, leaving GPUs underutilized while CPUs struggle to feed large climate datasets like ERA5. In this blog post, we discuss how to build a GPU-native pipeline using Zarr v3, CuPy, KvikIO, and NVIDIA DALI to accelerate data throughput. We walk through profiling results, chunking strategies, direct-to-GPU data reads, and GPU-accelerated preprocessing, all aimed at maximizing GPU usage and minimizing I/O overhead.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Earth science AI/ML workflows are often bottlenecked by slow data loading, leaving GPUs underutilized while CPUs struggle to feed large climate datasets like ERA5. In this blog post, we discuss how to build a GPU-native pipeline using Zarr v3, CuPy, KvikIO, and NVIDIA DALI to accelerate data throughput. We walk through profiling results, chunking strategies, direct-to-GPU data reads, and GPU-accelerated preprocessing, all aimed at maximizing GPU usage and minimizing I/O overhead.
Earth science AI/ML workflows are often limited by slow data loading, leaving GPUs underutilized while CPUs struggle to feed large climate datasets like ERA5. In this blog post, we discuss how to build a GPU-native pipeline using Zarr v3, CuPy, KvikIO, and NVIDIA DALI to accelerate data throughput. We walk through profiling results, chunking strategies, direct-to-GPU data reads, and GPU-accelerated preprocessing, all aimed at maximizing GPU usage and minimizing I/O overhead.

Not committed to this - just trying to vary the language a bit.


In large-scale geospatial AI and machine learning workflows, data loading is often the main bottleneck. Traditional pipelines rely on CPUs to preprocess and transfer massive datasets from storage to GPU memory, consuming resources and limiting scalability and effective use of GPU resources.

To tackle this issue, a team from the [National Center for Atmospheric Research (NSF-NCAR)](https://ncar.ucar.edu) and [Development Seed](https://developmentseed.org) with mentors from [NVIDIA](https://www.nvidia.com) participated in the [OpenHackathon](https://www.openhackathons.org/s/) to demonstrate how AI/ML workflows in Earth system sciences can benefit from GPU-native workflows using tools such as [Zarr](https://zarr.readthedocs.io/), [KvikIO](https://docs.rapids.ai/api/kvikio/stable/), and [DALI](https://developer.nvidia.com/dali).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To tackle this issue, a team from the [National Center for Atmospheric Research (NSF-NCAR)](https://ncar.ucar.edu) and [Development Seed](https://developmentseed.org) with mentors from [NVIDIA](https://www.nvidia.com) participated in the [OpenHackathon](https://www.openhackathons.org/s/) to demonstrate how AI/ML workflows in Earth system sciences can benefit from GPU-native workflows using tools such as [Zarr](https://zarr.readthedocs.io/), [KvikIO](https://docs.rapids.ai/api/kvikio/stable/), and [DALI](https://developer.nvidia.com/dali).
To tackle this issue, a team from the [NSF National Center for Atmospheric Research (NSF NCAR)](https://ncar.ucar.edu) and [Development Seed](https://developmentseed.org) with mentors from [NVIDIA](https://www.nvidia.com) participated in an [Open Hackathon](https://www.openhackathons.org/s/) to demonstrate how Earth system science AI/ML workflows can benefit from GPU-native workflows using tools such as [Zarr](https://zarr.readthedocs.io/), [KvikIO](https://docs.rapids.ai/api/kvikio/stable/), and [DALI](https://developer.nvidia.com/dali).


## Problem

ML pipelines for large scientific datasets typically include steps:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ML pipelines for large scientific datasets typically include steps:
ML pipelines for large scientific datasets typically include the following steps:


- Reading raw data from disk or object storage (often CPU-bound)
- Transforming / preprocessing data (often CPU-bound)
- Model Training/Inference (often GPU-bound)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Model Training/Inference (often GPU-bound)
- Model training / inference (often GPU-bound)


### Step 1: Optimized Chunking & Compression

The ERA-5 dataset we were using had a sub-optimal chunking scheme of `{'time': 10, 'channel': C, 'height': H, 'width': W}`, which meant that a minimum of 10 timesteps of data was being read even if we only needed 2 consecutive timesteps at a time.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The ERA-5 dataset we were using had a sub-optimal chunking scheme of `{'time': 10, 'channel': C, 'height': H, 'width': W}`, which meant that a minimum of 10 timesteps of data was being read even if we only needed 2 consecutive timesteps at a time.
The copy of the ERA5 dataset we were using initially had a suboptimal chunking scheme of `{'time': 10, 'channel': C, 'height': H, 'width': W}`, which meant that a minimum of 10 time steps of data was being read even if we only needed 2 consecutive time steps.

```

This will read the data directly from the Zarr store to GPU memory, significantly reducing I/O latency, especially for large datasets.
However, it relies on the [NVIDIA GPUDirect Storage (GDS)](https://docs.nvidia.com/datacenter/pgp/gds/index.html) feature to be enabled and correctly configured on your system.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
However, it relies on the [NVIDIA GPUDirect Storage (GDS)](https://docs.nvidia.com/datacenter/pgp/gds/index.html) feature to be enabled and correctly configured on your system.
However, it relies on the [NVIDIA GPUDirect Storage (GDS)](https://docs.nvidia.com/datacenter/pgp/gds/index.html) feature being enabled and correctly configured on your system.


To address this inefficiency, we adopted [NVIDIA DALI (Data Loading Library)](https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html), which provides a flexible, GPU-accelerated data pipeline with built-in support for asynchronous execution across CPU and GPU stages. DALI helps reduce CPU pressure, enables concurrent preprocessing, and increases training throughput by pipelining operations.

First, we began with a minimal example in [zarr_DALI directory](https://github.com/pangeo-data/ncar-hackathon-xarray-on-gpus/tree/main/zarr_DALI) with short, contained examples of a DALI pipeline loading directly from Zarr stores. This example shows how to build a custom DALI `pipeline` that uses an `ExternalSource` operator to load batched image data from a Zarr store and transfer them directly to GPU memory using CuPy arrays.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
First, we began with a minimal example in [zarr_DALI directory](https://github.com/pangeo-data/ncar-hackathon-xarray-on-gpus/tree/main/zarr_DALI) with short, contained examples of a DALI pipeline loading directly from Zarr stores. This example shows how to build a custom DALI `pipeline` that uses an `ExternalSource` operator to load batched image data from a Zarr store and transfer them directly to GPU memory using CuPy arrays.
First, we began with a minimal example in the [zarr_DALI directory](https://github.com/pangeo-data/ncar-hackathon-xarray-on-gpus/tree/main/zarr_DALI) with short, contained examples of a DALI pipeline loading directly from Zarr stores. This example shows how to build a custom DALI `pipeline` that uses an `ExternalSource` operator to load batched image data from a Zarr store and transfer them directly to GPU memory using CuPy arrays.

- GPU Direct Storage (GDS) for optimal performance
- NVIDIA DALI
- Support for sharded Zarr with GPU-friendly access patterns [already merged in Zarr v3]().
- Work out how to use GDS when reading from cloud object store instead of on-prem disk

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Work out how to use GDS when reading from cloud object store instead of on-prem disk
- Using GDS when reading from cloud object storage instead of on-prem disk storage


## Acknowledgements 🙌

This work was developed during the [NCAR/NOAA Open Hackathon](https://www.openhackathons.org/s/siteevent/a0CUP00000rwYYZ2A2/se000355) in Golden, Colorado from 18-27 February 2025. We would like to thank the OpenACC Hackathon for the opportunity to participate and learn from this experience. Special thanks to NCAR for providing access to NCAR’s Derecho supercomputer which we used for this project. Thanks also to the open-source communities behind [Xarray](https://github.com/pydata/xarray), [Zarr](https://github.com/zarr-developers/zarr-python), [CuPy](https://github.com/cupy/cupy), [KvikIO](https://github.com/rapidsai/kvikio), and [DALI](https://github.com/NVIDIA/DALI).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This work was developed during the [NCAR/NOAA Open Hackathon](https://www.openhackathons.org/s/siteevent/a0CUP00000rwYYZ2A2/se000355) in Golden, Colorado from 18-27 February 2025. We would like to thank the OpenACC Hackathon for the opportunity to participate and learn from this experience. Special thanks to NCAR for providing access to NCAR’s Derecho supercomputer which we used for this project. Thanks also to the open-source communities behind [Xarray](https://github.com/pydata/xarray), [Zarr](https://github.com/zarr-developers/zarr-python), [CuPy](https://github.com/cupy/cupy), [KvikIO](https://github.com/rapidsai/kvikio), and [DALI](https://github.com/NVIDIA/DALI).
This work was developed during the [NCAR/NOAA Open Hackathon](https://www.openhackathons.org/s/siteevent/a0CUP00000rwYYZ2A2/se000355) in Golden, Colorado from 18-27 February 2025. We would like to thank the OpenACC Hackathon for the opportunity to participate and learn from this experience. Special thanks to NSF NCAR for providing access to their Derecho supercomputer which we used for this project. Thanks also to the open-source communities behind [Xarray](https://github.com/pydata/xarray), [Zarr](https://github.com/zarr-developers/zarr-python), [CuPy](https://github.com/cupy/cupy), [KvikIO](https://github.com/rapidsai/kvikio), and [DALI](https://github.com/NVIDIA/DALI).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants