Numba/CUDA projection kernels for reproject, README update by brendancol · Pull Request #1046 · xarray-contrib/xarray-spatial

brendancol · 2026-03-21T09:05:22Z

Closes #1045.

Summary

README narrative sections were stale ("implements common raster analysis functions," "documentation still under construction") -- updated to reflect the actual library
Fixed three bugs in reproject grid computation that caused shape mismatches vs rioxarray
Ported six PROJ projections to Numba (CPU) and CUDA (GPU), so common CRS pairs no longer go through pyproj

Reproject fixes

_grid.py had three problems that showed up when comparing output grids against rioxarray:

Resolution was estimated by transforming a single diagonal step, coupling x and y through projection non-linearity. Now each axis is transformed independently and the geometric mean gives square pixels.
Edge sampling used 21 points, which missed bounding box extent on curved projections. Now 101 edge points plus a 21x21 interior grid.
ceil() turned 677.0000000000001 into 678. Switched to round().

Numba projection kernels

Six projections ported from the PROJ C library. Unsupported CRS pairs fall back to pyproj.

Projection	EPSG examples	CPU vs pyproj
Web Mercator	3857	5.7x
UTM (Krueger 6th-order)	326xx, 327xx, 269xx	7.8x
Ellipsoidal Mercator	3395	9.3x
Lambert Conformal Conic	2154, State Plane	9.3x
Albers Equal Area	5070	6.9x
Cylindrical Equal Area	6933	5.7x

CUDA kernels

Same projections as @cuda.jit(device=True) functions. Coordinates stay on-device instead of round-tripping through CPU.

Projection	GPU (16.8M px)	vs pyproj	vs CPU Numba
Web Mercator	5ms	165x	27x
UTM	22ms	81x	10x
Ell. Mercator	24ms	111x	12x
LCC	35ms	85x	9x
Albers	30ms	41x	6x
CEA	8ms	110x	18x

All kernels are bit-exact against pyproj (max error < 5e-14 degrees).

Test plan

67 existing reproject tests pass
Numba kernels validated point-by-point against pyproj (sub-nanometer)
CUDA kernels validated against CPU Numba (bit-exact)
Benchmark vs rioxarray: R^2 > 0.99999 on Landsat, Copernicus, USGS files
Verify README renders on GitHub

The intro, GDAL caveat, and docs note were written when the library was much smaller. Now there are 100+ functions, native GeoTIFF I/O, 4-backend dispatch, and 33+ user guide notebooks. Updated the prose to say what the library actually does instead of underselling it.

:gpu: doesn't render on GitHub -- switched to 🖥️. Added a quick start snippet showing the read_geotiff -> reproject -> write_geotiff workflow.

Switched quick start to 'import xrspatial as xrs', moved the read/reproject/write flow into the main example using the .xrs accessor, and simplified the standalone function example to use the xrs namespace.

Three fixes to _grid.py: - Resolution estimation now transforms each axis independently and uses the geometric mean for square pixels (was diagonal step) - Edge sampling increased from 21 to 101 points plus a 21x21 interior grid for better bounds coverage - ceil() replaced with round() for dimension calculation to prevent floating-point noise from adding spurious rows/columns

Ports six projections from the PROJ library to Numba (CPU) and Numba CUDA (GPU), bypassing pyproj for common CRS pairs: - Web Mercator (EPSG:3857) -- spherical, 3 lines per direction - Transverse Mercator / UTM (326xx, 327xx, 269xx) -- 6th-order Krueger series (Karney 2011), closed-form forward and inverse - Ellipsoidal Mercator (EPSG:3395) -- Newton inverse - Lambert Conformal Conic (e.g. EPSG:2154) -- Newton inverse - Albers Equal Area (e.g. EPSG:5070) -- authalic latitude series - Cylindrical Equal Area (e.g. EPSG:6933) -- authalic latitude series CPU Numba kernels are 6-9x faster than pyproj. CUDA kernels are 40-165x faster. Unsupported CRS pairs fall back to pyproj. _transform_coords now tries Numba first, then pyproj. The CuPy chunk worker tries CUDA first, keeping coordinates on-device.

Compares xrspatial (approx, exact, numba) against rioxarray on synthetic grids and real-world GeoTIFFs. Measures both performance (median of 5 runs) and pixel-level consistency (RMSE, R^2, NaN agreement) by forcing both libraries onto the same output grid.

Updated Reproject description and added a table listing the six projections with native Numba CPU and CUDA GPU kernels.

Three new CPU Numba projections for bathymetric/ocean use cases: - Sinusoidal (ellipsoidal): MODIS ocean/land products. Uses pj_mlfn meridional arc length with 5th-order series. Forward: sub-micrometer vs pyproj. Roundtrip: exact. - Polar Stereographic (N/S): IBCAO Arctic (3996/3413), IBCSO Antarctic (3031), UPS. Iterative inverse (15 iter max). Forward: sub-nanometer. Roundtrip: exact. LAEA implemented but disabled in dispatch pending investigation of ~940m oblique-mode error (kernels are in the file for future fix, just not wired into the dispatch table).

The oblique-mode LAEA had xmf and ymf swapped (rq/dd vs rq*dd). Research agent found the bug by comparing against PROJ's laea.cpp source. Forward accuracy is now sub-millimeter vs pyproj.

State Plane zones that use Transverse Mercator (Maine, New Hampshire, Wisconsin, etc.) now hit the Numba fast path. Uses the same Krueger 6th-order series as UTM, with a Zb offset for non-zero lat_0. Meter-based zones only; US survey foot zones fall back to pyproj.

State Plane zones in us-ft (136 zones) and ft (28 zones) now use the Numba fast path. The Krueger/LCC kernels compute in metres internally, then divide by the unit conversion factor (0.3048006 for us-ft, 0.3048 for ft) on output. x_0/y_0 from PROJ4 dicts are already in metres regardless of the units parameter.

…#1045) - Benchmark table showing Numba vs pyproj for all 12 supported projections - Fixed dispatch to work with custom PROJ strings (no EPSG code), needed for Sinusoidal and other non-registered CRS definitions - Fixed _utm_params to handle None epsg_code

All 12 projections now have GPU CUDA kernels. Performance on A6000: - Sinusoidal: 18ms (56x vs pyproj) - LAEA Europe: 18ms (92x) - Polar Stere: 57ms (64-67x) - State Plane tmerc: 23ms (88x) - State Plane lcc ftUS: 36ms (124x) - LCC France: 39ms (78x) All bit-exact against CPU Numba kernels. Updated README benchmark table and projection support matrix.

CRS on non-WGS84/GRS80 ellipsoids (Airy for BNG, Clarke 1866 for NAD27, Bessel for Tokyo, etc.) now fall back to pyproj instead of silently skipping the datum transformation. Without this, BNG had ~100m error, NAD27 ~24m, Tokyo ~900m. Each _*_params() function now checks _is_wgs84_compatible_ellipsoid() before returning parameters. EPSG-specific paths (UTM, Web Mercator) were already safe since they only match WGS84/NAD83 codes.

NAD27 (EPSG:4267) sources and targets now go through the Numba fast path. After the projection kernel runs in WGS84, a 3-parameter geocentric Helmert transform (dx=-8, dy=160, dz=176 for Clarke 1866) shifts coordinates to/from NAD27. Accuracy: mean 2.9m, p95 5.8m vs pyproj across CONUS. This matches PROJ's own accuracy when NADCON grids aren't installed. The framework is extensible to other datums by adding entries to _DATUM_PARAMS. Also broadened geographic CRS detection from WGS84/NAD83-only to include NAD27, so NAD27 -> UTM and NAD27 -> State Plane dispatch correctly.

Native CUDA nearest, bilinear, and cubic (Catmull-Rom) resampling kernels replace cupyx.scipy.ndimage.map_coordinates. When the CUDA projection path produces on-device coordinates, the entire pipeline now stays on GPU with no CPU roundtrip. Full reproject pipeline (2048x2048, bilinear, 4326->UTM): GPU end-to-end: 78ms CPU Numba: 161ms Speedup: 2.1x

Merges 4 overlapping WGS84 tiles and compares timing and pixel-level consistency against rioxarray.merge_arrays. Baseline results (xrspatial is currently slower on merge): 512x512 tiles: 59ms vs 56ms (1.1x) 1024x1024: 293ms vs 114ms (2.6x) 2048x2048: 2.52s vs 656ms (3.8x) Consistency: RMSE < 0.012, NaN agreement > 99.8%.

Two optimizations that make merge 4.5-7.3x faster: 1. Same-CRS tiles skip reprojection entirely. When source and target CRS match, tiles are placed into the output grid by direct coordinate alignment (array slicing). Falls back to full reprojection if resolutions differ by >1%. 2. try_numba_transform now bails before allocating coordinate arrays when neither CRS side is a supported geographic system. This saved ~100ms per tile for unsupported pairs. Merge benchmark (4 overlapping WGS84 tiles): 512x512: 13ms (was 59ms, now 2.3x faster than rioxarray) 1024x1024: 48ms (was 293ms, now 2.6x faster than rioxarray) 2048x2048: 344ms (was 2520ms, now 1.6x faster than rioxarray)

…bles (#1045) README now shows full pipeline times (transform + resampling) and merge times, both compared against rioxarray. More useful than the previous coordinate-transform-only table since users care about total wall time.

For dask+cupy inputs, eagerly compute the source to GPU memory and run the in-memory CuPy reproject in a single pass. This avoids the per-chunk overhead of 64+ dask.delayed calls, each creating a pyproj Transformer and launching small CUDA kernels. Before: 958ms (64 delayed chunks, 512x512 each) After: 43ms (single CuPy pass, pixel-exact same output) Speedup: 22x The output is a plain CuPy array. For truly out-of-core GPU data that doesn't fit in GPU memory, the old dask.delayed path remains available by passing the data as dask+numpy.

Replaces the eager .compute() approach with a chunked GPU pipeline that fetches only the needed source window per output chunk. This handles sources larger than GPU memory while still being 8-20x faster than the old dask.delayed path. The key optimizations vs dask.delayed: - CRS objects and transformer created once (not per chunk) - CUDA projection + native CUDA resampling per chunk - Default 2048x2048 GPU chunks (not 512x512) - Sequential loop avoids dask scheduler overhead Performance (4096x4096 WGS84 -> UTM, bilinear): CuPy single pass: 34ms Dask+CuPy (2048): 49ms (was 958ms) Dask+CuPy (512): 71ms Dask+CuPy (256): 124ms All chunk sizes are pixel-exact vs plain CuPy (max_err < 1e-11).

github-actions bot added the performance PR touches performance-sensitive code label Mar 21, 2026

brendancol added 6 commits March 21, 2026 02:26

Fix GPU emoji shortcode, add read/reproject/write example (#1045)

7c98568

:gpu: doesn't render on GitHub -- switched to 🖥️. Added a quick start snippet showing the read_geotiff -> reproject -> write_geotiff workflow.

Use import xrspatial as xrs, prioritize accessor methods (#1045)

52167a2

Switched quick start to 'import xrspatial as xrs', moved the read/reproject/write flow into the main example using the .xrs accessor, and simplified the standalone function example to use the xrs namespace.

Update README with Numba/CUDA projection table (#1045)

25ec8e9

Updated Reproject description and added a table listing the six projections with native Numba CPU and CUDA GPU kernels.

brendancol changed the title ~~Update README narrative to match current scope~~ Numba/CUDA projection kernels for reproject, README update Mar 21, 2026

brendancol added 15 commits March 21, 2026 09:02

Fix LAEA xmf/ymf swap, re-enable in dispatch (#1045)

f155a78

The oblique-mode LAEA had xmf and ymf swapped (rq/dd vs rq*dd). Research agent found the bug by comparing against PROJ's laea.cpp source. Forward accuracy is now sub-millimeter vs pyproj.

Add GPU column and inline speedups to projection benchmark table (#1045)

cca3c1b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numba/CUDA projection kernels for reproject, README update#1046

Numba/CUDA projection kernels for reproject, README update#1046
brendancol wants to merge 22 commits intogeotiff-reader-writerfrom
issue-1045

brendancol commented Mar 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

brendancol commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Reproject fixes

Numba projection kernels

CUDA kernels

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

brendancol commented Mar 21, 2026 •

edited

Loading