diff --git a/README.md b/README.md
index d6834656..fb9f5179 100644
--- a/README.md
+++ b/README.md
@@ -79,13 +79,15 @@
 
 :fast_forward: Scalable with [Dask](http://dask.pydata.org)
 
+:desktop_computer: GPU-accelerated with [CuPy](https://cupy.dev/) and [Numba CUDA](https://numba.readthedocs.io/en/stable/cuda/index.html)
+
 :confetti_ball: Free of GDAL / GEOS Dependencies
 
 :earth_africa: General-Purpose Spatial Processing, Geared Towards GIS Professionals
 
 -------
 
-Xarray-Spatial implements common raster analysis functions using Numba and provides an easy-to-install, easy-to-extend codebase for raster analysis.
+Xarray-Spatial is a Python library for raster analysis built on xarray. It has 100+ functions for surface analysis, hydrology (D8, D-infinity, MFD), fire behavior, flood modeling, multispectral indices, proximity, classification, pathfinding, and interpolation. Functions dispatch automatically across four backends (NumPy, Dask, CuPy, Dask+CuPy). A built-in GeoTIFF/COG reader and writer handles raster I/O without GDAL.
 
 ### Installation
 ```bash
@@ -119,9 +121,9 @@ In all the above, the command will download and store the files into your curren
 
 `xarray-spatial` grew out of the [Datashader project](https://datashader.org/), which provides fast rasterization of vector data (points, lines, polygons, meshes, and rasters) for use with xarray-spatial.
 
-`xarray-spatial` does not depend on GDAL / GEOS, which makes it fully extensible in Python but does limit the breadth of operations that can be covered.  xarray-spatial is meant to include the core raster-analysis functions needed for GIS developers / analysts, implemented independently of the non-Python geo stack.
+`xarray-spatial` does not depend on GDAL or GEOS. Raster I/O, reprojection, compression codecs, and coordinate handling are all pure Python and Numba -- no C/C++ bindings anywhere in the stack.
 
-Our documentation is still under construction, but [docs can be found here](https://xarray-spatial.readthedocs.io/en/latest/).
+[API reference docs](https://xarray-spatial.readthedocs.io/en/latest/) and [33+ user guide notebooks](examples/user_guide/) cover every module.
 
 #### Raster-huh?
 
@@ -210,9 +212,63 @@ write_vrt('mosaic.vrt', ['tile1.tif', 'tile2.tif'])  # generate VRT
 
 | Name | Description | Source | NumPy xr.DataArray | Dask xr.DataArray | CuPy GPU xr.DataArray | Dask GPU xr.DataArray |
 |:----------:|:------------|:------:|:----------------------:|:--------------------:|:-------------------:|:------:|
-| [Reproject](xrspatial/reproject/__init__.py) | Reprojects a raster to a new CRS using an approximate transform and numba JIT resampling | Standard (inverse mapping) | ✅️ | ✅️ | ✅️ | ✅️ |
+| [Reproject](xrspatial/reproject/__init__.py) | Reprojects a raster to a new CRS with Numba JIT / CUDA coordinate transforms and resampling. Supports vertical datums (EGM96, EGM2008) and horizontal datum shifts (NAD27, OSGB36, etc.) | Standard (inverse mapping) | ✅️ | ✅️ | ✅️ | ✅️ |
 | [Merge](xrspatial/reproject/__init__.py) | Merges multiple rasters into a single mosaic with configurable overlap strategy | Standard (mosaic) | ✅️ | ✅️ | 🔄 | 🔄 |
 
+Built-in Numba JIT and CUDA projection kernels bypass pyproj for per-pixel coordinate transforms. pyproj is used only for CRS metadata parsing (~1ms, once per call) and output grid boundary estimation (~500 control points, once per call). Any CRS pair without a built-in kernel falls back to pyproj automatically.
+
+| Projection | EPSG examples | CPU Numba | CUDA GPU |
+|:-----------|:-------------|:---------:|:--------:|
+| Web Mercator | 3857 | ✅️ | ✅️ |
+| UTM / Transverse Mercator | 326xx, 327xx, State Plane | ✅️ | ✅️ |
+| Ellipsoidal Mercator | 3395 | ✅️ | ✅️ |
+| Lambert Conformal Conic | 2154, 2229, State Plane | ✅️ | ✅️ |
+| Albers Equal Area | 5070 | ✅️ | ✅️ |
+| Cylindrical Equal Area | 6933 | ✅️ | ✅️ |
+| Sinusoidal | MODIS grids | ✅️ | ✅️ |
+| Lambert Azimuthal Equal Area | 3035, 6931, 6932 | ✅️ | ✅️ |
+| Polar Stereographic | 3031, 3413, 3996 | ✅️ | ✅️ |
+| Oblique Stereographic | custom WGS84 | ✅️ | pyproj fallback |
+| Oblique Mercator (Hotine) | 3375 (RSO) | implemented, disabled | pyproj fallback |
+
+**Vertical datum support:** `geoid_height`, `ellipsoidal_to_orthometric`, `orthometric_to_ellipsoidal` convert between ellipsoidal (GPS) and orthometric (map/MSL) heights using EGM96 (vendored, 2.6MB) or EGM2008 (77MB, downloaded on first use). Reproject can apply vertical shifts during reprojection via the `vertical_crs` parameter.
+
+**Datum shift support:** Reprojection from non-WGS84 datums (NAD27, OSGB36, DHDN, MGI, ED50, BD72, CH1903, D73, AGD66, Tokyo) applies grid-based shifts from PROJ CDN (sub-metre accuracy) with 7-parameter Helmert fallback (1-5m accuracy). 14 grids are registered covering North America, UK, Germany, Austria, Spain, Netherlands, Belgium, Switzerland, Portugal, and Australia.
+
+**ITRF frame support:** `itrf_transform` converts between ITRF2000, ITRF2008, ITRF2014, and ITRF2020 using 14-parameter time-dependent Helmert transforms from PROJ data files. Shifts are mm-level.
+
+**Reproject performance** (reproject-only, 1024x1024, bilinear, vs rioxarray):
+
+| Transform | xrspatial | rioxarray |
+|:---|---:|---:|
+| WGS84 -> Web Mercator | 23ms | 14ms |
+| WGS84 -> UTM 33N | 24ms | 18ms |
+| WGS84 -> Albers CONUS | 41ms | 33ms |
+| WGS84 -> LAEA Europe | 57ms | 17ms |
+| WGS84 -> Polar Stere S | 44ms | 38ms |
+| WGS84 -> LCC France | 44ms | 25ms |
+| WGS84 -> Ellipsoidal Merc | 27ms | 14ms |
+| WGS84 -> CEA EASE-Grid | 24ms | 15ms |
+
+**Full pipeline** (read 3600x3600 Copernicus DEM + reproject to EPSG:3857 + write GeoTIFF):
+
+| Backend | Time |
+|:---|---:|
+| NumPy | 2.7s |
+| CuPy GPU | 348ms |
+| Dask+CuPy GPU | 343ms |
+| rioxarray (GDAL) | 418ms |
+
+**Merge performance** (4 overlapping same-CRS tiles, vs rioxarray):
+
+| Tile size | xrspatial | rioxarray | Speedup |
+|:---|---:|---:|---:|
+| 512x512 | 16ms | 29ms | **1.8x** |
+| 1024x1024 | 52ms | 76ms | **1.5x** |
+| 2048x2048 | 361ms | 280ms | 0.8x |
+
+Same-CRS tiles skip reprojection entirely and are placed by direct coordinate alignment.
+
 -------
 
 ### **Utilities**
@@ -460,17 +516,21 @@ write_vrt('mosaic.vrt', ['tile1.tif', 'tile2.tif'])  # generate VRT
 Importing `xrspatial` registers an `.xrs` accessor on DataArrays and Datasets, giving you tab-completable access to every spatial operation:
 
 ```python
-import xrspatial
-from xrspatial.geotiff import read_geotiff
+import xrspatial as xrs
+from xrspatial.geotiff import read_geotiff, write_geotiff
 
 # Read a GeoTIFF (no GDAL required)
 elevation = read_geotiff('dem.tif')
 
-# Surface analysis — call operations directly on the DataArray
+# Surface analysis
 slope = elevation.xrs.slope()
 hillshaded = elevation.xrs.hillshade(azimuth=315, angle_altitude=45)
 aspect = elevation.xrs.aspect()
 
+# Reproject and write as a Cloud Optimized GeoTIFF
+dem_wgs84 = elevation.xrs.reproject(target_crs='EPSG:4326')
+write_geotiff(dem_wgs84, 'output.tif', cog=True)
+
 # Classification
 classes = elevation.xrs.equal_interval(k=5)
 breaks = elevation.xrs.natural_breaks(k=10)
@@ -478,11 +538,7 @@ breaks = elevation.xrs.natural_breaks(k=10)
 # Proximity
 distance = elevation.xrs.proximity(target_values=[1])
 
-# Multispectral — call on the NIR band, pass other bands as arguments
-nir = xr.DataArray(np.random.rand(100, 100), dims=['y', 'x'])
-red = xr.DataArray(np.random.rand(100, 100), dims=['y', 'x'])
-blue = xr.DataArray(np.random.rand(100, 100), dims=['y', 'x'])
-
+# Multispectral
 vegetation = nir.xrs.ndvi(red)
 enhanced_vi = nir.xrs.evi(red, blue)
 ```
@@ -503,14 +559,14 @@ ndvi_result = ds.xrs.ndvi(nir='band_5', red='band_4')
 
 ##### Function Import Style
 
-All operations are also available as standalone functions if you prefer explicit imports:
+All operations are also available as standalone functions:
 
 ```python
-from xrspatial import hillshade, slope, ndvi
+import xrspatial as xrs
 
-hillshaded = hillshade(elevation)
-slope_result = slope(elevation)
-vegetation = ndvi(nir, red)
+hillshaded = xrs.hillshade(elevation)
+slope_result = xrs.slope(elevation)
+vegetation = xrs.ndvi(nir, red)
 ```
 
 Check out the user guide [here](/examples/user_guide/).
diff --git a/benchmarks/reproject_benchmark.md b/benchmarks/reproject_benchmark.md
new file mode 100644
index 00000000..48295426
--- /dev/null
+++ b/benchmarks/reproject_benchmark.md
@@ -0,0 +1,257 @@
+# Reproject Module: Comprehensive Benchmarks
+
+Generated: 2026-03-22
+
+Hardware: AMD Ryzen / NVIDIA A6000 GPU, PCIe Gen4, NVMe SSD
+
+Python 3.14, NumPy, Numba, CuPy, Dask, pyproj, rioxarray (GDAL)
+
+---
+
+## 1. Full Pipeline Benchmark (read -> reproject -> write)
+
+Source file: Copernicus DEM COG (`Copernicus_DSM_COG_10_N40_00_W075_00_DEM.tif`), 3600x3600, WGS84, deflate+floating-point predictor. Reprojected to Web Mercator (EPSG:3857). Median of 3 runs after warmup.
+
+```python
+from xrspatial.geotiff import read_geotiff, write_geotiff
+from xrspatial.reproject import reproject
+
+dem = read_geotiff('Copernicus_DSM_COG_10_N40_00_W075_00_DEM.tif')
+dem_merc = reproject(dem, 'EPSG:3857')
+write_geotiff(dem_merc, 'output.tif')
+```
+
+All times measured with warm Numba/CUDA kernels (first call incurs ~4.5s JIT compilation).
+
+| Backend | End-to-end | Reproject only | vs rioxarray (reproject) |
+|:--------|----------:|--------------:|:------------------------|
+| CuPy GPU | 747 ms | 73 ms | **2.0x faster** |
+| Dask+CuPy GPU | 782 ms | ~80 ms | ~1.8x faster |
+| rioxarray (GDAL) | 411 ms | 144 ms | 1.0x |
+| NumPy | 2,907 ms | 413 ms | 0.3x |
+
+The CuPy reproject is 2x faster than rioxarray for the coordinate transform + resampling. The end-to-end gap is due to I/O: rioxarray uses rasterio's C-level compressed read/write, while our geotiff reader is pure Python/Numba. For reproject-only workloads (data already in memory), CuPy is the clear winner.
+
+**Note on JIT warmup**: The first `reproject()` call compiles the Numba kernels (~4.5s). All subsequent calls run at full speed. For long-running applications or batch processing, this is amortized over many calls.
+
+---
+
+## 2. Projection Coverage and Accuracy
+
+Each projection was tested with 5 geographically appropriate points. "Max error vs pyproj" measures the maximum positional difference between the Numba JIT inverse transform and pyproj's reference implementation. Errors are measured as approximate ground distance.
+
+| Projection | EPSG examples | Max error vs pyproj | CPU Numba | CUDA GPU |
+|:-----------|:-------------|--------------------:|:---------:|:--------:|
+| Web Mercator | 3857 | < 0.001 mm | yes | yes |
+| UTM / Transverse Mercator | 326xx, 327xx | < 0.001 mm | yes | yes |
+| Ellipsoidal Mercator | 3395 | < 0.001 mm | yes | yes |
+| Lambert Conformal Conic | 2154, 2229 | 0.003 mm | yes | yes |
+| Albers Equal Area | 5070 | 3.5 m | yes | yes |
+| Cylindrical Equal Area | 6933 | 4.8 m | yes | yes |
+| Sinusoidal | MODIS | 0.001 mm | yes | yes |
+| Lambert Azimuthal Equal Area | 3035 | see note | yes | yes |
+| Polar Stereographic (Antarctic) | 3031 | < 0.001 mm | yes | yes |
+| Polar Stereographic (Arctic) | 3413 | < 0.001 mm | yes | yes |
+| Oblique Stereographic | custom WGS84 | < 0.001 mm | yes | fallback |
+| Oblique Mercator (Hotine) | 3375 | N/A | disabled | fallback |
+| State Plane (tmerc) | 26983 | 43 cm | yes | yes |
+| State Plane (LCC, ftUS) | 2229 | 19 cm | yes | yes |
+
+**Notes:**
+- LAEA Europe (3035): The current implementation has a known latitude bias (~700m near Paris, larger at the projection's edges). This is an area for future improvement; for high-accuracy LAEA work, the pyproj fallback is used for unsupported ellipsoids.
+- Albers and CEA: Errors of 3-5m stem from the authalic latitude series approximation. Acceptable for most raster reprojection at typical DEM resolutions (30m+).
+- State Plane: Sub-metre accuracy in both tmerc and LCC variants. Unit conversion (US survey feet) is handled internally.
+- Oblique Stereographic: The Numba kernel exists and works for WGS84-based CRS definitions. EPSG:28992 (RD New) uses the Bessel ellipsoid without a registered datum, so it falls back to pyproj.
+- Oblique Mercator: Kernel implemented but disabled pending alignment with PROJ's omerc.cpp variant handling. Falls back to pyproj.
+
+### Reproject-only timing (1024x1024, bilinear)
+
+| Transform | xrspatial | rioxarray |
+|:-----------|----------:|----------:|
+| WGS84 -> Web Mercator | 23 ms | 14 ms |
+| WGS84 -> UTM 33N | 24 ms | 18 ms |
+| WGS84 -> Albers CONUS | 41 ms | 33 ms |
+| WGS84 -> LAEA Europe | 57 ms | 17 ms |
+| WGS84 -> Polar Stere S | 44 ms | 38 ms |
+| WGS84 -> LCC France | 44 ms | 25 ms |
+| WGS84 -> Ellipsoidal Merc | 27 ms | 14 ms |
+| WGS84 -> CEA EASE-Grid | 24 ms | 15 ms |
+
+At 1024x1024, rioxarray (GDAL) is generally faster than the NumPy backend for reproject-only workloads. The GPU backend closes this gap and pulls ahead for larger rasters (see Section 1). The xrspatial advantage is its pure-Python stack with no GDAL dependency, four-backend dispatch (NumPy/CuPy/Dask/Dask+CuPy), and integrated vertical/datum handling.
+
+### Merge timing (4 overlapping same-CRS tiles)
+
+| Tile size | xrspatial | rioxarray | Speedup |
+|:----------|----------:|----------:|--------:|
+| 512x512 | 16 ms | 29 ms | 1.8x |
+| 1024x1024 | 52 ms | 76 ms | 1.5x |
+| 2048x2048 | 361 ms | 280 ms | 0.8x |
+
+Same-CRS merge skips reprojection and places tiles by coordinate alignment. xrspatial is faster at small to medium sizes; rioxarray catches up at larger sizes due to its C-level copy routines.
+
+---
+
+## 3. Datum Shift Coverage
+
+The reproject module handles horizontal datum shifts for non-WGS84 source CRS. It first tries grid-based shifts (downloaded from the PROJ CDN on first use), falling back to 7-parameter Helmert transforms when no grid is available.
+
+### Grid-based shifts (sub-metre accuracy)
+
+| Registry key | Grid file | Coverage | Description |
+|:-------------|:----------|:---------|:------------|
+| NAD27_CONUS | us_noaa_conus.tif | CONUS | NAD27 -> NAD83 (NADCON) |
+| NAD27_NADCON5_CONUS | us_noaa_nadcon5_nad27_nad83_1986_conus.tif | CONUS | NAD27 -> NAD83 (NADCON5, preferred) |
+| NAD27_ALASKA | us_noaa_alaska.tif | Alaska | NAD27 -> NAD83 (NADCON) |
+| NAD27_HAWAII | us_noaa_hawaii.tif | Hawaii | Old Hawaiian -> NAD83 |
+| NAD27_PRVI | us_noaa_prvi.tif | PR/USVI | NAD27 -> NAD83 |
+| OSGB36_UK | uk_os_OSTN15_NTv2_OSGBtoETRS.tif | UK | OSGB36 -> ETRS89 (OSTN15) |
+| AGD66_GDA94 | au_icsm_A66_National_13_09_01.tif | Australia NT | AGD66 -> GDA94 |
+| DHDN_ETRS89_DE | de_adv_BETA2007.tif | Germany | DHDN -> ETRS89 |
+| MGI_ETRS89_AT | at_bev_AT_GIS_GRID.tif | Austria | MGI -> ETRS89 |
+| ED50_ETRS89_ES | es_ign_SPED2ETV2.tif | Spain (E coast) | ED50 -> ETRS89 |
+| RD_ETRS89_NL | nl_nsgi_rdcorr2018.tif | Netherlands | RD -> ETRS89 |
+| BD72_ETRS89_BE | be_ign_bd72lb72_etrs89lb08.tif | Belgium | BD72 -> ETRS89 |
+| CH1903_ETRS89_CH | ch_swisstopo_CHENyx06_ETRS.tif | Switzerland | CH1903 -> ETRS89 |
+| D73_ETRS89_PT | pt_dgt_D73_ETRS89_geo.tif | Portugal | D73 -> ETRS89 |
+
+Grids are downloaded from `cdn.proj.org` on first use and cached in `~/.cache/xrspatial/proj_grids/`. Bilinear interpolation within the grid is done via Numba JIT.
+
+### Helmert fallback (1-5m accuracy)
+
+When no grid covers the area, a 7-parameter (or 3-parameter) geocentric Helmert transform is applied:
+
+| Datum / Ellipsoid | Type | Parameters (dx, dy, dz, rx, ry, rz, ds) |
+|:------------------|:-----|:-----------------------------------------|
+| NAD27 / Clarke 1866 | 3-param | (-8, 160, 176, 0, 0, 0, 0) |
+| OSGB36 / Airy | 7-param | (446.4, -125.2, 542.1, 0.15, 0.25, 0.84, -20.5) |
+| DHDN / Bessel | 7-param | (598.1, 73.7, 418.2, 0.20, 0.05, -2.46, 6.7) |
+| MGI / Bessel | 7-param | (577.3, 90.1, 463.9, 5.14, 1.47, 5.30, 2.42) |
+| ED50 / Intl 1924 | 7-param | (-87, -98, -121, 0, 0, 0.81, -0.38) |
+| BD72 / Intl 1924 | 7-param | (-106.9, 52.3, -103.7, 0.34, -0.46, 1.84, -1.27) |
+| CH1903 / Bessel | 3-param | (674.4, 15.1, 405.3, 0, 0, 0, 0) |
+| D73 / Intl 1924 | 3-param | (-239.7, 88.2, 30.5, 0, 0, 0, 0) |
+| AGD66 / ANS | 3-param | (-133, -48, 148, 0, 0, 0, 0) |
+| Tokyo / Bessel | 3-param | (-146.4, 507.3, 680.5, 0, 0, 0, 0) |
+
+Grid-based accuracy is typically 0.01-0.1m; Helmert fallback accuracy is 1-5m depending on the datum.
+
+---
+
+## 4. Vertical Datum Support
+
+The module provides geoid undulation lookup from EGM96 (vendored, 15-arcminute global grid, 2.6MB) and optionally EGM2008 (25-arcminute, 77MB, downloaded on first use).
+
+### API
+
+```python
+from xrspatial.reproject import geoid_height, ellipsoidal_to_orthometric
+
+# Single point
+N = geoid_height(-74.0, 40.7)           # New York: -32.86m
+
+# Convert GPS height to map height
+H = ellipsoidal_to_orthometric(100.0, -74.0, 40.7)  # 132.86m
+
+# Batch (array)
+N = geoid_height(lon_array, lat_array)
+
+# Raster grid
+from xrspatial.reproject import geoid_height_raster
+N_grid = geoid_height_raster(dem)
+```
+
+### Accuracy vs pyproj geoid
+
+| Location | xrspatial EGM96 (m) | pyproj EGM96 (m) | Difference |
+|:---------|---------------------:|------------------:|-----------:|
+| New York (-74.0, 40.7) | -32.86 | -32.77 | 0.09 m |
+| Paris (2.35, 48.85) | 44.59 | 44.57 | 0.02 m |
+| Tokyo (139.7, 35.7) | 35.75 | 36.80 | 1.06 m |
+| Null Island (0.0, 0.0) | 17.15 | 17.16 | 0.02 m |
+| Rio (-43.2, -22.9) | -5.59 | -5.43 | 0.16 m |
+
+The 1.06m Tokyo difference is due to the 15-arcminute grid resolution in EGM96; the steep geoid gradient near Japan amplifies interpolation differences. Roundtrip accuracy (`ellipsoidal_to_orthometric` then `orthometric_to_ellipsoidal`) is exact (0.0 error).
+
+### Integration with reproject
+
+The `reproject` function accepts a `vertical_crs` parameter to apply vertical datum shifts during reprojection:
+
+```python
+from xrspatial.reproject import reproject
+
+# Reproject and convert ellipsoidal heights to orthometric (MSL)
+dem_merc = reproject(
+    dem, 'EPSG:3857',
+    src_vertical_crs='ellipsoidal',
+    tgt_vertical_crs='EGM96',
+)
+```
+
+---
+
+## 5. ITRF Frame Support
+
+Time-dependent transformations between International Terrestrial Reference Frames using 14-parameter Helmert transforms (7 static + 7 rates) from PROJ data files.
+
+### Available frames
+
+- ITRF2000
+- ITRF2008
+- ITRF2014
+- ITRF2020
+
+### Example
+
+```python
+from xrspatial.reproject import itrf_transform, itrf_frames
+
+print(itrf_frames())  # ['ITRF2000', 'ITRF2008', 'ITRF2014', 'ITRF2020']
+
+lon2, lat2, h2 = itrf_transform(
+    -74.0, 40.7, 10.0,
+    src='ITRF2014', tgt='ITRF2020', epoch=2024.0,
+)
+# -> (-73.9999999782, 40.6999999860, 9.996897)
+# Horizontal shift: 2.4 mm, vertical shift: -3.1 mm
+```
+
+### All frame-pair shifts (at epoch 2020.0, location 0E 45N)
+
+| Source | Target | Horizontal shift | Vertical shift |
+|:-------|:-------|:----------------:|:--------------:|
+| ITRF2000 | ITRF2008 | 33.0 mm | 32.8 mm |
+| ITRF2000 | ITRF2014 | 33.2 mm | 30.7 mm |
+| ITRF2000 | ITRF2020 | 30.5 mm | 30.0 mm |
+| ITRF2008 | ITRF2014 | 1.9 mm | -2.1 mm |
+| ITRF2008 | ITRF2020 | 2.6 mm | -2.8 mm |
+| ITRF2014 | ITRF2020 | 3.0 mm | -0.7 mm |
+
+Shifts between recent frames (ITRF2014/2020) are at the mm level. Older frames (ITRF2000) show larger shifts (~30mm) due to accumulated tectonic motion.
+
+---
+
+## 6. pyproj Usage
+
+The reproject module uses pyproj for metadata operations only. The heavy per-pixel work is done in Numba JIT or CUDA.
+
+### What pyproj does (runs once per reproject call)
+
+| Task | Cost | Description |
+|:-----|:-----|:------------|
+| CRS metadata parsing | ~1 ms | `CRS.from_user_input()`, `CRS.to_dict()`, extract projection parameters |
+| EPSG code lookup | ~0.1 ms | `CRS.to_epsg()` to check for known fast paths |
+| Output grid estimation | ~1 ms | `Transformer.transform()` on ~500 boundary points to determine output extent |
+| Fallback transform | per-pixel | Only used for CRS pairs without a built-in Numba/CUDA kernel |
+
+### What Numba/CUDA does (the per-pixel bottleneck)
+
+| Task | Implementation | Notes |
+|:-----|:---------------|:------|
+| Coordinate transforms | Numba `@njit(parallel=True)` / CUDA `@cuda.jit` | Per-pixel forward/inverse projection |
+| Bilinear resampling | Numba `@njit` / CUDA `@cuda.jit` | Source pixel interpolation |
+| Nearest-neighbor resampling | Numba `@njit` / CUDA `@cuda.jit` | Source pixel lookup |
+| Cubic resampling | `scipy.ndimage.map_coordinates` | CPU only (no Numba/CUDA kernel yet) |
+| Datum grid interpolation | Numba `@njit(parallel=True)` | Bilinear interp of NTv2/NADCON grids |
+| Geoid undulation interpolation | Numba `@njit(parallel=True)` | Bilinear interp of EGM96/EGM2008 grid |
+| 7-param Helmert datum shift | Numba `@njit(parallel=True)` | Geocentric ECEF transform |
+| 14-param ITRF transform | Numba `@njit(parallel=True)` | Time-dependent Helmert in ECEF |
diff --git a/xrspatial/geotiff/__init__.py b/xrspatial/geotiff/__init__.py
index 2940ba4c..f5807fc6 100644
--- a/xrspatial/geotiff/__init__.py
+++ b/xrspatial/geotiff/__init__.py
@@ -288,7 +288,7 @@ def _is_gpu_data(data) -> bool:
 def write_geotiff(data: xr.DataArray | np.ndarray, path: str, *,
                   crs: int | str | None = None,
                   nodata=None,
-                  compression: str = 'deflate',
+                  compression: str = 'zstd',
                   tiled: bool = True,
                   tile_size: int = 256,
                   predictor: bool = False,
@@ -379,9 +379,13 @@ def write_geotiff(data: xr.DataArray | np.ndarray, path: str, *,
         if geo_transform is None:
             geo_transform = _coords_to_transform(data)
         if epsg is None and crs is None:
-            epsg = data.attrs.get('crs')
+            crs_attr = data.attrs.get('crs')
+            if isinstance(crs_attr, str):
+                # WKT string from reproject() or other source
+                epsg = _wkt_to_epsg(crs_attr)
+            elif crs_attr is not None:
+                epsg = int(crs_attr)
             if epsg is None:
-                # Try resolving EPSG from a WKT string in attrs
                 wkt = data.attrs.get('crs_wkt')
                 if isinstance(wkt, str):
                     epsg = _wkt_to_epsg(wkt)
diff --git a/xrspatial/geotiff/_reader.py b/xrspatial/geotiff/_reader.py
index 8b15c544..338e7d06 100644
--- a/xrspatial/geotiff/_reader.py
+++ b/xrspatial/geotiff/_reader.py
@@ -476,6 +476,8 @@ def _read_tiles(data: bytes, ifd: IFD, header: TIFFHeader,
     band_count = samples if (planar == 2 and samples > 1) else 1
     tiles_per_band = tiles_across * tiles_down
 
+    # Build list of tiles to decode
+    tile_jobs = []
     for band_idx in range(band_count):
         band_tile_offset = band_idx * tiles_per_band if band_count > 1 else 0
         tile_samples = 1 if band_count > 1 else samples
@@ -485,37 +487,55 @@ def _read_tiles(data: bytes, ifd: IFD, header: TIFFHeader,
                 tile_idx = band_tile_offset + tr * tiles_across + tc
                 if tile_idx >= len(offsets):
                     continue
-
-                tile_data = data[offsets[tile_idx]:offsets[tile_idx] + byte_counts[tile_idx]]
-                tile_pixels = _decode_strip_or_tile(
-                    tile_data, compression, tw, th, tile_samples,
-                    bps, bytes_per_sample, is_sub_byte, dtype, pred,
-                    byte_order=header.byte_order)
-
-                tile_r0 = tr * th
-                tile_c0 = tc * tw
-
-                src_r0 = max(r0 - tile_r0, 0)
-                src_c0 = max(c0 - tile_c0, 0)
-                src_r1 = min(r1 - tile_r0, th)
-                src_c1 = min(c1 - tile_c0, tw)
-
-                dst_r0 = max(tile_r0 - r0, 0)
-                dst_c0 = max(tile_c0 - c0, 0)
-
-                actual_tile_h = min(th, height - tile_r0)
-                actual_tile_w = min(tw, width - tile_c0)
-                src_r1 = min(src_r1, actual_tile_h)
-                src_c1 = min(src_c1, actual_tile_w)
-                dst_r1 = dst_r0 + (src_r1 - src_r0)
-                dst_c1 = dst_c0 + (src_c1 - src_c0)
-
-                if dst_r1 > dst_r0 and dst_c1 > dst_c0:
-                    src_slice = tile_pixels[src_r0:src_r1, src_c0:src_c1]
-                    if band_count > 1:
-                        result[dst_r0:dst_r1, dst_c0:dst_c1, band_idx] = src_slice
-                    else:
-                        result[dst_r0:dst_r1, dst_c0:dst_c1] = src_slice
+                tile_jobs.append((band_idx, tr, tc, tile_idx, tile_samples))
+
+    # Decode tiles -- parallel for compressed, sequential for uncompressed
+    n_tiles = len(tile_jobs)
+    use_parallel = (compression != 1 and n_tiles > 4)  # 1 = COMPRESSION_NONE
+
+    def _decode_one(job):
+        band_idx, tr, tc, tile_idx, tile_samples = job
+        tile_data = data[offsets[tile_idx]:offsets[tile_idx] + byte_counts[tile_idx]]
+        return _decode_strip_or_tile(
+            tile_data, compression, tw, th, tile_samples,
+            bps, bytes_per_sample, is_sub_byte, dtype, pred,
+            byte_order=header.byte_order)
+
+    if use_parallel:
+        from concurrent.futures import ThreadPoolExecutor
+        import os as _os
+        n_workers = min(n_tiles, _os.cpu_count() or 4)
+        with ThreadPoolExecutor(max_workers=n_workers) as pool:
+            decoded = list(pool.map(_decode_one, tile_jobs))
+    else:
+        decoded = [_decode_one(job) for job in tile_jobs]
+
+    # Place decoded tiles into the output array
+    for (band_idx, tr, tc, tile_idx, tile_samples), tile_pixels in zip(tile_jobs, decoded):
+        tile_r0 = tr * th
+        tile_c0 = tc * tw
+
+        src_r0 = max(r0 - tile_r0, 0)
+        src_c0 = max(c0 - tile_c0, 0)
+        src_r1 = min(r1 - tile_r0, th)
+        src_c1 = min(c1 - tile_c0, tw)
+
+        dst_r0 = max(tile_r0 - r0, 0)
+        dst_c0 = max(tile_c0 - c0, 0)
+
+        actual_tile_h = min(th, height - tile_r0)
+        actual_tile_w = min(tw, width - tile_c0)
+        src_r1 = min(src_r1, actual_tile_h)
+        src_c1 = min(src_c1, actual_tile_w)
+        dst_r1 = dst_r0 + (src_r1 - src_r0)
+        dst_c1 = dst_c0 + (src_c1 - src_c0)
+
+        if dst_r1 > dst_r0 and dst_c1 > dst_c0:
+            src_slice = tile_pixels[src_r0:src_r1, src_c0:src_c1]
+            if band_count > 1:
+                result[dst_r0:dst_r1, dst_c0:dst_c1, band_idx] = src_slice
+            else:
+                result[dst_r0:dst_r1, dst_c0:dst_c1] = src_slice
 
     return result
 
diff --git a/xrspatial/geotiff/_writer.py b/xrspatial/geotiff/_writer.py
index ae7658ab..6e29dcbf 100644
--- a/xrspatial/geotiff/_writer.py
+++ b/xrspatial/geotiff/_writer.py
@@ -332,9 +332,49 @@ def _write_stripped(data: np.ndarray, compression: int, predictor: bool,
 # Tile writer
 # ---------------------------------------------------------------------------
 
+def _prepare_tile(data, tr, tc, th, tw, height, width, samples, dtype,
+                  bytes_per_sample, predictor, compression):
+    """Extract, pad, and compress a single tile.  Thread-safe."""
+    r0 = tr * th
+    c0 = tc * tw
+    r1 = min(r0 + th, height)
+    c1 = min(c0 + tw, width)
+    actual_h = r1 - r0
+    actual_w = c1 - c0
+
+    tile_slice = data[r0:r1, c0:c1]
+
+    if actual_h < th or actual_w < tw:
+        if data.ndim == 3:
+            padded = np.empty((th, tw, samples), dtype=dtype)
+        else:
+            padded = np.empty((th, tw), dtype=dtype)
+        padded[:actual_h, :actual_w] = tile_slice
+        if actual_h < th:
+            padded[actual_h:, :] = 0
+        if actual_w < tw:
+            padded[:actual_h, actual_w:] = 0
+        tile_arr = padded
+    else:
+        tile_arr = np.ascontiguousarray(tile_slice)
+
+    if predictor and compression != COMPRESSION_NONE:
+        buf = tile_arr.view(np.uint8).ravel().copy()
+        buf = predictor_encode(buf, tw, th, bytes_per_sample * samples)
+        tile_data = buf.tobytes()
+    else:
+        tile_data = tile_arr.tobytes()
+
+    return compress(tile_data, compression)
+
+
 def _write_tiled(data: np.ndarray, compression: int, predictor: bool,
                  tile_size: int = 256) -> tuple[list, list, list]:
-    """Compress data as tiles.
+    """Compress data as tiles, using parallel compression.
+
+    For compressed formats (deflate, lzw, zstd), tiles are compressed
+    in parallel using a thread pool.  zlib, zstandard, and our Numba
+    LZW all release the GIL.
 
     Returns
     -------
@@ -350,55 +390,92 @@ def _write_tiled(data: np.ndarray, compression: int, predictor: bool,
     th = tile_size
     tiles_across = math.ceil(width / tw)
     tiles_down = math.ceil(height / th)
-
-    tiles = []
-    rel_offsets = []
-    byte_counts = []
-    current_offset = 0
-
-    for tr in range(tiles_down):
-        for tc in range(tiles_across):
-            r0 = tr * th
-            c0 = tc * tw
-            r1 = min(r0 + th, height)
-            c1 = min(c0 + tw, width)
-
-            actual_h = r1 - r0
-            actual_w = c1 - c0
-
-            # Extract tile, pad to full tile size if needed
-            tile_slice = data[r0:r1, c0:c1]
-
-            if actual_h < th or actual_w < tw:
-                if data.ndim == 3:
-                    padded = np.empty((th, tw, samples), dtype=dtype)
+    n_tiles = tiles_across * tiles_down
+
+    if compression == COMPRESSION_NONE:
+        # Uncompressed: pre-allocate a contiguous buffer for all tiles
+        # and copy tile data directly, avoiding per-tile Python overhead.
+        tile_bytes = tw * th * bytes_per_sample * samples
+        total_buf = bytearray(n_tiles * tile_bytes)
+        mv = memoryview(total_buf)
+        tiles = []
+        rel_offsets = []
+        byte_counts = []
+        current_offset = 0
+
+        for tr in range(tiles_down):
+            for tc in range(tiles_across):
+                r0 = tr * th
+                c0 = tc * tw
+                r1 = min(r0 + th, height)
+                c1 = min(c0 + tw, width)
+                actual_h = r1 - r0
+                actual_w = c1 - c0
+
+                tile_slice = data[r0:r1, c0:c1]
+                if actual_h < th or actual_w < tw:
+                    if data.ndim == 3:
+                        padded = np.zeros((th, tw, samples), dtype=dtype)
+                    else:
+                        padded = np.zeros((th, tw), dtype=dtype)
+                    padded[:actual_h, :actual_w] = tile_slice
+                    tile_arr = padded
                 else:
-                    padded = np.empty((th, tw), dtype=dtype)
-                padded[:actual_h, :actual_w] = tile_slice
-                # Zero only the padding regions
-                if actual_h < th:
-                    padded[actual_h:, :] = 0
-                if actual_w < tw:
-                    padded[:actual_h, actual_w:] = 0
-                tile_arr = padded
-            else:
-                tile_arr = np.ascontiguousarray(tile_slice)
+                    tile_arr = np.ascontiguousarray(tile_slice)
+
+                chunk = tile_arr.tobytes()
+                rel_offsets.append(current_offset)
+                byte_counts.append(len(chunk))
+                tiles.append(chunk)
+                current_offset += len(chunk)
+
+        return rel_offsets, byte_counts, tiles
+
+    if n_tiles <= 4:
+        # Very few tiles: sequential (thread pool overhead not worth it)
+        tiles = []
+        rel_offsets = []
+        byte_counts = []
+        current_offset = 0
+        for tr in range(tiles_down):
+            for tc in range(tiles_across):
+                compressed = _prepare_tile(
+                    data, tr, tc, th, tw, height, width,
+                    samples, dtype, bytes_per_sample, predictor, compression,
+                )
+                rel_offsets.append(current_offset)
+                byte_counts.append(len(compressed))
+                tiles.append(compressed)
+                current_offset += len(compressed)
+        return rel_offsets, byte_counts, tiles
+
+    # Parallel tile compression -- zlib/zstd/LZW all release the GIL
+    from concurrent.futures import ThreadPoolExecutor
+    import os
 
-            if predictor and compression != COMPRESSION_NONE:
-                buf = tile_arr.view(np.uint8).ravel().copy()
-                buf = predictor_encode(buf, tw, th, bytes_per_sample * samples)
-                tile_data = buf.tobytes()
-            else:
-                tile_data = tile_arr.tobytes()
+    n_workers = min(n_tiles, os.cpu_count() or 4)
+    tile_indices = [(tr, tc) for tr in range(tiles_down)
+                    for tc in range(tiles_across)]
 
-            compressed = compress(tile_data, compression)
+    with ThreadPoolExecutor(max_workers=n_workers) as pool:
+        futures = [
+            pool.submit(
+                _prepare_tile, data, tr, tc, th, tw, height, width,
+                samples, dtype, bytes_per_sample, predictor, compression,
+            )
+            for tr, tc in tile_indices
+        ]
+        compressed_tiles = [f.result() for f in futures]
 
-            rel_offsets.append(current_offset)
-            byte_counts.append(len(compressed))
-            tiles.append(compressed)
-            current_offset += len(compressed)
+    rel_offsets = []
+    byte_counts = []
+    current_offset = 0
+    for ct in compressed_tiles:
+        rel_offsets.append(current_offset)
+        byte_counts.append(len(ct))
+        current_offset += len(ct)
 
-    return rel_offsets, byte_counts, tiles
+    return rel_offsets, byte_counts, compressed_tiles
 
 
 # ---------------------------------------------------------------------------
@@ -736,7 +813,7 @@ def write(data: np.ndarray, path: str, *,
           geo_transform: GeoTransform | None = None,
           crs_epsg: int | None = None,
           nodata=None,
-          compression: str = 'deflate',
+          compression: str = 'zstd',
           tiled: bool = True,
           tile_size: int = 256,
           predictor: bool = False,
diff --git a/xrspatial/reproject/__init__.py b/xrspatial/reproject/__init__.py
index c1bc327f..56ac7ad3 100644
--- a/xrspatial/reproject/__init__.py
+++ b/xrspatial/reproject/__init__.py
@@ -19,21 +19,64 @@
     _compute_output_grid,
     _make_output_coords,
 )
-from ._interpolate import _resample_cupy, _resample_numpy, _validate_resampling
+from ._interpolate import (
+    _resample_cupy,
+    _resample_cupy_native,
+    _resample_numpy,
+    _validate_resampling,
+)
 from ._merge import _merge_arrays_cupy, _merge_arrays_numpy, _validate_strategy
 from ._transform import ApproximateTransform
 
-__all__ = ['reproject', 'merge']
+from ._vertical import (
+    geoid_height,
+    geoid_height_raster,
+    ellipsoidal_to_orthometric,
+    orthometric_to_ellipsoidal,
+    depth_to_ellipsoidal,
+    ellipsoidal_to_depth,
+)
+from ._itrf import itrf_transform, list_frames as itrf_frames
+
+__all__ = [
+    'reproject', 'merge',
+    'geoid_height', 'geoid_height_raster',
+    'ellipsoidal_to_orthometric', 'orthometric_to_ellipsoidal',
+    'depth_to_ellipsoidal', 'ellipsoidal_to_depth',
+    'itrf_transform', 'itrf_frames',
+]
 
 
 # ---------------------------------------------------------------------------
 # Source geometry helpers
 # ---------------------------------------------------------------------------
 
+_Y_NAMES = {'y', 'lat', 'latitude', 'Y', 'Lat', 'Latitude'}
+_X_NAMES = {'x', 'lon', 'longitude', 'X', 'Lon', 'Longitude'}
+
+
+def _find_spatial_dims(raster):
+    """Find the y and x dimension names, handling multi-band rasters.
+
+    Returns (ydim, xdim).  Checks dim names first, falls back to
+    assuming the last two non-band dims are spatial.
+    """
+    dims = raster.dims
+    ydim = xdim = None
+    for d in dims:
+        if d in _Y_NAMES:
+            ydim = d
+        elif d in _X_NAMES:
+            xdim = d
+    if ydim is not None and xdim is not None:
+        return ydim, xdim
+    # Fallback: last two dims
+    return dims[-2], dims[-1]
+
+
 def _source_bounds(raster):
     """Extract (left, bottom, right, top) from a DataArray's coordinates."""
-    ydim = raster.dims[-2]
-    xdim = raster.dims[-1]
+    ydim, xdim = _find_spatial_dims(raster)
     y = raster.coords[ydim].values
     x = raster.coords[xdim].values
     # Compute pixel-edge bounds from pixel-center coords
@@ -56,13 +99,82 @@ def _source_bounds(raster):
 
 def _is_y_descending(raster):
     """Check if Y axis goes from top (large) to bottom (small)."""
-    ydim = raster.dims[-2]
+    ydim, _ = _find_spatial_dims(raster)
     y = raster.coords[ydim].values
     if len(y) < 2:
         return True
     return float(y[0]) > float(y[-1])
 
 
+# ---------------------------------------------------------------------------
+# Per-chunk coordinate transform
+# ---------------------------------------------------------------------------
+
+def _transform_coords(transformer, chunk_bounds, chunk_shape,
+                      transform_precision, src_crs=None, tgt_crs=None):
+    """Compute source CRS coordinates for every output pixel.
+
+    When *transform_precision* is 0, every pixel is transformed through
+    pyproj exactly (same strategy as GDAL/rasterio).  Otherwise an
+    approximate bilinear control-grid interpolation is used.
+
+    For common CRS pairs (WGS84/NAD83 <-> UTM, WGS84 <-> Web Mercator),
+    a Numba JIT fast path bypasses pyproj entirely for ~30x speedup.
+
+    Returns
+    -------
+    src_y, src_x : ndarray (height, width)
+    """
+    # Try Numba fast path for common projections
+    if src_crs is not None and tgt_crs is not None:
+        try:
+            from ._projections import try_numba_transform
+            result = try_numba_transform(
+                src_crs, tgt_crs, chunk_bounds, chunk_shape,
+            )
+            if result is not None:
+                return result
+        except (ImportError, ModuleNotFoundError):
+            pass  # fall through to pyproj
+
+    height, width = chunk_shape
+    left, bottom, right, top = chunk_bounds
+    res_x = (right - left) / width
+    res_y = (top - bottom) / height
+
+    if transform_precision == 0:
+        # Exact per-pixel transform via pyproj bulk API.
+        # Process in row strips to keep memory bounded and improve
+        # cache locality for large rasters.
+        out_x_1d = left + (np.arange(width, dtype=np.float64) + 0.5) * res_x
+        src_x_out = np.empty((height, width), dtype=np.float64)
+        src_y_out = np.empty((height, width), dtype=np.float64)
+        strip = 256
+        for r0 in range(0, height, strip):
+            r1 = min(r0 + strip, height)
+            n_rows = r1 - r0
+            out_y_strip = top - (np.arange(r0, r1, dtype=np.float64) + 0.5) * res_y
+            # Broadcast to (n_rows, width) without allocating a full copy
+            sx, sy = transformer.transform(
+                np.tile(out_x_1d, n_rows),
+                np.repeat(out_y_strip, width),
+            )
+            src_x_out[r0:r1] = np.asarray(sx, dtype=np.float64).reshape(n_rows, width)
+            src_y_out[r0:r1] = np.asarray(sy, dtype=np.float64).reshape(n_rows, width)
+        return src_y_out, src_x_out
+
+    # Approximate: bilinear interpolation on a coarse control grid.
+    approx = ApproximateTransform(
+        transformer, chunk_bounds, chunk_shape,
+        precision=transform_precision,
+    )
+    row_grid = np.arange(height, dtype=np.float64)[:, np.newaxis]
+    col_grid = np.arange(width, dtype=np.float64)[np.newaxis, :]
+    row_grid = np.broadcast_to(row_grid, (height, width))
+    col_grid = np.broadcast_to(col_grid, (height, width))
+    return approx(row_grid, col_grid)
+
+
 # ---------------------------------------------------------------------------
 # Per-chunk worker functions
 # ---------------------------------------------------------------------------
@@ -84,25 +196,27 @@ def _reproject_chunk_numpy(
     src_crs = pyproj.CRS.from_wkt(src_wkt)
     tgt_crs = pyproj.CRS.from_wkt(tgt_wkt)
 
-    # Build inverse transformer: target -> source
-    transformer = pyproj.Transformer.from_crs(
-        tgt_crs, src_crs, always_xy=True
-    )
-
-    height, width = chunk_shape
-    approx = ApproximateTransform(
-        transformer, chunk_bounds_tuple, chunk_shape,
-        precision=transform_precision,
-    )
-
-    # All output pixel positions (broadcast 1-D arrays to avoid HxW meshgrid)
-    row_grid = np.arange(height, dtype=np.float64)[:, np.newaxis]
-    col_grid = np.arange(width, dtype=np.float64)[np.newaxis, :]
-    row_grid = np.broadcast_to(row_grid, (height, width))
-    col_grid = np.broadcast_to(col_grid, (height, width))
+    # Try Numba fast path first (avoids creating pyproj Transformer)
+    numba_result = None
+    try:
+        from ._projections import try_numba_transform
+        numba_result = try_numba_transform(
+            src_crs, tgt_crs, chunk_bounds_tuple, chunk_shape,
+        )
+    except (ImportError, ModuleNotFoundError):
+        pass
 
-    # Source CRS coordinates for each output pixel
-    src_y, src_x = approx(row_grid, col_grid)
+    if numba_result is not None:
+        src_y, src_x = numba_result
+    else:
+        # Fallback: create pyproj Transformer (expensive)
+        transformer = pyproj.Transformer.from_crs(
+            tgt_crs, src_crs, always_xy=True
+        )
+        src_y, src_x = _transform_coords(
+            transformer, chunk_bounds_tuple, chunk_shape, transform_precision,
+            src_crs=src_crs, tgt_crs=tgt_crs,
+        )
 
     # Convert source CRS coordinates to source pixel coordinates
     src_left, src_bottom, src_right, src_top = source_bounds_tuple
@@ -117,10 +231,20 @@ def _reproject_chunk_numpy(
         src_row_px = (src_y - src_bottom) / src_res_y - 0.5
 
     # Determine source window needed
-    r_min = int(np.floor(np.nanmin(src_row_px))) - 2
-    r_max = int(np.ceil(np.nanmax(src_row_px))) + 3
-    c_min = int(np.floor(np.nanmin(src_col_px))) - 2
-    c_max = int(np.ceil(np.nanmax(src_col_px))) + 3
+    r_min = np.nanmin(src_row_px)
+    r_max = np.nanmax(src_row_px)
+    c_min = np.nanmin(src_col_px)
+    c_max = np.nanmax(src_col_px)
+
+    if not np.isfinite(r_min) or not np.isfinite(r_max):
+        return np.full(chunk_shape, nodata, dtype=np.float64)
+    if not np.isfinite(c_min) or not np.isfinite(c_max):
+        return np.full(chunk_shape, nodata, dtype=np.float64)
+
+    r_min = int(np.floor(r_min)) - 2
+    r_max = int(np.ceil(r_max)) + 3
+    c_min = int(np.floor(c_min)) - 2
+    c_max = int(np.ceil(c_max)) + 3
 
     # Check overlap
     if r_min >= src_h or r_max <= 0 or c_min >= src_w or c_max <= 0:
@@ -136,19 +260,47 @@ def _reproject_chunk_numpy(
     window = source_data[r_min_clip:r_max_clip, c_min_clip:c_max_clip]
     if hasattr(window, 'compute'):
         window = window.compute()
-    window = np.asarray(window, dtype=np.float64)
+    window = np.asarray(window)
+    orig_dtype = window.dtype
+
+    # Adjust coordinates relative to window
+    local_row = src_row_px - r_min_clip
+    local_col = src_col_px - c_min_clip
+
+    # Multi-band: reproject each band separately, share coordinates
+    if window.ndim == 3:
+        n_bands = window.shape[2]
+        bands = []
+        for b in range(n_bands):
+            band_data = window[:, :, b].astype(np.float64)
+            if not np.isnan(nodata):
+                band_data = band_data.copy()
+                band_data[band_data == nodata] = np.nan
+            band_result = _resample_numpy(band_data, local_row, local_col,
+                                          resampling=resampling, nodata=nodata)
+            if np.issubdtype(orig_dtype, np.integer):
+                info = np.iinfo(orig_dtype)
+                band_result = np.clip(np.round(band_result), info.min, info.max).astype(orig_dtype)
+            bands.append(band_result)
+        return np.stack(bands, axis=-1)
+
+    # Single-band path
+    window = window.astype(np.float64)
 
     # Convert sentinel nodata to NaN so numba kernels can detect it
     if not np.isnan(nodata):
         window = window.copy()
         window[window == nodata] = np.nan
 
-    # Adjust coordinates relative to window
-    local_row = src_row_px - r_min_clip
-    local_col = src_col_px - c_min_clip
+    result = _resample_numpy(window, local_row, local_col,
+                             resampling=resampling, nodata=nodata)
 
-    return _resample_numpy(window, local_row, local_col,
-                           resampling=resampling, nodata=nodata)
+    # Clamp and cast back for integer source dtypes
+    if np.issubdtype(orig_dtype, np.integer):
+        info = np.iinfo(orig_dtype)
+        result = np.clip(np.round(result), info.min, info.max).astype(orig_dtype)
+
+    return result
 
 
 def _reproject_chunk_cupy(
@@ -170,35 +322,75 @@ def _reproject_chunk_cupy(
         tgt_crs, src_crs, always_xy=True
     )
 
-    height, width = chunk_shape
-    approx = ApproximateTransform(
-        transformer, chunk_bounds_tuple, chunk_shape,
-        precision=transform_precision,
-    )
-
-    row_grid = np.arange(height, dtype=np.float64)[:, np.newaxis]
-    col_grid = np.arange(width, dtype=np.float64)[np.newaxis, :]
-    row_grid = np.broadcast_to(row_grid, (height, width))
-    col_grid = np.broadcast_to(col_grid, (height, width))
-
-    # Control grid is on CPU
-    src_y, src_x = approx(row_grid, col_grid)
-
-    src_left, src_bottom, src_right, src_top = source_bounds_tuple
-    src_h, src_w = source_shape
-    src_res_x = (src_right - src_left) / src_w
-    src_res_y = (src_top - src_bottom) / src_h
+    # Try CUDA transform first (keeps coordinates on-device)
+    cuda_result = None
+    if src_crs is not None and tgt_crs is not None:
+        try:
+            from ._projections_cuda import try_cuda_transform
+            cuda_result = try_cuda_transform(
+                src_crs, tgt_crs, chunk_bounds_tuple, chunk_shape,
+            )
+        except (ImportError, ModuleNotFoundError):
+            pass
 
-    src_col_px = (src_x - src_left) / src_res_x - 0.5
-    if source_y_desc:
-        src_row_px = (src_top - src_y) / src_res_y - 0.5
+    if cuda_result is not None:
+        src_y, src_x = cuda_result  # cupy arrays
+        src_left, src_bottom, src_right, src_top = source_bounds_tuple
+        src_h, src_w = source_shape
+        src_res_x = (src_right - src_left) / src_w
+        src_res_y = (src_top - src_bottom) / src_h
+        # Pixel coordinate math stays on GPU via cupy operators
+        src_col_px = (src_x - src_left) / src_res_x - 0.5
+        if source_y_desc:
+            src_row_px = (src_top - src_y) / src_res_y - 0.5
+        else:
+            src_row_px = (src_y - src_bottom) / src_res_y - 0.5
+        # Need min/max on CPU for window selection
+        r_min_val = float(cp.nanmin(src_row_px).get())
+        if not np.isfinite(r_min_val):
+            return cp.full(chunk_shape, nodata, dtype=cp.float64)
+        r_max_val = float(cp.nanmax(src_row_px).get())
+        c_min_val = float(cp.nanmin(src_col_px).get())
+        c_max_val = float(cp.nanmax(src_col_px).get())
+        if not np.isfinite(r_max_val) or not np.isfinite(c_min_val) or not np.isfinite(c_max_val):
+            return cp.full(chunk_shape, nodata, dtype=cp.float64)
+        r_min = int(np.floor(r_min_val)) - 2
+        r_max = int(np.ceil(r_max_val)) + 3
+        c_min = int(np.floor(c_min_val)) - 2
+        c_max = int(np.ceil(c_max_val)) + 3
+        # Keep coordinates as CuPy arrays for native CUDA resampling
+        _use_native_cuda = True
     else:
-        src_row_px = (src_y - src_bottom) / src_res_y - 0.5
+        # CPU fallback (Numba JIT or pyproj)
+        src_y, src_x = _transform_coords(
+            transformer, chunk_bounds_tuple, chunk_shape, transform_precision,
+            src_crs=src_crs, tgt_crs=tgt_crs,
+        )
 
-    r_min = int(np.floor(np.nanmin(src_row_px))) - 2
-    r_max = int(np.ceil(np.nanmax(src_row_px))) + 3
-    c_min = int(np.floor(np.nanmin(src_col_px))) - 2
-    c_max = int(np.ceil(np.nanmax(src_col_px))) + 3
+        src_left, src_bottom, src_right, src_top = source_bounds_tuple
+        src_h, src_w = source_shape
+        src_res_x = (src_right - src_left) / src_w
+        src_res_y = (src_top - src_bottom) / src_h
+
+        src_col_px = (src_x - src_left) / src_res_x - 0.5
+        if source_y_desc:
+            src_row_px = (src_top - src_y) / src_res_y - 0.5
+        else:
+            src_row_px = (src_y - src_bottom) / src_res_y - 0.5
+
+        r_min = np.nanmin(src_row_px)
+        r_max = np.nanmax(src_row_px)
+        c_min = np.nanmin(src_col_px)
+        c_max = np.nanmax(src_col_px)
+        if not np.isfinite(r_min) or not np.isfinite(r_max):
+            return cp.full(chunk_shape, nodata, dtype=cp.float64)
+        if not np.isfinite(c_min) or not np.isfinite(c_max):
+            return cp.full(chunk_shape, nodata, dtype=cp.float64)
+        r_min = int(np.floor(r_min)) - 2
+        r_max = int(np.ceil(r_max)) + 3
+        c_min = int(np.floor(c_min)) - 2
+        c_max = int(np.ceil(c_max)) + 3
+        _use_native_cuda = False
 
     if r_min >= src_h or r_max <= 0 or c_min >= src_w or c_max <= 0:
         return cp.full(chunk_shape, nodata, dtype=cp.float64)
@@ -215,14 +407,21 @@ def _reproject_chunk_cupy(
         window = cp.asarray(window)
     window = window.astype(cp.float64)
 
-    # Convert sentinel nodata to NaN
+    # Adjust coordinates relative to window (stays on GPU if CuPy)
+    local_row = src_row_px - r_min_clip
+    local_col = src_col_px - c_min_clip
+
+    if _use_native_cuda:
+        # Coordinates are already CuPy arrays -- use native CUDA kernels
+        # (nodata->NaN conversion is handled inside _resample_cupy_native)
+        return _resample_cupy_native(window, local_row, local_col,
+                                     resampling=resampling, nodata=nodata)
+
+    # CPU coordinates -- convert sentinel nodata to NaN before map_coordinates
     if not np.isnan(nodata):
         window = window.copy()
         window[window == nodata] = cp.nan
 
-    local_row = src_row_px - r_min_clip
-    local_col = src_col_px - c_min_clip
-
     return _resample_cupy(window, local_row, local_col,
                           resampling=resampling, nodata=nodata)
 
@@ -245,6 +444,8 @@ def reproject(
     transform_precision=16,
     chunk_size=None,
     name=None,
+    src_vertical_crs=None,
+    tgt_vertical_crs=None,
 ):
     """Reproject a raster DataArray to a new coordinate reference system.
 
@@ -271,15 +472,30 @@ def reproject(
     nodata : float or None
         Nodata value. Auto-detected if None.
     transform_precision : int
-        Coarse grid subdivisions for approximate transform (default 16).
+        Control-grid subdivisions for the coordinate transform (default 16).
+        Higher values increase accuracy at the cost of more pyproj calls.
+        Set to 0 for exact per-pixel transforms matching GDAL/rasterio.
     chunk_size : int or (int, int) or None
         Output chunk size for dask. Defaults to 512.
     name : str or None
         Name for the output DataArray.
+    src_vertical_crs : str or None
+        Source vertical datum for height values. One of:
+
+        - ``'EGM96'`` -- orthometric heights relative to EGM96 geoid (MSL)
+        - ``'EGM2008'`` -- orthometric heights relative to EGM2008 geoid
+        - ``'ellipsoidal'`` -- heights relative to the WGS84 ellipsoid
+        - ``None`` -- no vertical transformation (default)
+    tgt_vertical_crs : str or None
+        Target vertical datum. Same options as *src_vertical_crs*.
+        Both must be set to trigger a vertical transformation.
 
     Returns
     -------
     xr.DataArray
+        The output ``attrs['crs']`` is in WKT format.
+        If vertical transformation was applied, ``attrs['vertical_crs']``
+        records the target vertical datum.
     """
     from ._crs_utils import _require_pyproj
 
@@ -307,7 +523,8 @@ def reproject(
 
     # Source geometry
     src_bounds = _source_bounds(raster)
-    src_shape = (raster.sizes[raster.dims[-2]], raster.sizes[raster.dims[-1]])
+    _ydim, _xdim = _find_spatial_dims(raster)
+    src_shape = (raster.sizes[_ydim], raster.sizes[_xdim])
     y_desc = _is_y_descending(raster)
 
     # Compute output grid
@@ -336,7 +553,7 @@ def reproject(
         try:
             from ..utils import is_cupy_backed
             is_cupy = is_cupy_backed(raster)
-        except (ImportError, Exception):
+        except (ImportError, ModuleNotFoundError):
             pass
     else:
         is_cupy = is_cupy_array(data)
@@ -345,13 +562,21 @@ def reproject(
     src_wkt = src_crs.to_wkt()
     tgt_wkt = tgt_crs.to_wkt()
 
-    if is_dask:
+    if is_dask and is_cupy:
+        result_data = _reproject_dask_cupy(
+            raster, src_bounds, src_shape, y_desc,
+            src_wkt, tgt_wkt,
+            out_bounds, out_shape,
+            resampling, nd, transform_precision,
+            chunk_size,
+        )
+    elif is_dask:
         result_data = _reproject_dask(
             raster, src_bounds, src_shape, y_desc,
             src_wkt, tgt_wkt,
             out_bounds, out_shape,
             resampling, nd, transform_precision,
-            chunk_size, is_cupy,
+            chunk_size, False,
         )
     elif is_cupy:
         result_data = _reproject_inmemory_cupy(
@@ -368,21 +593,138 @@ def reproject(
             resampling, nd, transform_precision,
         )
 
-    ydim = raster.dims[-2]
-    xdim = raster.dims[-1]
+    # Vertical datum transformation (if requested)
+    if src_vertical_crs is not None and tgt_vertical_crs is not None:
+        if src_vertical_crs != tgt_vertical_crs:
+            result_data = _apply_vertical_shift(
+                result_data, y_coords, x_coords,
+                src_vertical_crs, tgt_vertical_crs, nd,
+                tgt_crs_wkt=tgt_wkt,
+            )
+
+    ydim, xdim = _find_spatial_dims(raster)
+    out_attrs = {
+        'crs': tgt_wkt,
+        'nodata': nd,
+    }
+    if tgt_vertical_crs is not None:
+        out_attrs['vertical_crs'] = tgt_vertical_crs
+
+    # Handle multi-band output (3D result from multi-band source)
+    if result_data.ndim == 3:
+        # Find the band dimension name from the source
+        band_dims = [d for d in raster.dims if d not in (ydim, xdim)]
+        band_dim = band_dims[0] if band_dims else 'band'
+        out_dims = [ydim, xdim, band_dim]
+        out_coords = {ydim: y_coords, xdim: x_coords}
+        if band_dim in raster.coords:
+            out_coords[band_dim] = raster.coords[band_dim]
+    else:
+        out_dims = [ydim, xdim]
+        out_coords = {ydim: y_coords, xdim: x_coords}
+
     result = xr.DataArray(
         result_data,
-        dims=[ydim, xdim],
-        coords={ydim: y_coords, xdim: x_coords},
+        dims=out_dims,
+        coords=out_coords,
         name=name or raster.name,
-        attrs={
-            'crs': tgt_wkt,
-            'nodata': nd,
-        },
+        attrs=out_attrs,
     )
     return result
 
 
+def _apply_vertical_shift(data, y_coords, x_coords,
+                          src_vcrs, tgt_vcrs, nodata,
+                          tgt_crs_wkt=None):
+    """Apply vertical datum shift to reprojected height values.
+
+    The geoid undulation grid is in geographic (lon/lat) coordinates.
+    If the output CRS is projected, coordinates are inverse-projected
+    to geographic before the geoid lookup.
+
+    Supported vertical CRS:
+    - 'EGM96', 'EGM2008': orthometric heights (above geoid/MSL)
+    - 'ellipsoidal': heights above WGS84 ellipsoid
+    """
+    from ._vertical import _load_geoid, _interp_geoid_2d
+
+    # Determine direction
+    geoid_models = []
+    signs = []
+
+    if src_vcrs in ('EGM96', 'EGM2008') and tgt_vcrs == 'ellipsoidal':
+        geoid_models.append(src_vcrs)
+        signs.append(1.0)  # H + N = h
+    elif src_vcrs == 'ellipsoidal' and tgt_vcrs in ('EGM96', 'EGM2008'):
+        geoid_models.append(tgt_vcrs)
+        signs.append(-1.0)  # h - N = H
+    elif src_vcrs in ('EGM96', 'EGM2008') and tgt_vcrs in ('EGM96', 'EGM2008'):
+        geoid_models.extend([src_vcrs, tgt_vcrs])
+        signs.extend([1.0, -1.0])  # H1 + N1 - N2
+    else:
+        return data
+
+    # Determine if we need inverse projection (output CRS is projected)
+    need_inverse = False
+    transformer = None
+    if tgt_crs_wkt is not None:
+        try:
+            from ._crs_utils import _require_pyproj
+            pyproj = _require_pyproj()
+            tgt_crs = pyproj.CRS.from_wkt(tgt_crs_wkt)
+            if not tgt_crs.is_geographic:
+                need_inverse = True
+                geo_crs = pyproj.CRS.from_epsg(4326)
+                transformer = pyproj.Transformer.from_crs(
+                    tgt_crs, geo_crs, always_xy=True
+                )
+        except Exception:
+            pass
+
+    x_arr = np.asarray(x_coords, dtype=np.float64)
+    y_arr = np.asarray(y_coords, dtype=np.float64)
+    out_h, out_w = data.shape[:2] if hasattr(data, 'shape') else (len(y_arr), len(x_arr))
+
+    # Load geoid grids once
+    geoids = []
+    for gm in geoid_models:
+        geoids.append(_load_geoid(gm))
+
+    # Process in row strips to bound memory (128 rows at a time)
+    result = data.copy() if hasattr(data, 'copy') else np.array(data)
+    is_nan_nodata = np.isnan(nodata) if isinstance(nodata, float) else False
+    strip = 128
+
+    for r0 in range(0, out_h, strip):
+        r1 = min(r0 + strip, out_h)
+        n_rows = r1 - r0
+
+        # Build strip coordinate grid
+        xx_strip = np.tile(x_arr, n_rows).reshape(n_rows, out_w)
+        yy_strip = np.repeat(y_arr[r0:r1], out_w).reshape(n_rows, out_w)
+
+        # Inverse project if needed
+        if need_inverse and transformer is not None:
+            lon_s, lat_s = transformer.transform(xx_strip.ravel(), yy_strip.ravel())
+            xx_strip = np.asarray(lon_s, dtype=np.float64).reshape(n_rows, out_w)
+            yy_strip = np.asarray(lat_s, dtype=np.float64).reshape(n_rows, out_w)
+
+        # Apply each geoid shift
+        strip_data = result[r0:r1]
+        if is_nan_nodata:
+            is_valid = np.isfinite(strip_data)
+        else:
+            is_valid = strip_data != nodata
+
+        for (grid_data, g_left, g_top, g_rx, g_ry, g_h, g_w), sign in zip(geoids, signs):
+            N_strip = np.empty((n_rows, out_w), dtype=np.float64)
+            _interp_geoid_2d(xx_strip, yy_strip, N_strip,
+                             grid_data, g_left, g_top, g_rx, g_ry, g_h, g_w)
+            strip_data[is_valid] += sign * N_strip[is_valid]
+
+    return result
+
+
 def _reproject_inmemory_numpy(
     raster, src_bounds, src_shape, y_desc,
     src_wkt, tgt_wkt,
@@ -415,6 +757,165 @@ def _reproject_inmemory_cupy(
     )
 
 
+def _reproject_dask_cupy(
+    raster, src_bounds, src_shape, y_desc,
+    src_wkt, tgt_wkt,
+    out_bounds, out_shape,
+    resampling, nodata, precision,
+    chunk_size,
+):
+    """Dask+CuPy backend: process output chunks on GPU sequentially.
+
+    Instead of dask.delayed per chunk (which has ~15ms overhead each from
+    pyproj init + small CUDA launches), we:
+    1. Create CRS/transformer objects once
+    2. Use GPU-sized output chunks (2048x2048 by default)
+    3. For each output chunk, compute CUDA coordinates and fetch only
+       the source window needed from the dask array
+    4. Assemble the result as a CuPy array
+
+    For sources that fit in GPU memory, this is ~22x faster than the
+    dask.delayed path.  For sources that don't fit, each chunk fetches
+    only its required window, so GPU memory usage scales with chunk size,
+    not source size.
+    """
+    import cupy as cp
+
+    from ._crs_utils import _require_pyproj
+
+    pyproj = _require_pyproj()
+    src_crs = pyproj.CRS.from_wkt(src_wkt)
+    tgt_crs = pyproj.CRS.from_wkt(tgt_wkt)
+
+    # Use larger chunks for GPU to amortize kernel launch overhead
+    gpu_chunk = chunk_size or 2048
+    if isinstance(gpu_chunk, int):
+        gpu_chunk = (gpu_chunk, gpu_chunk)
+
+    row_chunks, col_chunks = _compute_chunk_layout(out_shape, gpu_chunk)
+    out_h, out_w = out_shape
+    src_left, src_bottom, src_right, src_top = src_bounds
+    src_h, src_w = src_shape
+    src_res_x = (src_right - src_left) / src_w
+    src_res_y = (src_top - src_bottom) / src_h
+
+    result = cp.full(out_shape, nodata, dtype=cp.float64)
+
+    row_offset = 0
+    for i, rchunk in enumerate(row_chunks):
+        col_offset = 0
+        for j, cchunk in enumerate(col_chunks):
+            cb = _chunk_bounds(
+                out_bounds, out_shape,
+                row_offset, row_offset + rchunk,
+                col_offset, col_offset + cchunk,
+            )
+            chunk_shape = (rchunk, cchunk)
+
+            # CUDA coordinate transform (reuses cached CRS objects)
+            try:
+                from ._projections_cuda import try_cuda_transform
+                cuda_coords = try_cuda_transform(
+                    src_crs, tgt_crs, cb, chunk_shape,
+                )
+            except (ImportError, ModuleNotFoundError):
+                cuda_coords = None
+
+            if cuda_coords is not None:
+                src_y, src_x = cuda_coords
+                src_col_px = (src_x - src_left) / src_res_x - 0.5
+                if y_desc:
+                    src_row_px = (src_top - src_y) / src_res_y - 0.5
+                else:
+                    src_row_px = (src_y - src_bottom) / src_res_y - 0.5
+
+                r_min_val = float(cp.nanmin(src_row_px).get())
+                if not np.isfinite(r_min_val):
+                    col_offset += cchunk
+                    continue
+                r_max_val = float(cp.nanmax(src_row_px).get())
+                c_min_val = float(cp.nanmin(src_col_px).get())
+                c_max_val = float(cp.nanmax(src_col_px).get())
+                if not np.isfinite(r_max_val) or not np.isfinite(c_min_val) or not np.isfinite(c_max_val):
+                    col_offset += cchunk
+                    continue
+                r_min = int(np.floor(r_min_val)) - 2
+                r_max = int(np.ceil(r_max_val)) + 3
+                c_min = int(np.floor(c_min_val)) - 2
+                c_max = int(np.ceil(c_max_val)) + 3
+            else:
+                # CPU fallback for this chunk
+                transformer = pyproj.Transformer.from_crs(
+                    tgt_crs, src_crs, always_xy=True
+                )
+                src_y, src_x = _transform_coords(
+                    transformer, cb, chunk_shape, precision,
+                    src_crs=src_crs, tgt_crs=tgt_crs,
+                )
+                src_col_px = (src_x - src_left) / src_res_x - 0.5
+                if y_desc:
+                    src_row_px = (src_top - src_y) / src_res_y - 0.5
+                else:
+                    src_row_px = (src_y - src_bottom) / src_res_y - 0.5
+                r_min = np.nanmin(src_row_px)
+                r_max = np.nanmax(src_row_px)
+                c_min = np.nanmin(src_col_px)
+                c_max = np.nanmax(src_col_px)
+                if not np.isfinite(r_min) or not np.isfinite(r_max):
+                    col_offset += cchunk
+                    continue
+                if not np.isfinite(c_min) or not np.isfinite(c_max):
+                    col_offset += cchunk
+                    continue
+                r_min = int(np.floor(r_min)) - 2
+                r_max = int(np.ceil(r_max)) + 3
+                c_min = int(np.floor(c_min)) - 2
+                c_max = int(np.ceil(c_max)) + 3
+
+            # Check overlap
+            if r_min >= src_h or r_max <= 0 or c_min >= src_w or c_max <= 0:
+                col_offset += cchunk
+                continue
+
+            r_min_clip = max(0, r_min)
+            r_max_clip = min(src_h, r_max)
+            c_min_clip = max(0, c_min)
+            c_max_clip = min(src_w, c_max)
+
+            # Fetch only the needed source window from dask
+            window = raster.data[r_min_clip:r_max_clip, c_min_clip:c_max_clip]
+            if hasattr(window, 'compute'):
+                window = window.compute()
+            if not isinstance(window, cp.ndarray):
+                window = cp.asarray(window)
+            window = window.astype(cp.float64)
+
+            if not np.isnan(nodata):
+                window = window.copy()
+                window[window == nodata] = cp.nan
+
+            local_row = src_row_px - r_min_clip
+            local_col = src_col_px - c_min_clip
+
+            if cuda_coords is not None:
+                chunk_data = _resample_cupy_native(
+                    window, local_row, local_col,
+                    resampling=resampling, nodata=nodata,
+                )
+            else:
+                chunk_data = _resample_cupy(
+                    window, local_row, local_col,
+                    resampling=resampling, nodata=nodata,
+                )
+
+            result[row_offset:row_offset + rchunk,
+                   col_offset:col_offset + cchunk] = chunk_data
+            col_offset += cchunk
+        row_offset += rchunk
+
+    return result
+
+
 def _reproject_dask(
     raster, src_bounds, src_shape, y_desc,
     src_wkt, tgt_wkt,
@@ -422,7 +923,7 @@ def _reproject_dask(
     resampling, nodata, precision,
     chunk_size, is_cupy,
 ):
-    """Dask backend: build output as ``da.block`` of delayed chunks."""
+    """Dask+NumPy backend: build output as ``da.block`` of delayed chunks."""
     import dask
     import dask.array as da
 
@@ -617,21 +1118,103 @@ def merge(
     return result
 
 
+def _place_same_crs(src_data, src_bounds, src_shape, y_desc,
+                    out_bounds, out_shape, nodata):
+    """Place a same-CRS tile into the output grid by coordinate alignment.
+
+    No reprojection needed -- just index the output rows/columns that
+    overlap with the source tile and copy the data.
+    """
+    out_h, out_w = out_shape
+    src_h, src_w = src_shape
+    o_left, o_bottom, o_right, o_top = out_bounds
+    s_left, s_bottom, s_right, s_top = src_bounds
+
+    o_res_x = (o_right - o_left) / out_w
+    o_res_y = (o_top - o_bottom) / out_h
+    s_res_x = (s_right - s_left) / src_w
+    s_res_y = (s_top - s_bottom) / src_h
+
+    # Output pixel range that this tile covers
+    col_start = int(round((s_left - o_left) / o_res_x))
+    col_end = int(round((s_right - o_left) / o_res_x))
+    row_start = int(round((o_top - s_top) / o_res_y))
+    row_end = int(round((o_top - s_bottom) / o_res_y))
+
+    # Clip to output bounds
+    col_start_clip = max(0, col_start)
+    col_end_clip = min(out_w, col_end)
+    row_start_clip = max(0, row_start)
+    row_end_clip = min(out_h, row_end)
+
+    if col_start_clip >= col_end_clip or row_start_clip >= row_end_clip:
+        return np.full(out_shape, nodata, dtype=np.float64)
+
+    # Source pixel range (handle offset if tile extends beyond output)
+    src_col_start = col_start_clip - col_start
+    src_row_start = row_start_clip - row_start
+
+    # Resolutions may differ slightly; if close enough, do direct copy
+    res_ratio_x = s_res_x / o_res_x
+    res_ratio_y = s_res_y / o_res_y
+    if abs(res_ratio_x - 1.0) > 0.01 or abs(res_ratio_y - 1.0) > 0.01:
+        return None  # resolutions too different, fall back to reproject
+
+    out_data = np.full(out_shape, nodata, dtype=np.float64)
+    n_rows = row_end_clip - row_start_clip
+    n_cols = col_end_clip - col_start_clip
+
+    # Clamp source window
+    src_r_end = min(src_row_start + n_rows, src_h)
+    src_c_end = min(src_col_start + n_cols, src_w)
+    actual_rows = src_r_end - src_row_start
+    actual_cols = src_c_end - src_col_start
+
+    if actual_rows <= 0 or actual_cols <= 0:
+        return out_data
+
+    src_window = np.asarray(src_data[src_row_start:src_r_end,
+                                     src_col_start:src_c_end],
+                            dtype=np.float64)
+    out_data[row_start_clip:row_start_clip + actual_rows,
+             col_start_clip:col_start_clip + actual_cols] = src_window
+    return out_data
+
+
 def _merge_inmemory(
     raster_infos, tgt_wkt, out_bounds, out_shape,
     resampling, nodata, strategy,
 ):
-    """In-memory merge using numpy."""
+    """In-memory merge using numpy.
+
+    Detects same-CRS tiles and uses fast direct placement instead
+    of reprojection.
+    """
+    from ._crs_utils import _require_pyproj
+    pyproj = _require_pyproj()
+    tgt_crs = pyproj.CRS.from_wkt(tgt_wkt)
+
     arrays = []
     for info in raster_infos:
-        reprojected = _reproject_chunk_numpy(
-            info['raster'].values,
-            info['src_bounds'], info['src_shape'], info['y_desc'],
-            info['src_wkt'], tgt_wkt,
-            out_bounds, out_shape,
-            resampling, nodata, 16,
-        )
-        arrays.append(reprojected)
+        # Check if source CRS matches target (no reprojection needed)
+        placed = None
+        if info['src_crs'] == tgt_crs:
+            placed = _place_same_crs(
+                info['raster'].values,
+                info['src_bounds'], info['src_shape'], info['y_desc'],
+                out_bounds, out_shape, nodata,
+            )
+        if placed is not None:
+            arrays.append(placed)
+        else:
+            reprojected = _reproject_chunk_numpy(
+                info['raster'].values,
+                info['src_bounds'], info['src_shape'], info['y_desc'],
+                info['src_wkt'], tgt_wkt,
+                out_bounds, out_shape,
+                resampling, nodata, 16,
+            )
+            arrays.append(reprojected)
     return _merge_arrays_numpy(arrays, nodata, strategy)
 
 
diff --git a/xrspatial/reproject/_datum_grids.py b/xrspatial/reproject/_datum_grids.py
new file mode 100644
index 00000000..dc44af18
--- /dev/null
+++ b/xrspatial/reproject/_datum_grids.py
@@ -0,0 +1,374 @@
+"""Datum shift grid loading and interpolation.
+
+Downloads horizontal offset grids from the PROJ CDN, caches them locally,
+and provides Numba JIT bilinear interpolation for per-pixel datum shifts.
+
+Grid format: GeoTIFF with 2+ bands:
+  Band 1: latitude offset (arc-seconds)
+  Band 2: longitude offset (arc-seconds)
+"""
+from __future__ import annotations
+
+import math
+import os
+import threading
+import urllib.request
+
+import numpy as np
+from numba import njit, prange
+
+_PROJ_CDN = "https://cdn.proj.org"
+
+# Vendored grid directory (shipped with the package)
+_VENDORED_DIR = os.path.join(os.path.dirname(__file__), 'grids')
+
+# Grid registry: key -> (filename, coverage bounds, description, cdn_url)
+# Bounds are (lon_min, lat_min, lon_max, lat_max).
+GRID_REGISTRY = {
+    # --- NAD27 -> NAD83 (US + territories) ---
+    'NAD27_CONUS': (
+        'us_noaa_conus.tif',
+        (-131, 20, -63, 50),
+        'NAD27->NAD83 CONUS (NADCON)',
+        f'{_PROJ_CDN}/us_noaa_conus.tif',
+    ),
+    'NAD27_NADCON5_CONUS': (
+        'us_noaa_nadcon5_nad27_nad83_1986_conus.tif',
+        (-125, 24, -66, 50),
+        'NAD27->NAD83 CONUS (NADCON5)',
+        f'{_PROJ_CDN}/us_noaa_nadcon5_nad27_nad83_1986_conus.tif',
+    ),
+    'NAD27_ALASKA': (
+        'us_noaa_alaska.tif',
+        (-194, 50, -128, 72),
+        'NAD27->NAD83 Alaska (NADCON)',
+        f'{_PROJ_CDN}/us_noaa_alaska.tif',
+    ),
+    'NAD27_HAWAII': (
+        'us_noaa_hawaii.tif',
+        (-164, 17, -154, 23),
+        'Old Hawaiian->NAD83 (NADCON)',
+        f'{_PROJ_CDN}/us_noaa_hawaii.tif',
+    ),
+    'NAD27_PRVI': (
+        'us_noaa_prvi.tif',
+        (-68, 17, -64, 19),
+        'NAD27->NAD83 Puerto Rico/Virgin Islands',
+        f'{_PROJ_CDN}/us_noaa_prvi.tif',
+    ),
+    # --- OSGB36 -> ETRS89 (UK) ---
+    'OSGB36_UK': (
+        'uk_os_OSTN15_NTv2_OSGBtoETRS.tif',
+        (-9, 49, 3, 61),
+        'OSGB36->ETRS89 (Ordnance Survey OSTN15)',
+        f'{_PROJ_CDN}/uk_os_OSTN15_NTv2_OSGBtoETRS.tif',
+    ),
+    # --- Australia (parent grid covers NT region only) ---
+    'AGD66_GDA94': (
+        'au_icsm_A66_National_13_09_01.tif',
+        (104, -14, 129, -10),
+        'AGD66->GDA94 (Australia, NT region)',
+        f'{_PROJ_CDN}/au_icsm_A66_National_13_09_01.tif',
+    ),
+    # --- Europe ---
+    'DHDN_ETRS89_DE': (
+        'de_adv_BETA2007.tif',
+        (5, 47, 16, 56),
+        'DHDN->ETRS89 (Germany)',
+        f'{_PROJ_CDN}/de_adv_BETA2007.tif',
+    ),
+    'MGI_ETRS89_AT': (
+        'at_bev_AT_GIS_GRID.tif',
+        (9, 46, 18, 50),
+        'MGI->ETRS89 (Austria)',
+        f'{_PROJ_CDN}/at_bev_AT_GIS_GRID.tif',
+    ),
+    'ED50_ETRS89_ES': (
+        'es_ign_SPED2ETV2.tif',
+        (1, 38, 5, 41),
+        'ED50->ETRS89 (Spain, eastern coast/Balearics)',
+        f'{_PROJ_CDN}/es_ign_SPED2ETV2.tif',
+    ),
+    'RD_ETRS89_NL': (
+        'nl_nsgi_rdcorr2018.tif',
+        (2, 50, 8, 56),
+        'RD->ETRS89 (Netherlands)',
+        f'{_PROJ_CDN}/nl_nsgi_rdcorr2018.tif',
+    ),
+    'BD72_ETRS89_BE': (
+        'be_ign_bd72lb72_etrs89lb08.tif',
+        (2, 49, 7, 52),
+        'BD72->ETRS89 (Belgium)',
+        f'{_PROJ_CDN}/be_ign_bd72lb72_etrs89lb08.tif',
+    ),
+    'CH1903_ETRS89_CH': (
+        'ch_swisstopo_CHENyx06_ETRS.tif',
+        (5, 45, 11, 48),
+        'CH1903->ETRS89 (Switzerland)',
+        f'{_PROJ_CDN}/ch_swisstopo_CHENyx06_ETRS.tif',
+    ),
+    'D73_ETRS89_PT': (
+        'pt_dgt_D73_ETRS89_geo.tif',
+        (-10, 36, -6, 43),
+        'D73->ETRS89 (Portugal)',
+        f'{_PROJ_CDN}/pt_dgt_D73_ETRS89_geo.tif',
+    ),
+}
+
+# Cache directory for grids not vendored
+_CACHE_DIR = os.path.join(os.path.expanduser('~'), '.cache', 'xrspatial', 'proj_grids')
+
+
+def _ensure_cache_dir():
+    os.makedirs(_CACHE_DIR, exist_ok=True)
+
+
+def _find_grid_file(filename, cdn_url=None):
+    """Find a grid file: check vendored dir first, then cache, then download."""
+    # 1. Vendored (shipped with package)
+    vendored = os.path.join(_VENDORED_DIR, filename)
+    if os.path.exists(vendored):
+        return vendored
+
+    # 2. User cache
+    cached = os.path.join(_CACHE_DIR, filename)
+    if os.path.exists(cached):
+        return cached
+
+    # 3. Download from CDN
+    if cdn_url:
+        _ensure_cache_dir()
+        urllib.request.urlretrieve(cdn_url, cached)
+        return cached
+
+    return None
+
+
+def load_grid(grid_key):
+    """Load a datum shift grid by registry key.
+
+    Returns (dlat, dlon, bounds, resolution) where:
+    - dlat, dlon: numpy float64 arrays (arc-seconds), shape (H, W)
+    - bounds: (left, bottom, right, top) in degrees
+    - resolution: (res_lon, res_lat) in degrees
+    """
+    if grid_key not in GRID_REGISTRY:
+        return None
+
+    filename, _, _, cdn_url = GRID_REGISTRY[grid_key]
+    path = _find_grid_file(filename, cdn_url)
+    if path is None:
+        return None
+
+    # Read with rasterio for correct multi-band handling
+    try:
+        import rasterio
+        with rasterio.open(path) as ds:
+            dlat = ds.read(1).astype(np.float64)  # arc-seconds
+            dlon = ds.read(2).astype(np.float64)  # arc-seconds
+            b = ds.bounds
+            bounds = (b.left, b.bottom, b.right, b.top)
+            h, w = ds.height, ds.width
+            # Validate grid shape and bounds
+            if dlat.shape != dlon.shape:
+                return None
+            if h < 2 or w < 2:
+                return None
+            if b.left >= b.right or b.bottom >= b.top:
+                return None
+            # Compute resolution from bounds and shape (avoids ds.res ordering ambiguity)
+            res_x = (b.right - b.left) / w if w > 1 else 0.25
+            res_y = (b.top - b.bottom) / h if h > 1 else 0.25
+            return dlat, dlon, bounds, (res_x, res_y)
+    except ImportError:
+        pass
+
+    # Fallback: read with our own reader (may need band axis handling)
+    from xrspatial.geotiff import read_geotiff
+    da = read_geotiff(path)
+    data = da.values
+    if data.ndim == 3:
+        # (H, W, bands) or (bands, H, W)
+        if data.shape[2] == 2:
+            dlat = data[:, :, 0].astype(np.float64)
+            dlon = data[:, :, 1].astype(np.float64)
+        else:
+            dlat = data[0].astype(np.float64)
+            dlon = data[1].astype(np.float64)
+    else:
+        return None
+
+    # Validate grid shape and bounds
+    if dlat.shape != dlon.shape:
+        return None
+    if dlat.shape[0] < 2 or dlat.shape[1] < 2:
+        return None
+
+    y_coords = da.coords['y'].values
+    x_coords = da.coords['x'].values
+    bounds = (float(x_coords[0]), float(y_coords[-1]),
+              float(x_coords[-1]), float(y_coords[0]))
+    left, bottom, right, top = bounds
+    if left >= right or bottom >= top:
+        return None
+    res_x = abs(float(x_coords[1] - x_coords[0])) if len(x_coords) > 1 else 0.25
+    res_y = abs(float(y_coords[1] - y_coords[0])) if len(y_coords) > 1 else 0.25
+    return dlat, dlon, bounds, (res_x, res_y)
+
+
+# ---------------------------------------------------------------------------
+# Numba bilinear grid interpolation
+# ---------------------------------------------------------------------------
+
+@njit(nogil=True, cache=True)
+def _grid_interp_point(lon, lat, dlat_grid, dlon_grid,
+                       grid_left, grid_top, grid_res_x, grid_res_y,
+                       grid_h, grid_w):
+    """Bilinear interpolation of a single point in the shift grid.
+
+    Returns (dlat_arcsec, dlon_arcsec) or (0, 0) if outside the grid.
+    """
+    col_f = (lon - grid_left) / grid_res_x
+    row_f = (grid_top - lat) / grid_res_y
+
+    if col_f < 0 or col_f > grid_w - 1 or row_f < 0 or row_f > grid_h - 1:
+        return 0.0, 0.0
+
+    c0 = int(col_f)
+    r0 = int(row_f)
+    if c0 >= grid_w - 1:
+        c0 = grid_w - 2
+    if r0 >= grid_h - 1:
+        r0 = grid_h - 2
+
+    dc = col_f - c0
+    dr = row_f - r0
+
+    w00 = (1.0 - dr) * (1.0 - dc)
+    w01 = (1.0 - dr) * dc
+    w10 = dr * (1.0 - dc)
+    w11 = dr * dc
+
+    dlat = (dlat_grid[r0, c0] * w00 + dlat_grid[r0, c0 + 1] * w01 +
+            dlat_grid[r0 + 1, c0] * w10 + dlat_grid[r0 + 1, c0 + 1] * w11)
+    dlon = (dlon_grid[r0, c0] * w00 + dlon_grid[r0, c0 + 1] * w01 +
+            dlon_grid[r0 + 1, c0] * w10 + dlon_grid[r0 + 1, c0 + 1] * w11)
+
+    return dlat, dlon
+
+
+@njit(nogil=True, cache=True, parallel=True)
+def apply_grid_shift_forward(lon_arr, lat_arr, dlat_grid, dlon_grid,
+                             grid_left, grid_top, grid_res_x, grid_res_y,
+                             grid_h, grid_w):
+    """Apply grid-based datum shift: source -> target (add offsets)."""
+    for i in prange(lon_arr.shape[0]):
+        dlat, dlon = _grid_interp_point(
+            lon_arr[i], lat_arr[i], dlat_grid, dlon_grid,
+            grid_left, grid_top, grid_res_x, grid_res_y,
+            grid_h, grid_w,
+        )
+        lat_arr[i] += dlat / 3600.0  # arc-seconds to degrees
+        lon_arr[i] += dlon / 3600.0
+
+
+@njit(nogil=True, cache=True, parallel=True)
+def apply_grid_shift_inverse(lon_arr, lat_arr, dlat_grid, dlon_grid,
+                             grid_left, grid_top, grid_res_x, grid_res_y,
+                             grid_h, grid_w):
+    """Apply inverse grid-based datum shift: target -> source (subtract offsets).
+
+    Uses iterative approach: the grid is indexed by source coordinates,
+    but we have target coordinates.  One iteration is usually sufficient
+    since the shifts are small relative to the grid spacing.
+    """
+    for i in prange(lon_arr.shape[0]):
+        # Initial estimate: subtract the shift at the target coords
+        dlat, dlon = _grid_interp_point(
+            lon_arr[i], lat_arr[i], dlat_grid, dlon_grid,
+            grid_left, grid_top, grid_res_x, grid_res_y,
+            grid_h, grid_w,
+        )
+        lon_est = lon_arr[i] - dlon / 3600.0
+        lat_est = lat_arr[i] - dlat / 3600.0
+
+        # Refine: re-interpolate at the estimated source coords
+        dlat2, dlon2 = _grid_interp_point(
+            lon_est, lat_est, dlat_grid, dlon_grid,
+            grid_left, grid_top, grid_res_x, grid_res_y,
+            grid_h, grid_w,
+        )
+        lon_arr[i] -= dlon2 / 3600.0
+        lat_arr[i] -= dlat2 / 3600.0
+
+
+# ---------------------------------------------------------------------------
+# Grid cache (loaded grids, keyed by grid_key)
+# ---------------------------------------------------------------------------
+
+_loaded_grids = {}  # cleared on module reload
+_loaded_grids_lock = threading.Lock()
+
+
+def get_grid(grid_key):
+    """Get a loaded grid, downloading if necessary.
+
+    Returns (dlat, dlon, left, top, res_x, res_y, h, w) or None.
+    """
+    with _loaded_grids_lock:
+        if grid_key in _loaded_grids:
+            return _loaded_grids[grid_key]
+
+    result = load_grid(grid_key)
+
+    with _loaded_grids_lock:
+        if result is None:
+            _loaded_grids[grid_key] = None
+            return None
+
+        dlat, dlon, bounds, (res_x, res_y) = result
+        h, w = dlat.shape
+        # Ensure contiguous float64 for Numba
+        dlat = np.ascontiguousarray(dlat, dtype=np.float64)
+        dlon = np.ascontiguousarray(dlon, dtype=np.float64)
+        entry = (dlat, dlon, bounds[0], bounds[3], res_x, res_y, h, w)
+        _loaded_grids[grid_key] = entry
+        return entry
+
+
+def find_grid_for_point(lon, lat, datum_key):
+    """Find the best grid covering a given point.
+
+    Returns the grid_key or None.
+    """
+    # Map datum/ellipsoid names to grid keys, ordered by preference.
+    # Keys are matched against the 'datum' or 'ellps' field from CRS.to_dict().
+    datum_grids = {
+        'NAD27': ['NAD27_NADCON5_CONUS', 'NAD27_CONUS', 'NAD27_ALASKA',
+                  'NAD27_HAWAII', 'NAD27_PRVI'],
+        'clarke66': ['NAD27_NADCON5_CONUS', 'NAD27_CONUS', 'NAD27_ALASKA',
+                     'NAD27_HAWAII', 'NAD27_PRVI'],
+        'OSGB36': ['OSGB36_UK'],
+        'airy': ['OSGB36_UK'],
+        'AGD66': ['AGD66_GDA94'],
+        'aust_SA': ['AGD66_GDA94'],
+        'DHDN': ['DHDN_ETRS89_DE'],
+        'bessel': ['DHDN_ETRS89_DE'],  # Bessel used by DHDN, MGI, etc.
+        'MGI': ['MGI_ETRS89_AT'],
+        'ED50': ['ED50_ETRS89_ES'],
+        'intl': ['ED50_ETRS89_ES'],  # International 1924 ellipsoid
+        'BD72': ['BD72_ETRS89_BE'],
+        'CH1903': ['CH1903_ETRS89_CH'],
+        'D73': ['D73_ETRS89_PT'],
+    }
+
+    candidates = datum_grids.get(datum_key, [])
+    for grid_key in candidates:
+        entry = GRID_REGISTRY.get(grid_key)
+        if entry is None:
+            continue
+        _, coverage, _, _ = entry
+        lon_min, lat_min, lon_max, lat_max = coverage
+        if lon_min <= lon <= lon_max and lat_min <= lat <= lat_max:
+            return grid_key
+    return None
diff --git a/xrspatial/reproject/_grid.py b/xrspatial/reproject/_grid.py
index 3a19aa99..9cc2adf2 100644
--- a/xrspatial/reproject/_grid.py
+++ b/xrspatial/reproject/_grid.py
@@ -52,19 +52,30 @@ def _compute_output_grid(source_bounds, source_shape, source_crs, target_crs,
             if src_bottom >= src_top:
                 src_bottom, src_top = source_bounds[1], source_bounds[3]
 
-        n_edge = 21  # sample points along each edge
-        xs = np.concatenate([
+        # Sample edges densely plus an interior grid so that
+        # projections with curvature (e.g. UTM near zone edges)
+        # don't underestimate the output bounding box.
+        n_edge = 101
+        n_interior = 21
+        edge_xs = np.concatenate([
             np.linspace(src_left, src_right, n_edge),   # top edge
             np.linspace(src_left, src_right, n_edge),   # bottom edge
             np.full(n_edge, src_left),                   # left edge
             np.full(n_edge, src_right),                  # right edge
         ])
-        ys = np.concatenate([
+        edge_ys = np.concatenate([
             np.full(n_edge, src_top),
             np.full(n_edge, src_bottom),
             np.linspace(src_bottom, src_top, n_edge),
             np.linspace(src_bottom, src_top, n_edge),
         ])
+        # Interior grid catches cases where the projected extent
+        # bulges beyond the edges (e.g. Mercator near the poles).
+        ix = np.linspace(src_left, src_right, n_interior)
+        iy = np.linspace(src_bottom, src_top, n_interior)
+        ixx, iyy = np.meshgrid(ix, iy)
+        xs = np.concatenate([edge_xs, ixx.ravel()])
+        ys = np.concatenate([edge_ys, iyy.ravel()])
         tx, ty = transformer.transform(xs, ys)
         tx = np.asarray(tx)
         ty = np.asarray(ty)
@@ -131,29 +142,35 @@ def _compute_output_grid(source_bounds, source_shape, source_crs, target_crs,
         res_x = (right - left) / width
         res_y = (top - bottom) / height
     else:
-        # Estimate from source resolution
+        # Estimate from source resolution by transforming each axis
+        # independently, then taking the geometric mean for a square pixel.
         src_h, src_w = source_shape
         src_left, src_bottom, src_right, src_top = source_bounds
         src_res_x = (src_right - src_left) / src_w
         src_res_y = (src_top - src_bottom) / src_h
-        # Use the geometric mean of transformed pixel sizes
         center_x = (src_left + src_right) / 2
         center_y = (src_bottom + src_top) / 2
-        tx1, ty1 = transformer.transform(center_x, center_y)
-        tx2, ty2 = transformer.transform(
-            center_x + src_res_x, center_y + src_res_y
-        )
-        res_x = abs(float(tx2) - float(tx1))
-        res_y = abs(float(ty2) - float(ty1))
-        if res_x == 0 or res_y == 0:
+        tc_x, tc_y = transformer.transform(center_x, center_y)
+        # Step along x only
+        tx_x, tx_y = transformer.transform(center_x + src_res_x, center_y)
+        dx = np.hypot(float(tx_x) - float(tc_x), float(tx_y) - float(tc_y))
+        # Step along y only
+        ty_x, ty_y = transformer.transform(center_x, center_y + src_res_y)
+        dy = np.hypot(float(ty_x) - float(tc_x), float(ty_y) - float(tc_y))
+        if dx == 0 or dy == 0:
             res_x = (right - left) / src_w
             res_y = (top - bottom) / src_h
+        else:
+            # Geometric mean for square pixels
+            res_x = res_y = np.sqrt(dx * dy)
 
-    # Compute dimensions
+    # Compute dimensions.  Use round() instead of ceil() so that
+    # floating-point noise (e.g. 677.0000000000001) does not add a
+    # spurious extra row/column.
     if width is None:
-        width = max(1, int(np.ceil((right - left) / res_x)))
+        width = max(1, int(round((right - left) / res_x)))
     if height is None:
-        height = max(1, int(np.ceil((top - bottom) / res_y)))
+        height = max(1, int(round((top - bottom) / res_y)))
 
     # Adjust bounds to be exact multiples of resolution
     right = left + width * res_x
diff --git a/xrspatial/reproject/_interpolate.py b/xrspatial/reproject/_interpolate.py
index 1180a561..74c2241a 100644
--- a/xrspatial/reproject/_interpolate.py
+++ b/xrspatial/reproject/_interpolate.py
@@ -1,9 +1,17 @@
 """Per-backend resampling via numba JIT (nearest/bilinear) or map_coordinates (cubic)."""
 from __future__ import annotations
 
+import math
+
 import numpy as np
 from numba import njit
 
+try:
+    from numba import cuda as _cuda
+    _HAS_CUDA = True
+except ImportError:
+    _HAS_CUDA = False
+
 
 _RESAMPLING_ORDERS = {
     'nearest': 0,
@@ -35,7 +43,7 @@ def _resample_nearest_jit(src, row_coords, col_coords, nodata):
         for j in range(w_out):
             r = row_coords[i, j]
             c = col_coords[i, j]
-            if r < -0.5 or r > sh - 0.5 or c < -0.5 or c > sw - 0.5:
+            if r < -1.0 or r > sh or c < -1.0 or c > sw:
                 out[i, j] = nodata
                 continue
             ri = int(r + 0.5)
@@ -59,10 +67,11 @@ def _resample_nearest_jit(src, row_coords, col_coords, nodata):
 
 @njit(nogil=True, cache=True)
 def _resample_cubic_jit(src, row_coords, col_coords, nodata):
-    """Catmull-Rom cubic resampling with NaN propagation.
+    """Catmull-Rom cubic resampling with NaN-aware fallback to bilinear.
 
     Separable: interpolate 4 row-slices along columns, then combine
-    along rows.  Handles NaN inline (no second pass needed).
+    along rows.  When any of the 16 neighbors is NaN, falls back to
+    bilinear with weight renormalization (matching GDAL behavior).
     """
     h_out, w_out = row_coords.shape
     sh, sw = src.shape
@@ -71,7 +80,7 @@ def _resample_cubic_jit(src, row_coords, col_coords, nodata):
         for j in range(w_out):
             r = row_coords[i, j]
             c = col_coords[i, j]
-            if r < -0.5 or r > sh - 0.5 or c < -0.5 or c > sw - 0.5:
+            if r < -1.0 or r > sh or c < -1.0 or c > sw:
                 out[i, j] = nodata
                 continue
 
@@ -137,13 +146,62 @@ def _resample_cubic_jit(src, row_coords, col_coords, nodata):
                 else:
                     val += rv * wr3
 
-            out[i, j] = nodata if has_nan else val
+            if not has_nan:
+                out[i, j] = val
+            else:
+                # Fall back to bilinear with weight renormalization
+                r1 = r0 + 1
+                c1 = c0 + 1
+                dr = r - r0
+                dc = c - c0
+
+                w00 = (1.0 - dr) * (1.0 - dc)
+                w01 = (1.0 - dr) * dc
+                w10 = dr * (1.0 - dc)
+                w11 = dr * dc
+
+                accum = 0.0
+                wsum = 0.0
+
+                if 0 <= r0 < sh and 0 <= c0 < sw:
+                    v = src[r0, c0]
+                    if v == v:
+                        accum += w00 * v
+                        wsum += w00
+
+                if 0 <= r0 < sh and 0 <= c1 < sw:
+                    v = src[r0, c1]
+                    if v == v:
+                        accum += w01 * v
+                        wsum += w01
+
+                if 0 <= r1 < sh and 0 <= c0 < sw:
+                    v = src[r1, c0]
+                    if v == v:
+                        accum += w10 * v
+                        wsum += w10
+
+                if 0 <= r1 < sh and 0 <= c1 < sw:
+                    v = src[r1, c1]
+                    if v == v:
+                        accum += w11 * v
+                        wsum += w11
+
+                if wsum > 1e-10:
+                    out[i, j] = accum / wsum
+                else:
+                    out[i, j] = nodata
     return out
 
 
 @njit(nogil=True, cache=True)
 def _resample_bilinear_jit(src, row_coords, col_coords, nodata):
-    """Bilinear resampling with NaN propagation."""
+    """Bilinear resampling matching GDAL's weight-renormalization approach.
+
+    When a neighbor is out-of-bounds or NaN, its weight is excluded and
+    the result is renormalized from the remaining valid neighbors.  This
+    matches GDAL's GWKBilinearResample4Sample behavior.
+    """
     h_out, w_out = row_coords.shape
     sh, sw = src.shape
     out = np.empty((h_out, w_out), dtype=np.float64)
@@ -151,7 +209,7 @@ def _resample_bilinear_jit(src, row_coords, col_coords, nodata):
         for j in range(w_out):
             r = row_coords[i, j]
             c = col_coords[i, j]
-            if r < -0.5 or r > sh - 0.5 or c < -0.5 or c > sw - 0.5:
+            if r < -1.0 or r > sh or c < -1.0 or c > sw:
                 out[i, j] = nodata
                 continue
 
@@ -162,25 +220,43 @@ def _resample_bilinear_jit(src, row_coords, col_coords, nodata):
             dr = r - r0
             dc = c - c0
 
-            # Clamp to source bounds
-            r0c = r0 if r0 >= 0 else 0
-            r1c = r1 if r1 < sh else sh - 1
-            c0c = c0 if c0 >= 0 else 0
-            c1c = c1 if c1 < sw else sw - 1
-
-            v00 = src[r0c, c0c]
-            v01 = src[r0c, c1c]
-            v10 = src[r1c, c0c]
-            v11 = src[r1c, c1c]
-
-            # If any neighbor is NaN, output nodata
-            if v00 != v00 or v01 != v01 or v10 != v10 or v11 != v11:
-                out[i, j] = nodata
+            w00 = (1.0 - dr) * (1.0 - dc)
+            w01 = (1.0 - dr) * dc
+            w10 = dr * (1.0 - dc)
+            w11 = dr * dc
+
+            accum = 0.0
+            wsum = 0.0
+
+            # Accumulate only valid, in-bounds neighbors
+            if 0 <= r0 < sh and 0 <= c0 < sw:
+                v = src[r0, c0]
+                if v == v:  # not NaN
+                    accum += w00 * v
+                    wsum += w00
+
+            if 0 <= r0 < sh and 0 <= c1 < sw:
+                v = src[r0, c1]
+                if v == v:
+                    accum += w01 * v
+                    wsum += w01
+
+            if 0 <= r1 < sh and 0 <= c0 < sw:
+                v = src[r1, c0]
+                if v == v:
+                    accum += w10 * v
+                    wsum += w10
+
+            if 0 <= r1 < sh and 0 <= c1 < sw:
+                v = src[r1, c1]
+                if v == v:
+                    accum += w11 * v
+                    wsum += w11
+
+            if wsum > 1e-10:
+                out[i, j] = accum / wsum
             else:
-                out[i, j] = (v00 * (1.0 - dr) * (1.0 - dc) +
-                             v01 * (1.0 - dr) * dc +
-                             v10 * dr * (1.0 - dc) +
-                             v11 * dr * dc)
+                out[i, j] = nodata
     return out
 
 
@@ -223,18 +299,447 @@ def _resample_numpy(source_window, src_row_coords, src_col_coords,
     if order == 0:
         result = _resample_nearest_jit(work, rc, cc, nd)
         if is_integer:
-            result = np.round(result).astype(source_window.dtype)
+            info = np.iinfo(source_window.dtype)
+            result = np.clip(np.round(result), info.min, info.max).astype(source_window.dtype)
         return result
 
     if order == 1:
-        return _resample_bilinear_jit(work, rc, cc, nd)
+        result = _resample_bilinear_jit(work, rc, cc, nd)
+        if is_integer:
+            info = np.iinfo(source_window.dtype)
+            result = np.clip(np.round(result), info.min, info.max).astype(source_window.dtype)
+        return result
 
     # Cubic: numba Catmull-Rom (handles NaN inline, no extra passes)
-    return _resample_cubic_jit(work, rc, cc, nd)
+    result = _resample_cubic_jit(work, rc, cc, nd)
+    if is_integer:
+        info = np.iinfo(source_window.dtype)
+        result = np.clip(np.round(result), info.min, info.max).astype(source_window.dtype)
+    return result
+
+
+# ---------------------------------------------------------------------------
+# CUDA resampling kernels
+# ---------------------------------------------------------------------------
+
+if _HAS_CUDA:
+
+    @_cuda.jit
+    def _resample_nearest_cuda(src, row_coords, col_coords, out, nodata):
+        """Nearest-neighbor resampling kernel (CUDA)."""
+        i, j = _cuda.grid(2)
+        h_out = out.shape[0]
+        w_out = out.shape[1]
+        if i >= h_out or j >= w_out:
+            return
+        sh = src.shape[0]
+        sw = src.shape[1]
+        r = row_coords[i, j]
+        c = col_coords[i, j]
+        if r < -1.0 or r > sh or c < -1.0 or c > sw:
+            out[i, j] = nodata
+            return
+        ri = int(r + 0.5)
+        ci = int(c + 0.5)
+        if ri < 0:
+            ri = 0
+        if ri >= sh:
+            ri = sh - 1
+        if ci < 0:
+            ci = 0
+        if ci >= sw:
+            ci = sw - 1
+        v = src[ri, ci]
+        # NaN check
+        if v != v:
+            out[i, j] = nodata
+        else:
+            out[i, j] = v
+
+    @_cuda.jit
+    def _resample_bilinear_cuda(src, row_coords, col_coords, out, nodata):
+        """Bilinear resampling kernel (CUDA), GDAL-matching renormalization."""
+        i, j = _cuda.grid(2)
+        h_out = out.shape[0]
+        w_out = out.shape[1]
+        if i >= h_out or j >= w_out:
+            return
+        sh = src.shape[0]
+        sw = src.shape[1]
+        r = row_coords[i, j]
+        c = col_coords[i, j]
+        if r < -1.0 or r > sh or c < -1.0 or c > sw:
+            out[i, j] = nodata
+            return
+
+        r0 = int(math.floor(r))
+        c0 = int(math.floor(c))
+        r1 = r0 + 1
+        c1 = c0 + 1
+        dr = r - r0
+        dc = c - c0
+
+        w00 = (1.0 - dr) * (1.0 - dc)
+        w01 = (1.0 - dr) * dc
+        w10 = dr * (1.0 - dc)
+        w11 = dr * dc
+
+        accum = 0.0
+        wsum = 0.0
+
+        if 0 <= r0 < sh and 0 <= c0 < sw:
+            v = src[r0, c0]
+            if v == v:
+                accum += w00 * v
+                wsum += w00
+        if 0 <= r0 < sh and 0 <= c1 < sw:
+            v = src[r0, c1]
+            if v == v:
+                accum += w01 * v
+                wsum += w01
+        if 0 <= r1 < sh and 0 <= c0 < sw:
+            v = src[r1, c0]
+            if v == v:
+                accum += w10 * v
+                wsum += w10
+        if 0 <= r1 < sh and 0 <= c1 < sw:
+            v = src[r1, c1]
+            if v == v:
+                accum += w11 * v
+                wsum += w11
+
+        if wsum > 1e-10:
+            out[i, j] = accum / wsum
+        else:
+            out[i, j] = nodata
+
+    @_cuda.jit
+    def _resample_cubic_cuda(src, row_coords, col_coords, out, nodata):
+        """Catmull-Rom cubic resampling kernel (CUDA)."""
+        i, j = _cuda.grid(2)
+        h_out = out.shape[0]
+        w_out = out.shape[1]
+        if i >= h_out or j >= w_out:
+            return
+        sh = src.shape[0]
+        sw = src.shape[1]
+        r = row_coords[i, j]
+        c = col_coords[i, j]
+        if r < -1.0 or r > sh or c < -1.0 or c > sw:
+            out[i, j] = nodata
+            return
+
+        r0 = int(math.floor(r))
+        c0 = int(math.floor(c))
+        fr = r - r0
+        fc = c - c0
+
+        # Catmull-Rom column weights (a = -0.5)
+        fc2 = fc * fc
+        fc3 = fc2 * fc
+        wc0 = -0.5 * fc3 + fc2 - 0.5 * fc
+        wc1 = 1.5 * fc3 - 2.5 * fc2 + 1.0
+        wc2 = -1.5 * fc3 + 2.0 * fc2 + 0.5 * fc
+        wc3 = 0.5 * fc3 - 0.5 * fc2
+
+        # Catmull-Rom row weights
+        fr2 = fr * fr
+        fr3 = fr2 * fr
+        wr0 = -0.5 * fr3 + fr2 - 0.5 * fr
+        wr1 = 1.5 * fr3 - 2.5 * fr2 + 1.0
+        wr2 = -1.5 * fr3 + 2.0 * fr2 + 0.5 * fr
+        wr3 = 0.5 * fr3 - 0.5 * fr2
+
+        val = 0.0
+        has_nan = False
+
+        # Row 0
+        ric = r0 - 1
+        if ric < 0:
+            ric = 0
+        elif ric >= sh:
+            ric = sh - 1
+        # Unrolled column loop for row 0
+        cjc = c0 - 1
+        if cjc < 0:
+            cjc = 0
+        elif cjc >= sw:
+            cjc = sw - 1
+        sv = src[ric, cjc]
+        if sv != sv:
+            has_nan = True
+        if not has_nan:
+            rv0 = sv * wc0
+            cjc = c0
+            if cjc < 0:
+                cjc = 0
+            elif cjc >= sw:
+                cjc = sw - 1
+            sv = src[ric, cjc]
+            if sv != sv:
+                has_nan = True
+        if not has_nan:
+            rv0 += sv * wc1
+            cjc = c0 + 1
+            if cjc < 0:
+                cjc = 0
+            elif cjc >= sw:
+                cjc = sw - 1
+            sv = src[ric, cjc]
+            if sv != sv:
+                has_nan = True
+        if not has_nan:
+            rv0 += sv * wc2
+            cjc = c0 + 2
+            if cjc < 0:
+                cjc = 0
+            elif cjc >= sw:
+                cjc = sw - 1
+            sv = src[ric, cjc]
+            if sv != sv:
+                has_nan = True
+        if not has_nan:
+            rv0 += sv * wc3
+            val = rv0 * wr0
+
+        # Row 1
+        if not has_nan:
+            ric = r0
+            if ric < 0:
+                ric = 0
+            elif ric >= sh:
+                ric = sh - 1
+            cjc = c0 - 1
+            if cjc < 0:
+                cjc = 0
+            elif cjc >= sw:
+                cjc = sw - 1
+            sv = src[ric, cjc]
+            if sv != sv:
+                has_nan = True
+        if not has_nan:
+            rv1 = sv * wc0
+            cjc = c0
+            if cjc < 0:
+                cjc = 0
+            elif cjc >= sw:
+                cjc = sw - 1
+            sv = src[ric, cjc]
+            if sv != sv:
+                has_nan = True
+        if not has_nan:
+            rv1 += sv * wc1
+            cjc = c0 + 1
+            if cjc < 0:
+                cjc = 0
+            elif cjc >= sw:
+                cjc = sw - 1
+            sv = src[ric, cjc]
+            if sv != sv:
+                has_nan = True
+        if not has_nan:
+            rv1 += sv * wc2
+            cjc = c0 + 2
+            if cjc < 0:
+                cjc = 0
+            elif cjc >= sw:
+                cjc = sw - 1
+            sv = src[ric, cjc]
+            if sv != sv:
+                has_nan = True
+        if not has_nan:
+            rv1 += sv * wc3
+            val += rv1 * wr1
+
+        # Row 2
+        if not has_nan:
+            ric = r0 + 1
+            if ric < 0:
+                ric = 0
+            elif ric >= sh:
+                ric = sh - 1
+            cjc = c0 - 1
+            if cjc < 0:
+                cjc = 0
+            elif cjc >= sw:
+                cjc = sw - 1
+            sv = src[ric, cjc]
+            if sv != sv:
+                has_nan = True
+        if not has_nan:
+            rv2 = sv * wc0
+            cjc = c0
+            if cjc < 0:
+                cjc = 0
+            elif cjc >= sw:
+                cjc = sw - 1
+            sv = src[ric, cjc]
+            if sv != sv:
+                has_nan = True
+        if not has_nan:
+            rv2 += sv * wc1
+            cjc = c0 + 1
+            if cjc < 0:
+                cjc = 0
+            elif cjc >= sw:
+                cjc = sw - 1
+            sv = src[ric, cjc]
+            if sv != sv:
+                has_nan = True
+        if not has_nan:
+            rv2 += sv * wc2
+            cjc = c0 + 2
+            if cjc < 0:
+                cjc = 0
+            elif cjc >= sw:
+                cjc = sw - 1
+            sv = src[ric, cjc]
+            if sv != sv:
+                has_nan = True
+        if not has_nan:
+            rv2 += sv * wc3
+            val += rv2 * wr2
+
+        # Row 3
+        if not has_nan:
+            ric = r0 + 2
+            if ric < 0:
+                ric = 0
+            elif ric >= sh:
+                ric = sh - 1
+            cjc = c0 - 1
+            if cjc < 0:
+                cjc = 0
+            elif cjc >= sw:
+                cjc = sw - 1
+            sv = src[ric, cjc]
+            if sv != sv:
+                has_nan = True
+        if not has_nan:
+            rv3 = sv * wc0
+            cjc = c0
+            if cjc < 0:
+                cjc = 0
+            elif cjc >= sw:
+                cjc = sw - 1
+            sv = src[ric, cjc]
+            if sv != sv:
+                has_nan = True
+        if not has_nan:
+            rv3 += sv * wc1
+            cjc = c0 + 1
+            if cjc < 0:
+                cjc = 0
+            elif cjc >= sw:
+                cjc = sw - 1
+            sv = src[ric, cjc]
+            if sv != sv:
+                has_nan = True
+        if not has_nan:
+            rv3 += sv * wc2
+            cjc = c0 + 2
+            if cjc < 0:
+                cjc = 0
+            elif cjc >= sw:
+                cjc = sw - 1
+            sv = src[ric, cjc]
+            if sv != sv:
+                has_nan = True
+        if not has_nan:
+            rv3 += sv * wc3
+            val += rv3 * wr3
+
+        if has_nan:
+            out[i, j] = nodata
+        else:
+            out[i, j] = val
+
+
+# ---------------------------------------------------------------------------
+# Native CuPy resampler using CUDA kernels
+# ---------------------------------------------------------------------------
+
+def _resample_cupy_native(source_window, src_row_coords, src_col_coords,
+                          resampling='bilinear', nodata=np.nan):
+    """Resample using custom CUDA kernels (all data stays on GPU).
+
+    Unlike ``_resample_cupy`` which uses ``cupyx.scipy.ndimage.map_coordinates``,
+    this function uses hand-written CUDA kernels that match the Numba CPU
+    kernels exactly, including inline NaN handling.
+
+    Parameters
+    ----------
+    source_window : cupy.ndarray (H_src, W_src)
+    src_row_coords, src_col_coords : cupy.ndarray (H_out, W_out)
+    resampling : str
+    nodata : float
+
+    Returns
+    -------
+    cupy.ndarray (H_out, W_out)
+    """
+    if not _HAS_CUDA:
+        raise RuntimeError("numba.cuda is required for _resample_cupy_native")
+
+    import cupy as cp
+
+    order = _validate_resampling(resampling)
+
+    is_integer = cp.issubdtype(source_window.dtype, cp.integer)
+    if is_integer:
+        work = source_window.astype(cp.float64)
+    else:
+        work = source_window
+    if work.dtype != cp.float64:
+        work = work.astype(cp.float64)
+
+    # Ensure inputs are CuPy arrays
+    if not isinstance(src_row_coords, cp.ndarray):
+        src_row_coords = cp.asarray(src_row_coords)
+    if not isinstance(src_col_coords, cp.ndarray):
+        src_col_coords = cp.asarray(src_col_coords)
+    rc = cp.ascontiguousarray(src_row_coords, dtype=cp.float64)
+    cc = cp.ascontiguousarray(src_col_coords, dtype=cp.float64)
+
+    # Convert sentinel nodata to NaN so kernels can detect it
+    if not np.isnan(nodata):
+        work = work.copy()
+        work[work == nodata] = cp.nan
+
+    h_out, w_out = rc.shape
+    out = cp.empty((h_out, w_out), dtype=cp.float64)
+    nd = float(nodata)
+
+    # Launch configuration: (16, 16) thread blocks
+    threads_per_block = (16, 16)
+    blocks_per_grid = (
+        (h_out + threads_per_block[0] - 1) // threads_per_block[0],
+        (w_out + threads_per_block[1] - 1) // threads_per_block[1],
+    )
+
+    if order == 0:
+        _resample_nearest_cuda[blocks_per_grid, threads_per_block](
+            work, rc, cc, out, nd
+        )
+        if is_integer:
+            out = cp.round(out).astype(source_window.dtype)
+        return out
+
+    if order == 1:
+        _resample_bilinear_cuda[blocks_per_grid, threads_per_block](
+            work, rc, cc, out, nd
+        )
+        return out
+
+    # Cubic
+    _resample_cubic_cuda[blocks_per_grid, threads_per_block](
+        work, rc, cc, out, nd
+    )
+    return out
 
 
 # ---------------------------------------------------------------------------
-# CuPy resampler (unchanged -- GPU kernels are already fast)
+# CuPy resampler (uses cupyx.scipy.ndimage.map_coordinates)
 # ---------------------------------------------------------------------------
 
 def _resample_cupy(source_window, src_row_coords, src_col_coords,
@@ -279,8 +784,8 @@ def _resample_cupy(source_window, src_row_coords, src_col_coords,
 
     h, w = source_window.shape
     oob = (
-        (src_row_coords < -0.5) | (src_row_coords > h - 0.5) |
-        (src_col_coords < -0.5) | (src_col_coords > w - 0.5)
+        (src_row_coords < -1.0) | (src_row_coords > h) |
+        (src_col_coords < -1.0) | (src_col_coords > w)
     )
 
     if has_nan:
diff --git a/xrspatial/reproject/_itrf.py b/xrspatial/reproject/_itrf.py
new file mode 100644
index 00000000..a799bca6
--- /dev/null
+++ b/xrspatial/reproject/_itrf.py
@@ -0,0 +1,312 @@
+"""Time-dependent ITRF frame transformations.
+
+Implements 14-parameter Helmert transforms (7 static + 7 rates)
+for converting between International Terrestrial Reference Frames.
+
+The parameters are published by IGN France and shipped with PROJ.
+Shifts are mm-level for position and mm/year for rates -- relevant
+for precision geodesy, negligible for most raster reprojection.
+
+Usage
+-----
+>>> from xrspatial.reproject import itrf_transform
+>>> lon2, lat2, h2 = itrf_transform(
+...     -74.0, 40.7, 0.0,
+...     src='ITRF2014', tgt='ITRF2020', epoch=2024.0,
+... )
+"""
+from __future__ import annotations
+
+import math
+import os
+import re
+import threading
+
+import numpy as np
+from numba import njit, prange
+
+# ---------------------------------------------------------------------------
+# Parse PROJ ITRF parameter files
+# ---------------------------------------------------------------------------
+
+def _find_proj_data_dir():
+    """Locate the PROJ data directory."""
+    try:
+        import pyproj
+        return pyproj.datadir.get_data_dir()
+    except Exception:
+        return None
+
+
+def _parse_itrf_file(path):
+    """Parse a PROJ ITRF parameter file.
+
+    Returns dict mapping target_frame -> parameter dict.
+    """
+    transforms = {}
+    with open(path) as f:
+        for line in f:
+            line = line.strip()
+            if not line or line.startswith('#') or line.startswith('<metadata>'):
+                continue
+            # Format: <TARGET_FRAME> +proj=helmert +x=... +dx=... +t_epoch=...
+            m = re.match(r'<(\w+)>\s+(.+)', line)
+            if not m:
+                continue
+            target = m.group(1)
+            params_str = m.group(2)
+            params = {}
+            for token in params_str.split():
+                if '=' in token:
+                    key, val = token.lstrip('+').split('=', 1)
+                    try:
+                        params[key] = float(val)
+                    except ValueError:
+                        params[key] = val
+                elif token.startswith('+'):
+                    params[token.lstrip('+')] = True
+            transforms[target] = params
+    return transforms
+
+
+def _load_all_itrf_params():
+    """Load all ITRF transformation parameters from PROJ data files.
+
+    Returns a nested dict: {source_frame: {target_frame: params}}.
+    """
+    proj_dir = _find_proj_data_dir()
+    if proj_dir is None:
+        return {}
+
+    all_params = {}
+    for filename in os.listdir(proj_dir):
+        if not filename.startswith('ITRF'):
+            continue
+        source_frame = filename
+        path = os.path.join(proj_dir, filename)
+        if not os.path.isfile(path):
+            continue
+        transforms = _parse_itrf_file(path)
+        all_params[source_frame] = transforms
+
+    return all_params
+
+
+# Lazy-loaded parameter cache
+_itrf_params = None
+_itrf_params_lock = threading.Lock()
+
+
+def _get_itrf_params():
+    global _itrf_params
+    with _itrf_params_lock:
+        if _itrf_params is None:
+            _itrf_params = _load_all_itrf_params()
+        return _itrf_params
+
+
+def _find_transform(src, tgt):
+    """Find the 14-parameter Helmert from src to tgt frame.
+
+    Returns parameter dict or None. Tries direct lookup first,
+    then reverse (with negated parameters).
+    """
+    params = _get_itrf_params()
+
+    # Direct: src file contains entry for tgt
+    if src in params and tgt in params[src]:
+        return params[src][tgt], False
+
+    # Reverse: tgt file contains entry for src
+    if tgt in params and src in params[tgt]:
+        return params[tgt][src], True  # need to negate
+
+    return None, False
+
+
+# ---------------------------------------------------------------------------
+# 14-parameter time-dependent Helmert (Numba JIT)
+# ---------------------------------------------------------------------------
+
+@njit(nogil=True, cache=True)
+def _helmert14_point(X, Y, Z,
+                     tx, ty, tz, s, rx, ry, rz,
+                     dtx, dty, dtz, ds, drx, dry, drz,
+                     t_epoch, t_obs, position_vector):
+    """Apply 14-parameter Helmert transform to a single ECEF point.
+
+    Parameters are in metres (translations), ppb (scale), and
+    arcseconds (rotations).  Rates are per year.
+    """
+    dt = t_obs - t_epoch
+
+    # Effective parameters at observation epoch
+    tx_e = tx + dtx * dt
+    ty_e = ty + dty * dt
+    tz_e = tz + dtz * dt
+    s_e = 1.0 + (s + ds * dt) * 1e-9  # ppb -> scale factor
+    # Rotations: arcsec -> radians
+    AS2RAD = math.pi / (180.0 * 3600.0)
+    rx_e = (rx + drx * dt) * AS2RAD
+    ry_e = (ry + dry * dt) * AS2RAD
+    rz_e = (rz + drz * dt) * AS2RAD
+
+    if position_vector:
+        # Position vector convention (IERS/IGN)
+        X2 = tx_e + s_e * (X - rz_e * Y + ry_e * Z)
+        Y2 = ty_e + s_e * (rz_e * X + Y - rx_e * Z)
+        Z2 = tz_e + s_e * (-ry_e * X + rx_e * Y + Z)
+    else:
+        # Coordinate frame convention (transpose rotation)
+        X2 = tx_e + s_e * (X + rz_e * Y - ry_e * Z)
+        Y2 = ty_e + s_e * (-rz_e * X + Y + rx_e * Z)
+        Z2 = tz_e + s_e * (ry_e * X - rx_e * Y + Z)
+
+    return X2, Y2, Z2
+
+
+@njit(nogil=True, cache=True)
+def _geodetic_to_ecef(lon_deg, lat_deg, h, a, f):
+    lon = math.radians(lon_deg)
+    lat = math.radians(lat_deg)
+    e2 = 2.0 * f - f * f
+    slat = math.sin(lat)
+    clat = math.cos(lat)
+    N = a / math.sqrt(1.0 - e2 * slat * slat)
+    X = (N + h) * clat * math.cos(lon)
+    Y = (N + h) * clat * math.sin(lon)
+    Z = (N * (1.0 - e2) + h) * slat
+    return X, Y, Z
+
+
+@njit(nogil=True, cache=True)
+def _ecef_to_geodetic(X, Y, Z, a, f):
+    e2 = 2.0 * f - f * f
+    lon = math.atan2(Y, X)
+    p = math.sqrt(X * X + Y * Y)
+    lat = math.atan2(Z, p * (1.0 - e2))
+    for _ in range(10):
+        slat = math.sin(lat)
+        N = a / math.sqrt(1.0 - e2 * slat * slat)
+        lat = math.atan2(Z + e2 * N * slat, p)
+    N = a / math.sqrt(1.0 - e2 * math.sin(lat) * math.sin(lat))
+    h = p / math.cos(lat) - N if abs(lat) < math.pi / 4 else Z / math.sin(lat) - N * (1 - e2)
+    return math.degrees(lon), math.degrees(lat), h
+
+
+@njit(nogil=True, cache=True, parallel=True)
+def _itrf_batch(lon_arr, lat_arr, h_arr,
+                out_lon, out_lat, out_h,
+                tx, ty, tz, s, rx, ry, rz,
+                dtx, dty, dtz, ds, drx, dry, drz,
+                t_epoch, t_obs, position_vector,
+                a, f):
+    for i in prange(lon_arr.shape[0]):
+        X, Y, Z = _geodetic_to_ecef(lon_arr[i], lat_arr[i], h_arr[i], a, f)
+        X2, Y2, Z2 = _helmert14_point(
+            X, Y, Z,
+            tx, ty, tz, s, rx, ry, rz,
+            dtx, dty, dtz, ds, drx, dry, drz,
+            t_epoch, t_obs, position_vector,
+        )
+        out_lon[i], out_lat[i], out_h[i] = _ecef_to_geodetic(X2, Y2, Z2, a, f)
+
+
+# ---------------------------------------------------------------------------
+# Public API
+# ---------------------------------------------------------------------------
+
+# WGS84 ellipsoid
+_A = 6378137.0
+_F = 1.0 / 298.257223563
+
+
+def list_frames():
+    """List available ITRF frames.
+
+    Returns
+    -------
+    list of str
+        Available frame names (e.g. ['ITRF2000', 'ITRF2008', 'ITRF2014', 'ITRF2020']).
+    """
+    return sorted(_get_itrf_params().keys())
+
+
+def itrf_transform(lon, lat, h=0.0, *, src, tgt, epoch):
+    """Transform coordinates between ITRF frames at a given epoch.
+
+    Parameters
+    ----------
+    lon, lat : float or array-like
+        Geographic coordinates in degrees.
+    h : float or array-like
+        Ellipsoidal height in metres (default 0).
+    src : str
+        Source ITRF frame (e.g. 'ITRF2014').
+    tgt : str
+        Target ITRF frame (e.g. 'ITRF2020').
+    epoch : float
+        Observation epoch as decimal year (e.g. 2024.0).
+
+    Returns
+    -------
+    (lon, lat, h) : tuple of float or ndarray
+        Transformed coordinates.
+
+    Examples
+    --------
+    >>> itrf_transform(-74.0, 40.7, 10.0, src='ITRF2014', tgt='ITRF2020', epoch=2024.0)
+    """
+    raw_params, is_reverse = _find_transform(src, tgt)
+    if raw_params is None:
+        raise ValueError(
+            f"No transform found between {src} and {tgt}. "
+            f"Available frames: {list_frames()}"
+        )
+
+    # Extract parameters (default 0 for missing)
+    def g(key):
+        return raw_params.get(key, 0.0)
+
+    tx, ty, tz = g('x'), g('y'), g('z')
+    s = g('s')
+    rx, ry, rz = g('rx'), g('ry'), g('rz')
+    dtx, dty, dtz = g('dx'), g('dy'), g('dz')
+    ds = g('ds')
+    drx, dry, drz = g('drx'), g('dry'), g('drz')
+    t_epoch = g('t_epoch')
+    convention = raw_params.get('convention', 'position_vector')
+    position_vector = convention == 'position_vector'
+
+    if is_reverse:
+        # Negate all parameters for the reverse direction
+        tx, ty, tz = -tx, -ty, -tz
+        s = -s
+        rx, ry, rz = -rx, -ry, -rz
+        dtx, dty, dtz = -dtx, -dty, -dtz
+        ds = -ds
+        drx, dry, drz = -drx, -dry, -drz
+
+    scalar = np.ndim(lon) == 0 and np.ndim(lat) == 0
+    lon_arr = np.atleast_1d(np.asarray(lon, dtype=np.float64)).ravel()
+    lat_arr = np.atleast_1d(np.asarray(lat, dtype=np.float64)).ravel()
+    h_arr = np.broadcast_to(np.atleast_1d(np.asarray(h, dtype=np.float64)),
+                            lon_arr.shape).copy()
+
+    n = lon_arr.shape[0]
+    out_lon = np.empty(n, dtype=np.float64)
+    out_lat = np.empty(n, dtype=np.float64)
+    out_h = np.empty(n, dtype=np.float64)
+
+    _itrf_batch(
+        lon_arr, lat_arr, h_arr,
+        out_lon, out_lat, out_h,
+        tx, ty, tz, s, rx, ry, rz,
+        dtx, dty, dtz, ds, drx, dry, drz,
+        t_epoch, float(epoch), position_vector,
+        _A, _F,
+    )
+
+    if scalar:
+        return float(out_lon[0]), float(out_lat[0]), float(out_h[0])
+    return out_lon, out_lat, out_h
diff --git a/xrspatial/reproject/_projections.py b/xrspatial/reproject/_projections.py
new file mode 100644
index 00000000..4d73a4f9
--- /dev/null
+++ b/xrspatial/reproject/_projections.py
@@ -0,0 +1,2164 @@
+"""Numba JIT coordinate transforms for common projections.
+
+Replaces pyproj for the most-used CRS pairs, giving ~30x speedup
+via parallelised Numba kernels.
+
+Supported fast paths
+--------------------
+- WGS84 (EPSG:4326) <-> Web Mercator (EPSG:3857)
+- WGS84 / NAD83 <-> UTM zones (EPSG:326xx / 327xx / 269xx)
+- WGS84 / NAD83 <-> Ellipsoidal Mercator (EPSG:3395)
+- WGS84 / NAD83 <-> Lambert Conformal Conic (e.g. EPSG:2154)
+- WGS84 / NAD83 <-> Albers Equal Area (e.g. EPSG:5070)
+- WGS84 / NAD83 <-> Cylindrical Equal Area (e.g. EPSG:6933)
+- WGS84 / NAD83 <-> Sinusoidal (e.g. MODIS)
+- WGS84 / NAD83 <-> Lambert Azimuthal Equal Area (e.g. EPSG:3035)
+- WGS84 / NAD83 <-> Polar Stereographic (e.g. EPSG:3031, 3413, 3996)
+- WGS84 / NAD83 <-> Oblique Stereographic (e.g. EPSG:28992 RD New)
+- WGS84 / NAD83 <-> Oblique Mercator Hotine (e.g. EPSG:3375 RSO)
+
+All other CRS pairs fall back to pyproj.
+"""
+from __future__ import annotations
+
+import math
+
+import numpy as np
+from numba import njit, prange
+
+# ---------------------------------------------------------------------------
+# WGS84 ellipsoid constants
+# ---------------------------------------------------------------------------
+_WGS84_A = 6378137.0                        # semi-major axis (m)
+_WGS84_F = 1.0 / 298.257223563              # flattening
+_WGS84_B = _WGS84_A * (1.0 - _WGS84_F)     # semi-minor axis
+_WGS84_N = (_WGS84_A - _WGS84_B) / (_WGS84_A + _WGS84_B)  # third flattening
+_WGS84_E2 = 2.0 * _WGS84_F - _WGS84_F ** 2  # eccentricity squared
+_WGS84_E = math.sqrt(_WGS84_E2)             # eccentricity
+
+# ---------------------------------------------------------------------------
+# Web Mercator  (EPSG:3857)  --  spherical, trivial
+# ---------------------------------------------------------------------------
+
+@njit(nogil=True, cache=True)
+def _merc_fwd_point(lon_deg, lat_deg):
+    """(lon, lat) in degrees -> (x, y) in EPSG:3857 metres."""
+    x = _WGS84_A * math.radians(lon_deg)
+    phi = math.radians(lat_deg)
+    y = _WGS84_A * math.log(math.tan(math.pi / 4.0 + phi / 2.0))
+    return x, y
+
+
+@njit(nogil=True, cache=True)
+def _merc_inv_point(x, y):
+    """(x, y) in EPSG:3857 metres -> (lon, lat) in degrees."""
+    lon = math.degrees(x / _WGS84_A)
+    lat = math.degrees(math.atan(math.sinh(y / _WGS84_A)))
+    return lon, lat
+
+
+@njit(nogil=True, cache=True, parallel=True)
+def merc_forward(lons, lats, out_x, out_y):
+    """Batch WGS84 -> Web Mercator.  Writes into pre-allocated arrays."""
+    for i in prange(lons.shape[0]):
+        out_x[i], out_y[i] = _merc_fwd_point(lons[i], lats[i])
+
+
+@njit(nogil=True, cache=True, parallel=True)
+def merc_inverse(xs, ys, out_lon, out_lat):
+    """Batch Web Mercator -> WGS84.  Writes into pre-allocated arrays."""
+    for i in prange(xs.shape[0]):
+        out_lon[i], out_lat[i] = _merc_inv_point(xs[i], ys[i])
+
+
+# ---------------------------------------------------------------------------
+# Datum shift: 7-parameter Helmert (Bursa-Wolf)
+# ---------------------------------------------------------------------------
+
+# Ellipsoid definitions: (a, f)
+_ELLIPSOID_CLARKE1866 = (6378206.4, 1.0 / 294.9786982)
+_ELLIPSOID_AIRY = (6377563.396, 1.0 / 299.3249646)
+_ELLIPSOID_BESSEL = (6377397.155, 1.0 / 299.1528128)
+_ELLIPSOID_INTL1924 = (6378388.0, 1.0 / 297.0)
+_ELLIPSOID_ANS = (6378160.0, 1.0 / 298.25)  # Australian National Spheroid
+_ELLIPSOID_WGS84 = (_WGS84_A, _WGS84_F)
+
+# 7-parameter Helmert: (dx, dy, dz, rx, ry, rz, ds, ellipsoid)
+#   dx/dy/dz: translation (metres)
+#   rx/ry/rz: rotation (arcseconds, position vector convention)
+#   ds: scale difference (ppm)
+#   ellipsoid: (a, f) of the source datum
+# From EPSG dataset / NIMA TR 8350.2.  Used as fallback when
+# shift grids are not available.
+_DATUM_PARAMS = {
+    # North America (3-param, no rotations published)
+    'NAD27':    (-8.0, 160.0, 176.0, 0, 0, 0, 0, _ELLIPSOID_CLARKE1866),
+    'clarke66': (-8.0, 160.0, 176.0, 0, 0, 0, 0, _ELLIPSOID_CLARKE1866),
+    # UK -- OSGB36->ETRS89 (EPSG:1314, 7-param, position vector)
+    'OSGB36':   (446.448, -125.157, 542.060, 0.1502, 0.2470, 0.8421, -20.4894, _ELLIPSOID_AIRY),
+    'airy':     (446.448, -125.157, 542.060, 0.1502, 0.2470, 0.8421, -20.4894, _ELLIPSOID_AIRY),
+    # Germany -- DHDN->ETRS89 (EPSG:1776, 7-param)
+    'DHDN':     (598.1, 73.7, 418.2, 0.202, 0.045, -2.455, 6.7, _ELLIPSOID_BESSEL),
+    'potsdam':  (598.1, 73.7, 418.2, 0.202, 0.045, -2.455, 6.7, _ELLIPSOID_BESSEL),
+    # Austria -- MGI->ETRS89 (EPSG:1618, 7-param)
+    'MGI':      (577.326, 90.129, 463.919, 5.1366, 1.4742, 5.2970, 2.4232, _ELLIPSOID_BESSEL),
+    'hermannskogel': (577.326, 90.129, 463.919, 5.1366, 1.4742, 5.2970, 2.4232, _ELLIPSOID_BESSEL),
+    # Europe -- ED50->WGS84 (EPSG:1133, 7-param, western Europe)
+    'ED50':     (-87.0, -98.0, -121.0, 0, 0, 0.814, -0.38, _ELLIPSOID_INTL1924),
+    'intl':     (-87.0, -98.0, -121.0, 0, 0, 0.814, -0.38, _ELLIPSOID_INTL1924),
+    # Belgium -- BD72->ETRS89 (EPSG:1609, 7-param)
+    'BD72':     (-106.869, 52.2978, -103.724, 0.3366, -0.457, 1.8422, -1.2747, _ELLIPSOID_INTL1924),
+    # Switzerland -- CH1903->ETRS89 (EPSG:1753, 7-param)
+    'CH1903':   (674.374, 15.056, 405.346, 0, 0, 0, 0, _ELLIPSOID_BESSEL),
+    # Portugal -- D73->ETRS89 (3-param)
+    'D73':      (-239.749, 88.181, 30.488, 0, 0, 0, 0, _ELLIPSOID_INTL1924),
+    # Australia -- AGD66->GDA94 (3-param)
+    'AGD66':    (-133.0, -48.0, 148.0, 0, 0, 0, 0, _ELLIPSOID_ANS),
+    'aust_SA':  (-133.0, -48.0, 148.0, 0, 0, 0, 0, _ELLIPSOID_ANS),
+    # Japan -- Tokyo->WGS84 (3-param, grid not openly licensed)
+    'tokyo':    (-146.414, 507.337, 680.507, 0, 0, 0, 0, _ELLIPSOID_BESSEL),
+}
+
+
+@njit(nogil=True, cache=True)
+def _geodetic_to_ecef(lon_deg, lat_deg, a, f):
+    """Geographic (deg) -> geocentric ECEF (metres)."""
+    lon = math.radians(lon_deg)
+    lat = math.radians(lat_deg)
+    e2 = 2.0 * f - f * f
+    slat = math.sin(lat)
+    clat = math.cos(lat)
+    N = a / math.sqrt(1.0 - e2 * slat * slat)
+    X = N * clat * math.cos(lon)
+    Y = N * clat * math.sin(lon)
+    Z = N * (1.0 - e2) * slat
+    return X, Y, Z
+
+
+@njit(nogil=True, cache=True)
+def _ecef_to_geodetic(X, Y, Z, a, f):
+    """Geocentric ECEF (metres) -> geographic (deg).  Iterative."""
+    e2 = 2.0 * f - f * f
+    lon = math.atan2(Y, X)
+    p = math.sqrt(X * X + Y * Y)
+    lat = math.atan2(Z, p * (1.0 - e2))
+    for _ in range(10):
+        slat = math.sin(lat)
+        N = a / math.sqrt(1.0 - e2 * slat * slat)
+        lat = math.atan2(Z + e2 * N * slat, p)
+    return math.degrees(lon), math.degrees(lat)
+
+
+@njit(nogil=True, cache=True)
+def _helmert7_fwd(lon_deg, lat_deg, dx, dy, dz, rx, ry, rz, ds,
+                  a_src, f_src, a_tgt, f_tgt):
+    """Datum shift: source -> target via 7-param Helmert (Bursa-Wolf).
+
+    rx/ry/rz in arcseconds (position vector convention), ds in ppm.
+    """
+    X, Y, Z = _geodetic_to_ecef(lon_deg, lat_deg, a_src, f_src)
+    AS2RAD = math.pi / (180.0 * 3600.0)
+    rxr = rx * AS2RAD
+    ryr = ry * AS2RAD
+    rzr = rz * AS2RAD
+    sc = 1.0 + ds * 1e-6
+    X2 = dx + sc * (X - rzr * Y + ryr * Z)
+    Y2 = dy + sc * (rzr * X + Y - rxr * Z)
+    Z2 = dz + sc * (-ryr * X + rxr * Y + Z)
+    return _ecef_to_geodetic(X2, Y2, Z2, a_tgt, f_tgt)
+
+
+@njit(nogil=True, cache=True)
+def _helmert7_inv(lon_deg, lat_deg, dx, dy, dz, rx, ry, rz, ds,
+                  a_src, f_src, a_tgt, f_tgt):
+    """Inverse 7-param Helmert: target -> source (negate all params)."""
+    return _helmert7_fwd(lon_deg, lat_deg,
+                         -dx, -dy, -dz, -rx, -ry, -rz, -ds,
+                         a_tgt, f_tgt, a_src, f_src)
+
+
+def _get_datum_params(crs):
+    """Return (dx, dy, dz, rx, ry, rz, ds, a_src, f_src) for a non-WGS84 datum.
+
+    Returns None for WGS84/NAD83/GRS80 (no shift needed).
+    """
+    try:
+        d = crs.to_dict()
+    except Exception:
+        return None
+    datum = d.get('datum', '')
+    ellps = d.get('ellps', '')
+    key = datum if datum in _DATUM_PARAMS else ellps
+    if key not in _DATUM_PARAMS:
+        return None
+    dx, dy, dz, rx, ry, rz, ds, (a_src, f_src) = _DATUM_PARAMS[key]
+    return dx, dy, dz, rx, ry, rz, ds, a_src, f_src
+
+
+# ---------------------------------------------------------------------------
+# Shared helpers  (PROJ pj_tsfn, pj_sinhpsi2tanphi, authalic latitude)
+# ---------------------------------------------------------------------------
+
+@njit(nogil=True, cache=True)
+def _norm_lon_rad(lon):
+    """Normalize longitude to [-pi, pi]."""
+    while lon > math.pi:
+        lon -= 2.0 * math.pi
+    while lon < -math.pi:
+        lon += 2.0 * math.pi
+    return lon
+
+
+@njit(nogil=True, cache=True)
+def _pj_tsfn(phi, sinphi, e):
+    """Isometric co-latitude: ts = exp(-psi).
+
+    Equivalent to tan(pi/4 - phi/2) / ((1-e*sinphi)/(1+e*sinphi))^(e/2).
+    """
+    es = e * sinphi
+    return math.tan(math.pi / 4.0 - phi / 2.0) * math.pow(
+        (1.0 + es) / (1.0 - es), e / 2.0
+    )
+
+
+@njit(nogil=True, cache=True)
+def _pj_sinhpsi2tanphi(taup, e):
+    """Newton iteration: recover tan(phi) from sinh(isometric lat).
+
+    Matches PROJ's pj_sinhpsi2tanphi -- 5 iterations, always converges.
+    """
+    e2 = e * e
+    tau = taup
+    tau1 = math.sqrt(1.0 + tau * tau)
+
+    for _ in range(5):
+        tau1 = math.sqrt(1.0 + tau * tau)
+        sig = math.sinh(e * math.atanh(e * tau / tau1))
+        sig1 = math.sqrt(1.0 + sig * sig)
+        taupa = sig1 * tau - sig * tau1
+        dtau = ((taup - taupa) * (1.0 + (1.0 - e2) * tau * tau)
+                / ((1.0 - e2) * tau1 * math.sqrt(1.0 + taupa * taupa)))
+        tau += dtau
+        if abs(dtau) < 1e-12:
+            break
+    return tau
+
+
+@njit(nogil=True, cache=True)
+def _authalic_q(sinphi, e):
+    """Authalic latitude q-parameter: q(phi) for given sinphi and e."""
+    e2 = e * e
+    es = e * sinphi
+    return (1.0 - e2) * (sinphi / (1.0 - es * es) + math.atanh(es) / e)
+
+
+def _authalic_apa(e):
+    """Precompute 6 coefficients for the authalic latitude inverse series.
+
+    Returns array [APA0..APA5] used by _authalic_inv.
+    6 terms give sub-centimetre accuracy (vs ~4m with 3 terms).
+    Coefficients from Snyder (1987) / Karney (2011).
+    """
+    e2 = e * e
+    e4 = e2 * e2
+    e6 = e4 * e2
+    e8 = e6 * e2
+    e10 = e8 * e2
+    e12 = e10 * e2
+    apa = np.empty(6, dtype=np.float64)
+    apa[0] = e2 / 3.0 + 31.0 * e4 / 180.0 + 59.0 * e6 / 560.0 + 17141.0 * e8 / 166320.0 + 28289.0 * e10 / 249480.0
+    apa[1] = 17.0 * e4 / 360.0 + 61.0 * e6 / 1260.0 + 10217.0 * e8 / 120960.0 + 319.0 * e10 / 3024.0
+    apa[2] = 383.0 * e6 / 45360.0 + 34729.0 * e8 / 1814400.0 + 192757.0 * e10 / 5765760.0
+    apa[3] = 6007.0 * e8 / 272160.0 + 36941.0 * e10 / 1270080.0
+    apa[4] = 33661.0 * e10 / 5765760.0
+    apa[5] = 0.0  # 12th order term negligible for Earth
+    return apa
+
+
+@njit(nogil=True, cache=True)
+def _authalic_inv(beta, apa):
+    """Inverse authalic latitude: beta (authalic, rad) -> phi (geodetic, rad).
+
+    6-term Fourier series for sub-centimetre accuracy.
+    """
+    t = 2.0 * beta
+    return (beta
+            + apa[0] * math.sin(t)
+            + apa[1] * math.sin(2.0 * t)
+            + apa[2] * math.sin(3.0 * t)
+            + apa[3] * math.sin(4.0 * t)
+            + apa[4] * math.sin(5.0 * t))
+
+
+# Precompute authalic coefficients for WGS84
+_APA = _authalic_apa(_WGS84_E)
+_QP = _authalic_q(1.0, _WGS84_E)  # q at the pole
+
+
+# ---------------------------------------------------------------------------
+# Ellipsoidal Mercator  (EPSG:3395)
+# ---------------------------------------------------------------------------
+
+@njit(nogil=True, cache=True)
+def _emerc_fwd_point(lon_deg, lat_deg, k0, e):
+    """(lon, lat) deg -> (x, y) metres, ellipsoidal Mercator."""
+    lam = math.radians(lon_deg)
+    phi = math.radians(lat_deg)
+    sinphi = math.sin(phi)
+    x = k0 * _WGS84_A * lam
+    y = k0 * _WGS84_A * (math.asinh(math.tan(phi)) - e * math.atanh(e * sinphi))
+    return x, y
+
+
+@njit(nogil=True, cache=True)
+def _emerc_inv_point(x, y, k0, e):
+    """(x, y) metres -> (lon, lat) deg, ellipsoidal Mercator."""
+    lam = x / (k0 * _WGS84_A)
+    taup = math.sinh(y / (k0 * _WGS84_A))
+    tau = _pj_sinhpsi2tanphi(taup, e)
+    return math.degrees(lam), math.degrees(math.atan(tau))
+
+
+@njit(nogil=True, cache=True, parallel=True)
+def emerc_forward(lons, lats, out_x, out_y, k0, e):
+    for i in prange(lons.shape[0]):
+        out_x[i], out_y[i] = _emerc_fwd_point(lons[i], lats[i], k0, e)
+
+
+@njit(nogil=True, cache=True, parallel=True)
+def emerc_inverse(xs, ys, out_lon, out_lat, k0, e):
+    for i in prange(xs.shape[0]):
+        out_lon[i], out_lat[i] = _emerc_inv_point(xs[i], ys[i], k0, e)
+
+
+# ---------------------------------------------------------------------------
+# Lambert Conformal Conic  (LCC)
+# ---------------------------------------------------------------------------
+
+def _lcc_params(crs):
+    """Extract LCC projection parameters from a pyproj CRS.
+
+    Returns (lon0, lat0, n, c, rho0, k0) or None.
+    """
+    try:
+        d = crs.to_dict()
+    except Exception:
+        return None
+    if d.get('proj') != 'lcc':
+        return None
+    if not _is_wgs84_compatible_ellipsoid(crs):
+        return None
+
+    units = d.get('units', 'm')
+    _UNIT_TO_METER = {'m': 1.0, 'us-ft': 0.3048006096012192, 'ft': 0.3048}
+    to_meter = _UNIT_TO_METER.get(units)
+    if to_meter is None:
+        return None
+
+    lat_1 = math.radians(d.get('lat_1', d.get('lat_0', 0.0)))
+    lat_2 = math.radians(d.get('lat_2', lat_1))
+    lat_0 = math.radians(d.get('lat_0', 0.0))
+    lon_0 = math.radians(d.get('lon_0', 0.0))
+    k0_param = d.get('k_0', d.get('k', 1.0))
+
+    e = _WGS84_E
+    a = _WGS84_A
+
+    sinphi1 = math.sin(lat_1)
+    cosphi1 = math.cos(lat_1)
+    sinphi2 = math.sin(lat_2)
+
+    m1 = cosphi1 / math.sqrt(1.0 - _WGS84_E2 * sinphi1 * sinphi1)
+    ts1 = math.tan(math.pi / 4.0 - lat_1 / 2.0) * math.pow(
+        (1.0 + e * sinphi1) / (1.0 - e * sinphi1), e / 2.0)
+
+    if abs(lat_1 - lat_2) > 1e-10:
+        m2 = cosphi2 = math.cos(lat_2)
+        cosphi2 /= math.sqrt(1.0 - _WGS84_E2 * sinphi2 * sinphi2)
+        ts2 = math.tan(math.pi / 4.0 - lat_2 / 2.0) * math.pow(
+            (1.0 + e * sinphi2) / (1.0 - e * sinphi2), e / 2.0)
+        n = math.log(m1 / cosphi2) / math.log(ts1 / ts2)
+    else:
+        n = sinphi1
+
+    c = m1 * math.pow(ts1, -n) / n
+    sinphi0 = math.sin(lat_0)
+    ts0 = math.tan(math.pi / 4.0 - lat_0 / 2.0) * math.pow(
+        (1.0 + e * sinphi0) / (1.0 - e * sinphi0), e / 2.0)
+    rho0 = a * k0_param * c * math.pow(ts0, n)
+
+    fe = d.get('x_0', 0.0)   # always in metres in PROJ4 dict
+    fn = d.get('y_0', 0.0)
+
+    return lon_0, n, c, rho0, k0_param, fe, fn, to_meter
+
+
+@njit(nogil=True, cache=True)
+def _lcc_fwd_point(lon_deg, lat_deg, lon0, n, c, rho0, k0, e, a):
+    phi = math.radians(lat_deg)
+    lam = math.radians(lon_deg) - lon0
+    sinphi = math.sin(phi)
+    ts = math.tan(math.pi / 4.0 - phi / 2.0) * math.pow(
+        (1.0 + e * sinphi) / (1.0 - e * sinphi), e / 2.0)
+    rho = a * k0 * c * math.pow(ts, n)
+    lam_n = n * lam
+    x = rho * math.sin(lam_n)
+    y = rho0 - rho * math.cos(lam_n)
+    return x, y
+
+
+@njit(nogil=True, cache=True)
+def _lcc_inv_point(x, y, lon0, n, c, rho0, k0, e, a):
+    rho0_y = rho0 - y
+    if n < 0.0:
+        rho = -math.hypot(x, rho0_y)
+        lam_n = math.atan2(-x, -rho0_y)
+    else:
+        rho = math.hypot(x, rho0_y)
+        lam_n = math.atan2(x, rho0_y)
+    if abs(rho) < 1e-30:
+        return math.degrees(lon0 + lam_n / n), 90.0 if n > 0 else -90.0
+    ts = math.pow(rho / (a * k0 * c), 1.0 / n)
+    # Recover phi from ts via Newton (pj_sinhpsi2tanphi)
+    phi_approx = math.pi / 2.0 - 2.0 * math.atan(ts)
+    taup = math.sinh(math.log(1.0 / ts))  # sinh(psi)
+    tau = _pj_sinhpsi2tanphi(taup, e)
+    phi = math.atan(tau)
+    lam = lam_n / n
+    return math.degrees(_norm_lon_rad(lam + lon0)), math.degrees(phi)
+
+
+@njit(nogil=True, cache=True, parallel=True)
+def lcc_forward(lons, lats, out_x, out_y,
+                lon0, n, c, rho0, k0, fe, fn, e, a):
+    for i in prange(lons.shape[0]):
+        x, y = _lcc_fwd_point(lons[i], lats[i], lon0, n, c, rho0, k0, e, a)
+        out_x[i] = x + fe
+        out_y[i] = y + fn
+
+
+@njit(nogil=True, cache=True, parallel=True)
+def lcc_inverse(xs, ys, out_lon, out_lat,
+                lon0, n, c, rho0, k0, fe, fn, e, a):
+    for i in prange(xs.shape[0]):
+        out_lon[i], out_lat[i] = _lcc_inv_point(
+            xs[i] - fe, ys[i] - fn, lon0, n, c, rho0, k0, e, a)
+
+
+@njit(nogil=True, cache=True, parallel=True)
+def lcc_inverse_2d(x_1d, y_1d, out_lon_2d, out_lat_2d,
+                   lon0, n, c, rho0, k0, fe, fn, e, a, to_m):
+    """2D LCC inverse from 1D coordinate arrays, with built-in unit conversion.
+
+    Avoids np.tile/np.repeat (saves ~550ms for 4096x4096) and fuses
+    the unit conversion into the inner loop.
+    """
+    h = y_1d.shape[0]
+    w = x_1d.shape[0]
+    for i in prange(h):
+        y_m = y_1d[i] * to_m - fn
+        for j in range(w):
+            x_m = x_1d[j] * to_m - fe
+            out_lon_2d[i, j], out_lat_2d[i, j] = _lcc_inv_point(
+                x_m, y_m, lon0, n, c, rho0, k0, e, a)
+
+
+@njit(nogil=True, cache=True, parallel=True)
+def tmerc_inverse_2d(x_1d, y_1d, out_lon_2d, out_lat_2d,
+                     lon0, k0, fe, fn, Qn, beta, cgb, to_m):
+    """2D tmerc inverse from 1D coordinate arrays, with unit conversion."""
+    h = y_1d.shape[0]
+    w = x_1d.shape[0]
+    for i in prange(h):
+        y_m = y_1d[i] * to_m - fn
+        for j in range(w):
+            x_m = x_1d[j] * to_m - fe
+            out_lon_2d[i, j], out_lat_2d[i, j] = _tmerc_inv_point(
+                x_m, y_m, lon0, k0, Qn, beta, cgb)
+
+
+# ---------------------------------------------------------------------------
+# Albers Equal Area Conic  (AEA)
+# ---------------------------------------------------------------------------
+
+def _aea_params(crs):
+    """Extract AEA projection parameters from a pyproj CRS.
+
+    Returns (lon0, n, c, dd, rho0, fe, fn) or None.
+    """
+    try:
+        d = crs.to_dict()
+    except Exception:
+        return None
+    if d.get('proj') != 'aea':
+        return None
+
+    lat_1 = math.radians(d.get('lat_1', 0.0))
+    lat_2 = math.radians(d.get('lat_2', lat_1))
+    lat_0 = math.radians(d.get('lat_0', 0.0))
+    lon_0 = math.radians(d.get('lon_0', 0.0))
+
+    e = _WGS84_E
+    e2 = _WGS84_E2
+    a = _WGS84_A
+
+    sinphi1 = math.sin(lat_1)
+    cosphi1 = math.cos(lat_1)
+    sinphi2 = math.sin(lat_2)
+    cosphi2 = math.cos(lat_2)
+
+    m1 = cosphi1 / math.sqrt(1.0 - e2 * sinphi1 * sinphi1)
+    m2 = cosphi2 / math.sqrt(1.0 - e2 * sinphi2 * sinphi2)
+    q1 = _authalic_q(sinphi1, e)
+    q2 = _authalic_q(sinphi2, e)
+    q0 = _authalic_q(math.sin(lat_0), e)
+
+    if abs(lat_1 - lat_2) > 1e-10:
+        n = (m1 * m1 - m2 * m2) / (q2 - q1)
+    else:
+        n = sinphi1
+
+    C = m1 * m1 + n * q1
+    rho0 = a * math.sqrt(C - n * q0) / n
+
+    fe = d.get('x_0', 0.0)
+    fn = d.get('y_0', 0.0)
+
+    return lon_0, n, C, rho0, fe, fn
+
+
+@njit(nogil=True, cache=True)
+def _aea_fwd_point(lon_deg, lat_deg, lon0, n, C, rho0, e, a):
+    phi = math.radians(lat_deg)
+    lam = math.radians(lon_deg) - lon0
+    q = _authalic_q(math.sin(phi), e)
+    val = C - n * q
+    if val < 0.0:
+        val = 0.0
+    rho = a * math.sqrt(val) / n
+    theta = n * lam
+    x = rho * math.sin(theta)
+    y = rho0 - rho * math.cos(theta)
+    return x, y
+
+
+@njit(nogil=True, cache=True)
+def _aea_inv_point(x, y, lon0, n, C, rho0, e, a, qp, apa):
+    rho0_y = rho0 - y
+    if n < 0.0:
+        rho = -math.hypot(x, rho0_y)
+        theta = math.atan2(-x, -rho0_y)
+    else:
+        rho = math.hypot(x, rho0_y)
+        theta = math.atan2(x, rho0_y)
+    q = (C - (rho * rho * n * n) / (a * a)) / n
+    # beta = asin(q / qp), clamped
+    ratio = q / qp
+    if ratio > 1.0:
+        ratio = 1.0
+    elif ratio < -1.0:
+        ratio = -1.0
+    beta = math.asin(ratio)
+    phi = _authalic_inv(beta, apa)
+    lam = theta / n
+    return math.degrees(_norm_lon_rad(lam + lon0)), math.degrees(phi)
+
+
+@njit(nogil=True, cache=True, parallel=True)
+def aea_forward(lons, lats, out_x, out_y,
+                lon0, n, C, rho0, fe, fn, e, a):
+    for i in prange(lons.shape[0]):
+        x, y = _aea_fwd_point(lons[i], lats[i], lon0, n, C, rho0, e, a)
+        out_x[i] = x + fe
+        out_y[i] = y + fn
+
+
+@njit(nogil=True, cache=True, parallel=True)
+def aea_inverse(xs, ys, out_lon, out_lat,
+                lon0, n, C, rho0, fe, fn, e, a, qp, apa):
+    for i in prange(xs.shape[0]):
+        out_lon[i], out_lat[i] = _aea_inv_point(
+            xs[i] - fe, ys[i] - fn, lon0, n, C, rho0, e, a, qp, apa)
+
+
+# ---------------------------------------------------------------------------
+# Cylindrical Equal Area  (CEA)
+# ---------------------------------------------------------------------------
+
+def _cea_params(crs):
+    """Extract CEA projection parameters from a pyproj CRS.
+
+    Returns (lon0, k0, fe, fn) or None.
+    """
+    try:
+        d = crs.to_dict()
+    except Exception:
+        return None
+    if d.get('proj') != 'cea':
+        return None
+
+    lon_0 = math.radians(d.get('lon_0', 0.0))
+    lat_ts = math.radians(d.get('lat_ts', 0.0))
+    sinlts = math.sin(lat_ts)
+    coslts = math.cos(lat_ts)
+    # k0 = cos(lat_ts) / sqrt(1 - e² sin²(lat_ts))
+    k0 = coslts / math.sqrt(1.0 - _WGS84_E2 * sinlts * sinlts)
+    fe = d.get('x_0', 0.0)
+    fn = d.get('y_0', 0.0)
+    return lon_0, k0, fe, fn
+
+
+@njit(nogil=True, cache=True)
+def _cea_fwd_point(lon_deg, lat_deg, lon0, k0, e, a, qp):
+    lam = math.radians(lon_deg) - lon0
+    phi = math.radians(lat_deg)
+    q = _authalic_q(math.sin(phi), e)
+    x = a * k0 * lam
+    y = a * q / (2.0 * k0)
+    return x, y
+
+
+@njit(nogil=True, cache=True)
+def _cea_inv_point(x, y, lon0, k0, e, a, qp, apa):
+    lam = x / (a * k0)
+    ratio = 2.0 * y * k0 / (a * qp)
+    if ratio > 1.0:
+        ratio = 1.0
+    elif ratio < -1.0:
+        ratio = -1.0
+    beta = math.asin(ratio)
+    phi = _authalic_inv(beta, apa)
+    return math.degrees(_norm_lon_rad(lam + lon0)), math.degrees(phi)
+
+
+@njit(nogil=True, cache=True, parallel=True)
+def cea_forward(lons, lats, out_x, out_y,
+                lon0, k0, fe, fn, e, a, qp):
+    for i in prange(lons.shape[0]):
+        x, y = _cea_fwd_point(lons[i], lats[i], lon0, k0, e, a, qp)
+        out_x[i] = x + fe
+        out_y[i] = y + fn
+
+
+@njit(nogil=True, cache=True, parallel=True)
+def cea_inverse(xs, ys, out_lon, out_lat,
+                lon0, k0, fe, fn, e, a, qp, apa):
+    for i in prange(xs.shape[0]):
+        out_lon[i], out_lat[i] = _cea_inv_point(
+            xs[i] - fe, ys[i] - fn, lon0, k0, e, a, qp, apa)
+
+
+# ---------------------------------------------------------------------------
+# Shared: Meridional arc length (pj_mlfn / pj_enfn / pj_inv_mlfn)
+# Used by Sinusoidal ellipsoidal
+# ---------------------------------------------------------------------------
+
+def _mlfn_coeffs(es):
+    """Precompute 5 coefficients for meridional arc length.
+
+    Matches PROJ's pj_enfn exactly.  Returns array en[0..4].
+    """
+    en = np.empty(5, dtype=np.float64)
+    # Constants from PROJ mlfn.cpp
+    en[0] = 1.0 - es * (0.25 + es * (0.046875 + es * (0.01953125 + es * 0.01068115234375)))
+    en[1] = es * (0.75 - es * (0.046875 + es * (0.01953125 + es * 0.01068115234375)))
+    t = es * es
+    en[2] = t * (0.46875 - es * (0.013020833333333334 + es * 0.007120768229166667))
+    en[3] = t * es * (0.3645833333333333 - es * 0.005696614583333333)
+    en[4] = t * es * es * 0.3076171875
+    return en
+
+
+@njit(nogil=True, cache=True)
+def _mlfn(phi, sinphi, cosphi, en):
+    """Meridional arc length from equator to phi.
+
+    Matches PROJ's pj_mlfn: recurrence in sin^2(phi).
+    """
+    cphi = cosphi * sinphi  # = sin(2*phi)/2
+    sphi = sinphi * sinphi  # = sin^2(phi)
+    return en[0] * phi - cphi * (en[1] + sphi * (en[2] + sphi * (en[3] + sphi * en[4])))
+
+
+@njit(nogil=True, cache=True)
+def _inv_mlfn(arg, e2, en):
+    """Inverse meridional arc length: M -> phi.  Newton iteration."""
+    k = 1.0 / (1.0 - e2)
+    phi = arg
+    for _ in range(20):
+        s = math.sin(phi)
+        c = math.cos(phi)
+        t = 1.0 - e2 * s * s
+        dphi = (arg - _mlfn(phi, s, c, en)) * t * math.sqrt(t) * k
+        phi += dphi
+        if abs(dphi) < 1e-14:
+            break
+    return phi
+
+
+# Precompute for WGS84
+_MLFN_EN = _mlfn_coeffs(_WGS84_E2)
+
+
+# ---------------------------------------------------------------------------
+# Sinusoidal  (ellipsoidal)
+# ---------------------------------------------------------------------------
+
+def _sinu_params(crs):
+    """Extract Sinusoidal parameters from a pyproj CRS.
+
+    Returns (lon0, fe, fn) or None.
+    """
+    try:
+        d = crs.to_dict()
+    except Exception:
+        return None
+    if d.get('proj') != 'sinu':
+        return None
+    if not _is_wgs84_compatible_ellipsoid(crs):
+        return None
+    lon_0 = math.radians(d.get('lon_0', 0.0))
+    fe = d.get('x_0', 0.0)
+    fn = d.get('y_0', 0.0)
+    return lon_0, fe, fn
+
+
+@njit(nogil=True, cache=True)
+def _sinu_fwd_point(lon_deg, lat_deg, lon0, e2, a, en):
+    phi = math.radians(lat_deg)
+    lam = math.radians(lon_deg) - lon0
+    s = math.sin(phi)
+    c = math.cos(phi)
+    ms = _mlfn(phi, s, c, en)
+    x = a * lam * c / math.sqrt(1.0 - e2 * s * s)
+    y = a * ms
+    return x, y
+
+
+@njit(nogil=True, cache=True)
+def _sinu_inv_point(x, y, lon0, e2, a, en):
+    phi = _inv_mlfn(y / a, e2, en)
+    s = math.sin(phi)
+    c = math.cos(phi)
+    if abs(c) < 1e-14:
+        lam = 0.0
+    else:
+        lam = x * math.sqrt(1.0 - e2 * s * s) / (a * c)
+    return math.degrees(_norm_lon_rad(lam + lon0)), math.degrees(phi)
+
+
+@njit(nogil=True, cache=True, parallel=True)
+def sinu_forward(lons, lats, out_x, out_y,
+                 lon0, fe, fn, e2, a, en):
+    for i in prange(lons.shape[0]):
+        x, y = _sinu_fwd_point(lons[i], lats[i], lon0, e2, a, en)
+        out_x[i] = x + fe
+        out_y[i] = y + fn
+
+
+@njit(nogil=True, cache=True, parallel=True)
+def sinu_inverse(xs, ys, out_lon, out_lat,
+                 lon0, fe, fn, e2, a, en):
+    for i in prange(xs.shape[0]):
+        out_lon[i], out_lat[i] = _sinu_inv_point(
+            xs[i] - fe, ys[i] - fn, lon0, e2, a, en)
+
+
+# ---------------------------------------------------------------------------
+# Lambert Azimuthal Equal Area  (LAEA)  --  oblique & polar
+# ---------------------------------------------------------------------------
+
+def _laea_params(crs):
+    """Extract LAEA parameters from a pyproj CRS.
+
+    Returns (lon0, lat0, sinb1, cosb1, dd, xmf, ymf, rq, qp, fe, fn, mode)
+    where mode: 0=OBLIQ, 1=EQUIT, 2=N_POLE, 3=S_POLE.
+    Or None if not LAEA.
+    """
+    try:
+        d = crs.to_dict()
+    except Exception:
+        return None
+    if d.get('proj') != 'laea':
+        return None
+    if not _is_wgs84_compatible_ellipsoid(crs):
+        return None
+
+    lon_0 = math.radians(d.get('lon_0', 0.0))
+    lat_0 = math.radians(d.get('lat_0', 0.0))
+    fe = d.get('x_0', 0.0)
+    fn = d.get('y_0', 0.0)
+
+    e = _WGS84_E
+    a = _WGS84_A
+    e2 = _WGS84_E2
+
+    qp = _authalic_q(1.0, e)
+    rq = math.sqrt(0.5 * qp)
+
+    EPS10 = 1e-10
+    if abs(lat_0 - math.pi / 2) < EPS10:
+        mode = 2  # N_POLE
+    elif abs(lat_0 + math.pi / 2) < EPS10:
+        mode = 3  # S_POLE
+    elif abs(lat_0) < EPS10:
+        mode = 1  # EQUIT
+    else:
+        mode = 0  # OBLIQ
+
+    if mode == 0:  # OBLIQ
+        sinphi0 = math.sin(lat_0)
+        q0 = _authalic_q(sinphi0, e)
+        sinb1 = q0 / qp
+        cosb1 = math.sqrt(1.0 - sinb1 * sinb1)
+        m1 = math.cos(lat_0) / math.sqrt(1.0 - e2 * sinphi0 * sinphi0)
+        dd = m1 / (rq * cosb1)
+        # PROJ: xmf = rq * dd, ymf = rq / dd
+        xmf = rq * dd
+        ymf = rq / dd
+    elif mode == 1:  # EQUIT
+        sinb1 = 0.0
+        cosb1 = 1.0
+        m1 = math.cos(lat_0) / math.sqrt(1.0 - e2 * math.sin(lat_0)**2)
+        dd = m1 / rq
+        xmf = rq * dd
+        ymf = rq / dd
+    else:  # POLAR
+        sinb1 = 1.0 if mode == 2 else -1.0
+        cosb1 = 0.0
+        dd = 1.0
+        xmf = rq
+        ymf = rq
+
+    return lon_0, lat_0, sinb1, cosb1, dd, xmf, ymf, rq, qp, fe, fn, mode
+
+
+@njit(nogil=True, cache=True)
+def _laea_fwd_point(lon_deg, lat_deg, lon0, sinb1, cosb1,
+                    xmf, ymf, rq, qp, e, a, e2, mode):
+    phi = math.radians(lat_deg)
+    lam = math.radians(lon_deg) - lon0
+    sinphi = math.sin(phi)
+    q = (1.0 - e2) * (sinphi / (1.0 - e2 * sinphi * sinphi)
+                       + math.atanh(e * sinphi) / e)
+    sinb = q / qp
+    if sinb > 1.0:
+        sinb = 1.0
+    elif sinb < -1.0:
+        sinb = -1.0
+    cosb = math.sqrt(1.0 - sinb * sinb)
+    coslam = math.cos(lam)
+    sinlam = math.sin(lam)
+
+    if mode == 0:  # OBLIQ
+        denom = 1.0 + sinb1 * sinb + cosb1 * cosb * coslam
+        if denom < 1e-30:
+            denom = 1e-30
+        b = math.sqrt(2.0 / denom)
+        x = a * xmf * b * cosb * sinlam
+        y = a * ymf * b * (cosb1 * sinb - sinb1 * cosb * coslam)
+    elif mode == 1:  # EQUIT
+        denom = 1.0 + cosb * coslam
+        if denom < 1e-30:
+            denom = 1e-30
+        b = math.sqrt(2.0 / denom)
+        x = a * xmf * b * cosb * sinlam
+        y = a * ymf * b * sinb
+    elif mode == 2:  # N_POLE
+        q_diff = qp - q
+        if q_diff < 0.0:
+            q_diff = 0.0
+        rho = a * math.sqrt(q_diff)
+        x = rho * sinlam
+        y = -rho * coslam
+    else:  # S_POLE
+        q_diff = qp + q
+        if q_diff < 0.0:
+            q_diff = 0.0
+        rho = a * math.sqrt(q_diff)
+        x = rho * sinlam
+        y = rho * coslam
+    return x, y
+
+
+@njit(nogil=True, cache=True)
+def _laea_inv_point(x, y, lon0, sinb1, cosb1,
+                    xmf, ymf, rq, qp, e, a, e2, mode, apa):
+    if mode == 2 or mode == 3:  # POLAR
+        x_a = x / a
+        y_a = y / a
+        rho = math.hypot(x_a, y_a)
+        if rho < 1e-30:
+            return math.degrees(lon0), 90.0 if mode == 2 else -90.0
+        q = qp - rho * rho
+        if mode == 3:
+            q = -(qp - rho * rho)
+            lam = math.atan2(x_a, y_a)
+        else:
+            lam = math.atan2(x_a, -y_a)
+    else:  # OBLIQ or EQUIT
+        # PROJ: x /= dd, y *= dd (undo the xmf/ymf scaling)
+        xn = x / (a * xmf)   # = x / (a * rq * dd)
+        yn = y / (a * ymf)   # = y / (a * rq / dd) = y * dd / (a * rq)
+        rho = math.hypot(xn, yn)
+        if rho < 1e-30:
+            return math.degrees(lon0), math.degrees(math.asin(sinb1))
+        sce = 2.0 * math.asin(0.5 * rho / rq)
+        sinz = math.sin(sce)
+        cosz = math.cos(sce)
+        if mode == 0:  # OBLIQ
+            ab = cosz * sinb1 + yn * sinz * cosb1 / rho
+            lam = math.atan2(xn * sinz,
+                              rho * cosb1 * cosz - yn * sinb1 * sinz)
+        else:  # EQUIT
+            ab = yn * sinz / rho
+            lam = math.atan2(xn * sinz, rho * cosz)
+        q = qp * ab
+
+    # q -> phi via authalic inverse
+    ratio = q / qp
+    if ratio > 1.0:
+        ratio = 1.0
+    elif ratio < -1.0:
+        ratio = -1.0
+    beta = math.asin(ratio)
+    phi = beta + apa[0] * math.sin(2.0 * beta) + apa[1] * math.sin(4.0 * beta) + apa[2] * math.sin(6.0 * beta)
+    return math.degrees(_norm_lon_rad(lam + lon0)), math.degrees(phi)
+
+
+@njit(nogil=True, cache=True, parallel=True)
+def laea_forward(lons, lats, out_x, out_y,
+                 lon0, sinb1, cosb1, xmf, ymf, rq, qp,
+                 fe, fn, e, a, e2, mode):
+    for i in prange(lons.shape[0]):
+        x, y = _laea_fwd_point(lons[i], lats[i], lon0, sinb1, cosb1,
+                                xmf, ymf, rq, qp, e, a, e2, mode)
+        out_x[i] = x + fe
+        out_y[i] = y + fn
+
+
+@njit(nogil=True, cache=True, parallel=True)
+def laea_inverse(xs, ys, out_lon, out_lat,
+                 lon0, sinb1, cosb1, xmf, ymf, rq, qp,
+                 fe, fn, e, a, e2, mode, apa):
+    for i in prange(xs.shape[0]):
+        out_lon[i], out_lat[i] = _laea_inv_point(
+            xs[i] - fe, ys[i] - fn, lon0, sinb1, cosb1,
+            xmf, ymf, rq, qp, e, a, e2, mode, apa)
+
+
+# ---------------------------------------------------------------------------
+# Polar Stereographic  (N_POLE / S_POLE only)
+# ---------------------------------------------------------------------------
+
+def _stere_params(crs):
+    """Extract Polar Stereographic parameters.
+
+    Returns (lon0, k0, akm1, fe, fn, is_south) or None.
+    Supports EPSG codes for UPS and common polar stereographic CRSs,
+    and generic stere/ups proj definitions with polar lat_0.
+    """
+    try:
+        d = crs.to_dict()
+    except Exception:
+        return None
+    proj = d.get('proj', '')
+    if proj not in ('stere', 'ups', 'sterea'):
+        return None
+    if not _is_wgs84_compatible_ellipsoid(crs):
+        return None
+
+    lat_0 = d.get('lat_0', 0.0)
+    if abs(abs(lat_0) - 90.0) > 1e-6:
+        return None  # only polar modes
+
+    is_south = lat_0 < 0
+
+    lon_0 = math.radians(d.get('lon_0', 0.0))
+    lat_ts = d.get('lat_ts', None)
+    k0 = d.get('k_0', d.get('k', None))
+
+    e = _WGS84_E
+    e2 = _WGS84_E2
+    a = _WGS84_A
+
+    if k0 is not None:
+        k0 = float(k0)
+    elif lat_ts is not None:
+        lat_ts_r = math.radians(abs(lat_ts))
+        sinlts = math.sin(lat_ts_r)
+        coslts = math.cos(lat_ts_r)
+        # k0 from latitude of true scale
+        m_ts = coslts / math.sqrt(1.0 - e2 * sinlts * sinlts)
+        t_ts = math.tan(math.pi / 4.0 - lat_ts_r / 2.0) * math.pow(
+            (1.0 + e * sinlts) / (1.0 - e * sinlts), e / 2.0)
+        t_90 = 0.0  # tan(pi/4 - pi/4) = 0 at the pole
+        # For polar: k0 = m_ts / (2 * t_ts) * (something)
+        # Actually, for UPS/polar stereographic:
+        # akm1 = a * m_ts / sqrt((1+e)^(1+e) * (1-e)^(1-e)) / (2 * t_ts)
+        # But simpler: akm1 = a * k0 * 2 / sqrt((1+e)^(1+e)*(1-e)^(1-e))
+        # Let's compute akm1 directly
+        half_e = e / 2.0
+        con = math.pow(1.0 + e, 1.0 + e) * math.pow(1.0 - e, 1.0 - e)
+        if abs(t_ts) < 1e-30:
+            # lat_ts = 90: use k0 formula
+            k0 = 1.0
+            akm1 = 2.0 * a / math.sqrt(con)
+        else:
+            akm1 = a * m_ts / t_ts
+        fe = d.get('x_0', 0.0)
+        fn = d.get('y_0', 0.0)
+        return lon_0, 0.0, akm1, fe, fn, is_south
+    else:
+        k0 = 0.994  # UPS default
+
+    half_e = e / 2.0
+    con = math.pow(1.0 + e, 1.0 + e) * math.pow(1.0 - e, 1.0 - e)
+    akm1 = a * k0 * 2.0 / math.sqrt(con)
+    fe = d.get('x_0', 0.0)
+    fn = d.get('y_0', 0.0)
+    return lon_0, k0, akm1, fe, fn, is_south
+
+
+@njit(nogil=True, cache=True)
+def _stere_fwd_point(lon_deg, lat_deg, lon0, akm1, e, is_south):
+    phi = math.radians(lat_deg)
+    lam = math.radians(lon_deg) - lon0
+
+    # For south pole: negate phi to compute ts for abs(phi),
+    # and use (sin, cos) instead of (sin, -cos) for (x, y).
+    abs_phi = -phi if is_south else phi
+    sinphi = math.sin(abs_phi)
+    es = e * sinphi
+    ts = math.tan(math.pi / 4.0 - abs_phi / 2.0) * math.pow(
+        (1.0 + es) / (1.0 - es), e / 2.0)
+    rho = akm1 * ts
+
+    if is_south:
+        x = rho * math.sin(lam)
+        y = rho * math.cos(lam)
+    else:
+        x = rho * math.sin(lam)
+        y = -rho * math.cos(lam)
+    return x, y
+
+
+@njit(nogil=True, cache=True)
+def _stere_inv_point(x, y, lon0, akm1, e, is_south):
+    if is_south:
+        rho = math.hypot(x, y)
+        lam = math.atan2(x, y)
+    else:
+        rho = math.hypot(x, y)
+        lam = math.atan2(x, -y)
+
+    if rho < 1e-30:
+        lat = -90.0 if is_south else 90.0
+        return math.degrees(lon0), lat
+
+    tp = rho / akm1
+    half_e = e / 2.0
+    phi = math.pi / 2.0 - 2.0 * math.atan(tp)
+    for _ in range(15):
+        sinphi = math.sin(phi)
+        es = e * sinphi
+        phi_new = math.pi / 2.0 - 2.0 * math.atan(
+            tp * math.pow((1.0 - es) / (1.0 + es), half_e))
+        if abs(phi_new - phi) < 1e-14:
+            phi = phi_new
+            break
+        phi = phi_new
+
+    if is_south:
+        phi = -phi
+
+    return math.degrees(_norm_lon_rad(lam + lon0)), math.degrees(phi)
+
+
+@njit(nogil=True, cache=True, parallel=True)
+def stere_forward(lons, lats, out_x, out_y,
+                  lon0, akm1, fe, fn, e, is_south):
+    south_f = 1.0 if is_south else 0.0
+    for i in prange(lons.shape[0]):
+        x, y = _stere_fwd_point(lons[i], lats[i], lon0, akm1, e, is_south)
+        out_x[i] = x + fe
+        out_y[i] = y + fn
+
+
+@njit(nogil=True, cache=True, parallel=True)
+def stere_inverse(xs, ys, out_lon, out_lat,
+                  lon0, akm1, fe, fn, e, is_south):
+    for i in prange(xs.shape[0]):
+        out_lon[i], out_lat[i] = _stere_inv_point(
+            xs[i] - fe, ys[i] - fn, lon0, akm1, e, is_south)
+
+
+# ---------------------------------------------------------------------------
+# ---------------------------------------------------------------------------
+# Oblique Stereographic (double projection: Gauss conformal + stereographic)
+# ---------------------------------------------------------------------------
+
+def _sterea_params(crs):
+    """Extract oblique stereographic parameters (Gauss conformal double projection).
+
+    Returns (lon0, sinc0, cosc0, R2, C_gauss, K_gauss, ratexp, fe, fn, e) or None.
+    """
+    try:
+        d = crs.to_dict()
+    except Exception:
+        return None
+    if d.get('proj') != 'sterea':
+        return None
+    if not _is_wgs84_compatible_ellipsoid(crs):
+        return None
+
+    lat_0 = math.radians(d.get('lat_0', 0.0))
+    lon_0 = math.radians(d.get('lon_0', 0.0))
+    k0 = float(d.get('k_0', d.get('k', 1.0)))
+    fe = d.get('x_0', 0.0)
+    fn = d.get('y_0', 0.0)
+
+    e = _WGS84_E
+    e2 = _WGS84_E2
+    a = _WGS84_A
+
+    # Gauss conformal sphere constants (from PROJ gauss.cpp)
+    sinphi0 = math.sin(lat_0)
+    cosphi0 = math.cos(lat_0)
+    C_gauss = math.sqrt(1.0 + e2 * cosphi0 ** 4 / (1.0 - e2))
+    R = math.sqrt(1.0 - e2) / (1.0 - e2 * sinphi0 * sinphi0)
+    ratexp = 0.5 * C_gauss * e
+
+    # Conformal latitude at origin
+    chi0 = math.asin(sinphi0 / C_gauss)
+
+    # Normalization constant K
+    srat0 = math.pow((1.0 - e * sinphi0) / (1.0 + e * sinphi0), ratexp)
+    K_gauss = (math.tan(math.pi / 4.0 + chi0 / 2.0)
+               / (math.pow(math.tan(math.pi / 4.0 + lat_0 / 2.0), C_gauss) * srat0))
+
+    sinc0 = math.sin(chi0)
+    cosc0 = math.cos(chi0)
+    # R is dimensionless; scale by a * k0 for metric output
+    R_metric = a * k0 * R
+
+    return lon_0, sinc0, cosc0, R_metric, C_gauss, K_gauss, ratexp, fe, fn, e
+
+
+@njit(nogil=True, cache=True)
+def _gauss_fwd(phi, lam, C, K, e, ratexp):
+    """Geodetic -> Gauss conformal sphere: (phi, lam) -> (chi, lam_conf)."""
+    sinphi = math.sin(phi)
+    srat = math.pow((1.0 - e * sinphi) / (1.0 + e * sinphi), ratexp)
+    chi = 2.0 * math.atan(K * math.pow(math.tan(math.pi / 4.0 + phi / 2.0), C) * srat) - math.pi / 2.0
+    lam_conf = C * lam
+    return chi, lam_conf
+
+
+@njit(nogil=True, cache=True)
+def _gauss_inv(chi, lam_conf, C, K, e, ratexp):
+    """Gauss conformal sphere -> geodetic: (chi, lam_conf) -> (phi, lam)."""
+    lam = lam_conf / C
+    num = math.pow(math.tan(math.pi / 4.0 + chi / 2.0) / K, 1.0 / C)
+    phi = chi
+    for _ in range(20):
+        sinphi = math.sin(phi)
+        phi_new = 2.0 * math.atan(
+            num * math.pow((1.0 + e * sinphi) / (1.0 - e * sinphi), e / 2.0)
+        ) - math.pi / 2.0
+        if abs(phi_new - phi) < 1e-14:
+            return phi_new, lam
+        phi = phi_new
+    return phi, lam
+
+
+@njit(nogil=True, cache=True)
+def _sterea_fwd_point(lon_deg, lat_deg, lon0, sinc0, cosc0, Rm,
+                      C, K, ratexp, e):
+    """Oblique stereographic forward.  Rm = a * k0 * R_conformal."""
+    lam = math.radians(lon_deg) - lon0
+    phi = math.radians(lat_deg)
+    chi, lam_c = _gauss_fwd(phi, lam, C, K, e, ratexp)
+    sinc = math.sin(chi)
+    cosc = math.cos(chi)
+    cosl = math.cos(lam_c)
+    sinl = math.sin(lam_c)
+    denom = 1.0 + sinc0 * sinc + cosc0 * cosc * cosl
+    if denom < 1e-30:
+        denom = 1e-30
+    k = 2.0 * Rm / denom
+    x = k * cosc * sinl
+    y = k * (cosc0 * sinc - sinc0 * cosc * cosl)
+    return x, y
+
+
+@njit(nogil=True, cache=True)
+def _sterea_inv_point(x, y, lon0, sinc0, cosc0, Rm,
+                      C, K, ratexp, e):
+    """Oblique stereographic inverse.  Rm = a * k0 * R_conformal."""
+    rho = math.hypot(x, y)
+    if rho < 1e-30:
+        phi, lam = _gauss_inv(math.asin(sinc0), 0.0, C, K, e, ratexp)
+        return math.degrees(_norm_lon_rad(lam + lon0)), math.degrees(phi)
+    ce = 2.0 * math.atan2(rho, 2.0 * Rm)
+    sinCe = math.sin(ce)
+    cosCe = math.cos(ce)
+    chi = math.asin(cosCe * sinc0 + y * sinCe * cosc0 / rho)
+    lam_c = math.atan2(x * sinCe, rho * cosc0 * cosCe - y * sinc0 * sinCe)
+    phi, lam = _gauss_inv(chi, lam_c, C, K, e, ratexp)
+    return math.degrees(_norm_lon_rad(lam + lon0)), math.degrees(phi)
+
+
+@njit(nogil=True, cache=True, parallel=True)
+def sterea_forward(lons, lats, out_x, out_y,
+                   lon0, sinc0, cosc0, R2, C, K, ratexp, fe, fn, e):
+    for i in prange(lons.shape[0]):
+        x, y = _sterea_fwd_point(lons[i], lats[i], lon0, sinc0, cosc0, R2,
+                                  C, K, ratexp, e)
+        out_x[i] = x + fe
+        out_y[i] = y + fn
+
+
+@njit(nogil=True, cache=True, parallel=True)
+def sterea_inverse(xs, ys, out_lon, out_lat,
+                   lon0, sinc0, cosc0, R2, C, K, ratexp, fe, fn, e):
+    for i in prange(xs.shape[0]):
+        out_lon[i], out_lat[i] = _sterea_inv_point(
+            xs[i] - fe, ys[i] - fn, lon0, sinc0, cosc0, R2,
+            C, K, ratexp, e)
+
+
+# ---------------------------------------------------------------------------
+# Oblique Mercator (Hotine variant)
+# ---------------------------------------------------------------------------
+
+def _omerc_params(crs):
+    """Extract Hotine Oblique Mercator parameters.
+
+    Returns (lon0, lat0, alpha, gamma, k0, fe, fn, uc,
+             singam, cosgam, sinaz, cosaz, BH, AH, e) or None.
+    """
+    try:
+        d = crs.to_dict()
+    except Exception:
+        return None
+    if d.get('proj') != 'omerc':
+        return None
+    if not _is_wgs84_compatible_ellipsoid(crs):
+        return None
+
+    lat_0 = math.radians(d.get('lat_0', 0.0))
+    lonc = math.radians(d.get('lonc', d.get('lon_0', 0.0)))
+    alpha = math.radians(d.get('alpha', 0.0))
+    gamma = math.radians(d.get('gamma', alpha))
+    k0 = float(d.get('k_0', d.get('k', 1.0)))
+    fe = d.get('x_0', 0.0)
+    fn = d.get('y_0', 0.0)
+    no_uoff = 'no_uoff' in d
+
+    e = _WGS84_E
+    e2 = _WGS84_E2
+    a = _WGS84_A
+
+    sinphi0 = math.sin(lat_0)
+    cosphi0 = math.cos(lat_0)
+    com = math.sqrt(1.0 - e2)
+
+    BH = math.sqrt(1.0 + e2 * cosphi0 ** 4 / (1.0 - e2))
+    AH = a * BH * k0 * com / (1.0 - e2 * sinphi0 * sinphi0)
+    D = BH * com / (cosphi0 * math.sqrt(1.0 - e2 * sinphi0 * sinphi0))
+    if D < 1.0:
+        D = 1.0
+    F = D + math.sqrt(max(D * D - 1.0, 0.0)) * (1.0 if lat_0 >= 0 else -1.0)
+    H = F * math.pow(
+        math.tan(math.pi / 4.0 - lat_0 / 2.0)
+        * math.pow((1.0 + e * sinphi0) / (1.0 - e * sinphi0), e / 2.0),
+        BH,
+    )
+    if abs(H) < 1e-30:
+        H = 1e-30
+    lam0 = lonc - math.asin(0.5 * (F - 1.0 / F) * math.tan(alpha) / D) / BH
+
+    singam = math.sin(gamma)
+    cosgam = math.cos(gamma)
+    sinaz = math.sin(alpha)
+    cosaz = math.cos(alpha)
+
+    if no_uoff:
+        uc = 0.0
+    else:
+        if abs(cosaz) < 1e-10:
+            uc = AH * (lonc - lam0)
+        else:
+            uc = AH / BH * math.atan(math.sqrt(max(D * D - 1.0, 0.0)) / cosaz)
+            if lat_0 < 0:
+                uc = -uc
+
+    return lam0, lat_0, k0, fe, fn, uc, singam, cosgam, sinaz, cosaz, BH, AH, H, F, e
+
+
+@njit(nogil=True, cache=True)
+def _omerc_fwd_point(lon_deg, lat_deg, lam0, singam, cosgam,
+                     sinaz, cosaz, BH, AH, H, F, e):
+    lam = math.radians(lon_deg) - lam0
+    phi = math.radians(lat_deg)
+    sinphi = math.sin(phi)
+
+    # Conformal latitude
+    S = BH * math.log(
+        math.tan(math.pi / 4.0 - phi / 2.0)
+        * math.pow((1.0 + e * sinphi) / (1.0 - e * sinphi), e / 2.0)
+    )
+    Q = math.exp(-BH * lam)
+    Vl = 0.5 * (H * math.exp(S) - math.exp(-S) / H)
+    Ul = 0.5 * (H * math.exp(S) + math.exp(-S) / H)
+    u = AH * math.atan2(Vl * cosaz + math.sin(BH * lam) * sinaz, math.cos(BH * lam))
+    v = 0.5 * AH * math.log((Ul - Vl * sinaz + math.sin(BH * lam) * cosaz)
+                              / (Ul + Vl * sinaz - math.sin(BH * lam) * cosaz))
+
+    x = v * cosgam + u * singam
+    y = u * cosgam - v * singam
+    return x, y
+
+
+@njit(nogil=True, cache=True)
+def _omerc_inv_point(x, y, lam0, uc, singam, cosgam,
+                     sinaz, cosaz, BH, AH, H, F, e):
+    v = x * cosgam - y * singam
+    u = y * cosgam + x * singam + uc
+
+    Qp = math.exp(-BH * v / AH)
+    Sp = 0.5 * (Qp - 1.0 / Qp)
+    Tp = 0.5 * (Qp + 1.0 / Qp)
+    Vp = math.sin(BH * u / AH)
+    Up = (Vp * cosaz + Sp * sinaz) / Tp
+
+    if abs(abs(Up) - 1.0) < 1e-14:
+        lam = 0.0
+        phi = math.pi / 2.0 if Up > 0 else -math.pi / 2.0
+    else:
+        phi = math.exp(math.log((F - Up) / (F + Up)) / BH / 2.0)
+        # phi here is actually t = tan(pi/4 - phi_geo/2) * ((1+e*sin)/(1-e*sin))^(e/2)
+        # Need to invert: iterate
+        tp = phi  # this is t
+        phi = math.pi / 2.0 - 2.0 * math.atan(tp)
+        for _ in range(15):
+            sinp = math.sin(phi)
+            es = e * sinp
+            phi_new = math.pi / 2.0 - 2.0 * math.atan(
+                tp * math.pow((1.0 - es) / (1.0 + es), e / 2.0))
+            if abs(phi_new - phi) < 1e-14:
+                phi = phi_new
+                break
+            phi = phi_new
+        lam = -math.atan2(Sp * cosaz - Vp * sinaz, math.cos(BH * u / AH)) / BH
+
+    return math.degrees(lam + lam0), math.degrees(phi)
+
+
+@njit(nogil=True, cache=True, parallel=True)
+def omerc_forward(lons, lats, out_x, out_y,
+                  lam0, fe, fn, uc, singam, cosgam, sinaz, cosaz,
+                  BH, AH, H, F, e):
+    for i in prange(lons.shape[0]):
+        x, y = _omerc_fwd_point(lons[i], lats[i], lam0, singam, cosgam,
+                                 sinaz, cosaz, BH, AH, H, F, e)
+        out_x[i] = x + fe
+        out_y[i] = y + fn
+
+
+@njit(nogil=True, cache=True, parallel=True)
+def omerc_inverse(xs, ys, out_lon, out_lat,
+                  lam0, fe, fn, uc, singam, cosgam, sinaz, cosaz,
+                  BH, AH, H, F, e):
+    for i in prange(xs.shape[0]):
+        out_lon[i], out_lat[i] = _omerc_inv_point(
+            xs[i] - fe, ys[i] - fn, lam0, uc, singam, cosgam,
+            sinaz, cosaz, BH, AH, H, F, e)
+
+
+# ---------------------------------------------------------------------------
+# Transverse Mercator / UTM  --  6th-order Krueger series (Karney 2011)
+# ---------------------------------------------------------------------------
+
+def _tmerc_coefficients(n):
+    """Precompute all series coefficients from third flattening *n*.
+
+    Returns (alpha, beta, cbg, cgb, Qn) where:
+    - alpha[0..5]: forward Krueger (conformal sphere -> rectifying)
+    - beta[0..5]:  inverse Krueger (rectifying -> conformal sphere)
+    - cbg[0..5]:   geographic -> conformal latitude
+    - cgb[0..5]:   conformal -> geographic latitude
+    - Qn:          rectifying radius * k0
+    """
+    n2 = n * n
+    n3 = n2 * n
+    n4 = n3 * n
+    n5 = n4 * n
+    n6 = n5 * n
+
+    # Rectifying radius (scaled by k0 later)
+    A = _WGS84_A / (1.0 + n) * (1.0 + n2 / 4.0 + n4 / 64.0 + n6 / 256.0)
+
+    # Forward Krueger: alpha[1..6]
+    alpha = np.array([
+        n / 2.0 - 2.0 * n2 / 3.0 + 5.0 * n3 / 16.0
+        + 41.0 * n4 / 180.0 - 127.0 * n5 / 288.0 + 7891.0 * n6 / 37800.0,
+
+        13.0 * n2 / 48.0 - 3.0 * n3 / 5.0 + 557.0 * n4 / 1440.0
+        + 281.0 * n5 / 630.0 - 1983433.0 * n6 / 1935360.0,
+
+        61.0 * n3 / 240.0 - 103.0 * n4 / 140.0 + 15061.0 * n5 / 26880.0
+        + 167603.0 * n6 / 181440.0,
+
+        49561.0 * n4 / 161280.0 - 179.0 * n5 / 168.0
+        + 6601661.0 * n6 / 7257600.0,
+
+        34729.0 * n5 / 80640.0 - 3418889.0 * n6 / 1995840.0,
+
+        212378941.0 * n6 / 319334400.0,
+    ], dtype=np.float64)
+
+    # Inverse Krueger: beta[1..6]
+    beta = np.array([
+        n / 2.0 - 2.0 * n2 / 3.0 + 37.0 * n3 / 96.0
+        - n4 / 360.0 - 81.0 * n5 / 512.0 + 96199.0 * n6 / 604800.0,
+
+        n2 / 48.0 + n3 / 15.0 - 437.0 * n4 / 1440.0
+        + 46.0 * n5 / 105.0 - 1118711.0 * n6 / 3870720.0,
+
+        17.0 * n3 / 480.0 - 37.0 * n4 / 840.0
+        - 209.0 * n5 / 4480.0 + 5569.0 * n6 / 90720.0,
+
+        4397.0 * n4 / 161280.0 - 11.0 * n5 / 504.0
+        - 830251.0 * n6 / 7257600.0,
+
+        4583.0 * n5 / 161280.0 - 108847.0 * n6 / 3991680.0,
+
+        20648693.0 * n6 / 638668800.0,
+    ], dtype=np.float64)
+
+    # Geographic -> Conformal latitude: cbg[1..6]
+    cbg = np.array([
+        n * (-2.0 + n * (2.0 / 3.0 + n * (4.0 / 3.0 + n * (-82.0 / 45.0
+        + n * (32.0 / 45.0 + n * 4642.0 / 4725.0))))),
+
+        n2 * (5.0 / 3.0 + n * (-16.0 / 15.0 + n * (-13.0 / 9.0
+        + n * (904.0 / 315.0 - n * 1522.0 / 945.0)))),
+
+        n3 * (-26.0 / 15.0 + n * (34.0 / 21.0 + n * (8.0 / 5.0
+        - n * 12686.0 / 2835.0))),
+
+        n4 * (1237.0 / 630.0 + n * (-12.0 / 5.0
+        - n * 24832.0 / 14175.0)),
+
+        n5 * (-734.0 / 315.0 + n * 109598.0 / 31185.0),
+
+        n6 * 444337.0 / 155925.0,
+    ], dtype=np.float64)
+
+    # Conformal -> Geographic latitude: cgb[1..6]
+    cgb = np.array([
+        n * (2.0 + n * (-2.0 / 3.0 + n * (-2.0 + n * (116.0 / 45.0
+        + n * (26.0 / 45.0 - n * 2854.0 / 675.0))))),
+
+        n2 * (7.0 / 3.0 + n * (-8.0 / 5.0 + n * (-227.0 / 45.0
+        + n * (2704.0 / 315.0 + n * 2323.0 / 945.0)))),
+
+        n3 * (56.0 / 15.0 + n * (-136.0 / 35.0 + n * (-1262.0 / 105.0
+        + n * 73814.0 / 2835.0))),
+
+        n4 * (4279.0 / 630.0 + n * (-332.0 / 35.0
+        - n * 399572.0 / 14175.0)),
+
+        n5 * (4174.0 / 315.0 - n * 144838.0 / 6237.0),
+
+        n6 * 601676.0 / 22275.0,
+    ], dtype=np.float64)
+
+    return alpha, beta, cbg, cgb, A
+
+
+# Precompute WGS84 coefficients once at import time
+_ALPHA, _BETA, _CBG, _CGB, _A_RECT = _tmerc_coefficients(_WGS84_N)
+
+
+def _clenshaw_sin_py(coeffs, angle):
+    """Pure-Python version of _clenshaw_sin for use in setup code."""
+    N = len(coeffs)
+    X = 2.0 * math.cos(2.0 * angle)
+    u0 = 0.0
+    u1 = 0.0
+    for k in range(N - 1, -1, -1):
+        t = X * u0 - u1 + coeffs[k]
+        u1 = u0
+        u0 = t
+    return math.sin(2.0 * angle) * u0
+
+
+def _clenshaw_complex_py(coeffs, sin2Cn, cos2Cn, sinh2Ce, cosh2Ce):
+    """Pure-Python version of _clenshaw_complex for use in setup code.
+
+    Returns just dCn (real part).
+    """
+    N = len(coeffs)
+    r = 2.0 * cos2Cn * cosh2Ce
+    im = -2.0 * sin2Cn * sinh2Ce
+    hr = 0.0; hi = 0.0; hr1 = 0.0; hi1 = 0.0
+    for k in range(N - 1, -1, -1):
+        hr2 = hr1; hi2 = hi1; hr1 = hr; hi1 = hi
+        hr = -hr2 + r * hr1 - im * hi1 + coeffs[k]
+        hi = -hi2 + im * hr1 + r * hi1
+    dCn = sin2Cn * cosh2Ce * hr - cos2Cn * sinh2Ce * hi
+    return dCn
+
+
+@njit(nogil=True, cache=True)
+def _clenshaw_sin(coeffs, angle):
+    """Evaluate SUM_{k=1}^{N} coeffs[k-1] * sin(2*k*angle) via Clenshaw."""
+    N = coeffs.shape[0]
+    X = 2.0 * math.cos(2.0 * angle)
+    u0 = 0.0
+    u1 = 0.0
+    for k in range(N - 1, -1, -1):
+        t = X * u0 - u1 + coeffs[k]
+        u1 = u0
+        u0 = t
+    return math.sin(2.0 * angle) * u0
+
+
+@njit(nogil=True, cache=True)
+def _clenshaw_complex(coeffs, sin2Cn, cos2Cn, sinh2Ce, cosh2Ce):
+    """Complex Clenshaw summation for Krueger series.
+
+    Evaluates SUM a[k] * sin(2k*(Cn + i*Ce)) returning (dCn, dCe).
+    """
+    N = coeffs.shape[0]
+    r = 2.0 * cos2Cn * cosh2Ce
+    im = -2.0 * sin2Cn * sinh2Ce
+
+    hr = 0.0
+    hi = 0.0
+    hr1 = 0.0
+    hi1 = 0.0
+    for k in range(N - 1, -1, -1):
+        hr2 = hr1
+        hi2 = hi1
+        hr1 = hr
+        hi1 = hi
+        hr = -hr2 + r * hr1 - im * hi1 + coeffs[k]
+        hi = -hi2 + im * hr1 + r * hi1
+
+    dCn = sin2Cn * cosh2Ce * hr - cos2Cn * sinh2Ce * hi
+    dCe = sin2Cn * cosh2Ce * hi + cos2Cn * sinh2Ce * hr
+    return dCn, dCe
+
+
+@njit(nogil=True, cache=True)
+def _tmerc_fwd_point(lon_deg, lat_deg, lon0_rad, k0, Qn,
+                     alpha, cbg):
+    """(lon, lat) degrees -> (E, N) metres for a Transverse Mercator projection."""
+    lam = math.radians(lon_deg) - lon0_rad
+    phi = math.radians(lat_deg)
+
+    # Step 1: geographic -> conformal latitude via Clenshaw
+    chi = phi + _clenshaw_sin(cbg, phi)
+
+    sin_chi = math.sin(chi)
+    cos_chi = math.cos(chi)
+    sin_lam = math.sin(lam)
+    cos_lam = math.cos(lam)
+
+    # Step 2: conformal sphere -> isometric
+    denom = math.hypot(sin_chi, cos_chi * cos_lam)
+    if denom < 1e-30:
+        denom = 1e-30
+    Cn = math.atan2(sin_chi, cos_chi * cos_lam)
+    tan_Ce = sin_lam * cos_chi / denom
+    # Clamp to avoid NaN in asinh at extreme values
+    if tan_Ce > 1e15:
+        tan_Ce = 1e15
+    elif tan_Ce < -1e15:
+        tan_Ce = -1e15
+    Ce = math.asinh(tan_Ce)
+
+    # Step 3: Krueger series correction (complex Clenshaw)
+    inv_d = 1.0 / denom
+    inv_d2 = inv_d * inv_d
+    cos_chi_cos_lam = cos_chi * cos_lam
+    sin2 = 2.0 * sin_chi * cos_chi_cos_lam * inv_d2
+    cos2 = 2.0 * cos_chi_cos_lam * cos_chi_cos_lam * inv_d2 - 1.0
+    sinh2 = 2.0 * tan_Ce * inv_d
+    cosh2 = 2.0 * inv_d2 - 1.0
+
+    dCn, dCe = _clenshaw_complex(alpha, sin2, cos2, sinh2, cosh2)
+    Cn += dCn
+    Ce += dCe
+
+    # Step 4: scale
+    x = Qn * Ce   # easting before false easting
+    y = Qn * Cn   # northing (Zb = 0 for UTM since phi0 = 0)
+    return x, y
+
+
+@njit(nogil=True, cache=True)
+def _tmerc_inv_point(x, y, lon0_rad, k0, Qn, beta, cgb):
+    """(E, N) metres -> (lon, lat) degrees for a Transverse Mercator projection."""
+    Cn = y / Qn
+    Ce = x / Qn
+
+    # Step 2: inverse Krueger series
+    sin2Cn = math.sin(2.0 * Cn)
+    cos2Cn = math.cos(2.0 * Cn)
+    exp2Ce = math.exp(2.0 * Ce)
+    inv_exp2Ce = 1.0 / exp2Ce
+    sinh2Ce = 0.5 * (exp2Ce - inv_exp2Ce)
+    cosh2Ce = 0.5 * (exp2Ce + inv_exp2Ce)
+
+    dCn, dCe = _clenshaw_complex(beta, sin2Cn, cos2Cn, sinh2Ce, cosh2Ce)
+    Cn -= dCn
+    Ce -= dCe
+
+    # Step 3: isometric -> conformal sphere
+    sin_Cn = math.sin(Cn)
+    cos_Cn = math.cos(Cn)
+    sinh_Ce = math.sinh(Ce)
+
+    lam = math.atan2(sinh_Ce, cos_Cn)
+
+    # Step 4: conformal -> geographic latitude
+    modulus = math.hypot(sinh_Ce, cos_Cn)
+    chi = math.atan2(sin_Cn, modulus)
+
+    phi = chi + _clenshaw_sin(cgb, chi)
+
+    lon = math.degrees(lam + lon0_rad)
+    lat = math.degrees(phi)
+    return lon, lat
+
+
+@njit(nogil=True, cache=True, parallel=True)
+def tmerc_forward(lons, lats, out_x, out_y,
+                  lon0_rad, k0, false_e, false_n,
+                  Qn, alpha, cbg):
+    """Batch geographic -> Transverse Mercator."""
+    for i in prange(lons.shape[0]):
+        x, y = _tmerc_fwd_point(lons[i], lats[i], lon0_rad, k0, Qn,
+                                alpha, cbg)
+        out_x[i] = x + false_e
+        out_y[i] = y + false_n
+
+
+@njit(nogil=True, cache=True, parallel=True)
+def tmerc_inverse(xs, ys, out_lon, out_lat,
+                  lon0_rad, k0, false_e, false_n,
+                  Qn, beta, cgb):
+    """Batch Transverse Mercator -> geographic."""
+    for i in prange(xs.shape[0]):
+        lon, lat = _tmerc_inv_point(
+            xs[i] - false_e, ys[i] - false_n,
+            lon0_rad, k0, Qn, beta, cgb)
+        out_lon[i] = lon
+        out_lat[i] = lat
+
+
+# ---------------------------------------------------------------------------
+# UTM zone helpers
+# ---------------------------------------------------------------------------
+
+def _utm_params(epsg_code):
+    """Extract UTM zone parameters from EPSG code.
+
+    Returns (lon0_rad, k0, false_easting, false_northing) or None.
+    """
+    # EPSG:326xx = UTM North, EPSG:327xx = UTM South (WGS84)
+    # EPSG:269xx = UTM North (NAD83, effectively same ellipsoid)
+    if epsg_code is None:
+        return None
+    if 32601 <= epsg_code <= 32660:
+        zone = epsg_code - 32600
+        south = False
+    elif 32701 <= epsg_code <= 32760:
+        zone = epsg_code - 32700
+        south = True
+    elif 26901 <= epsg_code <= 26923:
+        # NAD83 UTM zones 1-23
+        zone = epsg_code - 26900
+        south = False
+    else:
+        return None
+
+    lon0 = math.radians((zone - 1) * 6.0 - 180.0 + 3.0)  # central meridian
+    k0 = 0.9996
+    false_e = 500000.0
+    false_n = 10000000.0 if south else 0.0
+    return lon0, k0, false_e, false_n
+
+
+def _tmerc_params(crs):
+    """Extract generic Transverse Mercator parameters from a pyproj CRS.
+
+    Handles State Plane, national grids, and any other tmerc definition.
+    Returns (lon0_rad, k0, false_easting, false_northing, Zb) or None.
+    Zb is the Krueger northing offset for non-zero lat_0.
+    """
+    try:
+        d = crs.to_dict()
+    except Exception:
+        return None
+    if d.get('proj') != 'tmerc':
+        return None
+    if not _is_wgs84_compatible_ellipsoid(crs):
+        return None  # e.g. BNG (Airy), NAD27 (Clarke 1866)
+
+    # Unit conversion: false easting/northing from to_dict() are in
+    # the CRS's native units.  The Krueger series works in metres,
+    # so we convert fe/fn to metres and return to_meter so the caller
+    # can scale the final projected coordinates.
+    units = d.get('units', 'm')
+    _UNIT_TO_METER = {
+        'm': 1.0,
+        'us-ft': 0.3048006096012192,   # US survey foot
+        'ft': 0.3048,                   # international foot
+    }
+    to_meter = _UNIT_TO_METER.get(units)
+    if to_meter is None:
+        return None  # unsupported unit
+
+    lon_0 = math.radians(d.get('lon_0', 0.0))
+    lat_0 = math.radians(d.get('lat_0', 0.0))
+    k0 = float(d.get('k_0', d.get('k', 1.0)))
+    fe = d.get('x_0', 0.0)   # always in metres in PROJ4 dict
+    fn = d.get('y_0', 0.0)
+
+    # Compute Zb: northing offset for the origin latitude.
+    # For lat_0=0 (UTM), Zb=0.
+    Qn = k0 * _A_RECT
+    if abs(lat_0) < 1e-14:
+        Zb = 0.0
+    else:
+        # Conformal latitude of origin
+        Z = lat_0 + _clenshaw_sin_py(_CBG, lat_0)
+        # Forward Krueger correction at Ce=0 (central meridian)
+        sin2Z = math.sin(2.0 * Z)
+        cos2Z = math.cos(2.0 * Z)
+        dCn = 0.0
+        for k in range(5, -1, -1):
+            dCn = cos2Z * dCn + _ALPHA[k] * sin2Z
+            # This is a simplified Clenshaw for Ce=0 (sinh=0, cosh=1)
+        # Actually, use the proper complex Clenshaw with Ce=0:
+        # sin2=sin(2Z), cos2=cos(2Z), sinh2=0, cosh2=1
+        dCn_val = _clenshaw_complex_py(_ALPHA, sin2Z, cos2Z, 0.0, 1.0)
+        Zb = -Qn * (Z + dCn_val)
+
+    return lon_0, k0, fe, fn, Zb, to_meter
+
+
+# ---------------------------------------------------------------------------
+# Dispatch: detect fast-path CRS pairs
+# ---------------------------------------------------------------------------
+
+def _get_epsg(crs):
+    """Extract integer EPSG code from a pyproj.CRS, or None."""
+    try:
+        auth = crs.to_authority()
+        if auth and auth[0].upper() == 'EPSG':
+            return int(auth[1])
+    except Exception:
+        pass
+    return None
+
+
+def _is_geographic_wgs84_or_nad83(epsg):
+    """True for EPSG:4326 (WGS84) or EPSG:4269 (NAD83)."""
+    return epsg in (4326, 4269)
+
+
+def _is_supported_geographic(epsg):
+    """True for any geographic CRS we can handle (WGS84, NAD83, NAD27)."""
+    return epsg in (4326, 4269, 4267)
+
+
+def _is_wgs84_compatible_ellipsoid(crs):
+    """True if *crs* uses WGS84/GRS80 OR a datum we can Helmert-shift.
+
+    Returns True for WGS84/NAD83 (no shift needed) and for datums
+    with known Helmert parameters (NAD27, etc.) since the dispatch
+    will wrap the projection with a datum shift.
+    """
+    try:
+        d = crs.to_dict()
+    except Exception:
+        return False
+    ellps = d.get('ellps', '')
+    datum = d.get('datum', '')
+    # WGS84 and GRS80: no shift needed
+    if (ellps in ('WGS84', 'GRS80', '')
+            and datum in ('WGS84', 'NAD83', '')):
+        return True
+    # Check if we have Helmert parameters for this datum
+    key = datum if datum in _DATUM_PARAMS else ellps
+    return key in _DATUM_PARAMS
+
+
+@njit(nogil=True, cache=True, parallel=True)
+def _apply_datum_shift_inv(lon_arr, lat_arr, dx, dy, dz, rx, ry, rz, ds,
+                           a_src, f_src, a_tgt, f_tgt):
+    """Batch inverse 7-param Helmert: WGS84 -> source datum."""
+    for i in prange(lon_arr.shape[0]):
+        lon_arr[i], lat_arr[i] = _helmert7_inv(
+            lon_arr[i], lat_arr[i], dx, dy, dz, rx, ry, rz, ds,
+            a_src, f_src, a_tgt, f_tgt)
+
+
+@njit(nogil=True, cache=True, parallel=True)
+def _apply_datum_shift_fwd(lon_arr, lat_arr, dx, dy, dz, rx, ry, rz, ds,
+                           a_src, f_src, a_tgt, f_tgt):
+    """Batch forward 7-param Helmert: source datum -> WGS84."""
+    for i in prange(lon_arr.shape[0]):
+        lon_arr[i], lat_arr[i] = _helmert7_fwd(
+            lon_arr[i], lat_arr[i], dx, dy, dz, rx, ry, rz, ds,
+            a_src, f_src, a_tgt, f_tgt)
+
+
+def try_numba_transform(src_crs, tgt_crs, chunk_bounds, chunk_shape):
+    """Attempt a Numba JIT coordinate transform for the given CRS pair.
+
+    Returns (src_y, src_x) arrays if a fast path exists, or None to
+    fall back to pyproj.
+
+    For non-WGS84 datums with known Helmert parameters, the projection
+    kernel runs in WGS84 and a geocentric 3-parameter datum shift is
+    applied as a post-processing step.
+    """
+    src_epsg = _get_epsg(src_crs)
+    tgt_epsg = _get_epsg(tgt_crs)
+    if src_epsg is None and tgt_epsg is None:
+        return None
+
+    # Check if source or target needs a datum shift
+    src_datum = _get_datum_params(src_crs)
+    tgt_datum = _get_datum_params(tgt_crs)
+
+    height, width = chunk_shape
+    left, bottom, right, top = chunk_bounds
+    res_x = (right - left) / width
+    res_y = (top - bottom) / height
+
+    # Quick bail: if neither side is a geographic CRS we support, no fast path.
+    # This avoids the expensive array allocation below for unsupported pairs
+    # (e.g. same-CRS identity transforms in merge).
+    src_is_geo = _is_supported_geographic(src_epsg)
+    tgt_is_geo = _is_supported_geographic(tgt_epsg)
+    if not src_is_geo and not tgt_is_geo:
+        # Neither side is geographic -- can't be a supported pair
+        # (all our fast paths have geographic on one side)
+        return None
+
+    # Build output coordinate arrays (target CRS)
+    col_1d = np.arange(width, dtype=np.float64)
+    row_1d = np.arange(height, dtype=np.float64)
+    out_x_1d = left + (col_1d + 0.5) * res_x
+    out_y_1d = top - (row_1d + 0.5) * res_y
+
+    # Flatten for batch transform
+    out_x_flat = np.tile(out_x_1d, height)
+    out_y_flat = np.repeat(out_y_1d, width)
+    n = out_x_flat.shape[0]
+    src_x_flat = np.empty(n, dtype=np.float64)
+    src_y_flat = np.empty(n, dtype=np.float64)
+
+    # --- Geographic -> Web Mercator (inverse: Merc -> Geo) ---
+    if _is_supported_geographic(src_epsg) and tgt_epsg == 3857:
+        # Target is Mercator, need inverse: merc -> geo
+        merc_inverse(out_x_flat, out_y_flat, src_x_flat, src_y_flat)
+        return (src_y_flat.reshape(height, width),
+                src_x_flat.reshape(height, width))
+
+    if src_epsg == 3857 and _is_supported_geographic(tgt_epsg):
+        # Target is geographic, need forward: geo -> merc... wait, no.
+        # We need the INVERSE transformer: target -> source.
+        # target=geo, source=merc. So: geo -> merc (forward).
+        merc_forward(out_x_flat, out_y_flat, src_x_flat, src_y_flat)
+        return (src_y_flat.reshape(height, width),
+                src_x_flat.reshape(height, width))
+
+    # --- Geographic -> UTM (inverse: UTM -> Geo) ---
+    if _is_supported_geographic(src_epsg):
+        utm = _utm_params(tgt_epsg)
+        if utm is not None:
+            lon0, k0, fe, fn = utm
+            Qn = k0 * _A_RECT
+            # Target is UTM, need inverse: UTM -> Geo
+            tmerc_inverse(out_x_flat, out_y_flat, src_x_flat, src_y_flat,
+                          lon0, k0, fe, fn, Qn, _BETA, _CGB)
+            return (src_y_flat.reshape(height, width),
+                    src_x_flat.reshape(height, width))
+
+    # --- UTM -> Geographic (forward: Geo -> UTM) ---
+    utm_src = _utm_params(src_epsg)
+    if utm_src is not None and _is_supported_geographic(tgt_epsg):
+        lon0, k0, fe, fn = utm_src
+        Qn = k0 * _A_RECT
+        # Target is geographic, need forward: Geo -> UTM
+        tmerc_forward(out_x_flat, out_y_flat, src_x_flat, src_y_flat,
+                      lon0, k0, fe, fn, Qn, _ALPHA, _CBG)
+        return (src_y_flat.reshape(height, width),
+                src_x_flat.reshape(height, width))
+
+    # --- Generic Transverse Mercator (State Plane, national grids, etc.) ---
+    if _is_supported_geographic(src_epsg):
+        tmerc_p = _tmerc_params(tgt_crs)
+        if tmerc_p is not None:
+            lon0, k0, fe, fn, Zb, to_m = tmerc_p
+            Qn = k0 * _A_RECT
+            # Use 2D kernel: takes 1D coords, avoids tile/repeat + fuses unit conv
+            out_lon_2d = np.empty((height, width), dtype=np.float64)
+            out_lat_2d = np.empty((height, width), dtype=np.float64)
+            tmerc_inverse_2d(out_x_1d, out_y_1d, out_lon_2d, out_lat_2d,
+                             lon0, k0, fe, fn + Zb, Qn, _BETA, _CGB, to_m)
+            return (out_lat_2d, out_lon_2d)
+
+    if _is_supported_geographic(tgt_epsg):
+        tmerc_p = _tmerc_params(src_crs)
+        if tmerc_p is not None:
+            lon0, k0, fe, fn, Zb, to_m = tmerc_p
+            Qn = k0 * _A_RECT
+            # tmerc_forward outputs metres; convert back to native units
+            tmerc_forward(out_x_flat, out_y_flat, src_x_flat, src_y_flat,
+                          lon0, k0, fe, fn + Zb, Qn, _ALPHA, _CBG)
+            if to_m != 1.0:
+                src_x_flat /= to_m
+                src_y_flat /= to_m
+            return (src_y_flat.reshape(height, width),
+                    src_x_flat.reshape(height, width))
+
+    # --- Ellipsoidal Mercator (EPSG:3395) ---
+    if _is_supported_geographic(src_epsg) and tgt_epsg == 3395:
+        emerc_inverse(out_x_flat, out_y_flat, src_x_flat, src_y_flat,
+                      1.0, _WGS84_E)
+        return (src_y_flat.reshape(height, width),
+                src_x_flat.reshape(height, width))
+    if src_epsg == 3395 and _is_supported_geographic(tgt_epsg):
+        emerc_forward(out_x_flat, out_y_flat, src_x_flat, src_y_flat,
+                      1.0, _WGS84_E)
+        return (src_y_flat.reshape(height, width),
+                src_x_flat.reshape(height, width))
+
+    # --- Parameterised projections (LCC, AEA, CEA) ---
+    # For these we need to parse the CRS parameters, so we operate on
+    # the pyproj CRS objects directly rather than just EPSG codes.
+
+    # LCC
+    if _is_supported_geographic(src_epsg):
+        params = _lcc_params(tgt_crs)
+        if params is not None:
+            lon0, nn, c, rho0, k0, fe, fn, to_m = params
+            # Use 2D kernel: avoids tile/repeat + fuses unit conversion
+            out_lon_2d = np.empty((height, width), dtype=np.float64)
+            out_lat_2d = np.empty((height, width), dtype=np.float64)
+            lcc_inverse_2d(out_x_1d, out_y_1d, out_lon_2d, out_lat_2d,
+                           lon0, nn, c, rho0, k0, fe, fn, _WGS84_E, _WGS84_A, to_m)
+            return (out_lat_2d, out_lon_2d)
+
+    if _is_supported_geographic(tgt_epsg):
+        params = _lcc_params(src_crs)
+        if params is not None:
+            lon0, nn, c, rho0, k0, fe, fn, to_m = params
+            lcc_forward(out_x_flat, out_y_flat, src_x_flat, src_y_flat,
+                        lon0, nn, c, rho0, k0, fe, fn, _WGS84_E, _WGS84_A)
+            if to_m != 1.0:
+                src_x_flat /= to_m
+                src_y_flat /= to_m
+            return (src_y_flat.reshape(height, width),
+                    src_x_flat.reshape(height, width))
+
+    # AEA
+    if _is_supported_geographic(src_epsg):
+        params = _aea_params(tgt_crs)
+        if params is not None:
+            lon0, nn, C, rho0, fe, fn = params
+            aea_inverse(out_x_flat, out_y_flat, src_x_flat, src_y_flat,
+                        lon0, nn, C, rho0, fe, fn,
+                        _WGS84_E, _WGS84_A, _QP, _APA)
+            return (src_y_flat.reshape(height, width),
+                    src_x_flat.reshape(height, width))
+
+    if _is_supported_geographic(tgt_epsg):
+        params = _aea_params(src_crs)
+        if params is not None:
+            lon0, nn, C, rho0, fe, fn = params
+            aea_forward(out_x_flat, out_y_flat, src_x_flat, src_y_flat,
+                        lon0, nn, C, rho0, fe, fn,
+                        _WGS84_E, _WGS84_A)
+            return (src_y_flat.reshape(height, width),
+                    src_x_flat.reshape(height, width))
+
+    # CEA
+    if _is_supported_geographic(src_epsg):
+        params = _cea_params(tgt_crs)
+        if params is not None:
+            lon0, k0, fe, fn = params
+            cea_inverse(out_x_flat, out_y_flat, src_x_flat, src_y_flat,
+                        lon0, k0, fe, fn,
+                        _WGS84_E, _WGS84_A, _QP, _APA)
+            return (src_y_flat.reshape(height, width),
+                    src_x_flat.reshape(height, width))
+
+    if _is_supported_geographic(tgt_epsg):
+        params = _cea_params(src_crs)
+        if params is not None:
+            lon0, k0, fe, fn = params
+            cea_forward(out_x_flat, out_y_flat, src_x_flat, src_y_flat,
+                        lon0, k0, fe, fn,
+                        _WGS84_E, _WGS84_A, _QP)
+            return (src_y_flat.reshape(height, width),
+                    src_x_flat.reshape(height, width))
+
+    # Sinusoidal
+    if _is_supported_geographic(src_epsg):
+        params = _sinu_params(tgt_crs)
+        if params is not None:
+            lon0, fe, fn = params
+            sinu_inverse(out_x_flat, out_y_flat, src_x_flat, src_y_flat,
+                         lon0, fe, fn, _WGS84_E2, _WGS84_A, _MLFN_EN)
+            return (src_y_flat.reshape(height, width),
+                    src_x_flat.reshape(height, width))
+
+    if _is_supported_geographic(tgt_epsg):
+        params = _sinu_params(src_crs)
+        if params is not None:
+            lon0, fe, fn = params
+            sinu_forward(out_x_flat, out_y_flat, src_x_flat, src_y_flat,
+                         lon0, fe, fn, _WGS84_E2, _WGS84_A, _MLFN_EN)
+            return (src_y_flat.reshape(height, width),
+                    src_x_flat.reshape(height, width))
+
+    # LAEA
+    if _is_supported_geographic(src_epsg):
+        params = _laea_params(tgt_crs)
+        if params is not None:
+            lon0, lat0, sinb1, cosb1, dd, xmf, ymf, rq, qp, fe, fn, mode = params
+            laea_inverse(out_x_flat, out_y_flat, src_x_flat, src_y_flat,
+                         lon0, sinb1, cosb1, xmf, ymf, rq, qp,
+                         fe, fn, _WGS84_E, _WGS84_A, _WGS84_E2, mode, _APA)
+            return (src_y_flat.reshape(height, width),
+                    src_x_flat.reshape(height, width))
+
+    if _is_supported_geographic(tgt_epsg):
+        params = _laea_params(src_crs)
+        if params is not None:
+            lon0, lat0, sinb1, cosb1, dd, xmf, ymf, rq, qp, fe, fn, mode = params
+            laea_forward(out_x_flat, out_y_flat, src_x_flat, src_y_flat,
+                         lon0, sinb1, cosb1, xmf, ymf, rq, qp,
+                         fe, fn, _WGS84_E, _WGS84_A, _WGS84_E2, mode)
+            return (src_y_flat.reshape(height, width),
+                    src_x_flat.reshape(height, width))
+
+    # Polar Stereographic
+    if _is_supported_geographic(src_epsg):
+        params = _stere_params(tgt_crs)
+        if params is not None:
+            lon0, k0, akm1, fe, fn, is_south = params
+            stere_inverse(out_x_flat, out_y_flat, src_x_flat, src_y_flat,
+                          lon0, akm1, fe, fn, _WGS84_E, is_south)
+            return (src_y_flat.reshape(height, width),
+                    src_x_flat.reshape(height, width))
+
+    if _is_supported_geographic(tgt_epsg):
+        params = _stere_params(src_crs)
+        if params is not None:
+            lon0, k0, akm1, fe, fn, is_south = params
+            stere_forward(out_x_flat, out_y_flat, src_x_flat, src_y_flat,
+                          lon0, akm1, fe, fn, _WGS84_E, is_south)
+            return (src_y_flat.reshape(height, width),
+                    src_x_flat.reshape(height, width))
+
+    # Oblique Stereographic
+    if _is_supported_geographic(src_epsg):
+        params = _sterea_params(tgt_crs)
+        if params is not None:
+            lon0, sinc0, cosc0, R2, C, K, ratexp, fe, fn, e = params
+            sterea_inverse(out_x_flat, out_y_flat, src_x_flat, src_y_flat,
+                           lon0, sinc0, cosc0, R2, C, K, ratexp, fe, fn, e)
+            return (src_y_flat.reshape(height, width),
+                    src_x_flat.reshape(height, width))
+
+    if _is_supported_geographic(tgt_epsg):
+        params = _sterea_params(src_crs)
+        if params is not None:
+            lon0, sinc0, cosc0, R2, C, K, ratexp, fe, fn, e = params
+            sterea_forward(out_x_flat, out_y_flat, src_x_flat, src_y_flat,
+                           lon0, sinc0, cosc0, R2, C, K, ratexp, fe, fn, e)
+            return (src_y_flat.reshape(height, width),
+                    src_x_flat.reshape(height, width))
+
+    # Oblique Mercator (Hotine) -- kernel implemented but disabled
+    # pending alignment with PROJ's omerc.cpp variant handling.
+
+    return None
+
+
+# Wrap try_numba_transform with datum shift support
+_try_numba_transform_inner = try_numba_transform
+
+
+def try_numba_transform(src_crs, tgt_crs, chunk_bounds, chunk_shape):
+    """Numba JIT coordinate transform with optional datum shift.
+
+    Wraps the projection-only transform.  If the source CRS uses a
+    non-WGS84 datum with known Helmert parameters (e.g. NAD27), the
+    returned geographic coordinates are shifted from WGS84 to the
+    source datum via a geocentric 3-parameter Helmert transform.
+    """
+    result = _try_numba_transform_inner(src_crs, tgt_crs, chunk_bounds, chunk_shape)
+    if result is None:
+        return None
+
+    # The projection kernels assume WGS84 on both sides.  Apply
+    # datum shifts where needed.
+    src_datum = _get_datum_params(src_crs)
+    if src_datum is not None:
+        src_y, src_x = result
+        flat_lon = src_x.ravel()
+        flat_lat = src_y.ravel()
+
+        # Try grid-based shift first (sub-meter accuracy)
+        try:
+            d = src_crs.to_dict()
+        except Exception:
+            d = {}
+        datum_key = d.get('datum', d.get('ellps', ''))
+
+        grid_applied = False
+        try:
+            from ._datum_grids import find_grid_for_point, get_grid
+            from ._datum_grids import apply_grid_shift_inverse
+
+            # Use center of the output chunk to select the grid
+            center_lon = float(np.mean(flat_lon[:min(100, len(flat_lon))]))
+            center_lat = float(np.mean(flat_lat[:min(100, len(flat_lat))]))
+            grid_key = find_grid_for_point(center_lon, center_lat, datum_key)
+            if grid_key is not None:
+                grid = get_grid(grid_key)
+                if grid is not None:
+                    dlat, dlon, g_left, g_top, g_rx, g_ry, g_h, g_w = grid
+                    apply_grid_shift_inverse(
+                        flat_lon, flat_lat, dlat, dlon,
+                        g_left, g_top, g_rx, g_ry, g_h, g_w,
+                    )
+                    grid_applied = True
+        except Exception:
+            pass
+
+        if not grid_applied:
+            # Fall back to 7-parameter Helmert
+            dx, dy, dz, rx, ry, rz, ds, a_src, f_src = src_datum
+            _apply_datum_shift_inv(
+                flat_lon, flat_lat, dx, dy, dz, rx, ry, rz, ds,
+                a_src, f_src, _WGS84_A, _WGS84_F,
+            )
+
+        return flat_lat.reshape(src_y.shape), flat_lon.reshape(src_x.shape)
+
+    return result
diff --git a/xrspatial/reproject/_projections_cuda.py b/xrspatial/reproject/_projections_cuda.py
new file mode 100644
index 00000000..7dc95c8b
--- /dev/null
+++ b/xrspatial/reproject/_projections_cuda.py
@@ -0,0 +1,960 @@
+"""CUDA JIT coordinate transforms for common projections.
+
+GPU equivalents of the Numba CPU kernels in ``_projections.py``.
+Each kernel computes source CRS coordinates directly on-device,
+avoiding the CPU->GPU transfer of coordinate arrays.
+"""
+from __future__ import annotations
+
+import math
+
+import numpy as np
+
+try:
+    from numba import cuda
+    HAS_CUDA = True
+except ImportError:
+    HAS_CUDA = False
+
+# Ellipsoid constants (duplicated here so CUDA device functions see them
+# as compile-time constants rather than module-level loads).
+_A = 6378137.0
+_F = 1.0 / 298.257223563
+_E2 = 2.0 * _F - _F * _F
+_E = math.sqrt(_E2)
+
+if not HAS_CUDA:
+    # Provide a no-op so the module can be imported without CUDA.
+    def try_cuda_transform(*args, **kwargs):
+        return None
+else:
+
+    # -----------------------------------------------------------------
+    # Shared device helpers
+    # -----------------------------------------------------------------
+
+    @cuda.jit(device=True)
+    def _d_pj_sinhpsi2tanphi(taup, e):
+        e2 = e * e
+        tau = taup
+        for _ in range(5):
+            tau1 = math.sqrt(1.0 + tau * tau)
+            sig = math.sinh(e * math.atanh(e * tau / tau1))
+            sig1 = math.sqrt(1.0 + sig * sig)
+            taupa = sig1 * tau - sig * tau1
+            dtau = ((taup - taupa) * (1.0 + (1.0 - e2) * tau * tau)
+                    / ((1.0 - e2) * tau1 * math.sqrt(1.0 + taupa * taupa)))
+            tau += dtau
+            if abs(dtau) < 1e-12:
+                break
+        return tau
+
+    @cuda.jit(device=True)
+    def _d_authalic_q(sinphi, e):
+        e2 = e * e
+        es = e * sinphi
+        return (1.0 - e2) * (sinphi / (1.0 - es * es) + math.atanh(es) / e)
+
+    @cuda.jit(device=True)
+    def _d_authalic_inv(beta, apa0, apa1, apa2, apa3, apa4):
+        t = 2.0 * beta
+        return (beta
+                + apa0 * math.sin(t)
+                + apa1 * math.sin(2.0 * t)
+                + apa2 * math.sin(3.0 * t)
+                + apa3 * math.sin(4.0 * t)
+                + apa4 * math.sin(5.0 * t))
+
+    @cuda.jit(device=True)
+    def _d_clenshaw_sin(c0, c1, c2, c3, c4, c5, angle):
+        X = 2.0 * math.cos(2.0 * angle)
+        u0 = 0.0
+        u1 = 0.0
+        for c in (c5, c4, c3, c2, c1, c0):
+            t = X * u0 - u1 + c
+            u1 = u0
+            u0 = t
+        return math.sin(2.0 * angle) * u0
+
+    @cuda.jit(device=True)
+    def _d_clenshaw_complex(a0, a1, a2, a3, a4, a5,
+                            sin2Cn, cos2Cn, sinh2Ce, cosh2Ce):
+        r = 2.0 * cos2Cn * cosh2Ce
+        im = -2.0 * sin2Cn * sinh2Ce
+        hr = 0.0; hi = 0.0; hr1 = 0.0; hi1 = 0.0
+        for a in (a5, a4, a3, a2, a1, a0):
+            hr2 = hr1; hi2 = hi1; hr1 = hr; hi1 = hi
+            hr = -hr2 + r * hr1 - im * hi1 + a
+            hi = -hi2 + im * hr1 + r * hi1
+        dCn = sin2Cn * cosh2Ce * hr - cos2Cn * sinh2Ce * hi
+        dCe = sin2Cn * cosh2Ce * hi + cos2Cn * sinh2Ce * hr
+        return dCn, dCe
+
+    # -----------------------------------------------------------------
+    # Web Mercator (EPSG:3857)  --  spherical
+    # -----------------------------------------------------------------
+
+    @cuda.jit(device=True)
+    def _d_merc_inv(x, y):
+        lon = math.degrees(x / _A)
+        lat = math.degrees(math.atan(math.sinh(y / _A)))
+        return lon, lat
+
+    @cuda.jit(device=True)
+    def _d_merc_fwd(lon_deg, lat_deg):
+        x = _A * math.radians(lon_deg)
+        phi = math.radians(lat_deg)
+        y = _A * math.log(math.tan(math.pi / 4.0 + phi / 2.0))
+        return x, y
+
+    @cuda.jit
+    def _k_merc_inverse(out_src_x, out_src_y,
+                        left, top, res_x, res_y):
+        i, j = cuda.grid(2)
+        if i < out_src_x.shape[0] and j < out_src_x.shape[1]:
+            tx = left + (j + 0.5) * res_x
+            ty = top - (i + 0.5) * res_y
+            lon, lat = _d_merc_inv(tx, ty)
+            out_src_x[i, j] = lon
+            out_src_y[i, j] = lat
+
+    @cuda.jit
+    def _k_merc_forward(out_src_x, out_src_y,
+                        left, top, res_x, res_y):
+        i, j = cuda.grid(2)
+        if i < out_src_x.shape[0] and j < out_src_x.shape[1]:
+            lon = left + (j + 0.5) * res_x
+            lat = top - (i + 0.5) * res_y
+            x, y = _d_merc_fwd(lon, lat)
+            out_src_x[i, j] = x
+            out_src_y[i, j] = y
+
+    # -----------------------------------------------------------------
+    # Ellipsoidal Mercator (EPSG:3395)
+    # -----------------------------------------------------------------
+
+    @cuda.jit(device=True)
+    def _d_emerc_inv(x, y, k0, e):
+        lam = x / (k0 * _A)
+        taup = math.sinh(y / (k0 * _A))
+        tau = _d_pj_sinhpsi2tanphi(taup, e)
+        return math.degrees(lam), math.degrees(math.atan(tau))
+
+    @cuda.jit(device=True)
+    def _d_emerc_fwd(lon_deg, lat_deg, k0, e):
+        lam = math.radians(lon_deg)
+        phi = math.radians(lat_deg)
+        sinphi = math.sin(phi)
+        x = k0 * _A * lam
+        y = k0 * _A * (math.asinh(math.tan(phi)) - e * math.atanh(e * sinphi))
+        return x, y
+
+    @cuda.jit
+    def _k_emerc_inverse(out_src_x, out_src_y,
+                         left, top, res_x, res_y, k0, e):
+        i, j = cuda.grid(2)
+        if i < out_src_x.shape[0] and j < out_src_x.shape[1]:
+            tx = left + (j + 0.5) * res_x
+            ty = top - (i + 0.5) * res_y
+            lon, lat = _d_emerc_inv(tx, ty, k0, e)
+            out_src_x[i, j] = lon
+            out_src_y[i, j] = lat
+
+    @cuda.jit
+    def _k_emerc_forward(out_src_x, out_src_y,
+                         left, top, res_x, res_y, k0, e):
+        i, j = cuda.grid(2)
+        if i < out_src_x.shape[0] and j < out_src_x.shape[1]:
+            lon = left + (j + 0.5) * res_x
+            lat = top - (i + 0.5) * res_y
+            x, y = _d_emerc_fwd(lon, lat, k0, e)
+            out_src_x[i, j] = x
+            out_src_y[i, j] = y
+
+    # -----------------------------------------------------------------
+    # Transverse Mercator / UTM  --  Krueger series
+    # -----------------------------------------------------------------
+
+    @cuda.jit(device=True)
+    def _d_tmerc_fwd(lon_deg, lat_deg, lon0, Qn,
+                     a0, a1, a2, a3, a4, a5,
+                     c0, c1, c2, c3, c4, c5):
+        lam = math.radians(lon_deg) - lon0
+        phi = math.radians(lat_deg)
+        chi = phi + _d_clenshaw_sin(c0, c1, c2, c3, c4, c5, phi)
+        sin_chi = math.sin(chi)
+        cos_chi = math.cos(chi)
+        sin_lam = math.sin(lam)
+        cos_lam = math.cos(lam)
+        denom = math.hypot(sin_chi, cos_chi * cos_lam)
+        if denom < 1e-30:
+            denom = 1e-30
+        Cn = math.atan2(sin_chi, cos_chi * cos_lam)
+        tan_Ce = sin_lam * cos_chi / denom
+        if tan_Ce > 1e15:
+            tan_Ce = 1e15
+        elif tan_Ce < -1e15:
+            tan_Ce = -1e15
+        Ce = math.asinh(tan_Ce)
+        inv_d = 1.0 / denom
+        inv_d2 = inv_d * inv_d
+        ccl = cos_chi * cos_lam
+        sin2 = 2.0 * sin_chi * ccl * inv_d2
+        cos2 = 2.0 * ccl * ccl * inv_d2 - 1.0
+        sinh2 = 2.0 * tan_Ce * inv_d
+        cosh2 = 2.0 * inv_d2 - 1.0
+        dCn, dCe = _d_clenshaw_complex(a0, a1, a2, a3, a4, a5,
+                                        sin2, cos2, sinh2, cosh2)
+        return Qn * (Ce + dCe), Qn * (Cn + dCn)
+
+    @cuda.jit(device=True)
+    def _d_tmerc_inv(x, y, lon0, Qn,
+                     b0, b1, b2, b3, b4, b5,
+                     g0, g1, g2, g3, g4, g5):
+        Cn = y / Qn
+        Ce = x / Qn
+        sin2Cn = math.sin(2.0 * Cn)
+        cos2Cn = math.cos(2.0 * Cn)
+        exp2Ce = math.exp(2.0 * Ce)
+        inv_exp = 1.0 / exp2Ce
+        sinh2Ce = 0.5 * (exp2Ce - inv_exp)
+        cosh2Ce = 0.5 * (exp2Ce + inv_exp)
+        dCn, dCe = _d_clenshaw_complex(b0, b1, b2, b3, b4, b5,
+                                        sin2Cn, cos2Cn, sinh2Ce, cosh2Ce)
+        Cn -= dCn
+        Ce -= dCe
+        sin_Cn = math.sin(Cn)
+        cos_Cn = math.cos(Cn)
+        sinh_Ce = math.sinh(Ce)
+        lam = math.atan2(sinh_Ce, cos_Cn)
+        modulus = math.hypot(sinh_Ce, cos_Cn)
+        chi = math.atan2(sin_Cn, modulus)
+        phi = chi + _d_clenshaw_sin(g0, g1, g2, g3, g4, g5, chi)
+        return math.degrees(lam + lon0), math.degrees(phi)
+
+    @cuda.jit
+    def _k_tmerc_inverse(out_src_x, out_src_y,
+                         left, top, res_x, res_y,
+                         lon0, fe, fn, Qn,
+                         b0, b1, b2, b3, b4, b5,
+                         g0, g1, g2, g3, g4, g5):
+        i, j = cuda.grid(2)
+        if i < out_src_x.shape[0] and j < out_src_x.shape[1]:
+            tx = left + (j + 0.5) * res_x - fe
+            ty = top - (i + 0.5) * res_y - fn
+            lon, lat = _d_tmerc_inv(tx, ty, lon0, Qn,
+                                     b0, b1, b2, b3, b4, b5,
+                                     g0, g1, g2, g3, g4, g5)
+            out_src_x[i, j] = lon
+            out_src_y[i, j] = lat
+
+    @cuda.jit
+    def _k_tmerc_forward(out_src_x, out_src_y,
+                         left, top, res_x, res_y,
+                         lon0, fe, fn, Qn,
+                         a0, a1, a2, a3, a4, a5,
+                         c0, c1, c2, c3, c4, c5):
+        i, j = cuda.grid(2)
+        if i < out_src_x.shape[0] and j < out_src_x.shape[1]:
+            lon = left + (j + 0.5) * res_x
+            lat = top - (i + 0.5) * res_y
+            x, y = _d_tmerc_fwd(lon, lat, lon0, Qn,
+                                 a0, a1, a2, a3, a4, a5,
+                                 c0, c1, c2, c3, c4, c5)
+            out_src_x[i, j] = x + fe
+            out_src_y[i, j] = y + fn
+
+    # -----------------------------------------------------------------
+    # Lambert Conformal Conic (LCC)
+    # -----------------------------------------------------------------
+
+    @cuda.jit(device=True)
+    def _d_lcc_fwd(lon_deg, lat_deg, lon0, n, c, rho0, k0, e, a):
+        phi = math.radians(lat_deg)
+        lam = math.radians(lon_deg) - lon0
+        sinphi = math.sin(phi)
+        es = e * sinphi
+        ts = math.tan(math.pi / 4.0 - phi / 2.0) * math.pow(
+            (1.0 + es) / (1.0 - es), e / 2.0)
+        rho = a * k0 * c * math.pow(ts, n)
+        lam_n = n * lam
+        return rho * math.sin(lam_n), rho0 - rho * math.cos(lam_n)
+
+    @cuda.jit(device=True)
+    def _d_lcc_inv(x, y, lon0, n, c, rho0, k0, e, a):
+        rho0_y = rho0 - y
+        if n < 0.0:
+            rho = -math.hypot(x, rho0_y)
+            lam_n = math.atan2(-x, -rho0_y)
+        else:
+            rho = math.hypot(x, rho0_y)
+            lam_n = math.atan2(x, rho0_y)
+        if abs(rho) < 1e-30:
+            lat = 90.0 if n > 0 else -90.0
+            return math.degrees(lon0 + lam_n / n), lat
+        ts = math.pow(rho / (a * k0 * c), 1.0 / n)
+        taup = math.sinh(math.log(1.0 / ts))
+        tau = _d_pj_sinhpsi2tanphi(taup, e)
+        return math.degrees(lam_n / n + lon0), math.degrees(math.atan(tau))
+
+    @cuda.jit
+    def _k_lcc_inverse(out_src_x, out_src_y,
+                       left, top, res_x, res_y,
+                       lon0, n, c, rho0, k0, fe, fn, e, a):
+        i, j = cuda.grid(2)
+        if i < out_src_x.shape[0] and j < out_src_x.shape[1]:
+            tx = left + (j + 0.5) * res_x - fe
+            ty = top - (i + 0.5) * res_y - fn
+            lon, lat = _d_lcc_inv(tx, ty, lon0, n, c, rho0, k0, e, a)
+            out_src_x[i, j] = lon
+            out_src_y[i, j] = lat
+
+    @cuda.jit
+    def _k_lcc_forward(out_src_x, out_src_y,
+                       left, top, res_x, res_y,
+                       lon0, n, c, rho0, k0, fe, fn, e, a):
+        i, j = cuda.grid(2)
+        if i < out_src_x.shape[0] and j < out_src_x.shape[1]:
+            lon = left + (j + 0.5) * res_x
+            lat = top - (i + 0.5) * res_y
+            x, y = _d_lcc_fwd(lon, lat, lon0, n, c, rho0, k0, e, a)
+            out_src_x[i, j] = x + fe
+            out_src_y[i, j] = y + fn
+
+    # -----------------------------------------------------------------
+    # Albers Equal Area (AEA)
+    # -----------------------------------------------------------------
+
+    @cuda.jit(device=True)
+    def _d_aea_fwd(lon_deg, lat_deg, lon0, n, C, rho0, e, a):
+        phi = math.radians(lat_deg)
+        lam = math.radians(lon_deg) - lon0
+        q = _d_authalic_q(math.sin(phi), e)
+        val = C - n * q
+        if val < 0.0:
+            val = 0.0
+        rho = a * math.sqrt(val) / n
+        theta = n * lam
+        return rho * math.sin(theta), rho0 - rho * math.cos(theta)
+
+    @cuda.jit(device=True)
+    def _d_aea_inv(x, y, lon0, n, C, rho0, e, a, qp,
+                   apa0, apa1, apa2, apa3, apa4):
+        rho0_y = rho0 - y
+        if n < 0.0:
+            rho = -math.hypot(x, rho0_y)
+            theta = math.atan2(-x, -rho0_y)
+        else:
+            rho = math.hypot(x, rho0_y)
+            theta = math.atan2(x, rho0_y)
+        q = (C - (rho * rho * n * n) / (a * a)) / n
+        ratio = q / qp
+        if ratio > 1.0:
+            ratio = 1.0
+        elif ratio < -1.0:
+            ratio = -1.0
+        beta = math.asin(ratio)
+        phi = _d_authalic_inv(beta, apa0, apa1, apa2, apa3, apa4)
+        return math.degrees(theta / n + lon0), math.degrees(phi)
+
+    @cuda.jit
+    def _k_aea_inverse(out_src_x, out_src_y,
+                       left, top, res_x, res_y,
+                       lon0, n, C, rho0, fe, fn, e, a, qp,
+                       apa0, apa1, apa2, apa3, apa4):
+        i, j = cuda.grid(2)
+        if i < out_src_x.shape[0] and j < out_src_x.shape[1]:
+            tx = left + (j + 0.5) * res_x - fe
+            ty = top - (i + 0.5) * res_y - fn
+            lon, lat = _d_aea_inv(tx, ty, lon0, n, C, rho0, e, a, qp,
+                                   apa0, apa1, apa2, apa3, apa4)
+            out_src_x[i, j] = lon
+            out_src_y[i, j] = lat
+
+    @cuda.jit
+    def _k_aea_forward(out_src_x, out_src_y,
+                       left, top, res_x, res_y,
+                       lon0, n, C, rho0, fe, fn, e, a):
+        i, j = cuda.grid(2)
+        if i < out_src_x.shape[0] and j < out_src_x.shape[1]:
+            lon = left + (j + 0.5) * res_x
+            lat = top - (i + 0.5) * res_y
+            x, y = _d_aea_fwd(lon, lat, lon0, n, C, rho0, e, a)
+            out_src_x[i, j] = x + fe
+            out_src_y[i, j] = y + fn
+
+    # -----------------------------------------------------------------
+    # Cylindrical Equal Area (CEA)
+    # -----------------------------------------------------------------
+
+    @cuda.jit(device=True)
+    def _d_cea_fwd(lon_deg, lat_deg, lon0, k0, e, a, qp):
+        lam = math.radians(lon_deg) - lon0
+        phi = math.radians(lat_deg)
+        q = _d_authalic_q(math.sin(phi), e)
+        return a * k0 * lam, a * q / (2.0 * k0)
+
+    @cuda.jit(device=True)
+    def _d_cea_inv(x, y, lon0, k0, e, a, qp, apa0, apa1, apa2, apa3, apa4):
+        lam = x / (a * k0)
+        ratio = 2.0 * y * k0 / (a * qp)
+        if ratio > 1.0:
+            ratio = 1.0
+        elif ratio < -1.0:
+            ratio = -1.0
+        beta = math.asin(ratio)
+        phi = _d_authalic_inv(beta, apa0, apa1, apa2, apa3, apa4)
+        return math.degrees(lam + lon0), math.degrees(phi)
+
+    @cuda.jit
+    def _k_cea_inverse(out_src_x, out_src_y,
+                       left, top, res_x, res_y,
+                       lon0, k0, fe, fn, e, a, qp,
+                       apa0, apa1, apa2, apa3, apa4):
+        i, j = cuda.grid(2)
+        if i < out_src_x.shape[0] and j < out_src_x.shape[1]:
+            tx = left + (j + 0.5) * res_x - fe
+            ty = top - (i + 0.5) * res_y - fn
+            lon, lat = _d_cea_inv(tx, ty, lon0, k0, e, a, qp,
+                                   apa0, apa1, apa2, apa3, apa4)
+            out_src_x[i, j] = lon
+            out_src_y[i, j] = lat
+
+    @cuda.jit
+    def _k_cea_forward(out_src_x, out_src_y,
+                       left, top, res_x, res_y,
+                       lon0, k0, fe, fn, e, a, qp):
+        i, j = cuda.grid(2)
+        if i < out_src_x.shape[0] and j < out_src_x.shape[1]:
+            lon = left + (j + 0.5) * res_x
+            lat = top - (i + 0.5) * res_y
+            x, y = _d_cea_fwd(lon, lat, lon0, k0, e, a, qp)
+            out_src_x[i, j] = x + fe
+            out_src_y[i, j] = y + fn
+
+    # -----------------------------------------------------------------
+    # Sinusoidal (ellipsoidal)
+    # -----------------------------------------------------------------
+
+    @cuda.jit(device=True)
+    def _d_mlfn(phi, sinphi, cosphi, en0, en1, en2, en3, en4):
+        cphi = cosphi * sinphi
+        sphi = sinphi * sinphi
+        return en0 * phi - cphi * (en1 + sphi * (en2 + sphi * (en3 + sphi * en4)))
+
+    @cuda.jit(device=True)
+    def _d_inv_mlfn(arg, e2, en0, en1, en2, en3, en4):
+        k = 1.0 / (1.0 - e2)
+        phi = arg
+        for _ in range(20):
+            s = math.sin(phi)
+            c = math.cos(phi)
+            t = 1.0 - e2 * s * s
+            dphi = (arg - _d_mlfn(phi, s, c, en0, en1, en2, en3, en4)) * t * math.sqrt(t) * k
+            phi += dphi
+            if abs(dphi) < 1e-14:
+                break
+        return phi
+
+    @cuda.jit(device=True)
+    def _d_sinu_fwd(lon_deg, lat_deg, lon0, e2, a, en0, en1, en2, en3, en4):
+        phi = math.radians(lat_deg)
+        lam = math.radians(lon_deg) - lon0
+        s = math.sin(phi)
+        c = math.cos(phi)
+        ms = _d_mlfn(phi, s, c, en0, en1, en2, en3, en4)
+        x = a * lam * c / math.sqrt(1.0 - e2 * s * s)
+        y = a * ms
+        return x, y
+
+    @cuda.jit(device=True)
+    def _d_sinu_inv(x, y, lon0, e2, a, en0, en1, en2, en3, en4):
+        phi = _d_inv_mlfn(y / a, e2, en0, en1, en2, en3, en4)
+        s = math.sin(phi)
+        c = math.cos(phi)
+        if abs(c) < 1e-14:
+            lam = 0.0
+        else:
+            lam = x * math.sqrt(1.0 - e2 * s * s) / (a * c)
+        return math.degrees(lam + lon0), math.degrees(phi)
+
+    @cuda.jit
+    def _k_sinu_inverse(out_src_x, out_src_y,
+                        left, top, res_x, res_y,
+                        lon0, fe, fn, e2, a,
+                        en0, en1, en2, en3, en4):
+        i, j = cuda.grid(2)
+        if i < out_src_x.shape[0] and j < out_src_x.shape[1]:
+            tx = left + (j + 0.5) * res_x - fe
+            ty = top - (i + 0.5) * res_y - fn
+            lon, lat = _d_sinu_inv(tx, ty, lon0, e2, a, en0, en1, en2, en3, en4)
+            out_src_x[i, j] = lon
+            out_src_y[i, j] = lat
+
+    @cuda.jit
+    def _k_sinu_forward(out_src_x, out_src_y,
+                        left, top, res_x, res_y,
+                        lon0, fe, fn, e2, a,
+                        en0, en1, en2, en3, en4):
+        i, j = cuda.grid(2)
+        if i < out_src_x.shape[0] and j < out_src_x.shape[1]:
+            lon = left + (j + 0.5) * res_x
+            lat = top - (i + 0.5) * res_y
+            x, y = _d_sinu_fwd(lon, lat, lon0, e2, a, en0, en1, en2, en3, en4)
+            out_src_x[i, j] = x + fe
+            out_src_y[i, j] = y + fn
+
+    # -----------------------------------------------------------------
+    # Lambert Azimuthal Equal Area (LAEA)
+    # -----------------------------------------------------------------
+
+    @cuda.jit(device=True)
+    def _d_laea_fwd(lon_deg, lat_deg, lon0, sinb1, cosb1,
+                    xmf, ymf, rq, qp, e, a, e2, mode):
+        phi = math.radians(lat_deg)
+        lam = math.radians(lon_deg) - lon0
+        sinphi = math.sin(phi)
+        q = _d_authalic_q(sinphi, e)
+        sinb = q / qp
+        if sinb > 1.0:
+            sinb = 1.0
+        elif sinb < -1.0:
+            sinb = -1.0
+        cosb = math.sqrt(1.0 - sinb * sinb)
+        coslam = math.cos(lam)
+        sinlam = math.sin(lam)
+        if mode == 0:  # OBLIQ
+            denom = 1.0 + sinb1 * sinb + cosb1 * cosb * coslam
+            if denom < 1e-30:
+                denom = 1e-30
+            b = math.sqrt(2.0 / denom)
+            x = a * xmf * b * cosb * sinlam
+            y = a * ymf * b * (cosb1 * sinb - sinb1 * cosb * coslam)
+        elif mode == 1:  # EQUIT
+            denom = 1.0 + cosb * coslam
+            if denom < 1e-30:
+                denom = 1e-30
+            b = math.sqrt(2.0 / denom)
+            x = a * xmf * b * cosb * sinlam
+            y = a * ymf * b * sinb
+        elif mode == 2:  # N_POLE
+            q_diff = qp - q
+            if q_diff < 0.0:
+                q_diff = 0.0
+            rho = a * math.sqrt(q_diff)
+            x = rho * sinlam
+            y = -rho * coslam
+        else:  # S_POLE
+            q_diff = qp + q
+            if q_diff < 0.0:
+                q_diff = 0.0
+            rho = a * math.sqrt(q_diff)
+            x = rho * sinlam
+            y = rho * coslam
+        return x, y
+
+    @cuda.jit(device=True)
+    def _d_laea_inv(x, y, lon0, sinb1, cosb1,
+                    xmf, ymf, rq, qp, e, a, e2, mode,
+                    apa0, apa1, apa2, apa3, apa4):
+        if mode == 2 or mode == 3:
+            x_a = x / a
+            y_a = y / a
+            rho = math.hypot(x_a, y_a)
+            if rho < 1e-30:
+                lat = 90.0 if mode == 2 else -90.0
+                return math.degrees(lon0), lat
+            q = qp - rho * rho
+            if mode == 3:
+                q = -(qp - rho * rho)
+                lam = math.atan2(x_a, y_a)
+            else:
+                lam = math.atan2(x_a, -y_a)
+        else:
+            xn = x / (a * xmf)
+            yn = y / (a * ymf)
+            rho = math.hypot(xn, yn)
+            if rho < 1e-30:
+                return math.degrees(lon0), math.degrees(math.asin(sinb1))
+            sce = 2.0 * math.asin(0.5 * rho / rq)
+            sinz = math.sin(sce)
+            cosz = math.cos(sce)
+            if mode == 0:
+                ab = cosz * sinb1 + yn * sinz * cosb1 / rho
+                lam = math.atan2(xn * sinz,
+                                  rho * cosb1 * cosz - yn * sinb1 * sinz)
+            else:
+                ab = yn * sinz / rho
+                lam = math.atan2(xn * sinz, rho * cosz)
+            q = qp * ab
+        ratio = q / qp
+        if ratio > 1.0:
+            ratio = 1.0
+        elif ratio < -1.0:
+            ratio = -1.0
+        beta = math.asin(ratio)
+        phi = _d_authalic_inv(beta, apa0, apa1, apa2, apa3, apa4)
+        return math.degrees(lam + lon0), math.degrees(phi)
+
+    @cuda.jit
+    def _k_laea_inverse(out_src_x, out_src_y,
+                        left, top, res_x, res_y,
+                        lon0, sinb1, cosb1, xmf, ymf, rq, qp,
+                        fe, fn, e, a, e2, mode,
+                        apa0, apa1, apa2, apa3, apa4):
+        i, j = cuda.grid(2)
+        if i < out_src_x.shape[0] and j < out_src_x.shape[1]:
+            tx = left + (j + 0.5) * res_x - fe
+            ty = top - (i + 0.5) * res_y - fn
+            lon, lat = _d_laea_inv(tx, ty, lon0, sinb1, cosb1,
+                                    xmf, ymf, rq, qp, e, a, e2, mode,
+                                    apa0, apa1, apa2, apa3, apa4)
+            out_src_x[i, j] = lon
+            out_src_y[i, j] = lat
+
+    @cuda.jit
+    def _k_laea_forward(out_src_x, out_src_y,
+                        left, top, res_x, res_y,
+                        lon0, sinb1, cosb1, xmf, ymf, rq, qp,
+                        fe, fn, e, a, e2, mode):
+        i, j = cuda.grid(2)
+        if i < out_src_x.shape[0] and j < out_src_x.shape[1]:
+            lon = left + (j + 0.5) * res_x
+            lat = top - (i + 0.5) * res_y
+            x, y = _d_laea_fwd(lon, lat, lon0, sinb1, cosb1,
+                                xmf, ymf, rq, qp, e, a, e2, mode)
+            out_src_x[i, j] = x + fe
+            out_src_y[i, j] = y + fn
+
+    # -----------------------------------------------------------------
+    # Polar Stereographic (N/S pole)
+    # -----------------------------------------------------------------
+
+    @cuda.jit(device=True)
+    def _d_stere_fwd(lon_deg, lat_deg, lon0, akm1, e, is_south):
+        phi = math.radians(lat_deg)
+        lam = math.radians(lon_deg) - lon0
+        abs_phi = -phi if is_south else phi
+        sinphi = math.sin(abs_phi)
+        es = e * sinphi
+        ts = math.tan(math.pi / 4.0 - abs_phi / 2.0) * math.pow(
+            (1.0 + es) / (1.0 - es), e / 2.0)
+        rho = akm1 * ts
+        if is_south:
+            return rho * math.sin(lam), rho * math.cos(lam)
+        else:
+            return rho * math.sin(lam), -rho * math.cos(lam)
+
+    @cuda.jit(device=True)
+    def _d_stere_inv(x, y, lon0, akm1, e, is_south):
+        if is_south:
+            rho = math.hypot(x, y)
+            lam = math.atan2(x, y)
+        else:
+            rho = math.hypot(x, y)
+            lam = math.atan2(x, -y)
+        if rho < 1e-30:
+            lat = -90.0 if is_south else 90.0
+            return math.degrees(lon0), lat
+        tp = rho / akm1
+        half_e = e / 2.0
+        phi = math.pi / 2.0 - 2.0 * math.atan(tp)
+        for _ in range(15):
+            sinphi = math.sin(phi)
+            es = e * sinphi
+            phi_new = math.pi / 2.0 - 2.0 * math.atan(
+                tp * math.pow((1.0 - es) / (1.0 + es), half_e))
+            if abs(phi_new - phi) < 1e-14:
+                phi = phi_new
+                break
+            phi = phi_new
+        if is_south:
+            phi = -phi
+        return math.degrees(lam + lon0), math.degrees(phi)
+
+    @cuda.jit
+    def _k_stere_inverse(out_src_x, out_src_y,
+                         left, top, res_x, res_y,
+                         lon0, akm1, fe, fn, e, is_south):
+        i, j = cuda.grid(2)
+        if i < out_src_x.shape[0] and j < out_src_x.shape[1]:
+            tx = left + (j + 0.5) * res_x - fe
+            ty = top - (i + 0.5) * res_y - fn
+            lon, lat = _d_stere_inv(tx, ty, lon0, akm1, e, is_south)
+            out_src_x[i, j] = lon
+            out_src_y[i, j] = lat
+
+    @cuda.jit
+    def _k_stere_forward(out_src_x, out_src_y,
+                         left, top, res_x, res_y,
+                         lon0, akm1, fe, fn, e, is_south):
+        i, j = cuda.grid(2)
+        if i < out_src_x.shape[0] and j < out_src_x.shape[1]:
+            lon = left + (j + 0.5) * res_x
+            lat = top - (i + 0.5) * res_y
+            x, y = _d_stere_fwd(lon, lat, lon0, akm1, e, is_south)
+            out_src_x[i, j] = x + fe
+            out_src_y[i, j] = y + fn
+
+    # -----------------------------------------------------------------
+    # Dispatch
+    # -----------------------------------------------------------------
+
+    def _cuda_dims(shape):
+        """Compute (blocks_per_grid, threads_per_block) for a 2D kernel."""
+        tpb = (16, 16)  # conservative to avoid register pressure
+        bpg = (
+            (shape[0] + tpb[0] - 1) // tpb[0],
+            (shape[1] + tpb[1] - 1) // tpb[1],
+        )
+        return bpg, tpb
+
+    def try_cuda_transform(src_crs, tgt_crs, chunk_bounds, chunk_shape):
+        """Attempt a CUDA JIT coordinate transform.
+
+        Returns (src_y, src_x) as CuPy arrays if a fast path exists,
+        or None to fall back to CPU.
+        """
+        import cupy as cp
+        from ._projections import (
+            _get_epsg, _is_geographic_wgs84_or_nad83, _utm_params,
+            _tmerc_params, _lcc_params, _aea_params, _cea_params,
+            _sinu_params, _laea_params, _stere_params,
+            _ALPHA, _BETA, _CBG, _CGB, _A_RECT, _QP, _APA,
+            _WGS84_E2, _MLFN_EN,
+        )
+
+        src_epsg = _get_epsg(src_crs)
+        tgt_epsg = _get_epsg(tgt_crs)
+        if src_epsg is None and tgt_epsg is None:
+            return None
+
+        height, width = chunk_shape
+        left, bottom, right, top = chunk_bounds
+        res_x = (right - left) / width
+        res_y = (top - bottom) / height
+
+        out_src_x = cp.empty((height, width), dtype=cp.float64)
+        out_src_y = cp.empty((height, width), dtype=cp.float64)
+        bpg, tpb = _cuda_dims((height, width))
+
+        # --- Web Mercator ---
+        if _is_geographic_wgs84_or_nad83(src_epsg) and tgt_epsg == 3857:
+            _k_merc_inverse[bpg, tpb](out_src_x, out_src_y,
+                                       left, top, res_x, res_y)
+            return out_src_y, out_src_x
+
+        if src_epsg == 3857 and _is_geographic_wgs84_or_nad83(tgt_epsg):
+            _k_merc_forward[bpg, tpb](out_src_x, out_src_y,
+                                       left, top, res_x, res_y)
+            return out_src_y, out_src_x
+
+        # --- UTM ---
+        if _is_geographic_wgs84_or_nad83(src_epsg):
+            utm = _utm_params(tgt_epsg)
+            if utm is not None:
+                lon0, k0, fe, fn = utm
+                Qn = k0 * _A_RECT
+                _k_tmerc_inverse[bpg, tpb](
+                    out_src_x, out_src_y, left, top, res_x, res_y,
+                    lon0, fe, fn, Qn,
+                    _BETA[0], _BETA[1], _BETA[2], _BETA[3], _BETA[4], _BETA[5],
+                    _CGB[0], _CGB[1], _CGB[2], _CGB[3], _CGB[4], _CGB[5],
+                )
+                return out_src_y, out_src_x
+
+        utm_src = _utm_params(src_epsg) if src_epsg else None
+        if utm_src is not None and _is_geographic_wgs84_or_nad83(tgt_epsg):
+            lon0, k0, fe, fn = utm_src
+            Qn = k0 * _A_RECT
+            _k_tmerc_forward[bpg, tpb](
+                out_src_x, out_src_y, left, top, res_x, res_y,
+                lon0, fe, fn, Qn,
+                _ALPHA[0], _ALPHA[1], _ALPHA[2], _ALPHA[3], _ALPHA[4], _ALPHA[5],
+                _CBG[0], _CBG[1], _CBG[2], _CBG[3], _CBG[4], _CBG[5],
+            )
+            return out_src_y, out_src_x
+
+        # --- Ellipsoidal Mercator ---
+        if _is_geographic_wgs84_or_nad83(src_epsg) and tgt_epsg == 3395:
+            _k_emerc_inverse[bpg, tpb](out_src_x, out_src_y,
+                                        left, top, res_x, res_y, 1.0, _E)
+            return out_src_y, out_src_x
+
+        if src_epsg == 3395 and _is_geographic_wgs84_or_nad83(tgt_epsg):
+            _k_emerc_forward[bpg, tpb](out_src_x, out_src_y,
+                                        left, top, res_x, res_y, 1.0, _E)
+            return out_src_y, out_src_x
+
+        # --- Generic Transverse Mercator (State Plane, etc.) ---
+        if _is_geographic_wgs84_or_nad83(src_epsg):
+            tmerc_p = _tmerc_params(tgt_crs)
+            if tmerc_p is not None:
+                lon0, k0, fe, fn, Zb, to_m = tmerc_p
+                Qn = k0 * _A_RECT
+                if to_m != 1.0:
+                    _k_tmerc_inverse[bpg, tpb](
+                        out_src_x, out_src_y,
+                        left * to_m, top * to_m, res_x * to_m, res_y * to_m,
+                        lon0, fe, fn + Zb, Qn,
+                        _BETA[0], _BETA[1], _BETA[2], _BETA[3], _BETA[4], _BETA[5],
+                        _CGB[0], _CGB[1], _CGB[2], _CGB[3], _CGB[4], _CGB[5],
+                    )
+                else:
+                    _k_tmerc_inverse[bpg, tpb](
+                        out_src_x, out_src_y, left, top, res_x, res_y,
+                        lon0, fe, fn + Zb, Qn,
+                        _BETA[0], _BETA[1], _BETA[2], _BETA[3], _BETA[4], _BETA[5],
+                        _CGB[0], _CGB[1], _CGB[2], _CGB[3], _CGB[4], _CGB[5],
+                    )
+                return out_src_y, out_src_x
+
+        if _is_geographic_wgs84_or_nad83(tgt_epsg):
+            tmerc_p = _tmerc_params(src_crs)
+            if tmerc_p is not None:
+                lon0, k0, fe, fn, Zb, to_m = tmerc_p
+                Qn = k0 * _A_RECT
+                _k_tmerc_forward[bpg, tpb](
+                    out_src_x, out_src_y, left, top, res_x, res_y,
+                    lon0, fe, fn + Zb, Qn,
+                    _ALPHA[0], _ALPHA[1], _ALPHA[2], _ALPHA[3], _ALPHA[4], _ALPHA[5],
+                    _CBG[0], _CBG[1], _CBG[2], _CBG[3], _CBG[4], _CBG[5],
+                )
+                if to_m != 1.0:
+                    out_src_x /= to_m
+                    out_src_y /= to_m
+                return out_src_y, out_src_x
+
+        # --- LCC ---
+        if _is_geographic_wgs84_or_nad83(src_epsg):
+            params = _lcc_params(tgt_crs)
+            if params is not None:
+                lon0, nn, c, rho0, k0, fe, fn, to_m = params
+                if to_m != 1.0:
+                    _k_lcc_inverse[bpg, tpb](
+                        out_src_x, out_src_y,
+                        left * to_m, top * to_m, res_x * to_m, res_y * to_m,
+                        lon0, nn, c, rho0, k0, fe, fn, _E, _A)
+                else:
+                    _k_lcc_inverse[bpg, tpb](
+                        out_src_x, out_src_y, left, top, res_x, res_y,
+                        lon0, nn, c, rho0, k0, fe, fn, _E, _A)
+                return out_src_y, out_src_x
+
+        if _is_geographic_wgs84_or_nad83(tgt_epsg):
+            params = _lcc_params(src_crs)
+            if params is not None:
+                lon0, nn, c, rho0, k0, fe, fn, to_m = params
+                _k_lcc_forward[bpg, tpb](
+                    out_src_x, out_src_y, left, top, res_x, res_y,
+                    lon0, nn, c, rho0, k0, fe, fn, _E, _A)
+                if to_m != 1.0:
+                    out_src_x /= to_m
+                    out_src_y /= to_m
+                return out_src_y, out_src_x
+
+        # --- AEA ---
+        if _is_geographic_wgs84_or_nad83(src_epsg):
+            params = _aea_params(tgt_crs)
+            if params is not None:
+                lon0, nn, C, rho0, fe, fn = params
+                _k_aea_inverse[bpg, tpb](
+                    out_src_x, out_src_y, left, top, res_x, res_y,
+                    lon0, nn, C, rho0, fe, fn, _E, _A, _QP,
+                    _APA[0], _APA[1], _APA[2], _APA[3], _APA[4])
+                return out_src_y, out_src_x
+
+        if _is_geographic_wgs84_or_nad83(tgt_epsg):
+            params = _aea_params(src_crs)
+            if params is not None:
+                lon0, nn, C, rho0, fe, fn = params
+                _k_aea_forward[bpg, tpb](
+                    out_src_x, out_src_y, left, top, res_x, res_y,
+                    lon0, nn, C, rho0, fe, fn, _E, _A)
+                return out_src_y, out_src_x
+
+        # --- CEA ---
+        if _is_geographic_wgs84_or_nad83(src_epsg):
+            params = _cea_params(tgt_crs)
+            if params is not None:
+                lon0, k0, fe, fn = params
+                _k_cea_inverse[bpg, tpb](
+                    out_src_x, out_src_y, left, top, res_x, res_y,
+                    lon0, k0, fe, fn, _E, _A, _QP,
+                    _APA[0], _APA[1], _APA[2], _APA[3], _APA[4])
+                return out_src_y, out_src_x
+
+        if _is_geographic_wgs84_or_nad83(tgt_epsg):
+            params = _cea_params(src_crs)
+            if params is not None:
+                lon0, k0, fe, fn = params
+                _k_cea_forward[bpg, tpb](
+                    out_src_x, out_src_y, left, top, res_x, res_y,
+                    lon0, k0, fe, fn, _E, _A, _QP)
+                return out_src_y, out_src_x
+
+        # --- Sinusoidal ---
+        if _is_geographic_wgs84_or_nad83(src_epsg):
+            params = _sinu_params(tgt_crs)
+            if params is not None:
+                lon0, fe, fn = params
+                en = _MLFN_EN
+                _k_sinu_inverse[bpg, tpb](
+                    out_src_x, out_src_y, left, top, res_x, res_y,
+                    lon0, fe, fn, _WGS84_E2, _A,
+                    en[0], en[1], en[2], en[3], en[4])
+                return out_src_y, out_src_x
+
+        if _is_geographic_wgs84_or_nad83(tgt_epsg):
+            params = _sinu_params(src_crs)
+            if params is not None:
+                lon0, fe, fn = params
+                en = _MLFN_EN
+                _k_sinu_forward[bpg, tpb](
+                    out_src_x, out_src_y, left, top, res_x, res_y,
+                    lon0, fe, fn, _WGS84_E2, _A,
+                    en[0], en[1], en[2], en[3], en[4])
+                return out_src_y, out_src_x
+
+        # --- LAEA ---
+        if _is_geographic_wgs84_or_nad83(src_epsg):
+            params = _laea_params(tgt_crs)
+            if params is not None:
+                lon0, lat0, sinb1, cosb1, dd, xmf, ymf, rq, qp, fe, fn, mode = params
+                _k_laea_inverse[bpg, tpb](
+                    out_src_x, out_src_y, left, top, res_x, res_y,
+                    lon0, sinb1, cosb1, xmf, ymf, rq, qp,
+                    fe, fn, _E, _A, _WGS84_E2, mode,
+                    _APA[0], _APA[1], _APA[2], _APA[3], _APA[4])
+                return out_src_y, out_src_x
+
+        if _is_geographic_wgs84_or_nad83(tgt_epsg):
+            params = _laea_params(src_crs)
+            if params is not None:
+                lon0, lat0, sinb1, cosb1, dd, xmf, ymf, rq, qp, fe, fn, mode = params
+                _k_laea_forward[bpg, tpb](
+                    out_src_x, out_src_y, left, top, res_x, res_y,
+                    lon0, sinb1, cosb1, xmf, ymf, rq, qp,
+                    fe, fn, _E, _A, _WGS84_E2, mode)
+                return out_src_y, out_src_x
+
+        # --- Polar Stereographic ---
+        if _is_geographic_wgs84_or_nad83(src_epsg):
+            params = _stere_params(tgt_crs)
+            if params is not None:
+                lon0, k0, akm1, fe, fn, is_south = params
+                _k_stere_inverse[bpg, tpb](
+                    out_src_x, out_src_y, left, top, res_x, res_y,
+                    lon0, akm1, fe, fn, _E, is_south)
+                return out_src_y, out_src_x
+
+        if _is_geographic_wgs84_or_nad83(tgt_epsg):
+            params = _stere_params(src_crs)
+            if params is not None:
+                lon0, k0, akm1, fe, fn, is_south = params
+                _k_stere_forward[bpg, tpb](
+                    out_src_x, out_src_y, left, top, res_x, res_y,
+                    lon0, akm1, fe, fn, _E, is_south)
+                return out_src_y, out_src_x
+
+        return None
diff --git a/xrspatial/reproject/_vertical.py b/xrspatial/reproject/_vertical.py
new file mode 100644
index 00000000..3d102d92
--- /dev/null
+++ b/xrspatial/reproject/_vertical.py
@@ -0,0 +1,340 @@
+"""Vertical datum transformations: ellipsoidal height <-> orthometric height.
+
+Provides geoid undulation lookup from vendored EGM96 (2.6MB, 15-arcmin
+global grid) for converting between:
+
+- **Ellipsoidal height** (height above the WGS84 ellipsoid, what GPS gives)
+- **Orthometric height** (height above mean sea level / geoid, what maps show)
+- **Depth below chart datum** (bathymetric convention, positive downward)
+
+The relationship is:
+    h_ellipsoidal = H_orthometric + N_geoid
+
+where N is the geoid undulation (can be positive or negative, ranges
+from -107m to +85m globally for EGM96).
+
+Usage
+-----
+>>> from xrspatial.reproject import geoid_height, ellipsoidal_to_orthometric
+>>> N = geoid_height(-74.0, 40.7)                    # New York: ~-33m
+>>> H = ellipsoidal_to_orthometric(h_gps, lon, lat)  # GPS -> map height
+>>> h = orthometric_to_ellipsoidal(H_map, lon, lat)  # map height -> GPS
+"""
+from __future__ import annotations
+
+import math
+import os
+import threading
+
+import numpy as np
+from numba import njit, prange
+
+# ---------------------------------------------------------------------------
+# Geoid grid loading
+# ---------------------------------------------------------------------------
+
+_VENDORED_DIR = os.path.join(os.path.dirname(__file__), 'grids')
+_PROJ_CDN = "https://cdn.proj.org"
+
+_GEOID_MODELS = {
+    'EGM96': (
+        'us_nga_egm96_15.tif',
+        f'{_PROJ_CDN}/us_nga_egm96_15.tif',
+    ),
+    'EGM2008': (
+        'us_nga_egm08_25.tif',
+        f'{_PROJ_CDN}/us_nga_egm08_25.tif',
+    ),
+}
+
+_loaded_geoids = {}
+_loaded_geoids_lock = threading.Lock()
+
+
+def _find_file(filename, cdn_url=None):
+    """Find a file: vendored dir, user cache, then download."""
+    vendored = os.path.join(_VENDORED_DIR, filename)
+    if os.path.exists(vendored):
+        return vendored
+
+    cache_dir = os.path.join(os.path.expanduser('~'), '.cache', 'xrspatial', 'proj_grids')
+    cached = os.path.join(cache_dir, filename)
+    if os.path.exists(cached):
+        return cached
+
+    if cdn_url:
+        os.makedirs(cache_dir, exist_ok=True)
+        import urllib.request
+        urllib.request.urlretrieve(cdn_url, cached)
+        return cached
+    return None
+
+
+def _load_geoid(model='EGM96'):
+    """Load a geoid model, returning (data, left, top, res_x, res_y, h, w)."""
+    with _loaded_geoids_lock:
+        if model in _loaded_geoids:
+            return _loaded_geoids[model]
+
+    if model not in _GEOID_MODELS:
+        raise ValueError(f"Unknown geoid model: {model!r}. "
+                         f"Available: {list(_GEOID_MODELS)}")
+
+    filename, cdn_url = _GEOID_MODELS[model]
+    path = _find_file(filename, cdn_url)
+    if path is None:
+        raise FileNotFoundError(
+            f"Geoid model {model} not found. File: {filename}")
+
+    try:
+        import rasterio
+        with rasterio.open(path) as ds:
+            data = ds.read(1).astype(np.float64)
+            b = ds.bounds
+            h, w = ds.height, ds.width
+            res_x = (b.right - b.left) / w
+            res_y = (b.top - b.bottom) / h
+            result = (np.ascontiguousarray(data), b.left, b.top, res_x, res_y, h, w)
+    except ImportError:
+        from xrspatial.geotiff import read_geotiff
+        da = read_geotiff(path)
+        vals = da.values.astype(np.float64)
+        if vals.ndim == 3:
+            vals = vals[0] if vals.shape[0] == 1 else vals[:, :, 0]
+        y = da.coords['y'].values
+        x = da.coords['x'].values
+        h, w = vals.shape
+        res_x = abs(float(x[1] - x[0])) if len(x) > 1 else 0.25
+        res_y = abs(float(y[1] - y[0])) if len(y) > 1 else 0.25
+        left = float(x[0]) - res_x / 2
+        top = float(y[0]) + res_y / 2
+        result = (np.ascontiguousarray(vals), left, top, res_x, res_y, h, w)
+
+    with _loaded_geoids_lock:
+        _loaded_geoids[model] = result
+    return result
+
+
+# ---------------------------------------------------------------------------
+# Numba interpolation
+# ---------------------------------------------------------------------------
+
+@njit(nogil=True, cache=True)
+def _interp_geoid_point(lon, lat, data, left, top, res_x, res_y, h, w):
+    """Bilinear interpolation of geoid undulation at a single point."""
+    # Wrap longitude to [-180, 180)
+    lon_w = lon
+    while lon_w < -180.0:
+        lon_w += 360.0
+    while lon_w >= 180.0:
+        lon_w -= 360.0
+
+    col_f = (lon_w - left) / res_x
+    row_f = (top - lat) / res_y
+
+    if row_f < 0 or row_f > h - 1:
+        return math.nan  # outside latitude range
+
+    # Wrap column for global grids
+    c0 = int(col_f) % w
+    c1 = (c0 + 1) % w
+    r0 = int(row_f)
+    if r0 >= h - 1:
+        r0 = h - 2
+    r1 = r0 + 1
+
+    dc = col_f - int(col_f)
+    dr = row_f - r0
+
+    N = (data[r0, c0] * (1.0 - dr) * (1.0 - dc) +
+         data[r0, c1] * (1.0 - dr) * dc +
+         data[r1, c0] * dr * (1.0 - dc) +
+         data[r1, c1] * dr * dc)
+    return N
+
+
+@njit(nogil=True, cache=True, parallel=True)
+def _interp_geoid_batch(lons, lats, out, data, left, top, res_x, res_y, h, w):
+    """Batch bilinear interpolation of geoid undulation."""
+    for i in prange(lons.shape[0]):
+        out[i] = _interp_geoid_point(lons[i], lats[i], data, left, top,
+                                     res_x, res_y, h, w)
+
+
+@njit(nogil=True, cache=True, parallel=True)
+def _interp_geoid_2d(lons_2d, lats_2d, out_2d, data, left, top, res_x, res_y, h, w):
+    """2D batch geoid interpolation for raster grids."""
+    for i in prange(lons_2d.shape[0]):
+        for j in range(lons_2d.shape[1]):
+            out_2d[i, j] = _interp_geoid_point(
+                lons_2d[i, j], lats_2d[i, j], data, left, top,
+                res_x, res_y, h, w)
+
+
+# ---------------------------------------------------------------------------
+# Public API
+# ---------------------------------------------------------------------------
+
+def geoid_height(lon, lat, model='EGM96'):
+    """Get the geoid undulation N at given coordinates.
+
+    Parameters
+    ----------
+    lon, lat : float, array-like, or xr.DataArray
+        Geographic coordinates in degrees (WGS84).
+    model : str
+        Geoid model: 'EGM96' (vendored, 2.6MB) or 'EGM2008' (77MB, downloaded on first use).
+
+    Returns
+    -------
+    N : same type as input
+        Geoid undulation in metres. Positive means the geoid is above
+        the ellipsoid.
+
+    Examples
+    --------
+    >>> geoid_height(-74.0, 40.7)           # New York: ~-33m
+    >>> geoid_height(np.array([0, 90]), np.array([0, 0]))  # batch
+    """
+    data, left, top, res_x, res_y, h, w = _load_geoid(model)
+
+    scalar = np.ndim(lon) == 0 and np.ndim(lat) == 0
+    lon_arr = np.atleast_1d(np.asarray(lon, dtype=np.float64)).ravel()
+    lat_arr = np.atleast_1d(np.asarray(lat, dtype=np.float64)).ravel()
+
+    out = np.empty(lon_arr.shape[0], dtype=np.float64)
+    _interp_geoid_batch(lon_arr, lat_arr, out, data, left, top,
+                        res_x, res_y, h, w)
+
+    return float(out[0]) if scalar else out.reshape(np.shape(lon))
+
+
+def geoid_height_raster(raster, model='EGM96'):
+    """Get geoid undulation for every pixel in a geographic raster.
+
+    Parameters
+    ----------
+    raster : xr.DataArray
+        Raster with y (latitude) and x (longitude) coordinates in degrees.
+    model : str
+        Geoid model name.
+
+    Returns
+    -------
+    xr.DataArray
+        Geoid undulation N in metres, same shape as input.
+    """
+    import xarray as xr
+
+    data, left, top, res_x, res_y, h, w = _load_geoid(model)
+
+    y = raster.coords[raster.dims[-2]].values.astype(np.float64)
+    x = raster.coords[raster.dims[-1]].values.astype(np.float64)
+    xx, yy = np.meshgrid(x, y)
+
+    out = np.empty_like(xx)
+    _interp_geoid_2d(xx, yy, out, data, left, top, res_x, res_y, h, w)
+
+    return xr.DataArray(
+        out, dims=raster.dims[-2:],
+        coords={raster.dims[-2]: raster.coords[raster.dims[-2]],
+                raster.dims[-1]: raster.coords[raster.dims[-1]]},
+        name='geoid_undulation',
+        attrs={'units': 'metres', 'model': model},
+    )
+
+
+def ellipsoidal_to_orthometric(height, lon, lat, model='EGM96'):
+    """Convert ellipsoidal height to orthometric (mean-sea-level) height.
+
+    H = h - N
+
+    Parameters
+    ----------
+    height : float or array-like
+        Ellipsoidal height in metres (e.g. from GPS).
+    lon, lat : float or array-like
+        Geographic coordinates in degrees.
+    model : str
+        Geoid model name.
+
+    Returns
+    -------
+    H : same type as height
+        Orthometric height in metres.
+    """
+    N = geoid_height(lon, lat, model)
+    return np.asarray(height) - N
+
+
+def orthometric_to_ellipsoidal(height, lon, lat, model='EGM96'):
+    """Convert orthometric (mean-sea-level) height to ellipsoidal height.
+
+    h = H + N
+
+    Parameters
+    ----------
+    height : float or array-like
+        Orthometric height in metres.
+    lon, lat : float or array-like
+        Geographic coordinates in degrees.
+    model : str
+        Geoid model name.
+
+    Returns
+    -------
+    h : same type as height
+        Ellipsoidal height in metres.
+    """
+    N = geoid_height(lon, lat, model)
+    return np.asarray(height) + N
+
+
+def depth_to_ellipsoidal(depth, lon, lat, model='EGM96'):
+    """Convert depth below chart datum (positive downward) to ellipsoidal height.
+
+    Assumes chart datum is approximately mean sea level (the geoid).
+
+    h = -depth + N
+
+    Parameters
+    ----------
+    depth : float or array-like
+        Depth below chart datum in metres (positive downward).
+    lon, lat : float or array-like
+        Geographic coordinates in degrees.
+    model : str
+        Geoid model name.
+
+    Returns
+    -------
+    h : same type as depth
+        Ellipsoidal height in metres (negative below ellipsoid).
+    """
+    N = geoid_height(lon, lat, model)
+    return -np.asarray(depth) + N
+
+
+def ellipsoidal_to_depth(height, lon, lat, model='EGM96'):
+    """Convert ellipsoidal height to depth below chart datum (positive downward).
+
+    Assumes chart datum is approximately mean sea level (the geoid).
+
+    depth = -(h - N) = N - h
+
+    Parameters
+    ----------
+    height : float or array-like
+        Ellipsoidal height in metres.
+    lon, lat : float or array-like
+        Geographic coordinates in degrees.
+    model : str
+        Geoid model name.
+
+    Returns
+    -------
+    depth : same type as height
+        Depth below chart datum in metres (positive downward).
+    """
+    N = geoid_height(lon, lat, model)
+    return N - np.asarray(height)
diff --git a/xrspatial/reproject/grids/at_bev_AT_GIS_GRID.tif b/xrspatial/reproject/grids/at_bev_AT_GIS_GRID.tif
new file mode 100644
index 00000000..79a9bb54
Binary files /dev/null and b/xrspatial/reproject/grids/at_bev_AT_GIS_GRID.tif differ
diff --git a/xrspatial/reproject/grids/au_icsm_A66_National_13_09_01.tif b/xrspatial/reproject/grids/au_icsm_A66_National_13_09_01.tif
new file mode 100644
index 00000000..98cf934b
Binary files /dev/null and b/xrspatial/reproject/grids/au_icsm_A66_National_13_09_01.tif differ
diff --git a/xrspatial/reproject/grids/be_ign_bd72lb72_etrs89lb08.tif b/xrspatial/reproject/grids/be_ign_bd72lb72_etrs89lb08.tif
new file mode 100644
index 00000000..28d95159
Binary files /dev/null and b/xrspatial/reproject/grids/be_ign_bd72lb72_etrs89lb08.tif differ
diff --git a/xrspatial/reproject/grids/ch_swisstopo_CHENyx06_ETRS.tif b/xrspatial/reproject/grids/ch_swisstopo_CHENyx06_ETRS.tif
new file mode 100644
index 00000000..f9ec53d3
Binary files /dev/null and b/xrspatial/reproject/grids/ch_swisstopo_CHENyx06_ETRS.tif differ
diff --git a/xrspatial/reproject/grids/de_adv_BETA2007.tif b/xrspatial/reproject/grids/de_adv_BETA2007.tif
new file mode 100644
index 00000000..34091717
Binary files /dev/null and b/xrspatial/reproject/grids/de_adv_BETA2007.tif differ
diff --git a/xrspatial/reproject/grids/es_ign_SPED2ETV2.tif b/xrspatial/reproject/grids/es_ign_SPED2ETV2.tif
new file mode 100644
index 00000000..affb93af
Binary files /dev/null and b/xrspatial/reproject/grids/es_ign_SPED2ETV2.tif differ
diff --git a/xrspatial/reproject/grids/nl_nsgi_rdcorr2018.tif b/xrspatial/reproject/grids/nl_nsgi_rdcorr2018.tif
new file mode 100644
index 00000000..c71fe805
Binary files /dev/null and b/xrspatial/reproject/grids/nl_nsgi_rdcorr2018.tif differ
diff --git a/xrspatial/reproject/grids/pt_dgt_D73_ETRS89_geo.tif b/xrspatial/reproject/grids/pt_dgt_D73_ETRS89_geo.tif
new file mode 100644
index 00000000..1e44b7c8
Binary files /dev/null and b/xrspatial/reproject/grids/pt_dgt_D73_ETRS89_geo.tif differ
diff --git a/xrspatial/reproject/grids/uk_os_OSTN15_NTv2_OSGBtoETRS.tif b/xrspatial/reproject/grids/uk_os_OSTN15_NTv2_OSGBtoETRS.tif
new file mode 100644
index 00000000..36694176
Binary files /dev/null and b/xrspatial/reproject/grids/uk_os_OSTN15_NTv2_OSGBtoETRS.tif differ
diff --git a/xrspatial/reproject/grids/us_nga_egm96_15.tif b/xrspatial/reproject/grids/us_nga_egm96_15.tif
new file mode 100644
index 00000000..94a9f967
Binary files /dev/null and b/xrspatial/reproject/grids/us_nga_egm96_15.tif differ
diff --git a/xrspatial/reproject/grids/us_noaa_alaska.tif b/xrspatial/reproject/grids/us_noaa_alaska.tif
new file mode 100644
index 00000000..a11852a0
Binary files /dev/null and b/xrspatial/reproject/grids/us_noaa_alaska.tif differ
diff --git a/xrspatial/reproject/grids/us_noaa_conus.tif b/xrspatial/reproject/grids/us_noaa_conus.tif
new file mode 100644
index 00000000..88c4d00b
Binary files /dev/null and b/xrspatial/reproject/grids/us_noaa_conus.tif differ
diff --git a/xrspatial/reproject/grids/us_noaa_hawaii.tif b/xrspatial/reproject/grids/us_noaa_hawaii.tif
new file mode 100644
index 00000000..ae425391
Binary files /dev/null and b/xrspatial/reproject/grids/us_noaa_hawaii.tif differ
diff --git a/xrspatial/reproject/grids/us_noaa_nadcon5_nad27_nad83_1986_conus.tif b/xrspatial/reproject/grids/us_noaa_nadcon5_nad27_nad83_1986_conus.tif
new file mode 100644
index 00000000..745ce4ef
Binary files /dev/null and b/xrspatial/reproject/grids/us_noaa_nadcon5_nad27_nad83_1986_conus.tif differ
diff --git a/xrspatial/reproject/grids/us_noaa_prvi.tif b/xrspatial/reproject/grids/us_noaa_prvi.tif
new file mode 100644
index 00000000..2aff41c8
Binary files /dev/null and b/xrspatial/reproject/grids/us_noaa_prvi.tif differ
diff --git a/xrspatial/tests/bench_reproject_vs_rioxarray.py b/xrspatial/tests/bench_reproject_vs_rioxarray.py
new file mode 100644
index 00000000..48dafe25
--- /dev/null
+++ b/xrspatial/tests/bench_reproject_vs_rioxarray.py
@@ -0,0 +1,579 @@
+#!/usr/bin/env python
+"""
+Benchmark xrspatial.reproject vs rioxarray.reproject
+====================================================
+
+Compares performance and pixel-level consistency across raster sizes,
+CRS pairs, and resampling methods.
+
+Usage
+-----
+    python -m xrspatial.tests.bench_reproject_vs_rioxarray
+"""
+
+import time
+import sys
+
+import numpy as np
+import xarray as xr
+
+from xrspatial.reproject import reproject as xrs_reproject
+
+try:
+    import rioxarray  # noqa: F401
+    HAS_RIOXARRAY = True
+except ImportError:
+    HAS_RIOXARRAY = False
+
+try:
+    from pyproj import CRS
+    HAS_PYPROJ = True
+except ImportError:
+    HAS_PYPROJ = False
+
+
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+
+def _timer(fn, warmup=1, runs=5):
+    """Time a callable, returning (median_seconds, result_from_last_call)."""
+    for _ in range(warmup):
+        result = fn()
+    times = []
+    for _ in range(runs):
+        t0 = time.perf_counter()
+        result = fn()
+        times.append(time.perf_counter() - t0)
+    times.sort()
+    return times[len(times) // 2], result
+
+
+def _make_raster(h, w, crs='EPSG:4326', x_range=(-10, 10), y_range=(-10, 10),
+                 nodata=np.nan):
+    """Create a test DataArray with geographic coordinates and CRS metadata."""
+    y = np.linspace(y_range[1], y_range[0], h)
+    x = np.linspace(x_range[0], x_range[1], w)
+    xx, yy = np.meshgrid(x, y)
+    data = (xx + yy).astype(np.float64)
+    return xr.DataArray(
+        data, dims=['y', 'x'],
+        coords={'y': y, 'x': x},
+        name='gradient',
+        attrs={'crs': crs, 'nodata': nodata},
+    )
+
+
+def _make_rio_raster(da, crs_str='EPSG:4326'):
+    """Convert an xrspatial-style DataArray to rioxarray-compatible form."""
+    da_rio = da.copy()
+    res_y = float(da.y[1] - da.y[0])  # negative for north-up
+    res_x = float(da.x[1] - da.x[0])
+    left = float(da.x[0]) - res_x / 2
+    top = float(da.y[0]) - res_y / 2   # y descending, so y[0] is top
+    from rasterio.transform import from_origin
+    transform = from_origin(left, top, res_x, abs(res_y))
+    da_rio.rio.write_crs(crs_str, inplace=True)
+    da_rio.rio.write_transform(transform, inplace=True)
+    da_rio.rio.write_nodata(np.nan, inplace=True)
+    return da_rio
+
+
+RESAMPLING_MAP_RIO = {
+    'nearest': 0,   # rasterio.enums.Resampling.nearest
+    'bilinear': 1,  # rasterio.enums.Resampling.bilinear
+    'cubic': 2,     # rasterio.enums.Resampling.cubic
+}
+
+
+def _fmt_time(seconds):
+    if seconds < 1:
+        return f'{seconds * 1000:.1f}ms'
+    return f'{seconds:.2f}s'
+
+
+def _fmt_shape(shape):
+    return f'{shape[0]}x{shape[1]}'
+
+
+# CRS-specific coordinate ranges (square aspect ratio in source units)
+CRS_RANGES = {
+    'EPSG:4326': {'x_range': (-10, 10), 'y_range': (40, 60)},
+    'EPSG:32633': {'x_range': (300000, 700000), 'y_range': (5200000, 5600000)},
+}
+
+
+# ---------------------------------------------------------------------------
+# Benchmark cases
+# ---------------------------------------------------------------------------
+
+SIZES = [
+    (256, 256),
+    (512, 512),
+    (1024, 1024),
+    (2048, 2048),
+    (4096, 4096),
+]
+
+CRS_PAIRS = [
+    ('EPSG:4326', 'EPSG:32633'),   # WGS84 -> UTM zone 33N
+    ('EPSG:32633', 'EPSG:4326'),   # UTM -> WGS84
+    ('EPSG:4326', 'EPSG:3857'),    # WGS84 -> Web Mercator
+]
+
+RESAMPLINGS = ['nearest', 'bilinear', 'cubic']
+
+
+def run_performance(sizes=None, crs_pairs=None, resamplings=None):
+    """Run performance benchmarks (approx, exact, and rioxarray)."""
+    sizes = sizes or SIZES
+    crs_pairs = crs_pairs or CRS_PAIRS
+    resamplings = resamplings or ['bilinear']
+
+    print()
+    print('=' * 90)
+    print('PERFORMANCE BENCHMARK: xrspatial (approx / exact) vs rioxarray')
+    print('=' * 90)
+
+    for src_crs, dst_crs in crs_pairs:
+        ranges = CRS_RANGES[src_crs]
+
+        print(f'\n### {src_crs} -> {dst_crs}')
+        print()
+        print(f'| {"Size":>12} | {"Resampling":>10} '
+              f'| {"xrs approx":>12} | {"xrs exact":>12} '
+              f'| {"rioxarray":>12} | {"approx/rio":>10} | {"exact/rio":>10} |')
+        print(f'|{"-"*14}|{"-"*12}'
+              f'|{"-"*14}|{"-"*14}'
+              f'|{"-"*14}|{"-"*12}|{"-"*12}|')
+
+        for h, w in sizes:
+            da = _make_raster(h, w, crs=src_crs, **ranges)
+            da_rio = _make_rio_raster(da, src_crs)
+
+            for resampling in resamplings:
+                # xrspatial approx (default, precision=16)
+                approx_time, _ = _timer(
+                    lambda: xrs_reproject(da, dst_crs,
+                                         resampling=resampling,
+                                         transform_precision=16),
+                    warmup=2, runs=5,
+                )
+
+                # xrspatial exact (precision=0)
+                exact_time, _ = _timer(
+                    lambda: xrs_reproject(da, dst_crs,
+                                         resampling=resampling,
+                                         transform_precision=0),
+                    warmup=2, runs=5,
+                )
+
+                # rioxarray
+                rio_resamp = RESAMPLING_MAP_RIO[resampling]
+                rio_time, _ = _timer(
+                    lambda: da_rio.rio.reproject(dst_crs,
+                                                resampling=rio_resamp),
+                    warmup=2, runs=5,
+                )
+
+                approx_ratio = rio_time / approx_time if approx_time > 0 else float('inf')
+                exact_ratio = rio_time / exact_time if exact_time > 0 else float('inf')
+
+                print(f'| {_fmt_shape((h, w)):>12} | {resampling:>10} '
+                      f'| {_fmt_time(approx_time):>12} '
+                      f'| {_fmt_time(exact_time):>12} '
+                      f'| {_fmt_time(rio_time):>12} '
+                      f'| {approx_ratio:>9.2f}x '
+                      f'| {exact_ratio:>9.2f}x |')
+
+
+def run_consistency(sizes=None, crs_pairs=None, resamplings=None):
+    """Run pixel-level consistency checks.
+
+    Forces both libraries to produce the same output grid by running
+    rioxarray first, then passing its resolution and bounds to xrspatial.
+    """
+    sizes = sizes or [(256, 256), (512, 512), (1024, 1024)]
+    crs_pairs = crs_pairs or CRS_PAIRS
+    resamplings = resamplings or RESAMPLINGS
+
+    print()
+    print('=' * 80)
+    print('CONSISTENCY CHECK: xrspatial vs rioxarray (same output grid)')
+    print('=' * 80)
+    print()
+    print(f'| {"Size":>12} | {"CRS":>24} | {"Resampling":>10} '
+          f'| {"Out shape":>11} | {"RMSE":>10} | {"MaxErr":>10} '
+          f'| {"R²":>8} | {"NaN agree":>9} |')
+    print(f'|{"-"*14}|{"-"*26}|{"-"*12}'
+          f'|{"-"*13}|{"-"*12}|{"-"*12}'
+          f'|{"-"*10}|{"-"*11}|')
+
+    for src_crs, dst_crs in crs_pairs:
+        ranges = CRS_RANGES[src_crs]
+
+        for h, w in sizes:
+            da = _make_raster(h, w, crs=src_crs, **ranges)
+            da_rio = _make_rio_raster(da, src_crs)
+
+            for resampling in resamplings:
+                # Run rioxarray first to get the reference output grid
+                rio_resamp = RESAMPLING_MAP_RIO[resampling]
+                rio_result = da_rio.rio.reproject(dst_crs,
+                                                 resampling=rio_resamp)
+                rio_vals = rio_result.values
+
+                # Extract rioxarray's output grid parameters
+                rio_transform = rio_result.rio.transform()
+                rio_res_x = rio_transform.a
+                rio_res_y = abs(rio_transform.e)
+                rio_h, rio_w = rio_vals.shape
+                rio_left = rio_transform.c
+                rio_top = rio_transform.f
+                rio_bounds = (
+                    rio_left,                        # left
+                    rio_top - rio_res_y * rio_h,     # bottom
+                    rio_left + rio_res_x * rio_w,    # right
+                    rio_top,                         # top
+                )
+
+                # Run xrspatial with the same grid
+                xrs_result = xrs_reproject(
+                    da, dst_crs,
+                    resampling=resampling,
+                    resolution=(rio_res_y, rio_res_x),
+                    bounds=rio_bounds,
+                )
+                xrs_vals = xrs_result.values
+
+                shape_ok = xrs_vals.shape == rio_vals.shape
+                if not shape_ok:
+                    # Crop to common area
+                    common_h = min(xrs_vals.shape[0], rio_vals.shape[0])
+                    common_w = min(xrs_vals.shape[1], rio_vals.shape[1])
+                    xrs_vals = xrs_vals[:common_h, :common_w]
+                    rio_vals = rio_vals[:common_h, :common_w]
+
+                # Compare where both have valid data
+                xrs_nan = np.isnan(xrs_vals)
+                rio_nan = np.isnan(rio_vals)
+                both_valid = ~xrs_nan & ~rio_nan
+                nan_agree = np.mean(xrs_nan == rio_nan) * 100
+
+                if both_valid.sum() > 0:
+                    diff = xrs_vals[both_valid] - rio_vals[both_valid]
+                    rmse = np.sqrt(np.mean(diff ** 2))
+                    max_err = np.max(np.abs(diff))
+                    ss_res = np.sum(diff ** 2)
+                    ss_tot = np.sum(
+                        (rio_vals[both_valid]
+                         - np.mean(rio_vals[both_valid])) ** 2
+                    )
+                    r2 = 1 - ss_res / ss_tot if ss_tot > 0 else 1.0
+                    rmse_str = f'{rmse:.6f}'
+                    max_str = f'{max_err:.6f}'
+                    r2_str = f'{r2:.6f}'
+                else:
+                    rmse_str = 'N/A'
+                    max_str = 'N/A'
+                    r2_str = 'N/A'
+
+                out_shape = _fmt_shape(xrs_vals.shape)
+                if not shape_ok:
+                    out_shape += '*'
+                crs_label = f'{src_crs}->{dst_crs}'
+
+                print(f'| {_fmt_shape((h, w)):>12} | {crs_label:>24} '
+                      f'| {resampling:>10} '
+                      f'| {out_shape:>11} | {rmse_str:>10} '
+                      f'| {max_str:>10} | {r2_str:>8} '
+                      f'| {nan_agree:>8.1f}% |')
+
+
+REAL_WORLD_FILES = [
+    {
+        'path': '~/rtxpy/examples/render_demo_terrain.tif',
+        'target_crs': 'EPSG:32618',
+        'label': 'render_demo 187x253 NAD83->UTM18',
+    },
+    {
+        'path': '~/rtxpy/examples/USGS_1_n43w123.tif',
+        'target_crs': 'EPSG:32610',
+        'label': 'USGS 1as Oregon 3612x3612 NAD83->UTM10',
+    },
+    {
+        'path': '~/rtxpy/examples/USGS_1_n39w106.tif',
+        'target_crs': 'EPSG:32613',
+        'label': 'USGS 1as Colorado 3612x3612 NAD83->UTM13',
+    },
+    {
+        'path': '~/rtxpy/examples/Copernicus_DSM_COG_10_N40_00_W075_00_DEM.tif',
+        'target_crs': 'EPSG:32618',
+        'label': 'Copernicus DEM 3600x3600 WGS84->UTM18',
+    },
+    {
+        'path': '~/rtxpy/examples/USGS_one_meter_x66y454_NY_LongIsland_Z18_2014.tif',
+        'target_crs': 'EPSG:4326',
+        'label': 'USGS 1m LongIsland 10012x10012 UTM18->WGS84',
+    },
+]
+
+
+def _load_for_both(path):
+    """Load a GeoTIFF for both xrspatial and rioxarray."""
+    import os
+    path = os.path.expanduser(path)
+
+    from xrspatial.geotiff import read_geotiff
+    da_xrs = read_geotiff(path)
+
+    da_rio = rioxarray.open_rasterio(path).squeeze(drop=True)
+    return da_xrs, da_rio
+
+
+def run_real_world(files=None, resamplings=None):
+    """Benchmark and compare on real-world GeoTIFF files."""
+    import os
+    files = files or REAL_WORLD_FILES
+    resamplings = resamplings or ['bilinear']
+
+    # Filter to files that exist
+    files = [f for f in files if os.path.exists(os.path.expanduser(f['path']))]
+    if not files:
+        print('\nNo real-world files found, skipping.')
+        return
+
+    print()
+    print('=' * 130)
+    print('REAL-WORLD FILES: performance and consistency (approx vs exact vs rioxarray)')
+    print('=' * 130)
+    print()
+    print(f'| {"File":>48} '
+          f'| {"xrs approx":>11} | {"xrs exact":>11} | {"rioxarray":>11} '
+          f'| {"ap/rio":>6} | {"ex/rio":>6} '
+          f'| {"RMSE(approx)":>12} | {"RMSE(exact)":>12} '
+          f'| {"MaxE(approx)":>12} | {"MaxE(exact)":>12} |')
+    print(f'|{"-"*50}'
+          f'|{"-"*13}|{"-"*13}|{"-"*13}'
+          f'|{"-"*8}|{"-"*8}'
+          f'|{"-"*14}|{"-"*14}'
+          f'|{"-"*14}|{"-"*14}|')
+
+    for entry in files:
+        da_xrs, da_rio = _load_for_both(entry['path'])
+        dst_crs = entry['target_crs']
+        label = entry['label']
+
+        for resampling in resamplings:
+            rio_resamp = RESAMPLING_MAP_RIO[resampling]
+
+            # Performance: xrspatial approx
+            approx_time, _ = _timer(
+                lambda: xrs_reproject(da_xrs, dst_crs, resampling=resampling,
+                                     transform_precision=16),
+                warmup=2, runs=5,
+            )
+
+            # Performance: xrspatial exact
+            exact_time, _ = _timer(
+                lambda: xrs_reproject(da_xrs, dst_crs, resampling=resampling,
+                                     transform_precision=0),
+                warmup=2, runs=5,
+            )
+
+            # Performance: rioxarray
+            rio_time, rio_result = _timer(
+                lambda: da_rio.rio.reproject(dst_crs, resampling=rio_resamp),
+                warmup=2, runs=5,
+            )
+
+            approx_ratio = rio_time / approx_time if approx_time > 0 else float('inf')
+            exact_ratio = rio_time / exact_time if exact_time > 0 else float('inf')
+
+            # Consistency: force same grid, test both modes
+            rio_vals = rio_result.values
+            rio_transform = rio_result.rio.transform()
+            rio_res_x = rio_transform.a
+            rio_res_y = abs(rio_transform.e)
+            rio_h, rio_w = rio_vals.shape
+            rio_left = rio_transform.c
+            rio_top = rio_transform.f
+            rio_bounds = (
+                rio_left,
+                rio_top - rio_res_y * rio_h,
+                rio_left + rio_res_x * rio_w,
+                rio_top,
+            )
+
+            nodata = da_xrs.attrs.get('nodata', None)
+            stats = {}
+            for mode_name, precision in [('approx', 16), ('exact', 0)]:
+                xrs_matched = xrs_reproject(
+                    da_xrs, dst_crs,
+                    resampling=resampling,
+                    resolution=(rio_res_y, rio_res_x),
+                    bounds=rio_bounds,
+                    transform_precision=precision,
+                )
+                xrs_vals = xrs_matched.values
+                rv = rio_vals
+
+                if xrs_vals.shape != rv.shape:
+                    ch = min(xrs_vals.shape[0], rv.shape[0])
+                    cw = min(xrs_vals.shape[1], rv.shape[1])
+                    xrs_vals = xrs_vals[:ch, :cw]
+                    rv = rv[:ch, :cw]
+
+                xf = xrs_vals.astype(np.float64)
+                rf = rv.astype(np.float64)
+
+                if nodata is not None and not np.isnan(nodata):
+                    both_valid = (xf != nodata) & (rf != nodata)
+                else:
+                    both_valid = np.isfinite(xf) & np.isfinite(rf)
+
+                if both_valid.sum() > 0:
+                    diff = xf[both_valid] - rf[both_valid]
+                    rmse = np.sqrt(np.mean(diff ** 2))
+                    max_err = np.max(np.abs(diff))
+                else:
+                    rmse = max_err = float('nan')
+                stats[mode_name] = (rmse, max_err)
+
+            print(f'| {label:>48} '
+                  f'| {_fmt_time(approx_time):>11} '
+                  f'| {_fmt_time(exact_time):>11} '
+                  f'| {_fmt_time(rio_time):>11} '
+                  f'| {approx_ratio:>5.2f}x '
+                  f'| {exact_ratio:>5.2f}x '
+                  f'| {stats["approx"][0]:>12.6f} '
+                  f'| {stats["exact"][0]:>12.6f} '
+                  f'| {stats["approx"][1]:>12.6f} '
+                  f'| {stats["exact"][1]:>12.6f} |')
+
+
+def run_merge(sizes=None):
+    """Benchmark xrspatial.merge vs rioxarray.merge_arrays.
+
+    Creates 4 overlapping rasters in a 2x2 grid arrangement and merges
+    them into a single mosaic with each library.
+    """
+    from rioxarray.merge import merge_arrays as rio_merge_arrays
+
+    from xrspatial.reproject import merge as xrs_merge
+
+    sizes = sizes or [(512, 512), (1024, 1024), (2048, 2048)]
+
+    print()
+    print('=' * 100)
+    print('MERGE BENCHMARK: xrspatial.merge vs rioxarray.merge_arrays (4 overlapping tiles)')
+    print('=' * 100)
+    print()
+    print(f'| {"Tile size":>12} '
+          f'| {"xrs merge":>11} | {"rio merge":>11} '
+          f'| {"xrs/rio":>7} '
+          f'| {"RMSE":>10} | {"MaxErr":>10} '
+          f'| {"Valid px":>10} | {"NaN agree":>9} |')
+    print(f'|{"-" * 14}'
+          f'|{"-" * 13}|{"-" * 13}'
+          f'|{"-" * 9}'
+          f'|{"-" * 12}|{"-" * 12}'
+          f'|{"-" * 12}|{"-" * 11}|')
+
+    for h, w in sizes:
+        # Build 4 overlapping tiles in a 2x2 grid.
+        # Each tile spans 10 degrees; overlap is 2 degrees on each shared edge.
+        # Total coverage: 18 x 18 degrees (from -9 to 9 lon, 41 to 59 lat).
+        tile_specs = [
+            # (x_range, y_range) -- 2-degree overlap between neighbours
+            ((-9, 1),  (49, 59)),   # top-left
+            ((-1, 9),  (49, 59)),   # top-right
+            ((-9, 1),  (41, 51)),   # bottom-left
+            ((-1, 9),  (41, 51)),   # bottom-right
+        ]
+
+        tiles_xrs = []
+        tiles_rio = []
+        for x_range, y_range in tile_specs:
+            da = _make_raster(h, w, crs='EPSG:4326',
+                              x_range=x_range, y_range=y_range)
+            tiles_xrs.append(da)
+            tiles_rio.append(_make_rio_raster(da, 'EPSG:4326'))
+
+        # Benchmark xrspatial merge
+        xrs_time, xrs_result = _timer(
+            lambda: xrs_merge(tiles_xrs),
+            warmup=1, runs=3,
+        )
+
+        # Benchmark rioxarray merge
+        rio_time, rio_result = _timer(
+            lambda: rio_merge_arrays(tiles_rio),
+            warmup=1, runs=3,
+        )
+
+        xrs_vals = xrs_result.values
+        rio_vals = rio_result.values
+
+        # Crop to common shape if they differ
+        common_h = min(xrs_vals.shape[0], rio_vals.shape[0])
+        common_w = min(xrs_vals.shape[1], rio_vals.shape[1])
+        xrs_vals = xrs_vals[:common_h, :common_w]
+        rio_vals = rio_vals[:common_h, :common_w]
+
+        # Compare where both have valid data
+        xrs_nan = np.isnan(xrs_vals)
+        rio_nan = np.isnan(rio_vals)
+        both_valid = ~xrs_nan & ~rio_nan
+        n_valid = int(both_valid.sum())
+        nan_agree = np.mean(xrs_nan == rio_nan) * 100
+
+        if n_valid > 0:
+            diff = xrs_vals[both_valid] - rio_vals[both_valid]
+            rmse = np.sqrt(np.mean(diff ** 2))
+            max_err = np.max(np.abs(diff))
+            rmse_str = f'{rmse:.6f}'
+            max_str = f'{max_err:.6f}'
+        else:
+            rmse_str = 'N/A'
+            max_str = 'N/A'
+
+        ratio = xrs_time / rio_time if rio_time > 0 else float('inf')
+
+        print(f'| {_fmt_shape((h, w)):>12} '
+              f'| {_fmt_time(xrs_time):>11} '
+              f'| {_fmt_time(rio_time):>11} '
+              f'| {ratio:>6.2f}x '
+              f'| {rmse_str:>10} | {max_str:>10} '
+              f'| {n_valid:>10} | {nan_agree:>8.1f}% |')
+
+
+def main():
+    if not HAS_PYPROJ:
+        print('ERROR: pyproj is required for reprojection benchmarks')
+        sys.exit(1)
+    if not HAS_RIOXARRAY:
+        print('ERROR: rioxarray is required for comparison benchmarks')
+        print('  pip install rioxarray')
+        sys.exit(1)
+
+    print(f'NumPy {np.__version__}')
+    try:
+        import numba
+        print(f'Numba {numba.__version__}')
+    except ImportError:
+        pass
+    try:
+        import rasterio
+        print(f'Rasterio {rasterio.__version__}')
+    except ImportError:
+        pass
+
+    run_consistency()
+    run_performance()
+    run_real_world()
+    run_merge()
+
+
+if __name__ == '__main__':
+    main()
diff --git a/xrspatial/tests/test_reproject.py b/xrspatial/tests/test_reproject.py
index 12c92706..cbacd7b2 100644
--- a/xrspatial/tests/test_reproject.py
+++ b/xrspatial/tests/test_reproject.py
@@ -577,6 +577,54 @@ def test_merge_invalid_strategy(self):
         with pytest.raises(ValueError, match="strategy"):
             merge([raster], strategy='median')
 
+    def test_merge_strategy_last(self):
+        """merge() with strategy='last' uses the last valid value."""
+        from xrspatial.reproject import merge
+        a = _make_raster(
+            np.full((16, 16), 10.0), x_range=(-5, 5), y_range=(-5, 5)
+        )
+        b = _make_raster(
+            np.full((16, 16), 20.0), x_range=(-5, 5), y_range=(-5, 5)
+        )
+        result = merge([a, b], strategy='last', resolution=1.0)
+        vals = result.values
+        interior = vals[2:-2, 2:-2]
+        valid = ~np.isnan(interior) & (interior != 0)
+        if valid.any():
+            np.testing.assert_allclose(interior[valid], 20.0, atol=1.0)
+
+    def test_merge_strategy_max(self):
+        """merge() with strategy='max' takes the maximum."""
+        from xrspatial.reproject import merge
+        a = _make_raster(
+            np.full((16, 16), 10.0), x_range=(-5, 5), y_range=(-5, 5)
+        )
+        b = _make_raster(
+            np.full((16, 16), 20.0), x_range=(-5, 5), y_range=(-5, 5)
+        )
+        result = merge([a, b], strategy='max', resolution=1.0)
+        vals = result.values
+        interior = vals[2:-2, 2:-2]
+        valid = ~np.isnan(interior) & (interior != 0)
+        if valid.any():
+            np.testing.assert_allclose(interior[valid], 20.0, atol=1.0)
+
+    def test_merge_strategy_min(self):
+        """merge() with strategy='min' takes the minimum."""
+        from xrspatial.reproject import merge
+        a = _make_raster(
+            np.full((16, 16), 10.0), x_range=(-5, 5), y_range=(-5, 5)
+        )
+        b = _make_raster(
+            np.full((16, 16), 20.0), x_range=(-5, 5), y_range=(-5, 5)
+        )
+        result = merge([a, b], strategy='min', resolution=1.0)
+        vals = result.values
+        interior = vals[2:-2, 2:-2]
+        valid = ~np.isnan(interior) & (interior != 0)
+        if valid.any():
+            np.testing.assert_allclose(interior[valid], 10.0, atol=1.0)
+
     @pytest.mark.skipif(not HAS_DASK, reason="dask required")
     def test_merge_dask(self):
         from xrspatial.reproject import merge
@@ -773,3 +821,191 @@ def test_wide_raster(self):
                               x_range=(-170, 170), y_range=(-2, 2))
         result = reproject(raster, 'EPSG:3857')
         assert result.shape[0] > 0
+
+
+def test_reproject_1x1_raster():
+    """Reprojecting a single-pixel raster should not crash."""
+    from xrspatial.reproject import reproject
+    da = xr.DataArray(
+        np.array([[42.0]]), dims=['y', 'x'],
+        coords={'y': [50.0], 'x': [10.0]},
+        attrs={'crs': 'EPSG:4326', 'nodata': np.nan},
+    )
+    result = reproject(da, 'EPSG:32633')
+    assert result.shape[0] >= 1 and result.shape[1] >= 1
+
+
+def test_reproject_all_nan():
+    """Reprojecting an all-NaN raster should produce all-NaN output."""
+    from xrspatial.reproject import reproject
+    da = xr.DataArray(
+        np.full((64, 64), np.nan), dims=['y', 'x'],
+        coords={'y': np.linspace(55, 45, 64), 'x': np.linspace(-5, 5, 64)},
+        attrs={'crs': 'EPSG:4326', 'nodata': np.nan},
+    )
+    result = reproject(da, 'EPSG:32633')
+    assert np.all(np.isnan(result.values))
+
+
+def test_reproject_uint8_cubic_no_overflow():
+    """Cubic resampling on uint8 should clamp, not wrap."""
+    from xrspatial.reproject import reproject
+    # Create a raster with sharp edge (0 to 255)
+    data = np.zeros((64, 64), dtype=np.uint8)
+    data[:, 32:] = 255
+    da = xr.DataArray(
+        data, dims=['y', 'x'],
+        coords={'y': np.linspace(55, 45, 64), 'x': np.linspace(-5, 5, 64)},
+        attrs={'crs': 'EPSG:4326', 'nodata': 0},
+    )
+    result = reproject(da, 'EPSG:32633', resampling='cubic')
+    vals = result.values
+    # Should be within uint8 range (clamped, not wrapped)
+    valid = vals[vals != 0]  # exclude nodata
+    if len(valid) > 0:
+        assert np.all(valid >= 0) and np.all(valid <= 255)
+
+
+# ---------------------------------------------------------------------------
+# Edge case tests
+# ---------------------------------------------------------------------------
+
+@pytest.mark.skipif(not HAS_PYPROJ, reason="pyproj not installed")
+class TestEdgeCases:
+    """Edge cases that previously caused crashes or wrong results."""
+
+    def _do_reproject(self, *args, **kwargs):
+        from xrspatial.reproject import reproject
+        return reproject(*args, **kwargs)
+
+    def test_multiband_rgb(self):
+        da = xr.DataArray(
+            np.random.rand(32, 32, 3).astype(np.float32),
+            dims=['y', 'x', 'band'],
+            coords={'y': np.linspace(55, 45, 32), 'x': np.linspace(-5, 5, 32)},
+            attrs={'crs': 'EPSG:4326', 'nodata': np.nan},
+        )
+        r = self._do_reproject(da, 'EPSG:32633')
+        assert r.ndim == 3 and r.shape[2] == 3 and 'band' in r.dims
+
+    def test_multiband_uint8(self):
+        da = xr.DataArray(
+            np.random.randint(0, 255, (32, 32, 3), dtype=np.uint8),
+            dims=['y', 'x', 'band'],
+            coords={'y': np.linspace(55, 45, 32), 'x': np.linspace(-5, 5, 32)},
+            attrs={'crs': 'EPSG:4326', 'nodata': 0},
+        )
+        r = self._do_reproject(da, 'EPSG:32633')
+        assert r.dtype == np.uint8
+
+    def test_antimeridian_crossing(self):
+        da = xr.DataArray(
+            np.ones((32, 32)), dims=['y', 'x'],
+            coords={'y': np.linspace(50, 40, 32), 'x': np.linspace(170, -170, 32)},
+            attrs={'crs': 'EPSG:4326', 'nodata': np.nan},
+        )
+        r = self._do_reproject(da, 'EPSG:32660')
+        assert r.shape[0] > 0
+
+    def test_y_ascending(self):
+        da = xr.DataArray(
+            np.ones((64, 64)), dims=['y', 'x'],
+            coords={'y': np.linspace(45, 55, 64), 'x': np.linspace(-5, 5, 64)},
+            attrs={'crs': 'EPSG:4326', 'nodata': np.nan},
+        )
+        r = self._do_reproject(da, 'EPSG:32633')
+        assert np.any(np.isfinite(r.values))
+
+    def test_checkerboard_nan(self):
+        data = np.ones((64, 64))
+        data[::2, ::2] = np.nan
+        data[1::2, 1::2] = np.nan
+        da = xr.DataArray(
+            data, dims=['y', 'x'],
+            coords={'y': np.linspace(55, 45, 64), 'x': np.linspace(-5, 5, 64)},
+            attrs={'crs': 'EPSG:4326', 'nodata': np.nan},
+        )
+        r = self._do_reproject(da, 'EPSG:32633')
+        assert np.any(np.isfinite(r.values))
+
+    def test_utm_to_geographic(self):
+        da = xr.DataArray(
+            np.ones((64, 64)), dims=['y', 'x'],
+            coords={'y': np.linspace(5600000, 5500000, 64),
+                    'x': np.linspace(300000, 400000, 64)},
+            attrs={'crs': 'EPSG:32633', 'nodata': np.nan},
+        )
+        r = self._do_reproject(da, 'EPSG:4326')
+        assert np.any(np.isfinite(r.values))
+
+    def test_proj_to_proj(self):
+        da = xr.DataArray(
+            np.ones((64, 64)), dims=['y', 'x'],
+            coords={'y': np.linspace(6500000, 6000000, 64),
+                    'x': np.linspace(200000, 800000, 64)},
+            attrs={'crs': 'EPSG:2154', 'nodata': np.nan},
+        )
+        r = self._do_reproject(da, 'EPSG:32632')
+        assert np.any(np.isfinite(r.values))
+
+    def test_sentinel_nodata(self):
+        data = np.where(np.random.rand(64, 64) > 0.8, -9999, 500).astype(np.float64)
+        da = xr.DataArray(
+            data, dims=['y', 'x'],
+            coords={'y': np.linspace(55, 45, 64), 'x': np.linspace(-5, 5, 64)},
+            attrs={'crs': 'EPSG:4326', 'nodata': -9999},
+        )
+        r = self._do_reproject(da, 'EPSG:32633')
+        assert r is not None
+
+    def test_target_crs_as_integer(self):
+        da = xr.DataArray(
+            np.ones((32, 32)), dims=['y', 'x'],
+            coords={'y': np.linspace(55, 45, 32), 'x': np.linspace(-5, 5, 32)},
+            attrs={'crs': 'EPSG:4326', 'nodata': np.nan},
+        )
+        r = self._do_reproject(da, 32633)
+        assert r.shape[0] > 0
+
+    def test_explicit_resolution(self):
+        da = xr.DataArray(
+            np.ones((64, 64)), dims=['y', 'x'],
+            coords={'y': np.linspace(55, 45, 64), 'x': np.linspace(-5, 5, 64)},
+            attrs={'crs': 'EPSG:4326', 'nodata': np.nan},
+        )
+        r = self._do_reproject(da, 'EPSG:32633', resolution=1000)
+        assert r.shape[0] > 0
+
+    def test_explicit_width_height(self):
+        da = xr.DataArray(
+            np.ones((64, 64)), dims=['y', 'x'],
+            coords={'y': np.linspace(55, 45, 64), 'x': np.linspace(-5, 5, 64)},
+            attrs={'crs': 'EPSG:4326', 'nodata': np.nan},
+        )
+        r = self._do_reproject(da, 'EPSG:32633', width=100, height=100)
+        assert r.shape == (100, 100)
+
+    def test_merge_non_overlapping(self):
+        from xrspatial.reproject import merge
+        t1 = xr.DataArray(
+            np.full((32, 32), 1.0), dims=['y', 'x'],
+            coords={'y': np.linspace(55, 50, 32), 'x': np.linspace(-5, 0, 32)},
+            attrs={'crs': 'EPSG:4326', 'nodata': np.nan},
+        )
+        t2 = xr.DataArray(
+            np.full((32, 32), 2.0), dims=['y', 'x'],
+            coords={'y': np.linspace(45, 40, 32), 'x': np.linspace(5, 10, 32)},
+            attrs={'crs': 'EPSG:4326', 'nodata': np.nan},
+        )
+        r = merge([t1, t2])
+        assert r.shape[0] > 32 and r.shape[1] > 32
+
+    def test_merge_single_tile(self):
+        from xrspatial.reproject import merge
+        t = xr.DataArray(
+            np.ones((32, 32)), dims=['y', 'x'],
+            coords={'y': np.linspace(55, 45, 32), 'x': np.linspace(-5, 5, 32)},
+            attrs={'crs': 'EPSG:4326', 'nodata': np.nan},
+        )
+        r = merge([t])
+        assert np.any(np.isfinite(r.values))