GeoParquet provider via DuckDB — interest check #2343

henrik716 · 2026-05-19T07:27:54Z

henrik716
May 19, 2026

I've been running a pygeoapi provider for GeoParquet files on remote object storage (Cloudflare R2) in production for a few months and wanted to gauge interest in upstreaming it before writing a formal PR.

The gap

The existing OGR provider lacks native Parquet predicate pushdown — bbox and pagination don't map to row-group pruning, which means unnecessary data transfer on remote storage. This provider uses DuckDB + httpfs to push LIMIT/OFFSET and bbox predicates directly into the Parquet scan, so only relevant row groups are fetched.

What it does

Native SQL pagination via DuckDB — no local download
Parquet row-group pruning via numeric bbox columns before geometry intersection, which matters for remote object store latency
Two connection modes: s3:// (any S3-compatible endpoint) and https:// (public CDN)
CRS reprojection via ST_Transform inside the DuckDB query — same pass as geometry fetch and GeoJSON serialization, no per-feature Python roundtrips
Lazy spatial extension loading (DuckDB ≥1.5 optimization)
CQL2 filter support via pygeofilter
One DuckDB connection shared per Gunicorn worker process to preserve the HTTP keep-alive pool and in-memory metadata cache across requests

Demo and code

Live demo: https://demo.waystones.cloud

The demo runs on an on-demand Cloudflare Container with no persistent state — the first API request warms up the DuckDB worker, after which requests are fast.

Implementation: https://github.com/waystones-nexus/pygeoapi-duckdb-geoparquet

I'm also presenting on this stack at FOSS4G Hiroshima in September, which feels like a natural moment to point people toward an upstream contribution if there is one.

Two questions before I write a PR

Is there appetite for this in core, or would a community plugin listing be the preferred path?
Is duckdb acceptable as an optional dependency alongside the existing optional deps?

Happy to discuss. The main design decision worth flagging upfront is the shared connection singleton — correct for Gunicorn's pre-fork model but worth documenting carefully for other deployment scenarios.

henrik716 · 2026-06-04T10:20:57Z

henrik716
Jun 4, 2026
Author

Update — June 2026

A few things have changed since posting this, worth flagging for anyone following the thread.

The provider now has its own repo:
https://github.com/waystones-nexus/pygeoapi-duckdb-geoparquet

Moved it out of the Waystones monorepo so it's easier to find, reference, and contribute to independently of everything else we're running.

Note on the demo: The demo URL is the same but no longer runs this provider. After running the pygeoapi stack in production we found cold start and response latency on remote object storage was the main constraint for our use case, so we ended up building a standalone Go OAPIF server (oapif-go) specifically for GeoParquet on object storage. The demo now runs on that. It's not a pygeoapi replacement — it has none of pygeoapi's provider breadth — but for the narrow GeoParquet-on-R2 path it gets cold start under 300ms.

The DuckDB provider here remains the right approach for anyone who wants to stay on pygeoapi. The two questions from the original post still stand — happy to write a proper PR if there's appetite, or accept that a community plugin listing is the better fit. Either way the code is now in a dedicated repo and will be maintained there.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

GeoParquet provider via DuckDB — interest check #2343

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Uh oh!

GeoParquet provider via DuckDB — interest check #2343

Uh oh!

Uh oh!

henrik716 May 19, 2026

The gap

What it does

Demo and code

Two questions before I write a PR

Replies: 1 comment

Uh oh!

henrik716 Jun 4, 2026 Author

henrik716
May 19, 2026

henrik716
Jun 4, 2026
Author