Authors: Mike Johnson + Lynker Spatial Team
Accurate reservoir locations are essential for hydrologic modeling because reservoirs alter the natural flow regime by storing, releasing, and redistributing water across space and time. These operations directly influence downstream streamflow, flood peaks, drought severity, water availability, and ecosystem conditions. Today’s NWM only accounts for ~500 reservoirs across CONUS, which is incomplete for many forecasting and planning applications. To extend the scope of reservoir locations, data from other resources is needed.
The National Inventory of Dams (NID) provides broad coverage but variable location quality (on-reservoir, on-flowline, generalized, sometimes wrong). Even small positional errors can misconnect a dam/reservoir to the wrong flowline or waterbody, degrading routing of inflows/outflows and reducing model skill for discharge, storage, and evapotranspiration—undermining flood forecasting, drought planning, and environmental flow assessments.
Other datasets often have better locations but are incomplete or inconsistent in other ways particularly with spatial coverage. Critically, each dataset also opens doors for data assimilation, parameterization, and ML training on historic time series. By grounding our reference reservoirs with precise geographic contexts and aligning to a shared hydrographic fabric, we get regulated flow representation that better reflects the coupled human–natural water cycle and is a boon to community efforts like those at Geoconnex and as NOAA/NWS POIs in the NWM.
Our goal is to build a harmonized set of reference reservoirs (proxied
by dams) that are geospatially consistent with the hydrofabric used in
USGS and NOAA/NWS modeling. We treat NID as the global set to
validate and enrich, assign stable synthetic IDs (dam_id = "ls-*"
),
and use multiple contexts to correct locations and enhance attributes.
Strategy (evidence aggregation):
- build candidate pairs via spatial proximity within tuned per-context radii,
- compute name similarity (Jaro–Winkler) from cleaned strings
- rank contexts by reliability and derived evidence,
- select a best realization per dam, with diagnostics.
Per-dam output: A chosen realization (context + ID), snap distance (m), name similarity, number of supporting contexts, and offset from the original NID point.
-
NID (cleaned, EPSG:5070, synthetic IDs
dam_id = "ls-*"
). Baseline catalog (USACE). High inclusivity; variable positional accuracy. Synthetic IDs provide stable tracking. -
Lynker Spatial hydrofabric flowlines (
ref_fab_fp
) + waterbodies (ref_fab_wb
). National hydrographic backbone (v2.3). Consistent topology for flowlines and waterbodies aligned to modeling needs. -
OpenStreetMap (OSM): water polygons, water lines, dam lines. Volunteer geographic data adding local detail; quality and coverage vary regionally.
-
GNIS. USGS naming authority for natural/cultural features (dams, lakes, reservoirs), used for robust naming comparisons.
-
ResOpsUS. Reservoir operations and attributes useful for modeling and water management.
-
HILARRI. Curated links among NID (2024), GRanD (v1.3), and EHA (2024), connecting dams, reservoirs, and hydropower plants (ORNL/DOE).
-
GOODD. Global dam compilation (>38k) with attributes supporting large-scale analyses.
-
NWM (optionally re-linked to WB IDs). NOAA’s hydrologic modeling system. Reservoir POIs can be re-indexed to hydrofabric WBs to improve geometric alignment.
Bring Your Own.: The method is extensible so that anyone can add a dataset by specifying a unique ID, search radius, and rank weight; it will be harmonized with the principal data resources.
# stitched outputs (written by the runner)
res_rds <- "output/reference-reservoirs.rds"
res <- readRDS(res_rds) |>
dplyr::filter(!is.na(X)) |>
sf::st_as_sf(coords = c("X","Y"), crs = 5070, remove = FALSE)
CONUS is divided into ~100 km cells. We process only tiles that
intersect dams. Each tile runs independently (bounded memory; smaller
candidate pools). Per-tile results are written to RDS; a final pass
stitches tiles, resolving overlaps by preferring more supporting
contexts (n
) then closer snaps.
source("R/utils_fin.R")
#> Warning in fun(libname, pkgname): GEOS versions differ: lwgeom has 3.11.0 sf
#> has 3.14.0
#> Warning in fun(libname, pkgname): PROJ versions differ: lwgeom has 9.1.0 sf has
#> 9.6.2
#> Spherical geometry (s2) switched off
conus <- AOI::aoi_get(state = "conus") |> st_transform(5070)
tiles <- make_conus_grid(st_union(conus), cell_km = 100)
if (!is.null(res)) {
ggplot2::ggplot() +
ggplot2::geom_sf(data = res, alpha = 0.15, size = 0.25) +
ggplot2::geom_sf(data = tiles, fill = NA, color = "brown", size = 0.2) +
ggplot2::labs(title = "Reservoirs", subtitle = "EPSG:5070",
x = NULL, y = NULL) +
ggplot2::theme_minimal()
} else {
plot.new(); title("Dam points plot skipped (no X/Y)")
}
The NID defines the global set we validate, supplement, and
standardize. Because NID IDs can be duplicated and locations imprecise,
we assign stable synthetic IDs (dam_id = ls-*
) and treat NID like
any other context in scoring—but privileged as the anchor. Outputs
retain NID identifiers while updated coordinates, names, and attributes
can be adopted from the best realization across contexts. This
preserves continuity with the most complete inventory while
systematically improving accuracy via GNIS names, GOODD’s footprint,
hydrofabric topology, and OSM detail—producing features that are
Geoconnex-ready and compatible with NWS POIs.
A context is an external dataset/layer (e.g., gnis
, goodd
,
ref_fab_fp
, osm_ww_poly
) against which NID dams are compared. For
each dam and context, we:
- generate candidate pairs within a tuned search radius,
- compute snap distance and name similarity (JW), and
- filter/rank to a single best match per (dam, context).
Two derived contexts are also created by intersecting waterbodies and flowlines in each data family:
ref_int
: intersections of ref_fab_wb × ref_fab_fposm_int
: intersections of osm_ww_poly × osm_ww_lines
These provide strong geometry/topology anchors.
- 0 – Intersection evidence:
ref_int
,osm_int
(geometry + topology; strongest). - 1 – Curated/named:
gnis
,resops
,goodd
,osm_dam_lines
,hillari
. - 2 – Direct/core geometries:
osm_ww_poly
,osm_ww_lines
,ref_fab_fp
,ref_fab_wb
,nwm
(re-linked),nid
. - Tributary penalty: if
river
implies TR/OS/TRIB, add +5 to rank. Within any tier, smaller snap and smaller JW win.
-
Per tile
- Load dams (NID) and clip contexts.
- Build representative points per context: points (identity), lines (midpoints/endpoints), polygons (point-on-surface).
- Generate candidates via
st_is_within_distance
(per-context radius) with a KNN fallback gated by the same radius. - Score (snap distance, JW), apply tributary penalty; reduce to best per (dam, context).
- Build a wide table of IDs (one column per context), select best realization per dam, compute QA (offset from NID), and distance to flowpath.
- Write tile RDS and append a manifest row.
Context | Search Distance (m) | Rank | Group | Notes |
---|---|---|---|---|
ref_int | 2000 | 0 | Anchors / Derived | Intersections of ref_fab_wb × ref_fab_fp; highest-confidence geometry. |
osm_int | 2000 | 0 | Anchors / Derived | Intersections of osm_ww_poly × osm_ww_lines; strong topology signal. |
gnis | 2000 | 1 | Curated / Named | USGS names; authoritative nomenclature, variable location quality. |
resops | 2000 | 1 | Curated / Named | Reservoir ops/attributes useful for modeling. |
osm_dam_lines | 1500 | 1 | Curated / Named | OSM dam features; coverage varies. |
hillari | 2000 | 1 | Curated / Named | Links dams–reservoirs–plants (ORNL/DOE). |
goodd | 2000 | 1 | Curated / Named | Global dam footprint/attributes. |
osm_ww_lines | 1500 | 2 | Direct / Network | Dense/noisy; short radius reduces false hits. |
osm_ww_poly | 1500 | 2 | Direct / Network | Strong geometric anchors for reservoirs. |
ref_fab_fp | 1500 | 2 | Direct / Fabric | Topologically consistent flowlines. |
ref_fab_wb | 2000 | 2 | Direct / Fabric | Waterbodies as spatial anchors. |
nwm | 2000 | 2 | Direct / POIs | Often mislocated; improved when re-indexed to WBs. |
nid | 2000 | 2 | Core Dataset | Baseline set for validation & enrichment; stable synthetic IDs. |
Risk / Complexity | Why it matters | Mitigation in this workflow |
---|---|---|
Mis-snap to wrong flowline/waterbody | Broken routing; bad inflow/outflow accounting | Per-context radii; intersections (ref_int /osm_int ); rank 0 |
Duplicate/ambiguous IDs & names | Double-counting or missed joins | Synthetic dam_id , string prep + JW, cross-context tallies n |
Noisy/shifted geometries (esp. NWM, NID) | High false positives; unstable matches | Rep points, short radii (750 m), KNN fallback within same gate |
Seasonal shoreline changes | Point-on-surface drift vs. dam location | Prefer dam-aligned contexts; intersections; multi-context voting |
Tile edge effects | Missed candidates near boundaries | Buffered tile search; global stitch preferring n then distance |
Nonstationarity / updates over time | Drift between versions; reproducibility | Tile manifests, context IDs, rank map documented |
Licensing & attribution (OSM) | Compliance and redistribution | Keep source IDs/contexts; document license provenance |
if (exists("res") && nrow(res)) {
p1 <- ggplot2::ggplot(res, ggplot2::aes(x = realization_snap_m)) +
ggplot2::geom_histogram(bins = 50) +
ggplot2::labs(title = "Snap distance (m)") + ggplot2::theme_minimal()
p2 <- ggplot2::ggplot(res, ggplot2::aes(x = realization_jw)) +
ggplot2::geom_histogram(bins = 50) +
ggplot2::labs(title = "Name similarity (JW)") + ggplot2::theme_minimal()
p3 <- ggplot2::ggplot(res, ggplot2::aes(x = n)) +
ggplot2::geom_histogram(binwidth = 1) +
ggplot2::scale_x_continuous(breaks = 0:10) +
ggplot2::labs(title = "Supporting contexts per dam (n)") + ggplot2::theme_minimal()
print(p1); print(p2); print(p3)
}
#> Warning: Removed 54654 rows containing non-finite outside the scale range
#> (`stat_bin()`).
if (exists("res") && nrow(res)) {
ctx_cols <- c("gnis","resops","goodd","nwm","osm_ww_poly","osm_ww_lines",
"osm_dam_lines","ref_fab_fp","ref_fab_wb","ref_int","osm_int","nid")
have <- intersect(ctx_cols, names(res))
if (length(have)) {
long <- tidyr::pivot_longer(as.data.frame(res), dplyr::all_of(have), names_to = "context", values_to = "id")
long$has <- !is.na(long$id)
ggplot2::ggplot(long, ggplot2::aes(x = context, fill = has)) +
ggplot2::geom_bar() +
ggplot2::coord_flip() +
ggplot2::labs(title = "Context coverage (count of dams with a match)", y = "count", x = NULL) +
ggplot2::theme_minimal()
}
}
-
To use this repo, all data is stored wit the exception of OSM. All data - including OSM - can be downloaded with the direction in the
data/data_prep.R
. -
Run
workflow/01_process_tiles.R
If new resources are added, be sure to include them in the ingest as well as provide a rank and radius -
02_stich.R
stitches the tiles together and adds preliminary info -
03_ops.R
adds reservoir parameters needd for RFC-DA in the NWM using a mix of traits. -
If you want to recreate the webmap, run the make file in scripts/tiles using the latest gpkg. Output can be viewed with
pnpm dev --strictPort --port 8000