-
Notifications
You must be signed in to change notification settings - Fork 30
6. Evaluating HAND Performance
The Inundation Mapping repository includes functionality to evaluate the performance of HAND-derived FIMs using benchmark data derived from comparisons against FEMA BLE, USGS, RAS2FIM, IFC, and NWS models. This page describes the basic evaluation methodology used in the Inundation Mapping evaluation software.
Multiple benchmark dataset types and analysis methods allow evaluation of the predictive skill of modeled flood inundation maps (FIMs). The section discusses these benchmark data types and analysis methods in detail.
Evaluation techniques can use event-based or model-based benchmark datasets as appropriate. Some of the event-based benchmark dataset options include, but are not limited to:
- Observed high water marks collected after a flood event.
- Remotely sensed imagery (optical or synthetic aperture radar is commonly used).
- Post-flood damage reports and claims.
- Model-based benchmark datasets. Model-based benchmark datasets can include any modeled FIM of a known quality, such as NWS- and USGS-produced HEC-RAS derived maps at river gages, FEMA Base Level Engineering (BLE) maps, Iowa Flood Center (IFC), and RAS2FIM.
Figure 1: Benchmark datasets to evaluate the predictive skill of modeled flood inundation maps (FIMs)
Multiple methods exist for the evaluation of FIMs. Some of these methods include, but are not limited to:
- Spatial comparisons of inundated area between modeled FIM and benchmark FIM.
- Comparison of depth estimates across space between modeled FIM and benchmark FIM.
- Comparison of modeled FIM to field-collected high water marks.
- Comparison of rating curves between FIM model and benchmark models of a known quality.
While a full-scale evaluation effort using all of the aforementioned benchmark datasets and analysis methods is possible, the Inundation Mapping repository currently supports Benchmark Data option 4 and Analysis Method 1, i.e. a spatial comparison between modeled FIM and model-based benchmark data. In Inundation Mapping, this spatial analysis is called the Alpha Test.
FEMA BLE studies use automated methods to develop HEC-RAS hydraulic models that produce flood inundation maps for specified flood magnitudes such as the 100- and 500-year recurrence interval flood. These BLE studies are typically performed on a Hydrologic Unit Code (HUC) 8 watershed scale. BLE models provide a foundation for more detailed modeling efforts. For example, if a community wants more detailed flood maps, they can upgrade the BLE models to include ground survey data or bridges. BLE flood data is available on the Interagency Flood Risk Management’s (InFRM) Base Flood Elevation viewer (Figure 1).
For this analysis, FIM evaluation involved comparison against FEMA depth grids (100-year and 500-year) and associated flows for BLE studies throughout Texas, Oklahoma, Arkansas, and New Mexico using the same discharge values with both approaches.
Figure 2: FEMA BFE viewer contains BLE datasets such as depth grids and model data across the south-central U.S.
Study results were collected from high quality flood studies performed on small river segments, typically between 1-10 miles in length, coinciding with select Advanced Hydrologic Prediction Service (AHPS) sites. These studies consisted of a static library of depth grids spanning a wide range of elevation. For example, a site may have inundation maps and grids spanning from an elevation of 100.0 ft to 120.0 ft with an inundation product at each 1 ft interval. Typically, these static libraries spanned across the flood categories (Action, Minor, Moderate, Major) for that site (Figure 2). These datasets were available from two sources, the USGS (for more information see https://fim.wim.usgs.gov/fim) and the NWS (for more information see https://water.weather.gov/ahps/inundation.php).
Generally, flows associated with supplied depth grids were not available, therefore, they were estimated from the rating curve associated with the site. In some instances, the USGS inundation studies did provide flows associated with supplied depth grids and these supplied flows were used in place of the rating curve. Flood maps, corresponding to Action, Minor, Moderate, and Major flood categories were programmatically selected for each site and assessed. It should be noted that not all AHPS sites had depth grids available for evaluation for every flood category. Additionally, some USGS inundation sites and NWS inundation sites were co-located. The USGS and NWS datasets were treated independently as there may have been differences in the associated flows as well as the benchmark grids themselves.
Figure 3: Example of Action, Flood, Moderate, and Major flood stages at an AHPS site
In 2010, IFC researchers used laser radar (LiDAR) data provided by the IDNR, allowing them to precisely map Iowa’s river and stream network, develop computer-based flood simulations, and delineate floodplains with reasonable accuracy. The maps show the probability, extent, and depth of flooding for every Iowa stream draining more than one square mile.
Completed in 2016, these maps are a critical resource to help citizens, emergency managers, and other community decision-makers identify and communicate Iowa’s flood hazards. The maps also support informed decision-making on managing floodplain areas. Funding from the Iowa Natural Heritage Foundation allowed the team to also create maps for 2-, 5-, 10-, 25-, 50-, and 200-year floods.
The IFC developed floodplain maps show the probability, extent, and depth of flooding for every Iowa stream draining more than one square mile. The maps are freely accessible through the Iowa Flood Information System.
Figure 4: Iowa Flood Information System (IFIS) that provides community-based flood conditions, forecasts, and inundation maps
The primary goal of the Hydrologic Engineering Center (CEIWR-HEC) is to support the nation in its water resources management responsibilities. As such the U.S. Army Corps of Engineers produces the River Analysis System (HEC-RAS) . FIM4 inundation-mapping utilizes these source data via the RAS2FIM repository.
An area-based comparison is employed using a contingency table approach.
Figure 5: Example contingency map.
The predicted flood inundation map is evaluated against a benchmark dataset and divided into four categories (True Positive, False Positive, False Negative, and True Negative) and these categories comprise a 2x2 contingency table.
Figure 6: Example contingency table.
Statistical metrics such as critical success index (CSI), false alarm ratio (FAR), and Probability of Detection (POD), also known as True Positive Rate (TPR), provide insight on how well a model performed relative to benchmark data. Sampson et al. (2015) notes that POD, which is sometimes referred to as “hit rate” in the literature, indicates how well the model replicates benchmark inundation without penalizing for overprediction. Wing et al. (2019) notes that POD can be thought of as the proportion of benchmark flooded areas that were predicted by the model and that it is a measure of the model’s tendency to underpredict. Sampson et al. (2015) and Wing et al. (2019) explain that FAR measures the model’s tendency to overpredict the benchmark inundation area and Wing et al. (2019) further explains that this metric can be thought of as the proportion of predicted flood areas which are dry in the benchmark. Of these three, CSI is the most discriminant metric where Sampson et al. (2015) explains that it is a combined score that extends on POD and FAR because it penalizes both overprediction and underprediction. Wing et al. (2019) explains that CSI can be thought of as representing model performance over floodplain areas only and that an average score of 0.66 indicates that about 2 in every 3 model pixels in the functional floodplain match the benchmark (Figure 5).
Figure 7: Statistical metrics calculated using a 2x2 contingency table.
To accomplish the area-based evaluations, benchmark datasets, depth grids from BLE and AHPS sites, were converted to binary rasters with areas of inundation encoded as a 1 and areas of no flooding encoded as a 0. The binary rasters were reprojected and resampled to align with FIM projection and resolution. Once a FIM map is created, using the input flows associated with the benchmark dataset, it is overlaid with the benchmark dataset and a contingency raster is created consisting of the four contingency table categories. A contingency table is populated by summing up the total area under each contingency category. An additional masked category is added to exclude areas such as lakes and levee protected areas (as supplied by the National Levee Database), where FIM is not well suited, from the calculations. The processing unit for FIM as well as BLE is the HUC 8 watershed and this serves as the domain extent for BLE contingency evaluations. The domain extent for AHPS studies is the maximum flood map within the static library for that site.
To perform comparisons on a rapid and regional basis, an automated evaluation software system was designed and used to perform the area-based comparisons. This area-based comparison produces inundation maps using the supplied or estimated flows associated with benchmark data, overlays the FIM-based inundation grids on the benchmark datasets, and produces a contingency raster consisting of all four contingency categories as well as masked areas (Figure 7). It then calculates a range of contingency metrics and records them to CSV and JSON files.
Figure 8: Example contingency raster creation for a HUC8.
-
“FEMA Estimated Base Flood Elevation (estBFE) Viewer”, Federal Emergency Management Agency (FEMA), https://webapps.usgs.gov/infrm/estbfe/ (FEMA BLE datasets retrieved from this website)
-
“Flood Inundation Mapper”, United States Geological Survey (USGS), https://fim.wim.usgs.gov/fim/ (USGS FIM data retrieved via personal correspondence with Nicholas Estes)
-
“NOAA Inundation Mapping Locations”, National Oceanic and Atmospheric Administration (NOAA), https://water.weather.gov/ahps/inundation.php (NWS AHPS inundation maps retrieved from this website)
-
Sampson, C. C., Smith, A. M., Bates, P. D., Neal, J. C., Alfieri, L., & Freer, J. E. (2015). A high‐resolution global flood hazard model. Water resources research, 51(9), 7358-7381.
-
Wing, O. E., Sampson, C. C., Bates, P. D., Quinn, N., Smith, A. M., & Neal, J. C. (2019). A flood inundation forecast of Hurricane Harvey using a continental-scale 2D hydrodynamic model. Journal of Hydrology X, 4, 100039.
BLE datasets contain spatial data including model cross sections attributed with flows for each cross section. These cross sections are intersected against the NWM river network (Figure A.1). The median flow for each segment is then calculated and used in flow file creation.
Figure A.1: BLE Cross sections are intersected with NWM stream segments to get a flow value for each segment.
BLE grids were transposed to the same resolution and coordinate system as FIM products. Inundated areas are coded as 1 and dry areas as 0 and are clipped to the HUC 8 boundary during the Alpha Test (Figure A.2).
Figure A.2: BLE data converted from depth to binary (wet = 1, dry = 0).
The NWM feature IDs and flow values (in CMS) were then written to a flow file. The Alpha Test then ingests the flow file and HAND data for the site and creates a contingency raster and calculates metrics (Figure A.3).
Figure A.3: BLE evaluation workflow.
The Alpha Test requires a flow file and to find an appropriate HAND stage using the HAND synthetic rating curves. This stage value is selected on the HAND grid on a catchment basis and ultimately constitutes the FIM map (Figure A.4).
Figure A.4: Flow file to map creation using the HAND synthetic rating curve.
A typical AHPS dataset consisted of a static inundation library spanning a range of flood elevations. The static inundation library typically spanned across all four flood categories (Action, Minor, Moderate, Major). The Alpha Test evaluation considered four benchmark grids per AHPS site corresponding to the Action, Minor, Moderate, Major flood categories. The elevation (NAVD 88) for the benchmark grids was supplied. The benchmark grids were selected by converting the site categorical flood stages to elevations using the site datum. For example, if a site had an Action stage of 10 ft and a datum of 100 ft, the stage value is converted to an elevation by adding the datum (100 ft + 10 ft), this Action elevation (NAVD88) was then used to select an appropriate benchmark grid in the AHPS library. If the site datum was not in the proper vertical datum (e.g NGVD29) then it was converted to NAVD88 using the NOAA Tidal API (https://vdatum.noaa.gov/docs/services.html) prior to converting a flood category stage to elevation. A benchmark depth grid was selected for evaluation if its elevation was greater than or equal to the categorical flood elevation within 1 ft (Figure A.5). The Alpha Test evaluation extent was the maximum modeled depth grid for each site. This approach allows for automation, however, it is possible for false positives to be neglected for areas outside the analysis extent where the FIM method produces wet areas.
Figure A.5: Selecting inundation grids for evaluation from a static AHPS library.
The selected benchmark grid was transformed to a binary raster so that values of 1 were wet and 0 were dry. The maximum inundation map for the static library served as the domain extent and the benchmark flood map and contingency metrics were only calculated within the domain extent (Figure A.6)
Figure A.6: Benchmark map creation for AHPS sites.
Once the inundation maps were selected flows were calculated using the site rating curve. Typically, there were two sources for rating curves, the National River Location Database (NRLDB) and the USGS rating curve, and preference was given to the USGS rating curve. The rating curve is converted to an elevation (NAVD88) using the site datum. A flow value is interpolated from the rating curve using the elevation of the selected flood map (Figure A.7). This flow value is then written to a flow file.
Figure A7: Flows associated with a static flood grid determined using the site rating curve. Benchmark grid elevations were determined from the gridname.
In order to create FIM, a flow and a National Water Model (NWM) feature ID (attributed as feature_id) must be specified. NWM feature IDs associated with a AHPS site were determined using the OWP-WRDS API. The AHPS evaluations were assessed using the mainstems configuration and only NWM mainstems segments were selected. To ensure that only mainstems segments were selected a multi-pass system was developed:
- Using the WRDS API, a collection of previously defined upstream gages were traced downstream all the way to the ocean outlet
- All AHPS locations that were attributed
rfc_forecast_pt == True
were selected and traced downstream all the way to the ocean outlet. - All AHPS sites that contained inundation libraries were selected and traced downstream all the way to the ocean outlet.
- All AHPS sites in Hawaii/Puerto Rico/ Virgin Islands were selected and traced downstream all the way to the ocean outlet.
The unique NWM segments from the above passes were collected into a mainstems database which constitutes all possible mainstems segments. When an individual AHPS site is assessed, it is traced upstream and downstream 10 miles in each direction using the WRDS API and then intersected against the mainstems database (Figure A8). The WRDS API selects all NWM feature IDs upstream of a site, including all tributary segments. By intersecting against the mainstems database these tributary segments are filtered out and FIM is only generated on mainstems.
Figure A8: Selection of mainstem segments associated with AHPS locations.
Once the appropriate NWM feature IDs are determined, the information is written to a flow file consisting of the NWM feature ID and the flow (in CMS). The Alpha Test then ingests the flow file and HAND data for the site and creates a contingency raster and calculates metrics (Figure A.9).
Figure A.9: Evaluation workflow for AHPS locations.
There were a number of AHPS sites that had inundation studies performed (both USGS and NWS), however, they were not evaluated for reasons including:
- Missing flow data (no rating curve) or no maps within the selection window
- Missing HAND or crosswalk errors
- Some USGS studies spanned multiple AHPS sites with multiple flows specified at a site
- Entirely within a waterbody (no FIM mapping)
As reported | Full Name | Plain English | Formula | Also Known As |
---|---|---|---|---|
true_negatives_count | true_negatives_count | # cells classified as true negative | ∑ TN | Correct Negatives |
false_negatives_count | false_negatives_count | # cells classified as false negative | ∑ FN | Misses |
true_positives_count | true_positives_count | # cells classified as true positive | ∑ TP | Hits |
false_positives_count | false_positives_count | # cells classified as false positive | ∑ FP | False Alarm |
contingency_tot_count | contingency_tot_count | total number of cells classified | TP + FP + TN + FN | |
cell_area_m2 | cell_area_m2 | Area of cell (m2) | Xres * Yres | |
TP_area_km2 | True Positive Area (km2) | Total area that is True Positive (in km2) | TP * area of cell | |
FP_area_km2 | False Positive Area (km2) | Total area that is False Positive (in km2) | FP * area of cell | |
TN_area_km2 | True Negative Area (km2) | Total area that is True Negative (in km2) | TN * area of cell | |
FN_area_km2 | False Negative Area (km2) | Total area that is False Negative (in km2) | FN * area of cell | |
contingency_tot_area_km2 | Contingency Table Total Area (km2) | Total area classified by contingency table (in km2) | TParea + FParea + TNarea + FNarea | |
predPositive_area_km2 | Predicted Positive Area (km2) | Area that is modeled as wet | TP + FP | |
predNegative_area_km2 | Predicted Negative Area (km2) | Area that is modeled as dry | TN + FN | |
obsPositive_area_km2 | Observed Positive Area (km2) | Observed area that is wet | TP + FN | |
obsNegative_area_km2 | Observed Negative Area (km2) | Observed area that is dry | TN + FP | |
positiveDiff_area_km2 | Positive Difference Area (km2) | Difference Predicted Positive area and Observed Positive area | predPositive - obsPositive | |
CSI | Critical Success Index | How well do predicted wet areas correspond to observed wet areas? | TP / (TP + FP + FN) | |
FAR | False Alarm Ratio | What fraction of predicted wet areas actually did not occur | FP / (TP + FP) = FP / predPositive | Not to be confused with False alarm rate |
TPR | True Positive Rate | What fraction of observed wet areas were correctly predicted | TP / (TP + FN) =TP / obsPositive | Probability of Detection (POD), Hit Rate, Correct %, Recall |
TNR | True Negative Rate | What fraction of observed dry areas were correctly predicted | TN / (TN + FP) = TN / obsNegative | |
PPV | Positive Predictive Values | What fraction of predicted wet areas actually occurred | TP / (TP + FP) = TP / predPositive | Precision |
NPV | Negative Predictive Values | What fraction of predicted dry areas actually occurred | TN / (TN + FN) = TN / predNegative | |
ACC | Accuracy | What fraction of the total area was correctly predicted | (TP + TN) / (TP+FP+TN+FN) = (TP + TN) / (Total) | |
Bal_ACC | Balanced Accuracy | Average of TPR and TNR | mean (TPR, TNR) | |
MCC | Matthews Correlation Coefficient | Correlation coefficient between predicted dataset and observed dataset. Higher correlation between predicted and observed = higher MCC scores | (TPTN - FPFN)/square_root((TP+FP)(TP+FN)(TN+FP)*(TN+FN)) | Phi Coefficient |
EQUITABLE_THREAT_SCORE | Equitable Threat Score | How well did the predicted wet areas correspond to the observed wet areas (accounting for hits due to chance) | (TP - a_ref) / (TP - a_ref + FP + FN)where: a_ref = ((TP + FP)*(TP + FN)) / (TN+FN+TP+FP) | GilbertSkill Score |
PREVALENCE | Prevalence | What fraction of the total area was observed to be wet | (TP + FN) / (TP + FP + TN + FN) = obsPositive / Total | |
BIAS | Bias | Ratio of predicted wet area to observed wet area. Neglects accuracy. | (TP + FP) / (TP + FN) = predPositive / obsPositive | Area Ratio, Frequency Bias |
F1_SCORE | F1 Score | Harmonic mean of TPR and PPV (TPR and PPV have to be high to have high F1) | (2TP) / (2TP + FP + FN) | |
TP_perc | True Positive Percent | |||
FP_perc | False Positive Percent | |||
TN_perc | True Negative Percent | |||
FN_perc | False Negative Percent | |||
predPositive_perc | Predicted Positive Percent | |||
predNegative_perc | Predicted Negative Percent | |||
obsPositive_perc | Observed Positive Percent | |||
obsNegative_perc | Observed Negative Percent | |||
positiveDiff_perc | Positive Difference Percent | |||
masked_count | Masked Count | # cells excluded from domain due to masking (e.g. Lakes) | ∑ masked | CONTINGENCY TABLE DEVELOPED AFTER MASKED AREAS REMOVED |
masked_perc | Masked Percent | Percent of total domain that is masked (e.g. Lakes) | 100*(Masked / (TP+FN+FP+TN+Masked)) | |
masked_area_km2 | Masked Area (km2) | area of masked cells (e.g. Lakes) | Masked * area of cell |