This section of the ER discusses the experiments and results for storing large numbers of vector geometries in a single GeoPackage.
To test potential approaches to storing CDB vector data in GeoPackage, the Vector Data in GeoPackage Subgroup, hereafter referred to as the Subgroup, performed several different experiments to determine the viability of storing very large volumes of vector data in a smaller number of large GeoPackage files rather than a very large number of small Shapefiles. Some of the experiments also set out to determine if there are upper limits in how much CDB content can be stored in a GeoPackage. This document details the Subgroup’s approach and results of the experiments.
The experiments demonstrated that GeoPackage/SQLite is capable of meeting the performance and size constraints for vector data in a CDB if properly defined {CDB or GeoPackage?}. The experiments also demonstrated that from a performance standpoint doing a direct conversion of existing Shapefile content into GeoPackage(s) may not be feasible. To gain the most from using GeoPackage as a container for large vector datasets, some approaches to level of detail (LOD) and other CDB content should be reconsidered. This includes the way significant size is used in CDB.
The Subgroup also found that the design the schema is very important, including proper index design, when using GeoPackage. Shapefiles are very simple for developers to use since there is not a lot of flexibility when it comes to how to store data. Conversely, GeoPackage is very flexible. As such, how a GeoPackage is structured and indexed can have a major impact on performance. If the exact schema and indexes for GeoPackage are properly defined as part of the CDB-X, users can expect very good performance, even for large datasets.
Note
|
The OGC CDB Standards Working Group will use the work performed by the participants in this Sprint but also further augmented by the OGC Testbed 16 GeoPackage experiments and findings related to processing, storing, and querying extremely large vector datasets. |
The purpose of these experiments was to examine performance in searching for and interacting with CDB features stored in very large unitary GeoPackage containers. The design intent of the large datasets consisted of three main considerations:
-
Be large in terms of feature counts.
-
Be CDB-like in terms of feature attribution.
-
Be ‘representative’ of real-world feature density over a large geographic area.
The source data was prepared for the USA, including Alaska, Hawaii and Puerto Rico. Testing, however, was limited to only the Continental US (CONUS).
TO achieve the experiment objectives, four GeoPackage containers were populated:
-
cdb-usa-bldg-ssiz.gpkg
-
osm-hydro-to-CDB.gpkg
-
osm-roads-to-CDB.gpkg
-
combined-features-conus.gpkg
These are now described.
This GeoPackage contains 39,079,781 point features and requires 5,249,290,240 bytes of storage. The following is an image showing the attributes on features in the attribute table.
These point features were created using the following process:
-
Building footprints for the United States area were extracted from OpenStreetMap sources.
-
An approximate ‘significant size’ for each building we calculated using the formula: SignificantSize = sqrt($area) * 1.5
-
The centroid of each footprint polygon was generated, preserving attributes.
-
The centroid points were passed through an 'OSM to CDB Attribute Translator' to assign CDB-like attribution.
This data is contained in a layer named “GT_Models”, though the nature of the data can be considered as an agnostic representation of either GT or GS CDB models. Please note that the original OSM building footprints do not have a ‘height’ attribute, so the derived significant size is approximate and present only for illustrative and comparative purposes.
This GeoPackage contains two layers with a total of 6,378,675 features and requires 5,043,081,216 bytes of storage.
The first layer is Hydro_Network_Areal, with attributes as shown below. This layer contains 2,126,072 features.
While named a ‘network’ layer, no effort was made to conduct a topological analysis and assign junction IDs. The CDB-like attribution is merely representative. This layer was created by combining OSM hydrographic polygons based on a very simple attribute filter, and then running the results through an 'OSM to CDB Attribute Translator’ with rules set to create very generic CDB attributions.
The second layer is ‘Hydro_Network_Linear, with attribution as shown below. This layer contains 4,252,603 features.
Again, no effort was made to conduct a topological analysis and assign junction IDs. The CDB-like attribution is merely representative. This layer was created by combining OSM hydrographic linears based on a simple attribute filter, and then running the results through an 'OSM to CDB Attribute Translator’ with rules set to create very generic CDB attributions.
This GeoPackage contains roads derived from worldwide OSM. The GeoPackage contains 90,425,963 features and requires 29,832,347,648 bytes of storage.
Like the hydrographic features described above, this dataset does not contain a true network. Topology was not analyzed and CDB junction IDs are not set.
The final GeoPackage container, combined-features-conus.gpkg, is simply a single container with each of the aforementioned layers copied into it. This GeoPackage requires 18,164,895,744 bytes of storage.
The layer ‘Road_Network_Linear’ was clipped to approximately the CONUS area from the world-wide road coverage. This is so this GeoPackage covers the same extents as the other three layers.
The objective of the testing was to explore a combination of spatial and attribution filtering in a CDB-like environment. To illustrate the importance of properly designing the schema when migrating from a Shapefile to a GeoPackage-based CDB, all the vector data in a CDB was converted. An approach similar to "Design Approach 4" in the OGC Discussion Paper titled OGC CDB, Leveraging GeoPackage Discussion Paper was used. The conversion grouped all vector features by dataset and geocell into a single GeoPackage. Each vector feature was assigned a value for LOD, HREF, and UREF to correspond to the original Shapefile filename. A test was developed to randomly seek and read features in the CDB GeoPackage. The test script had a list of 8243 individual Shapefiles, but each file was opened and read in a randomized order. In the case of the Shapefile, each file was opened by filename, and all of the features were read. In subsequent tests with GeoPackage, the same content was read (using the list of Shapefile filenames), but instead of opening the Shapefile, the script performed a query based on the LOD, HREF, and UREF attributes.
In this test, reading the Shapefiles took 0:01:29 (1.5 minutes). With no indexes on the GeoPackage attributes, the queries took over one hour (1:01:47). Next, an index for the LOD, HREF, and UREF attributes was created and the above GeoPackage test repeated. With the indexes, finding and reading the same features took 0:00:49. This is only half of the time it took to read the Shapefiles.
-
The testing environment was a single Windows workstation, 16 CPU cores, 64 GB of system RAM, and very large SATA disk storage. No ‘exotic’ (SSD, M2, etc.) storage devices were used.
-
Tests were created as Python scripts, leveraging the ‘osgeo’ Python module. Timing information was captured using Python’s ‘time’ module. Python 3.7.4 (64-bit) was used.
-
Each timing test was performed in the approximate CONUS extents of North 49 degrees latitude to South 24 degrees latitude, and from West 66 degrees longitude to West 125 degrees longitude.
-
Prior to running a test, a ‘step size’ was defined – typically corresponding to a CDB LOD tile size. A list of every spatial filter in the entire extent was created then randomized.
-
Also, prior to a test, a ‘significant size’ filter was set. When the layer ‘GT_Model’ is encountered, this additional attribute filter is applied. The intent is to mimic LOD selection, in addition to the spatial filter.
-
There were three timing steps:
-
Timing step one is the elapsed time to apply the spatial filter.
-
Timing step two is the elapsed time to return a feature count based on the combined spatial and (if any) attribute filters.
-
Timing step three is the elapsed time to read the features from the layer into a Python list.
-
-
At the end of processing and timing each ‘tile’ defined by the collection of spatial filters, a corresponding ‘shape’ is created and written into the test record output file.
-
The output attribution is as follows:
-
count: the number of features returned after application of filters.
-
filter_t – time to complete the filtering operation(s) in seconds.
-
count_t: time to complete the feature count operation in seconds.
-
read_t : time to complete feature read operation in seconds. This includes reading from the GeoPackage container and appending each feature to a Python list.
-
Sequence: order that the tile was processed.
-
‘$geometry’: tile extents derived from spatial filter polygon.
-
Note: tiles that return zero features do not create a test output record.
Test results coverage is shown in the figure below.
This test simulates retrieving point features corresponding to CDB LOD2 and only models with the corresponding lowest significant size (as defined in the CDB 3.2, Table 3-1). The conclusion that is drawn from this test, however, is that spatial filtering time is insignificant and appears to not be correlated to the number of features found. The time taken to count and read filtered features appears to be a direct correlation to number of features found.
The Experiment 1 attribute table results are shown in the figures below, each filtered on a different field.
Experiment 2: Simulation of LOD4, hydro, road, and building layers, significant size (buildings) > 3.39718
Test results coverage is shown in the figure below.
This test used the combined layers source file and simulates a CDB LOD4 data access pattern. Timing values are totals, accumulating as each layer was filtered, counted as features were read. Once again, filter timing appears to be insignificant and unrelated to the number of features filtered. Data in the GT_Model layer has both a spatial and attribute (significant size) filter applies.
The Experiment 2 attribute table results are shown in the figures below, each filtered on a different field.
-
Storing large amounts of feature data in single GeoPackage containers and retrieving that data by applying spatial and attribution filters that correspond with typical CDB access patterns appears to be practical.
-
Spatial filters easily mimic the existing CDB tiling scheme.
-
Storing ‘significant size’ on model instancing point features can significantly improve the model retrieval scheme, rather than storing models in the significant size related folder scheme. Storing and evaluating significant size on instancing points can make visual content and performance tuning much more practical.