Non linear running time for ogc:contains precomputation

The precomputation time of the ogc:contains currently grows non-linearly with the size of the data set, which shouldn't be the case. We are investigating the problem.

Example precomputation times

PBF files downloaded from https://download.geofabrik.de . The data for baden-wuerttemberg is around 4 times larger than that of freiburg-regbez. The data of germany is around 7.5 times larger than that of baden-wuerttemberg and around 30 times larger than that of freiburg-regbez. The time to compute the regular triples indeed grows linearly, as one would expect. But the time to compute the ogc:contains triples grows super-linearly, and quite strongly so.

Dataset	nodes, ways, rels	regular triples	contains triples	ways-contained-in-area triples
freiburg-regbez	0.6M, 1.9M, 37K	0.5min	0.4min	0.25min
baden-wuerttemberg	2.2M, 8.0M, 118K	1.5min	17min	14min
germany	16M, 59M, 742K	12min	400min	300min

Where does the time go?

Optionally, osm2rdf can be made to output "statistic triples" for each way-in-area comparison done during the ogc:contains precomputation, which can then be queried conveniently using SPARQL. An analysis of these numbers gives a very good picture of what is happening.

Dataset	#ways	#comp	#comp / way	time per comp	total time comp
freiburg-regbez	1.9M	2.5M	1.4	14µs	0.6min
baden-wuerttemberg	8.0M	11M	1.4	23µs	4.3min
germany	59M	83M	1.4	130µs	180min

Queries

Numbers that behave as expected

The number of way-in-area comparisons according to the statistic triples (column "#comp" in the table above) correlates well with the respective number reported in the osm2rdf log. For freiburg-regbez, the two numbers are identical, for baden-wuerttemberg, there is a slight deviation. Note that this number comprises all comparisons between a way w and an area a such that (1) the bounding boxes of w and a overlap (and hence the pair is contained in the result of the R-tree query for areas potentially overlapping with w), and (2) there is no area a' that is contained in a (according to our contains-area DAG) and for which we already know that w is contained a' (and hence also in a).

One way is compared to around 1.4 areas on average, independent of the size of the data. In particular, most ways are compared to a single area. That shows that our approach works as intended. Also, most of the total comparison time is spent on comparisons of ways that are only compared against very few areas (see Query 3).

Numbers that contribute to the non-linear growth of the time for the way-in-area comparisons

area size	#areas	#distinct	total time	area	all ways in area
81,447	440,664	1	59min	Jutland Peninsula	QLever query
48,864	322,828	1	25min	Vorpommern	QLever query
155,244	64,150	1	18min	Deutschland	QLever query
4	204,808	187	13min	many

The table above shows the four biggest contributors (when grouping by area size, measured in number of nodes) to the total time for the way-in-area comparisons. The three biggest contributors are single areas: these three use almost 60% of the total time. Only the fourth contributor is "normal" (many comparisons to a variety of small areas, with a moderate contribution to the total running time). See Query 6 above. After row 4 (not shown above, execute the query), there are many more single-area contributors, amounting for around a minute of the total time. These are larger regions like states (Bayern, Nordrhein-Westfalen) or others (Naturpark Bergisches Land, Umweltzone Ruhrgebiet).

Open questions from this analysis

Question 1: For baden-wuerttemberg, according to the statistic triples, only 4 minutes are spent on way-in-area comparisons, but the total time for this phase as reported by osm2rdf is 14 minutes. Where are the remaining 10 minutes spent?

Question 2: For each way-in-area comparison, there are currently two area ids osmway:<id> and osmrel:<id> because I did not know how to find out (at the point in the code, where the statistic triples are output) whether the area came from a way or from a relation. Is this information available at that point?

New figures from Patrick

Preprocessing time for "ways in areas" on ob (AMD Ryzen 7, 128 GB, 16 cores), with default settings (0.015 for inner-outer-geoms):

Dataset	#ways	R-tree	intersect	contains	total time	time per core
freiburg-regbez	1.9M	2min	3min	0.7min	7.1min	0.4min
baden-wuerttemberg	8.0M	4.6h	1.9h	0.2h	7.1h	0.4h
germany	59M	96.0h	35.8h	4.3h	141h	8.8h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly