-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix performance regression in quadtree_point_in_polygon by 5x #1446
base: branch-24.12
Are you sure you want to change the base?
Fix performance regression in quadtree_point_in_polygon by 5x #1446
Conversation
I think the title should be "Fix performance regression in ...". This is a bugfix, not an improvement, right? The current title sounds like a major new optimization. Also, in your benchmark, please make it clear what data size you are seeing a 5x improvement for. |
Probably should have been a bug issue filed for this. |
points_data = points.points | ||
points_x = as_column(points_data.x) | ||
points_y = as_column(points_data.y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should allow users to avoid packing points into a GeoSeries if desired.
Since GeoSeries is a Union<List<List<List<Float32>>>
, each 8-byte point pair requires an additional 13 bytes (1 + 4 * 3
) of indices, which is prohibitively expensive when the dataset is large.
I suggest we either change the API to accept separate x/y columns again (while still accepting and unpacking the points from a GeoSeries), or provide some sort of wrapper for a pair of x/y columns that can be accessed the same way as the GeoSeries.
At the moment, here's my workaround:
class Points:
def __init__(self, x, y):
self.x = x
self.y = y
self.points = self
self._column = self
self._meta = self
self.input_types = Series([Feature_Enum.POINT.value])
def __len__(self) -> int:
return len(self.x)
x = Series([ ... ])
y = Series([ ... ])
# Wrap x and y cols in a Points instance which duck-types
# GeoSeries' accessors ala `p.points.x` and `p.points.y`
quadtree_on_points(Points(x, y), ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
127.406344ms
is still way too long, the query in the benchmark code should take 5-10ms.
Sorry yeah I updated the PR title and added the sizes of |
@trxcllnt do you have a repro of that high performance? Are you saying we still have a 12-25x performance regression? |
@harrism yes, applying the workaround in my comment brings the run time down to ~25ms (see below). I believe this is still longer than it should take, since it takes 5-10ms in node-rapids (which is still on cuspatial branch-22.12). Since node-rapids is basically straight calls into libcudf and libcuspatial, it doesn't include most of the additional overhead in Python (like DataFrame indices), so there are likely additional operations in Python contributing to the 25ms. There is an alternative to the workaround I've described here -- we could add a C++ API that accepts the packed X/Y format in GeoSeries, then adapt it into two strided x and y iterators for the quadtree spatial join. However, this still requires the user to construct a GeoSeries from the points (which takes 21 bytes/row instead of 8), so this would still be sub-optimal for memory-sensitive users. $ python cuspatial-quadtree-regression.py
390.267027ms
25.687572ms # cuspatial-quadtree-regression.py
# 1.3G points.arrow https://node-rapids-data.s3.us-west-2.amazonaws.com/spatial/168898952_points.arrow.gz
# 722K polys.arrow https://node-rapids-data.s3.us-west-2.amazonaws.com/spatial/263_tracts.arrow.gz
import cudf
import cupy
import cuspatial
import pyarrow as pa
from time import perf_counter_ns
polys = pa.ipc.open_stream("polys.arrow").read_all()
polys = cudf.DataFrame.from_arrow(polys).rename(columns={"tract": "poly"})
point = polys["poly"].list.leaves.struct.explode()
polys = cuspatial.GeoSeries.from_polygons_xy(
point.interleave_columns(),
polys["poly"]._column.elements.offsets,
polys["poly"]._column.offsets,
cupy.arange(0, len(polys) + 1)
)
point = pa.ipc.open_stream("points.arrow").read_all()
point = cudf.DataFrame.from_arrow(point)
min_x = point["x"].min()
max_x = point["x"].max()
min_y = point["y"].min()
max_y = point["y"].max()
max_size = 0
max_depth = 16
threshold = 10
while max_size <= threshold:
max_depth -= 1
max_size = len(point) / pow(4, max_depth) / 4
scale = max(max_x - min_x, max_y - min_y) / (1 << max_depth)
point_xy = cuspatial.GeoSeries.from_points_xy(point.interleave_columns())
point_map, quadtree = (
cuspatial.quadtree_on_points(point_xy,
min_x,
max_x,
min_y,
max_y,
scale,
max_depth,
max_size))
t0 = perf_counter_ns()
cuspatial.quadtree_point_in_polygon(
cuspatial.join_quadtree_and_bounding_boxes(
quadtree,
cuspatial.polygon_bounding_boxes(polys[0:1]),
min_x,
max_x,
min_y,
max_y,
scale,
max_depth
),
quadtree,
point_map,
point_xy,
polys[0:1]
)
t1 = perf_counter_ns()
print(f"{(t1 - t0) / (10 ** 6)}ms")
from cuspatial.core._column.geometa import Feature_Enum
class Points:
def __init__(self, x, y):
self.x = x
self.y = y
self.points = self
self._column = self
self._meta = self
self.input_types = cudf.Series([Feature_Enum.POINT.value])
def __len__(self) -> int:
return len(self.x)
t0 = perf_counter_ns()
cuspatial.quadtree_point_in_polygon(
cuspatial.join_quadtree_and_bounding_boxes(
quadtree,
cuspatial.polygon_bounding_boxes(polys[0:1]),
min_x,
max_x,
min_y,
max_y,
scale,
max_depth
),
quadtree,
point_map,
Points(point["x"], point["y"]),
polys[0:1]
)
t1 = perf_counter_ns()
print(f"{(t1 - t0) / (10 ** 6)}ms") |
Description
Improves on a performance regression identified by @trxcllnt when using
quadtree_point_in_polygon
Seems like the core slowdown is that the passed
points
is aGeoSeries
, and the x/y points are generated by interleaving them inGeoColumnAccessor.xy
and then slicing x/y back out.Since these points appear to be held in a
cudf.ListDtype
(i.e.[x, y]
), I tried avoiding interleaving and slicing the the x and y points individually in the lists using thecudf.Series.list.get
API but this was failing for x/y points for.polygons.x
/.polygons.y
since some cuspatial APIs expect these points to befloat
(not sure if it's a cudf bug inlist.get
or how points as structured for polygons)Instead, I was able to get a 5x speedup by
cudf.Series
(which uselessly slices an.index
which is not used)Benchmarking code
Checklist