Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix performance regression in quadtree_point_in_polygon by 5x #1446

Open
wants to merge 2 commits into
base: branch-24.12
Choose a base branch
from

Conversation

mroeschke
Copy link
Contributor

@mroeschke mroeschke commented Aug 27, 2024

Description

Improves on a performance regression identified by @trxcllnt when using quadtree_point_in_polygon

Seems like the core slowdown is that the passed points is a GeoSeries, and the x/y points are generated by interleaving them in GeoColumnAccessor.xy and then slicing x/y back out.

Since these points appear to be held in a cudf.ListDtype (i.e. [x, y]), I tried avoiding interleaving and slicing the the x and y points individually in the lists using the cudf.Series.list.get API but this was failing for x/y points for .polygons.x/.polygons.y since some cuspatial APIs expect these points to be float (not sure if it's a cudf bug in list.get or how points as structured for polygons)

Instead, I was able to get a 5x speedup by

  • Caching the interleaving step to avoid doing this multiple times
  • Slicing a cupy array of points instead of a cudf.Series (which uselessly slices an .index which is not used)
Benchmarking code
# 1.3G    points.arrow
# 722K    polys.arrow

import cudf
import cupy
import cuspatial
import pyarrow as pa
from time import perf_counter_ns

polys = pa.ipc.open_stream("polys.arrow").read_all()
polys = cudf.DataFrame.from_arrow(polys).rename(columns={"tract": "poly"})
point = polys["poly"].list.leaves.struct.explode()
polys = cuspatial.GeoSeries.from_polygons_xy(
  point.interleave_columns(),
  polys["poly"]._column.elements.offsets,
  polys["poly"]._column.offsets,
  cupy.arange(0, len(polys) + 1)
)

point = pa.ipc.open_stream("points.arrow").read_all()
point = cudf.DataFrame.from_arrow(point)
min_x = point["x"].min()
max_x = point["x"].max()
min_y = point["y"].min()
max_y = point["y"].max()

max_size = 0
max_depth = 16
threshold = 10
while max_size <= threshold:
  max_depth -= 1
  max_size = len(point) / pow(4, max_depth) / 4

scale = max(max_x - min_x, max_y - min_y) / (1 << max_depth)

point_xy = cuspatial.GeoSeries.from_points_xy(point.interleave_columns())
point_map, quadtree = (
  cuspatial.quadtree_on_points(point_xy,
                               min_x,
                               max_x,
                               min_y,
                               max_y,
                               scale,
                               max_depth,
                               max_size))

t0 = perf_counter_ns()

cuspatial.quadtree_point_in_polygon(
  cuspatial.join_quadtree_and_bounding_boxes(
      quadtree,
      cuspatial.polygon_bounding_boxes(polys[0:1]),
      min_x,
      max_x,
      min_y,
      max_y,
      scale,
      max_depth
  ),
  quadtree,
  point_map,
  point_xy,
  polys[0:1]
)

t1 = perf_counter_ns()

print(f"{(t1 - t0) / (10 ** 6)}ms")

t0 = perf_counter_ns()
poly_offsets = cudf.Series(polys[0:1].polygons.part_offset)._column
ring_offsets = cudf.Series(polys[0:1].polygons.ring_offset)._column
poly_points_x = cudf.Series(polys[0:1].polygons.x)._column
poly_points_y = cudf.Series(polys[0:1].polygons.y)._column

from cuspatial._lib import spatial_join

cudf.DataFrame._from_data(
  *spatial_join.quadtree_point_in_polygon(
      cuspatial.join_quadtree_and_bounding_boxes(
          quadtree,
          cuspatial.polygon_bounding_boxes(polys[0:1]),
          min_x,
          max_x,
          min_y,
          max_y,
          scale,
          max_depth
      ),
      quadtree,
      point_map._column,
      point["x"]._column,
      point["y"]._column,
      poly_offsets,
      ring_offsets,
      poly_points_x,
      poly_points_y,
  )
)

t1 = perf_counter_ns()

print(f"{(t1 - t0) / (10 ** 6)}ms")

# 127.406344ms  # this PR
# 644.963021ms  # branch 24.10

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@mroeschke mroeschke added Python Related to Python code improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Aug 27, 2024
@mroeschke mroeschke requested a review from a team as a code owner August 27, 2024 22:50
@harrism
Copy link
Member

harrism commented Aug 27, 2024

I think the title should be "Fix performance regression in ...". This is a bugfix, not an improvement, right? The current title sounds like a major new optimization.

Also, in your benchmark, please make it clear what data size you are seeing a 5x improvement for.

@harrism
Copy link
Member

harrism commented Aug 27, 2024

Probably should have been a bug issue filed for this.

Comment on lines +217 to +219
points_data = points.points
points_x = as_column(points_data.x)
points_y = as_column(points_data.y)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should allow users to avoid packing points into a GeoSeries if desired.

Since GeoSeries is a Union<List<List<List<Float32>>>, each 8-byte point pair requires an additional 13 bytes (1 + 4 * 3 ) of indices, which is prohibitively expensive when the dataset is large.

I suggest we either change the API to accept separate x/y columns again (while still accepting and unpacking the points from a GeoSeries), or provide some sort of wrapper for a pair of x/y columns that can be accessed the same way as the GeoSeries.

At the moment, here's my workaround:

class Points:
    def __init__(self, x, y):
        self.x = x
        self.y = y
        self.points = self
        self._column = self
        self._meta = self
        self.input_types = Series([Feature_Enum.POINT.value])

    def __len__(self) -> int:
        return len(self.x)


x = Series([ ... ])
y = Series([ ... ])
# Wrap x and y cols in a Points instance which duck-types
# GeoSeries' accessors ala `p.points.x` and `p.points.y`
quadtree_on_points(Points(x, y), ...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Historically it looks like the API changed in #934 where quadtree_point_in_polygon was updated to accept a GeoSeries points and polygon inputs. I suppose @isVoid is best to make the call on whether we should change the API back to accept x/y points again.

Copy link
Contributor

@trxcllnt trxcllnt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

127.406344ms is still way too long, the query in the benchmark code should take 5-10ms.

@mroeschke mroeschke changed the title Improve quadtree_point_in_polygon performance by 5x Fix performance regression in quadtree_point_in_polygon by 5x Aug 28, 2024
@mroeschke
Copy link
Contributor Author

I think the title should be "Fix performance regression in ...". This is a bugfix, not an improvement, right? The current title sounds like a major new optimization.

Also, in your benchmark, please make it clear what data size you are seeing a 5x improvement for.

Sorry yeah I updated the PR title and added the sizes of points.arrow and polys.arrow in the benchmark example

@harrism
Copy link
Member

harrism commented Aug 28, 2024

127.406344ms is still way too long, the query in the benchmark code should take 5-10ms.

@trxcllnt do you have a repro of that high performance? Are you saying we still have a 12-25x performance regression?

@trxcllnt
Copy link
Contributor

@harrism yes, applying the workaround in my comment brings the run time down to ~25ms (see below).

I believe this is still longer than it should take, since it takes 5-10ms in node-rapids (which is still on cuspatial branch-22.12). Since node-rapids is basically straight calls into libcudf and libcuspatial, it doesn't include most of the additional overhead in Python (like DataFrame indices), so there are likely additional operations in Python contributing to the 25ms.

There is an alternative to the workaround I've described here -- we could add a C++ API that accepts the packed X/Y format in GeoSeries, then adapt it into two strided x and y iterators for the quadtree spatial join.

However, this still requires the user to construct a GeoSeries from the points (which takes 21 bytes/row instead of 8), so this would still be sub-optimal for memory-sensitive users.

$ python cuspatial-quadtree-regression.py 
390.267027ms
25.687572ms
# cuspatial-quadtree-regression.py
# 1.3G    points.arrow https://node-rapids-data.s3.us-west-2.amazonaws.com/spatial/168898952_points.arrow.gz
# 722K    polys.arrow https://node-rapids-data.s3.us-west-2.amazonaws.com/spatial/263_tracts.arrow.gz

import cudf
import cupy
import cuspatial
import pyarrow as pa
from time import perf_counter_ns

polys = pa.ipc.open_stream("polys.arrow").read_all()
polys = cudf.DataFrame.from_arrow(polys).rename(columns={"tract": "poly"})
point = polys["poly"].list.leaves.struct.explode()
polys = cuspatial.GeoSeries.from_polygons_xy(
  point.interleave_columns(),
  polys["poly"]._column.elements.offsets,
  polys["poly"]._column.offsets,
  cupy.arange(0, len(polys) + 1)
)

point = pa.ipc.open_stream("points.arrow").read_all()
point = cudf.DataFrame.from_arrow(point)
min_x = point["x"].min()
max_x = point["x"].max()
min_y = point["y"].min()
max_y = point["y"].max()

max_size = 0
max_depth = 16
threshold = 10
while max_size <= threshold:
  max_depth -= 1
  max_size = len(point) / pow(4, max_depth) / 4

scale = max(max_x - min_x, max_y - min_y) / (1 << max_depth)

point_xy = cuspatial.GeoSeries.from_points_xy(point.interleave_columns())
point_map, quadtree = (
  cuspatial.quadtree_on_points(point_xy,
                               min_x,
                               max_x,
                               min_y,
                               max_y,
                               scale,
                               max_depth,
                               max_size))

t0 = perf_counter_ns()

cuspatial.quadtree_point_in_polygon(
  cuspatial.join_quadtree_and_bounding_boxes(
      quadtree,
      cuspatial.polygon_bounding_boxes(polys[0:1]),
      min_x,
      max_x,
      min_y,
      max_y,
      scale,
      max_depth
  ),
  quadtree,
  point_map,
  point_xy,
  polys[0:1]
)

t1 = perf_counter_ns()

print(f"{(t1 - t0) / (10 ** 6)}ms")

from cuspatial.core._column.geometa import Feature_Enum

class Points:
    def __init__(self, x, y):
        self.x = x
        self.y = y
        self.points = self
        self._column = self
        self._meta = self
        self.input_types = cudf.Series([Feature_Enum.POINT.value])

    def __len__(self) -> int:
        return len(self.x)

t0 = perf_counter_ns()

cuspatial.quadtree_point_in_polygon(
  cuspatial.join_quadtree_and_bounding_boxes(
      quadtree,
      cuspatial.polygon_bounding_boxes(polys[0:1]),
      min_x,
      max_x,
      min_y,
      max_y,
      scale,
      max_depth
  ),
  quadtree,
  point_map,
  Points(point["x"], point["y"]),
  polys[0:1]
)

t1 = perf_counter_ns()

print(f"{(t1 - t0) / (10 ** 6)}ms")

@mroeschke mroeschke changed the base branch from branch-24.10 to branch-24.12 September 25, 2024 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Related to Python code
Projects
Status: Todo
Development

Successfully merging this pull request may close these issues.

3 participants