Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Header-only Refactor of point_in_polygon #587

Merged
merged 31 commits into from
Jul 27, 2022

Conversation

isVoid
Copy link
Contributor

@isVoid isVoid commented Jul 8, 2022

Overview

This PR closes #569. This PR rewrites the core pip test with Eric Haine's "crossings-multiply" algorithm: http://erich.realtimerendering.com/ptinpoly/

And this gives 30%~70% kernel time reduction (please ignore Status column):

['before_optim.json', 'optim_2.json']
# point_in_polygon_benchmark

## [0] Tesla V100-SXM2-32GB

|  CoordsType  |  NumTestPoints  |  NumSidesPerRing  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |          Diff |   %Diff |  Status  |
|--------------|-----------------|-------------------|------------|-------------|------------|-------------|---------------|---------|----------|
|     F32      |      1000       |         4         |  63.706 us |       6.09% |  36.483 us |       6.33% |    -27.223 us | -42.73% |   FAIL   |
|     F32      |     100000      |         4         | 113.862 us |       0.90% |  55.468 us |       1.75% |    -58.394 us | -51.29% |   FAIL   |
|     F32      |    10000000     |         4         |   9.167 ms |       0.02% |   3.785 ms |       0.14% |  -5382.554 us | -58.71% |   FAIL   |
|     F32      |      1000       |        10         | 109.911 us |       1.04% |  55.120 us |       1.82% |    -54.791 us | -49.85% |   FAIL   |
|     F32      |     100000      |        10         | 190.151 us |       0.57% |  81.772 us |       1.14% |   -108.380 us | -57.00% |   FAIL   |
|     F32      |    10000000     |        10         |  16.934 ms |       0.17% |   5.987 ms |       0.06% | -10947.199 us | -64.65% |   FAIL   |
|     F32      |      1000       |        100        | 813.854 us |       0.14% | 314.469 us |       0.30% |   -499.384 us | -61.36% |   FAIL   |
|     F32      |     100000      |        100        |   1.292 ms |       0.18% | 343.747 us |       0.37% |   -948.724 us | -73.40% |   FAIL   |
|     F32      |    10000000     |        100        | 124.471 ms |       0.09% |  29.318 ms |       0.06% | -95152.593 us | -76.45% |   FAIL   |
|     F64      |      1000       |         4         |  73.915 us |       1.40% |  46.365 us |       2.23% |    -27.550 us | -37.27% |   FAIL   |
|     F64      |     100000      |         4         | 149.814 us |       2.77% |  74.001 us |       1.42% |    -75.813 us | -50.60% |   FAIL   |
|     F64      |    10000000     |         4         |  10.541 ms |       0.08% |   5.186 ms |       0.05% |  -5354.802 us | -50.80% |   FAIL   |
|     F64      |      1000       |        10         | 133.100 us |       0.65% |  76.647 us |       1.25% |    -56.453 us | -42.41% |   FAIL   |
|     F64      |     100000      |        10         | 277.337 us |       3.51% | 117.104 us |       0.89% |   -160.233 us | -57.78% |   FAIL   |
|     F64      |    10000000     |        10         |  19.648 ms |       0.06% |   9.214 ms |       0.05% | -10434.298 us | -53.11% |   FAIL   |
|     F64      |      1000       |        100        |   1.017 ms |       0.19% | 485.950 us |       0.37% |   -531.440 us | -52.24% |   FAIL   |
|     F64      |     100000      |        100        |   1.992 ms |       0.63% | 517.869 us |       0.25% |  -1474.597 us | -74.01% |   FAIL   |
|     F64      |    10000000     |        100        | 137.110 ms |       0.08% |  40.617 ms |       0.07% | -96493.030 us | -70.38% |   FAIL   |

# Summary

- Total Matches: 18
  - Pass    (diff <= min_noise): 0
  - Unknown (infinite noise):    0
  - Failure (diff > min_noise):  18

Potential impact introducing divergence and removal of division

EDIT: there's no divergence: #587 (comment)

@zhangjianting and @harrism mentions in different occasions of the changes in

if (rise > 0 && (test_point.x - a.x) * rise < run * rise_to_point)
point_is_within = not point_is_within;
else if (rise < 0 && (test_point.x - a.x) * rise > run * rise_to_point)
point_is_within = not point_is_within;
may cause unwanted overhead due to divergence. The assumption to do so is that the cycles cost by division is greater than that of warp convergence. This is apparently dependent on many factors, such as compiling with/without --fast-math flag or how fast the hardware supports warp convergence. On a Tesla V100 the benchmark between divergence branch and division branch is as follows:

['optim_1.json', 'optim_2.json']
# point_in_polygon_benchmark

## [0] Tesla V100-SXM2-32GB

|  CoordsType  |  NumTestPoints  |  NumSidesPerRing  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |         Diff |   %Diff |  Status  |
|--------------|-----------------|-------------------|------------|-------------|------------|-------------|--------------|---------|----------|
|     F32      |      1000       |         4         |  37.351 us |       6.39% |  36.483 us |       6.33% |    -0.868 us |  -2.32% |   PASS   |
|     F32      |     100000      |         4         |  56.265 us |       1.63% |  55.468 us |       1.75% |    -0.797 us |  -1.42% |   PASS   |
|     F32      |    10000000     |         4         |   3.858 ms |       0.02% |   3.785 ms |       0.14% |   -72.843 us |  -1.89% |   FAIL   |
|     F32      |      1000       |        10         |  56.490 us |       2.04% |  55.120 us |       1.82% |    -1.370 us |  -2.43% |   FAIL   |
|     F32      |     100000      |        10         |  85.581 us |       1.08% |  81.772 us |       1.14% |    -3.810 us |  -4.45% |   FAIL   |
|     F32      |    10000000     |        10         |   6.327 ms |       0.11% |   5.987 ms |       0.06% |  -340.206 us |  -5.38% |   FAIL   |
|     F32      |      1000       |        100        | 314.513 us |       0.41% | 314.469 us |       0.30% |    -0.044 us |  -0.01% |   PASS   |
|     F32      |     100000      |        100        | 347.750 us |       0.49% | 343.747 us |       0.37% |    -4.003 us |  -1.15% |   FAIL   |
|     F32      |    10000000     |        100        |  29.822 ms |       0.07% |  29.318 ms |       0.06% |  -504.290 us |  -1.69% |   FAIL   |
|     F64      |      1000       |         4         |  49.142 us |       1.83% |  46.365 us |       2.23% |    -2.777 us |  -5.65% |   FAIL   |
|     F64      |     100000      |         4         |  78.167 us |       1.23% |  74.001 us |       1.42% |    -4.166 us |  -5.33% |   FAIL   |
|     F64      |    10000000     |         4         |   5.641 ms |       0.05% |   5.186 ms |       0.05% |  -454.642 us |  -8.06% |   FAIL   |
|     F64      |      1000       |        10         |  80.904 us |       1.13% |  76.647 us |       1.25% |    -4.257 us |  -5.26% |   FAIL   |
|     F64      |     100000      |        10         | 125.702 us |       0.65% | 117.104 us |       0.89% |    -8.598 us |  -6.84% |   FAIL   |
|     F64      |    10000000     |        10         |  10.074 ms |       0.03% |   9.214 ms |       0.05% |  -860.423 us |  -8.54% |   FAIL   |
|     F64      |      1000       |        100        | 487.959 us |       0.40% | 485.950 us |       0.37% |    -2.009 us |  -0.41% |   FAIL   |
|     F64      |     100000      |        100        | 528.070 us |       0.35% | 517.869 us |       0.25% |   -10.201 us |  -1.93% |   FAIL   |
|     F64      |    10000000     |        100        |  44.663 ms |       0.18% |  40.617 ms |       0.07% | -4045.961 us |  -9.06% |   FAIL   |

which gives visible speedups. This PR adopts the change based on this benchmark.

Minor improvement and follow ups

This PR also introduces benchmarking suites for point_in_polygon API.

Follow ups to this PR includes:

@isVoid isVoid requested review from a team as code owners July 8, 2022 23:59
@isVoid isVoid requested review from cwharris and zhangjianting July 8, 2022 23:59
@github-actions github-actions bot added cmake Related to CMake code or build configuration libcuspatial Relates to the cuSpatial C++ library labels Jul 8, 2022
@isVoid isVoid added 3 - Ready for Review Ready for review by team non-breaking Non-breaking change improvement Improvement / enhancement to an existing function cmake Related to CMake code or build configuration libcuspatial Relates to the cuSpatial C++ library and removed cmake Related to CMake code or build configuration libcuspatial Relates to the cuSpatial C++ library labels Jul 8, 2022

if (y_in_bounds && test_point.x < (run / rise) * rise_to_point + a.x) {
point_is_within = not point_is_within;
for (auto point_idx = ring_begin; point_idx < ring_end; point_idx++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to know how much of the speedup came just from eliminating the division vs. all the other changes. Because there is potential for more thread divergence in the new code.

Copy link
Contributor Author

@isVoid isVoid Jul 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will include the comparison data of division in the PR description. I will also include a comment stating the assumption we have here is that the cycles it costs on division is greater than that of warp syncs. It's apparently dependent on many factors such as whether user compiles the code with fast-math, or how fast the hardware synchronizes divergence.

That said, I don't think it's appropriate to mention benchmark results in the comments as it's hardware dependent and will get stale very quickly.

struct PointInPolygonUnsupportedChronoTypesTest : public BaseFixture {
};

TYPED_TEST_CASE(PointInPolygonUnsupportedChronoTypesTest, ChronoTypes);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't remember why we have unsupported types tests specifically for Chrono types but not for other non-numerical types (e.g. nested types). Is it necessary to test these at all? Every type we test increases compile time, so I wonder if we can just test one unsupported type as a proxy for all the others?

Don't necessarily need to decide this in the present PR!

Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice optimization!

@isVoid
Copy link
Contributor Author

isVoid commented Jul 27, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit f7d9687 into rapidsai:branch-22.08 Jul 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team cmake Related to CMake code or build configuration improvement Improvement / enhancement to an existing function libcuspatial Relates to the cuSpatial C++ library non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Refactor cuspatial::point-in-polygon() to header-only
3 participants