Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore GIL-releasing Cython code for performance improvements #111

Open
kylebarron opened this issue Mar 18, 2021 · 0 comments
Open

Explore GIL-releasing Cython code for performance improvements #111

kylebarron opened this issue Mar 18, 2021 · 0 comments

Comments

@kylebarron
Copy link

Jeff mentioned something peculiar in the benchmarks.

notice at ~200 concurrent requests P99 goes through the roof, with this trend continuing to lower percentiles as you increase concurrency. But note how number of tiles requested (request_count) is staying the same, median stays the same etc.
this tells me something is causing a small but increasing number of requests to block as the event loop becomes saturated
and I think its because tiling isn’t “perfectly async”. You are often doing some io bound stuff, then some cpu bound stuff, then some more io bound stuff etc.

image

I don't completely understand async, threading, etc, but one possibility for this slowdown is the GIL.

For example, Jeff was pointing to this piece of code:

def _calculate_image_tiles(
self,
bounds: Tuple[float, float, float, float],
tile_width: int,
tile_height: int,
band_count: int,
ovr_level: int,
dtype: np.dtype,
) -> TileMetadata:
"""
Internal method to calculate which images tiles need to be requested for a partial read. Also returns all of
the required metadata about the image tiles to perform a partial read
"""
geotransform = self.geotransform(ovr_level)
invgt = ~geotransform
# Project request bounds to pixel coordinates relative to geotransform of the overview
tlx, tly = invgt * (bounds[0], bounds[3])
brx, bry = invgt * (bounds[2], bounds[1])
# Calculate tiles
xmin = math.floor((tlx + 1e-6) / tile_width)
xmax = math.floor((brx + 1e-6) / tile_width)
ymax = math.floor((bry + 1e-6) / tile_height)
ymin = math.floor((tly + 1e-6) / tile_height)
tile_bounds = (
xmin * tile_width,
ymin * tile_height,
(xmax + 1) * tile_width,
(ymax + 1) * tile_height,
)
# Create geotransform for the fused image
_tlx, _tly = geotransform * (tile_bounds[0], tile_bounds[1])
fused_gt = affine.Affine(
geotransform.a, geotransform.b, _tlx, geotransform.d, geotransform.e, _tly
)
inv_fused_gt = ~fused_gt
xorigin, yorigin = [round(v) for v in inv_fused_gt * (bounds[0], bounds[3])]
return TileMetadata(
tlx=xorigin,
tly=yorigin,
width=round(brx - tlx),
height=round(bry - tly),
xmin=xmin,
ymin=ymin,
xmax=xmax,
ymax=ymax,
tile_width=tile_width,
tile_height=tile_height,
bands=band_count,
dtype=dtype,
ovr_level=ovr_level,
)

It's not too complex, and calling %timeit on the code from IPython says it's pretty fast:

14.1 µs ± 221 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

However, if you're doing this 70,000 times in multithreaded code, that could still be an issue because each thread needs to acquire the Python GIL for this snippet.

One benefit of Cython is that code that doesn't need to touch the Python interpreter can be marked as nogil, which allows that code to be run in multiple threads concurrently. Pure-math code is a good candidate for this, because you can write it in such a way that doesn't need to interact with Python objects using pure-Python functions.

For example, consider this code https://github.com/uber/h3-py/blob/c3593ecae5796526f9c6dc7dc140c7df2251a7f2/src/h3/_cy/unstable_vect.pyx#L22-L43. The nogil on the cpdef definition tells Cython that parts of this function release the GIL, and then the with nogil actually releases the GIL for that block. Ref uber/h3-py#166 (comment)

In broader terms, one way to visualize which lines you can set nogil on is to use Cython's "annotation mode". From the tutorial:

Cython has a way to visualise where interaction with Python objects and Python’s C-API is taking place

image

If a line is white, it means that the code generated doesn’t interact with Python, so will run as fast as normal C code. The darker the yellow, the more Python interaction there is in that line. Those yellow lines will usually operate on Python objects, raise exceptions, or do other kinds of higher-level operations than what can easily be translated into simple and fast C code.

Lines that are completely white can be marked as OK to release the GIL.

Future work:

  • Figure out a small benchmark, such as converting a single function to Cython and then running the existing benchmarks again, in order to see whether there are benefits here, and whether it's worth converting a larger portion of the code to Cython. I don't know whether converting a single function to Cython will provide noticeable benefits.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant