Explore GIL-releasing Cython code for performance improvements #111

kylebarron · 2021-03-18T21:31:31Z

Jeff mentioned something peculiar in the benchmarks.

notice at ~200 concurrent requests P99 goes through the roof, with this trend continuing to lower percentiles as you increase concurrency. But note how number of tiles requested (request_count) is staying the same, median stays the same etc.
this tells me something is causing a small but increasing number of requests to block as the event loop becomes saturated
and I think its because tiling isn’t “perfectly async”. You are often doing some io bound stuff, then some cpu bound stuff, then some more io bound stuff etc.

I don't completely understand async, threading, etc, but one possibility for this slowdown is the GIL.

For example, Jeff was pointing to this piece of code:

aiocogeo/aiocogeo/partial_reads.py

Lines 149 to 204 in 5a1d32c

    
               def _calculate_image_tiles( 
        
                   self, 
        
                   bounds: Tuple[float, float, float, float], 
        
                   tile_width: int, 
        
                   tile_height: int, 
        
                   band_count: int, 
        
                   ovr_level: int, 
        
                   dtype: np.dtype, 
        
               ) -> TileMetadata: 
        
                   """ 
        
                   Internal method to calculate which images tiles need to be requested for a partial read.  Also returns all of 
        
                   the required metadata about the image tiles to perform a partial read 
        
                   """ 
        
                   geotransform = self.geotransform(ovr_level) 
        
                   invgt = ~geotransform 
        
                   # Project request bounds to pixel coordinates relative to geotransform of the overview 
        
                   tlx, tly = invgt * (bounds[0], bounds[3]) 
        
                   brx, bry = invgt * (bounds[2], bounds[1]) 
        
                   # Calculate tiles 
        
                   xmin = math.floor((tlx + 1e-6) / tile_width) 
        
                   xmax = math.floor((brx + 1e-6) / tile_width) 
        
                   ymax = math.floor((bry + 1e-6) / tile_height) 
        
                   ymin = math.floor((tly + 1e-6) / tile_height) 
        
                   tile_bounds = ( 
        
                       xmin * tile_width, 
        
                       ymin * tile_height, 
        
                       (xmax + 1) * tile_width, 
        
                       (ymax + 1) * tile_height, 
        
                   ) 
        
                   # Create geotransform for the fused image 
        
                   _tlx, _tly = geotransform * (tile_bounds[0], tile_bounds[1]) 
        
                   fused_gt = affine.Affine( 
        
                       geotransform.a, geotransform.b, _tlx, geotransform.d, geotransform.e, _tly 
        
                   ) 
        
                   inv_fused_gt = ~fused_gt 
        
                   xorigin, yorigin = [round(v) for v in inv_fused_gt * (bounds[0], bounds[3])] 
        
                   return TileMetadata( 
        
                       tlx=xorigin, 
        
                       tly=yorigin, 
        
                       width=round(brx - tlx), 
        
                       height=round(bry - tly), 
        
                       xmin=xmin, 
        
                       ymin=ymin, 
        
                       xmax=xmax, 
        
                       ymax=ymax, 
        
                       tile_width=tile_width, 
        
                       tile_height=tile_height, 
        
                       bands=band_count, 
        
                       dtype=dtype, 
        
                       ovr_level=ovr_level, 
        
                   )

It's not too complex, and calling %timeit on the code from IPython says it's pretty fast:

14.1 µs ± 221 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

However, if you're doing this 70,000 times in multithreaded code, that could still be an issue because each thread needs to acquire the Python GIL for this snippet.

One benefit of Cython is that code that doesn't need to touch the Python interpreter can be marked as nogil, which allows that code to be run in multiple threads concurrently. Pure-math code is a good candidate for this, because you can write it in such a way that doesn't need to interact with Python objects using pure-Python functions.

For example, consider this code https://github.com/uber/h3-py/blob/c3593ecae5796526f9c6dc7dc140c7df2251a7f2/src/h3/_cy/unstable_vect.pyx#L22-L43. The nogil on the cpdef definition tells Cython that parts of this function release the GIL, and then the with nogil actually releases the GIL for that block. Ref uber/h3-py#166 (comment)

In broader terms, one way to visualize which lines you can set nogil on is to use Cython's "annotation mode". From the tutorial:

Cython has a way to visualise where interaction with Python objects and Python’s C-API is taking place

If a line is white, it means that the code generated doesn’t interact with Python, so will run as fast as normal C code. The darker the yellow, the more Python interaction there is in that line. Those yellow lines will usually operate on Python objects, raise exceptions, or do other kinds of higher-level operations than what can easily be translated into simple and fast C code.

Lines that are completely white can be marked as OK to release the GIL.

Future work:

Figure out a small benchmark, such as converting a single function to Cython and then running the existing benchmarks again, in order to see whether there are benefits here, and whether it's worth converting a larger portion of the code to Cython. I don't know whether converting a single function to Cython will provide noticeable benefits.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore GIL-releasing Cython code for performance improvements #111

Explore GIL-releasing Cython code for performance improvements #111

kylebarron commented Mar 18, 2021

Explore GIL-releasing Cython code for performance improvements #111

Explore GIL-releasing Cython code for performance improvements #111

Comments

kylebarron commented Mar 18, 2021