Skip to content

sjtechdev/fastgrouper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

fastgrouper

Fast groupby-apply operations in python.

Install

Users can install the package from PyPI via:

python -m pip install fastgrouper

Usage

Use the arr interface, for numpy array related applications.

import numpy as np
import fastgrouper.arr
  
def baz(x, y):
    return np.mean(x + y) - 3

# Sample arrays, to slice
xvals = np.array([1, 2, 10])
yvals = np.array([4, 5, 6])
  
# Group ids
gids  = np.array([1, -3, 1])

# Perform groupby-apply; note that keyword args are supported as well.
grpd = fastgrouper.arr.Grouped(gids)
result = grpd.apply(baz, xvals, y=yvals) # np.array([7.5, 4])

# The gids correponding to the result above can be found via the `dedup_gids` attribute.
grpd.dedup_gids # np.array([ 1, -3])

# Users can also perform groupby-apply, and then expand results back to align with the original gids.
result = grpd.apply_expand(baz, xvals, yvals) # np.array([7.5, 4, 7.5])

The li interface returns the results over the groups as a list (instead of an array); this may be useful for functions that return different-sized results. Note that in all interfaces (e.g. both arr and li), the order of the group elements is preserved when the group slices are passed to the function being applied.

import numpy as np
import fastgrouper.li
  
def bop(x):
    return list(x)

# Sample arrays, to slice
xvals = np.array([2, 3, 4])
  
# Group ids
gids  = np.array([10, -20, 10])

grpd = fastgrouper.li.Grouped(gids)
grpd.apply(bop, xvals) # [[2, 4], [3]]

For additional examples, checkout the tests.

Benchmarks

Checkout the benchmarks here for a sample comparison between the pandas groupby-apply and fastgrouper groupby-apply workflows. While it is difficult to compare the two perfectly, I tried to make the comparison as fair as possible.

Results from running the benchmarks on a sample machine with an Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz:

---------------------------------------------------------------------------------------------- benchmark: 4 tests ---------------------------------------------------------------------------------------------
Name (time in ms)                                  Min                Max               Mean            StdDev             Median               IQR            Outliers       OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_fastgrouper_arr_slice_apply_benchmark      5.8296 (1.0)       6.6786 (1.0)       6.0080 (1.0)      0.1165 (1.0)       6.0071 (1.0)      0.1294 (1.0)          35;5  166.4435 (1.0)         147           1
test_fastgrouper_all_steps_benchmark            7.7704 (1.33)     10.3270 (1.55)      8.0946 (1.35)     0.3171 (2.72)      8.0872 (1.35)     0.2511 (1.94)          6;2  123.5386 (0.74)        121           1
test_pure_pandas_slice_apply_benchmark         42.4697 (7.29)     46.9096 (7.02)     43.0534 (7.17)     0.9816 (8.42)     42.6915 (7.11)     0.4361 (3.37)          2;3   23.2270 (0.14)         22           1
test_pure_pandas_all_steps_benchmark           43.4275 (7.45)     45.2340 (6.77)     43.8837 (7.30)     0.4243 (3.64)     43.7748 (7.29)     0.4973 (3.84)          3;1   22.7875 (0.14)         23           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------