feat(wip): cupy histograms #1095

lgray · 2024-05-15T14:57:58Z

work in progress on using cupy in the old coffea-hist package as a demonstrator

martindurant · 2024-05-31T15:03:58Z

(cf https://docs.cupy.dev/en/stable/reference/generated/cupy.histogramdd.html , which is the standard regular-array API conforming to the numpy version - no boost or anything)

lgray · 2024-06-03T15:16:05Z

@martindurant - cool, it appears to use the same techniques (fill by atomics) so it'll be subject to the same scaling limitations I'm seeing. However, the number of calls to fill is a bit more lean so maybe it's worth backing an implementation with it. I'll have to try some benchmarks.

Otherwise - there's significant functionality missing from the cupy hists that we'll still have to add on top, if it turns out to run faster in the first place.

martindurant · 2024-06-03T15:20:16Z

Otherwise - there's significant functionality missing from the cupy hists that we'll still have to add on top, if it turns out to run faster in the first place.

Yes, I expect this is the case. dask-histogram also uses boost and adds a numpy- compatible API on top for those that expect it; and of course, it's designed to work with awkward. I expect there's an amount of sharing and refactoring that can eventually be done.

lgray · 2024-06-03T15:24:06Z

Yeah - I think it is all possible. Right now is really getting all the pipe-fittings in place. I'll let you know if there's any clear win in the benchmarks.

lgray · 2024-06-03T15:30:07Z

@Saransh-cpp: @jpivarski and I talked last Friday and it came up that you might be interested in taking this "pilot project" and turning it into a full-blow UHI compatible histogramming interface (a la scikit-hep/hist), but for cupy/cuda histograms.

What's in this PR has the necessary functionality for HEP and we can convert to scikit-hep/hist, but it would be nice to have a cohesive ecosystem and only convert to CPU memory-space at the last moment. This would grant us more or less infinite scaling.

We also has some ideas towards warpwise-distributed histograms where a (collection) of warps would tend to a sub-range of bins so that filling can be done more in parallel. This old implementation description demonstrates that if you stick to a warp (i.e. 32 bins) and replicate histograms to do filling parallel you can reach 10GB/s filling rates, because there's no use of atomics.

This also has interesting parallels to cluster-distributed histograms where a (relatively enormous) histogram could be distributed across a whole dask cluster and achieve scaling to 100s of GBs in size or more. This would effectively remove scaling limitations for histograms for the foreseeable future and is probably important for achieving precision HL-LHC analyses.

Anyway - please let us know if you are interested in turning this into a more mature package and possibly adding features to it! We're happy to answer any questions you may have.

Saransh-cpp · 2024-06-07T11:56:46Z

HI @lgray, apologies for the late reply. I have submitted a proposal for the cuda-histogram project, and my contract here has been extended for two months (15th June - 15th August). I will start working on this project from the 15th of June. I will fly to India on the 18th of June to get my CUDA laptop and return to CERN on the 1st of July. I will be working remotely for this period.

I still need to go through this PR in detail, but I will pick it up very soon and discuss the work in awkward-dask meetings. Thank you again for the ping! :)

Saransh-cpp · 2024-07-23T14:31:07Z

Hi @lgray @jpivarski, I am experimenting with inheriting hist axis to make the axis classes in this PR match the existing APIs. Could you please let me know if the axis methods and properties are supposed to return a cupy array or is a numpy array fine. For instance, right now the axis class Bin does include methods that return a cupy array. Once I inherit the hist axis class, the methods in Bin override the inherited hist.axis.Regular methods and return a cupy array, but the ones that are not present in Bin simply return a numpy array. Should I update Bin to have as many methods as hist.axis.Regular has and make them all return a cupy array, or should I eliminate the methods in Bin and just depend on the inherited (numpy array returning) methods -

In [1]: import cuda_histogram

In [2]: cuda_histogram.hist_tools.Regular(10, 0, 1)  # the repr is inherited, works!
Out[2]: Regular(10, 0, 1)

In [3]: cuda_histogram.hist_tools.Regular(10, 0, 1).extent  # inherited property, works!
Out[3]: 12

In [4]: type(cuda_histogram.hist_tools.Regular(10, 0, 1).edges)  # overriden by `Bin` to return a cupy array
Out[4]: cupy.ndarray

In [5]: type(cuda_histogram.hist_tools.Regular(10, 0, 1).widths)  # not overriden, inherited, thus a numpy array
Out[5]: numpy.ndarray

The class inheritance looks like this right now -

class Bin(hist.axis.Regular, family="cuda_histogram"):  # has common methods for Regular and Variable axis but I might split it
    ...


class Regular(Bin, family="cuda_histogram"):
    ...

lgray · 2024-07-23T22:13:41Z

@Saransh-cpp I've been mulling over this for a while, and I think that really only the filling should be done on the GPU, from the user's perspective.

I would say that for any other user facing thing like axes and such it should produce np arrays, since axes and other arrays tend to be very small and shipping them to/from GPU is inefficient. fill should be the only thing that accepts cupy arrays and should complain if receiving anything else.

What do you think?

jpivarski · 2024-07-24T01:38:03Z

If the histogram bin edges/descriptions are immutable, you could have a copy of them on both CPU and GPU. Then they'd be available on the GPU for the filling operations, and available on the CPU for user queries. The only thing that would need to be transferred is the bin data. It can be a new cudaMalloc on the GPU at the start of a fill, then copy to the CPU and added to a running total at the end of a fill. Since fill is a high level function called from Python, the overhead of 1 cudaMalloc and 1 cudaMemcpy per fill would be hidden under all the Python overhead. Outside of a fill, the source of truth for histogram data would all be on the CPU.

lgray · 2024-07-24T01:42:15Z

@jpivarski that's just about what I'm thinking, but fill shouldn't trigger an xfer to CPU (as then on multiple fills you xfer to CPU each fill which syncronizes the device). So it should xfer to CPU on any non-fill access of the bin/weight contents, IMO.

I think this is possible as well?

jpivarski · 2024-07-24T02:03:01Z

That would work. Although if fill is a Python function, the overhead is already high enough that a transfer from GPU to CPU would matter all that much.

lgray · 2024-07-24T02:07:56Z

Multiple fills being able to proceed in parallel is better than sync'ing on each call, I'd say.

Saransh-cpp · 2024-07-24T14:29:16Z

Thanks for the detailed discussion, @lgray @jpivarski! I have been working on the API and right now it looks like this -

In [1]: import cuda_histogram; import numpy as np

In [2]: a = cuda_histogram.axis.Regular(10, 0, 1, name="r")

In [3]: c = cuda_histogram.Hist(a)

In [4]: a, c
Out[4]:
(Regular(10, 0, 1, name='r', label='Axis 0'),
 Hist(Regular(10, 0, 1), storage=Double()))

In [5]: c.fill(r=np.random.normal(size=1_000_000))  # the signature of fill needs to be changed to match other libraries

In [6]: c.values()
Out[6]: array([39441., 38312., 37423., 35879., 34110., 32467., 30292.])

In [7]: type(c.values())
Out[7]: cupy.ndarray

In [8]: type(a.edges)
Out[8]: numpy.ndarray

Only the fill method executes on GPU with this design. Could you please let me know if this is the right direction or should I change something in this example?

Saransh-cpp · 2024-08-14T17:05:27Z

Hi @lgray, the flow behavior in this PR works like this right now -

In [1]: import cuda_histogram; import numpy as np; import hist

In [2]: a = cuda_histogram.axis.Bin("n", "l", 10, 0, 1)

In [3]: c = cuda_histogram.Hist("c", a)

In [4]: c.fill(n=np.random.normal(size=1_000_000))

In [5]: c.values()
Out[5]:
{(): array([59481., 62261., 63944., 65914., 67848., 69844., 70480., 71625.,
        72873., 73828.])}

In [6]: c.values("over")
Out[6]:
{(): (array([59481., 62261., 63944., 65914., 67848., 69844., 70480., 71625.,
         72873., 73828.]),
  array([59481., 62261., 63944., 65914., 67848., 69844., 70480., 71625.,
         72873., 73828.]))}

I am not sure if this is the intended behavior. The flow behaviors in other histogramming libraries for a 1D histogram just adds overflow and underflow values to the end of the array instead of duplicating the entire array. Could you please let me know if this is the intended behavior?

lgray · 2024-08-14T18:25:03Z

Yes, coffea histograms also have "NaN-flow" which is a special overflow bin for NaNs, so that counts and weights don't get affected by floating point problems. I personally like this feature, but none of the other histogram libraries care to implement it.

Insofar as the filling workflow you've shown, does it work without copies using GPU-only arrays?

Saransh-cpp · 2024-08-16T14:50:00Z

Yes! The fill method uses the exact logic in this PR and makes it compatible with other histogramming libraries -

Here is a more detailed usage example - https://github.com/Saransh-cpp/cuda-histogram?tab=readme-ov-file#usage

I have updated the repository with documentation, a few tests, and CI. The API is now very similar to boost-histogram's API. The Categorical axis is the only major thing missing (or left for refactoring).

I have requested to transfer the repository to Scikit-HEP and will create a v0.1.0 release once it is transferred.

…throughput on multi-dim histograms

lgray force-pushed the jitters branch from 89e25b8 to 238156e Compare May 15, 2024 14:58

lgray force-pushed the jitters branch 3 times, most recently from 0668e63 to a0aa63a Compare May 28, 2024 13:27

lgray mentioned this pull request Jun 3, 2024

chore: trying atomics and tree reduction for CUDA reducer kernels scikit-hep/awkward#3123

Merged

21 tasks

lgray force-pushed the jitters branch 4 times, most recently from 35285fa to 8100467 Compare August 8, 2024 21:51

lgray added 5 commits September 5, 2024 13:28

port in hist from old coffea to jitters area

bcd2f74

ressurect tests

5c5e46d

ressurect main functionality tests, bits of cruft yet to fix, decent …

2f56762

…throughput on multi-dim histograms

use precompiled package for cupy

ee0ccc6

install cuda-toolkit with actions

ec7bdf9

lgray added 6 commits September 5, 2024 13:28

cuda install is too big

7f351af

cannot test gpu code in matrix as is

a31afcc

skip hist plotting tests if no cupy

74831fc

all tests work except test_hist_plot::test_plotgrid

a16c81a

add numba-cuda as gpu dependency

73855e1

lint

cb7e494

lgray force-pushed the jitters branch from 8100467 to cb7e494 Compare September 5, 2024 18:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(wip): cupy histograms #1095

feat(wip): cupy histograms #1095

lgray commented May 15, 2024

martindurant commented May 31, 2024

lgray commented Jun 3, 2024

martindurant commented Jun 3, 2024

lgray commented Jun 3, 2024

lgray commented Jun 3, 2024 •

edited

Loading

Saransh-cpp commented Jun 7, 2024

Saransh-cpp commented Jul 23, 2024

lgray commented Jul 23, 2024

jpivarski commented Jul 24, 2024

lgray commented Jul 24, 2024

jpivarski commented Jul 24, 2024

lgray commented Jul 24, 2024

Saransh-cpp commented Jul 24, 2024

Saransh-cpp commented Aug 14, 2024

lgray commented Aug 14, 2024

Saransh-cpp commented Aug 16, 2024

feat(wip): cupy histograms #1095

Are you sure you want to change the base?

feat(wip): cupy histograms #1095

Conversation

lgray commented May 15, 2024

martindurant commented May 31, 2024

lgray commented Jun 3, 2024

martindurant commented Jun 3, 2024

lgray commented Jun 3, 2024

lgray commented Jun 3, 2024 • edited Loading

Saransh-cpp commented Jun 7, 2024

Saransh-cpp commented Jul 23, 2024

lgray commented Jul 23, 2024

jpivarski commented Jul 24, 2024

lgray commented Jul 24, 2024

jpivarski commented Jul 24, 2024

lgray commented Jul 24, 2024

Saransh-cpp commented Jul 24, 2024

Saransh-cpp commented Aug 14, 2024

lgray commented Aug 14, 2024

Saransh-cpp commented Aug 16, 2024

lgray commented Jun 3, 2024 •

edited

Loading