GPU - integrate multiple frames with one kernel call ? #2457

jonwright · 2025-02-24T14:43:40Z

jonwright
Feb 24, 2025
Collaborator

I had a quick look at cuSPARSE via cupy. At first sight, it looks quite promising :

import cupyx.scipy.sparse
csr_cuda = cupyx.scipy.sparse.csr_matrix(  csr_matrix_from_pyFAI )
sum_signal = csr_cuda.dot( data_image )

First run is slow (compilation). It didn't seem to care about csr vs csc for timing. After warmup it was a bit slower than pyFAI for single frames, but a claims to be quicker for doing a stack of 32 frames.

frames = np.array( [ 32, 2162, 2068 ] )
frames.shape = 32, -1
sum_signals = csr_cuda.dot( frames.T )

Assuming I haven't made a mistake, it is dropping from 600 us per frame to about 32 us on an L40s. This is a lot of kHz. It means doing batched transfer and decompression.

What do you think? Did you try making batched integrations on a GPU before?

kif · 2025-02-24T16:05:55Z

kif
Feb 24, 2025
Maintainer

Hi Jon,
Indeed, AzimuthalIntegration is much faster on the GPU than transfer ... Here are my results on a RTX A5000 in 16xPCEev4

OpenCL kernel profiling statistics in milliseconds for: OCL_CSR_Integrator
                                       Kernel name (count):      min   median      max     mean      std
                                   copy H->D image (  811):    1.440    1.450    1.586    1.454    0.011
                                         memset_ng (  811):    0.002    0.009    0.023    0.009    0.002
                                     corrections4a (  811):    0.600    0.610    0.831    0.625    0.043
                                    csr_integrate4 (  811):    1.338    1.360    2.199    1.385    0.088
                                  copy D->H avgint (  811):    0.001    0.001    0.002    0.001    0.000
                                     copy D->H std (  811):    0.001    0.001    0.002    0.001    0.000
                                     copy D->H sem (  811):    0.001    0.001    0.002    0.001    0.000
                                 copy D->H merged8 (  811):    0.003    0.003    0.004    0.003    0.000
________________________________________________________________________________
                       Total OpenCL execution time        : 2822.780ms

Making the azimuthal integration much faster (and it is possible with batching) would gain only a factor 2 in actual speed according to Amdhal's law.

I noticed the GPU performances can be used for other application like outlier removal

Did you submit an ESRF project on this topic ? Who else (i.e non ESRF) is missing this feature ?

Jerome

1 reply

jonwright Feb 27, 2025
Collaborator Author

Hi Jerome

There is a TDR to be written for ID11, so any projects are to be defined. The future is notoriously hard to predict. A bunch of things I have in mind:

pyFAI is the key code to process powder diffraction data from 2D detectors, and it is mature
more efficient codes offer new scientific opportunities, as well as the energy/cost savings (20x can be transformative)
integrated data could be used for things like: peak fitting, phase id, tomo reconstructions, training an AI, etc
this batching question is an example of trying to "plug-in" pyFAI to some other package/framework (cuSPARSE above)
pyFAI is using opencl for GPU processing
new GPU programming options keep appearing (dlpack -> JAX, cupy, pytorch, numba, cuda, etc)

In thermodynamics, there is "entropy of mixing" which is always positive in chemistry. Seems this idea applies to features inside computer programs too. If it is easy to insert pyFAI into some other framework, then the batching would happen in someone else's code and not inside pyFAI.

So, perhaps a "small scale" project/question: how hard is it to port the pure python "pyFAI.engine" code to be fed into other "GPU backend" frameworks? Doing 'import cupy as numpy' still leaves a lot of testing (and probably debugging). Adding every new language/framework to pyFAI seems like it will be hard to maintain, but maybe your engine framework is already designed for this?

In terms of timescales, the TDR not due until September.

Thanks!

Jon

kif · 2025-02-24T16:08:25Z

kif
Feb 24, 2025
Maintainer

I changed this issue to a discussion, this allows to make polls to assess how relevant this idea is for users outside Jon's beamline.
Dear folk, please vote if you find the idea relevant for you too.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU - integrate multiple frames with one kernel call ? #2457

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

GPU - integrate multiple frames with one kernel call ? #2457

jonwright Feb 24, 2025 Collaborator

Replies: 2 comments · 1 reply

kif Feb 24, 2025 Maintainer

jonwright Feb 27, 2025 Collaborator Author

kif Feb 24, 2025 Maintainer

jonwright
Feb 24, 2025
Collaborator

Replies: 2 comments 1 reply

kif
Feb 24, 2025
Maintainer

jonwright Feb 27, 2025
Collaborator Author

kif
Feb 24, 2025
Maintainer