-
Notifications
You must be signed in to change notification settings - Fork 4
Porting NCC to GPU
Unlike e.g. PeakToCorrelationEnergy NCC was not ported to a GPU yet.
NCC consists of two parts: sumSquared and computeNCC. It was first run on a GPU using a single threadblock, next multiple threadblocks were applied resulting in a data througput close to half the theoretical maximum memory bandwidth.
Some experiments were done changing LARGETB = 1024 (and hence threads = 1024 in compare/NormalizedCrossCorrelation.java). We tried LARGETB = 512 and 2048, but no significant speedup or slowdown was observed. Also, reducing_thread_blocks = num_sm was altered to reducing_thread_blocks = 2* num_sm and 4 * num_sm, but again this had no observable effect on the processing times.
Above are indicated phases 1 and 2 of the reduction algorithms, they apply to both sumSquared and computeNCC. We thought of using warpReduce in phase 2, as suggested in the Optimizing Parallel Reduction in Cuda presentation. However, phase 2 takes much less time ( ~ 1%) than phase 1, so there is no point in optimizing phase 2. Also, enhancements such as warpReduce make the code much less readable, so we did not apply it.
ComputeNCC and sumSquared take about the same amount of time on both a CPU and a GPU.