@page docgpudetail GPU technical details and performance tips
This page outlines some more technical details how the OHM library uses the GPU.
The OHM library has been specifically authored to support GPU execution of voxel occupancy type algorithms using GPU threads rather than focusing on multi-threaded CPU algorithms to update such voxel maps. While GPUs excel at solving some "embarrassingly parallel" problems, the occupancy map update does not fall into this category and the GPU adaptation of the occupancy algorithms does not represent an ideal GPU problem. However, it has been demonstrated that these GPU algorithms do perform significantly better than single threaded CPU algorithms.
There are several things to be aware of with respect to how the GPU algorithms operate and how the results may differ from the single threaded CPU implementations.
- GPU memory contention
- Non determinism
- Single precision operations
- Syncing between GPU and CPU memory
The OHM GPU algorithm is specifically written to allow memory contention when updating voxels. This contention is resolved by using atomic operations; primarily compare and swap operations (CAS) and atomic increment. The general design pattern for using CAS is to attempt to update voxel memory and note whether the update was successful or not. Unsuccessful update attempts are retried until they succeed, up to a hard coded limit to prevent excessive waiting.
This is a less than ideal uses of GPU code. There is in fact an inherent increase in the likelihood of CAS failures in GPU because because all threads in a GPU warp (CUDA terminology) or work group (OpenCL terminology) will execute the same instruction at the same time. Thus CAS failures are practically guaranteed in cases where multiple threads target the same voxels.
One technique used to mitigate contention is to trace rays in reverse, from sample point back to sensor (this is enabled by default). Consider a case where multiple rays of varying lengths originate from the same sensor location. The dense ray packing near the sensor, coupled with the discrete nature of voxels will see increased contention when forward tracing from the sensor location. Reverse tracing mitigates this as the ray length variation will see those same voxels reach by different threads at different times.
It is also worth noting that this update using CAS requires we write directly to GPU global memory, thus we cannot gain the benefits of faster GPU local memory.
Once a work load has been has been queued, it is up to the GPU driver to manage execution. The driver may submit work items/warps as it sees fit. This coupled with the non deterministic nature of CAS means that there is no way to generate a deterministic execution order while still maintaining parallelism. Thus GPU map results will vary and may be different to maps generated by single threaded CPU algorithms.
Additionally, the use of single precision calculations on GPU vs double precision in CPU, will result in different rounding errors when tracing rays near the edges of voxels.
However, the nature of occupancy maps is that they are statistical, probabilistic approximations, thus the results should be comparable.
Double precision maths has been supported in CUDA for some time. However, double precision is an extension in OpenCL. We choose to support a wider range of GPU accelerators by using single precision maths. In order to achieve this, rays are converted to single precision before uploading to GPU memory. Specifically, each ray is converted to single precision ray in a coordinate frame centred on the voxel containing the ray sample, or end point. This ensures overall precision in a large maps should not be lost.
OHM uses a GPU abstraction API to wrap either CUDA or OpenCL. The runtime code is then written in OpenCL style, with support functions added to convert to CUDA. This allows the same code to compile and run with either API.
There are some differences between the two APIs and performance differences should be expected between CUDA and OpenCL running on the same hardware. CUDA should generally perform better than OpenCL on the same GPU, however, NVIDIA's recent addition of OpenCL 3.0 has reduced the difference.
OpenCL support is designed to cover a number of OpenCL standards; 1.2, 2.0, 3.0. However, OpenCL support across GPU hardware can vary. Code which compiles on one card may fail to compile for another hardware manufacturer. Alternatively, some algorithms may not be supported. For example, the TSDF implementation requires 64-bit atomic operations (specifically CAS). It is recommended that OHM be tested extensively on the target hardware.
As noted above, CUDA performance is generally better than OpenCL. There are also a number of data factors which may affect performance and algorithm options which can be used to improve performance.
Long rays will always take longer to process in both CPU and GPU. However, in GPU there is an added complication of redundant threads. Consider a warp which contains a number of very short rays and one very long ray. The time to execute the warp is determined by the longest rays with other threads effectively idle for the additional voxels touched by the long ray.
Two options can be used to reduce the impact of inconsistent ray lengths. Below we describe the API function to set the
option as well as the equivalent command line option for ohmpopcuda
or ohmpopocl
.
GpuMap::setRaySegmentLength()
(--gpu-ray-segment-length
) sets a threshold above which the CPU will segment a long ray into multiple smaller GPU work items.GpuMap::setRayFilter()
may be used to install a filter function which can modify rays before they are uploaded to the GPU. A number of common filter functions are available inRayFilter.h
such asclipRayFilter()
(--ray-length-max
).
The GPU maintains a fixed size cache for voxel data. Voxel regions are uploaded from CPU to GPU as required. Once the cache is full, voxel regions are returned back to the CPU to make space. This incurs an IO cost. A small cache can result in cache thrashing between batches of rays and severely impact performance.
The cache size may be set when constructing a GpuMap
via the gpu_mem_size
argument. This is exposed to the ohmpop
command line as --gpu-cache-size
.
Consider the following when choosing a cache size:
- The use of a local or global map? A local map requires less memory as regions can be dropped outside some local spatial region (such as when running on a sensing payload).
- The voxel layers in use and the region size. The
MapLayout
defines theMapLayer
definitions and the number of bytes per region is calculated byMapLayer::layerByteSize()
.