-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Potential Performance Improvement] Hard voxelization could have a much faster implementation (at the cost of determinism). #894
Comments
Yes, it really makes sense. I guess the deterministic property is not much important in the overall procedure. Have you ever compare the performance yielded by these two ways of voxelization? If they are comparable, I think we can first add this way as an optional approach, and then test it with several times regression benchmark to guarantee it has no problem and performs better indeed. Finally if necessary, we can replace the old one. Do you agree or have any suggestions? |
I fully agree with you. |
Hi all @zhanggefan @Tai-Wang , out of curiosity, can't we use something like this (https://github.com/tianweiy/CenterPoint/blob/44ebd63a81053e6fe85cd1228b644bab9d726507/det3d/models/readers/dynamic_voxel_encoder.py#L8) to do the voxelization ? It is relatively fast and doesn't need to write custome cuda code |
Hi, @tianweiy But for hard voxelization part, things are different. Extra care shall be taken to handle the discarding process when there are more points inside a voxel than local threshold, or more voxels than global threshold. The points/voxels selection & discarding is full of dirty logic (forgive me but it really gives me a headache) and moving those logic to python code would make it much less efficient. |
ic, yeah. additionally, scatter-reduce can also be implemented in PyTorch easily (at least for mean / sum)
I am actually not sure if we really need hard voxelization. I feel dynamic is both faster + simpler (we can probably do discarding of more voxels later) |
@zhanggefan Oh don't worry. What I originally meant is that we will help you do the regression benchmark to further check it. Actually, we usually conduct it once before each release. @tianweiy Great discussion. I just add one thing. As @zhanggefan points out, the current way should be a better one especially for hard voxelization. We just consider them as basic ops and would like to implement them in a consistent way that can be easily migrated (maybe to mmcv in the future). In addition, I guess the dynamic voxelization is not much stable in some cases (such as points are extremely dense) so it has not replaced the hard version. There is also space for further experiments. |
I guess the dynamic voxelization is not much stable in some cases (such as points are extremely dense) so it has not replaced the hard version. <- I think at least on Waymo, dynamic voxelization performs about the same as hard? The two differences I know are
for the first one, if you want to do some average, then using more points (e.g. all the points in dynamic voxelization) shouldn't make the result worse. for the second one, if we still want to limit the max number of voxels (for speed consideration, etc..), we can do the indexing at the end Are there any other differences that I miss? |
I remember the previous experiments are on nuScenes (10-frame points concatenated), and the problem appears when a voxel contains very dense points. It will influence the voxelization efficiency. Besides, using more points is ok (both in terms of efficiency and implementation complexity) when computing mean value, but it can bring some trouble when we would like to extract features in other ways (more complicated voxel encoder). |
thanks, the efficiency doesn't sound to be a problem in my dynamic voxelization implementation above (3ms for 3 frames Waymo point cloud 400k+ points if I remember correctly). I agree it will be more complex to extract other features. I do think things like point pillar is doable just using torch_scatter / unique (not proposing a change, just an example) |
Honestly, I've never used hard voxelization for point-pillars in production code. According to my experiments, it is indeed slow and the score is even less optimum compared to dynamic voxelization. |
I have some more suggestions for dynamic voxelization. I suggest to devide the Dynamic voxelization, as what @tianweiy just pointed out, as a whole can be devided into 3 parts:
The core thought is, the "Scatter-Reduction" itself does not cost much time according to my observation, but "Indexing" does. Currently, in MMDet3D's implementation, the first part is the All voxel encoders that use |
@zhanggefan Yes, I agree with you. I think a more compatible way to improve it is to split this function into two sub-functions while keeping the original one? Then if we need to use only one of the sub-function, we do not need to conduct both of them? Is there any example that is similar to pillar-od in the currently supported models? (Then I will know how much influence it has on our current codes) If you have preliminarily tested your improved code, could you please create a PR and we can help review or further enhance it? If not, maybe we will take it as a TODO item to improve in the future. |
The original implementation of dynamic voxelization follows the similar strategy of hard voxelization, which still builds a big feature map with shape (num_voxels, max_num_voxels_per_point, C) for feature average and summation. In the case of nuScenes, sometimes there are more than 1k points in a pillar, thus the creation of this feature map will cause OOM with 32G V100. I am not very satisfied with the implementation in the initial release actually, though we do not find a suitable time to optimize it. But thanks to @zhanggefan we have better implementation now. Therefore, any optimization suggestions/PRs about the voxelization part are welcomed. For example, if we find the pure torch version for dynamic voxelization has little memory cost and is faster yet simpler, we can consider that. |
When using Dynamic Voxelization, there must be more points in voxel or pillar around the vehicle. |
Hi,
I noticed that the Cuda code of hard voxelization wraps some kernel code that is computationally inefficient.
single-thread kernel launch
mmdetection3d/mmdet3d/ops/voxel/src/voxelization_cuda.cu
Line 276 in d112308
O(n^2) part -- loop in kernel.
mmdetection3d/mmdet3d/ops/voxel/src/voxelization_cuda.cu
Line 156 in d112308
From my experience (By pr #318), voxelization can be viewed as a sparse hashing problem. Getting the sparse index can be solved by torch.unique at the complexity of O(nlogn), while the hash-reduce part can be implemented by atomicMax at the complexity of O(n). Just like the implementation in Apollo: https://github.com/ApolloAuto/apollo/blob/master/modules/perception/lidar/lib/detector/point_pillars_detection/preprocess_points_cuda.cu
From my observation, this part is the performance bottleneck for many lightweight detectors on many datasets. For example the CenterPoint with pillar encoder on NuScenes dataset:
https://github.com/open-mmlab/mmdetection3d/blob/d1123084973dd3a910760e77c499afad76d16cea/configs/centerpoint/centerpoint_02pillar_second_secfpn_4x8_cyclic_20e_nus.py
I managed to reimplement this part and it can reduce the training iteration time by more than a half, a huge improvement.
I also understand that your implementation is designed for determinism. Because hard voxelization involves points/voxels discarding, the ordering between points matters so that we could guarantee that always to discard the same portion of input points. In your implementation, the former points/voxels take precedence over the latter ones when discard points, thus your implementation is deterministic.
My reimplementation is non-deterministic because the hash-reduce kernel may be executed in any order, with no ordering between input points at runtime. Different runs for the same data can result in different portions of points being discarded.
If you think the performance gain is worth more than keeping the determinism, then my reimplementation is ready for PR.
Looking forward to your reply!
The text was updated successfully, but these errors were encountered: