Update geoformer to use cuda 11.3, pytorch 1.11.0, and spconv 2.3.6 #2

adidier17 · 2024-05-26T01:46:51Z

Hello, I am interested in using GeoFormer, but my GPU does not support CUDA 10.2, so I was forced to update to try it out. I've encountered some issues, so this is a PR/Issue. I hope that you will be able to provide me with some guidance so that I can try GeoFormer and complete the PR.

The error

When I run test.py or test_fs.py, there are no detections returned, and then at scene 77 the following error is raised:

[2024-05-25 20:34:42,976  INFO  test_fs.py  line 272  81243]  Num points: 76966 | Num instances of 10 runs: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[2024-05-25 20:34:43,503  INFO  test_fs.py  line 269  81243]  Test scene 19/310: scene0064_00 | Elapsed time: 15s | Remaining time: 236s
[2024-05-25 20:34:43,503  INFO  test_fs.py  line 272  81243]  Num points: 230672 | Num instances of 10 runs: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[2024-05-25 20:34:43,785  INFO  test_fs.py  line 269  81243]  Test scene 20/310: scene0064_01 | Elapsed time: 15s | Remaining time: 228s
[2024-05-25 20:34:43,785  INFO  test_fs.py  line 272  81243]  Num points: 195252 | Num instances of 10 runs: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Failed on input scene: {'voxel_locs': tensor([[  0,  25,  20,   9],
        [  0,  25,  19,   8],
        [  0,  25,  20,   8],
        ...,
        [  0, 228,  42,   2],
        [  0, 228,  41,   2],
        [  0, 229,  41,   2]], device='cuda:0'), 'p2v_map': tensor([    0,     1,     2,  ..., 64958, 64960, 57075], device='cuda:0',
       dtype=torch.int32), 'v2p_map': tensor([[    1,     0,     0,  ...,     0,     0,     0],
        [    2,     1,   106,  ...,     0,     0,     0],
        [    2,     2,    77,  ...,     0,     0,     0],
        ...,
        [    3, 92799, 92801,  ...,     0,     0,     0],
        [    1, 92802,     0,  ...,     0,     0,     0],
        [    2, 92803, 92805,  ...,     0,     0,     0]], device='cuda:0',
       dtype=torch.int32), 'locs': tensor([[  0,  25,  20,   9],
        [  0,  25,  19,   8],
        [  0,  25,  20,   8],
        ...,
        [  0, 228,  42,   2],
        [  0, 229,  41,   2],
        [  0, 227,  44,   2]], device='cuda:0'), 'locs_float': tensor([[-1.4505, -0.7623, -0.8699],
        [-1.4424, -0.7747, -0.8870],
        [-1.4458, -0.7634, -0.8874],
        ...,
        [ 2.6209, -0.3197, -1.0005],
        [ 2.6300, -0.3277, -0.9942],
        [ 2.6002, -0.2715, -1.0065]], device='cuda:0'), 'feats': tensor([[-0.9765, -0.9765, -0.9922],
        [-0.9843, -0.9843, -0.9922],
        [-0.9765, -0.9765, -0.9922],
        ...,
        [-0.8353, -0.8667, -0.9059],
        [-0.8353, -0.8667, -0.9059],
        [-0.8275, -0.8745, -0.8980]], device='cuda:0'), 'spatial_shape': array([231, 171, 139]), 'batch_offsets': tensor([    0, 92807], device='cuda:0', dtype=torch.int32), 'pc_mins': tensor([[-1.9537, -1.1645, -1.0521]], device='cuda:0', dtype=torch.float64), 'pc_maxs': tensor([[2.6526, 2.2472, 1.7230]], device='cuda:0', dtype=torch.float64), 'labels': tensor([2, 2, 1,  ..., 1, 1, 1], device='cuda:0')} with output_feats: tensor([], device='cuda:0', size=(0, 16))
Traceback (most recent call last):
  File "test_fs.py", line 352, in <module>
    do_test(
  File "test_fs.py", line 198, in do_test
    outputs = model(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/model/geoformer/geoformer_fs.py", line 565, in forward
    raise e
  File "/workspace/model/geoformer/geoformer_fs.py", line 558, in forward
    mask_features_ = self.mask_tower(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/util/warpper.py", line 145, in forward
    x = super().forward(x)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 302, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 298, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Calculated padded input size per channel: (0). Kernel size: (1). Kernel size can't be greater than actual input size

I've used a breakpoint to trace the issue, and at some point the scores are just too low, so an empty list is returned. From the spconv documentation, weight layout in 1.x uses RSCK, but 2.x uses RSKC or KRSC. With the 2.x default set, the model weights will only load if I uncomment your weight permutation in the loading script. However, I get the empty detections and the error described above. I have also tried re-commenting those lines with the weight layout set to "RSCK", "RSKC", and "KRSC", but all result in shape mismatches between the model definition and the loaded weights. I suspect that I am getting these empty detections due to an incorrect loading order of the weights. Perhaps the input and output are swapped, or the weight for x coordinate swapped with the weight for z, etc.? That would result in correct layer sizes, but spurious model results. Could you print out the weight values of one of the model layers as it loads for you with spconv 1.x and cuda 10.2? If I know the values I can check if the order of the weights is loading correctly.

Changes Made

I've added a Dockerfile to use CUDA 11.3, spconv 2.3.6 for CUDA 11.3, and pytorch 1.11.0. Pytorch was upgraded to this version to avoid this bug with MinkowskiEngine. If you use my Dockerfile, note that I installed pointgroup_ops and pointnet2 from within the container rather than during the image build. You may not encounter this problem, but I was unable to get docker to recognize CUDA during the build to install those packages, but it did inside the container.

I removed THC,

and updated the imports for spconv.
Those are the relevant changes. Other changes were automatic from my python linter.

Please let me know if you have any other thoughts on why the detections returned are empty.

Update geoformer to use cuda 11.3, pytorch 1.11.0, and spconv 2.3.6

22992b7

adidier17 mentioned this pull request May 26, 2024

Empty detections returned from test script #3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update geoformer to use cuda 11.3, pytorch 1.11.0, and spconv 2.3.6 #2

Update geoformer to use cuda 11.3, pytorch 1.11.0, and spconv 2.3.6 #2

adidier17 commented May 26, 2024

Update geoformer to use cuda 11.3, pytorch 1.11.0, and spconv 2.3.6 #2

Are you sure you want to change the base?

Update geoformer to use cuda 11.3, pytorch 1.11.0, and spconv 2.3.6 #2

Conversation

adidier17 commented May 26, 2024

The error

Changes Made