Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update geoformer to use cuda 11.3, pytorch 1.11.0, and spconv 2.3.6 #2

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

adidier17
Copy link

Hello, I am interested in using GeoFormer, but my GPU does not support CUDA 10.2, so I was forced to update to try it out. I've encountered some issues, so this is a PR/Issue. I hope that you will be able to provide me with some guidance so that I can try GeoFormer and complete the PR.

The error

When I run test.py or test_fs.py, there are no detections returned, and then at scene 77 the following error is raised:

[2024-05-25 20:34:42,976  INFO  test_fs.py  line 272  81243]  Num points: 76966 | Num instances of 10 runs: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[2024-05-25 20:34:43,503  INFO  test_fs.py  line 269  81243]  Test scene 19/310: scene0064_00 | Elapsed time: 15s | Remaining time: 236s
[2024-05-25 20:34:43,503  INFO  test_fs.py  line 272  81243]  Num points: 230672 | Num instances of 10 runs: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[2024-05-25 20:34:43,785  INFO  test_fs.py  line 269  81243]  Test scene 20/310: scene0064_01 | Elapsed time: 15s | Remaining time: 228s
[2024-05-25 20:34:43,785  INFO  test_fs.py  line 272  81243]  Num points: 195252 | Num instances of 10 runs: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Failed on input scene: {'voxel_locs': tensor([[  0,  25,  20,   9],
        [  0,  25,  19,   8],
        [  0,  25,  20,   8],
        ...,
        [  0, 228,  42,   2],
        [  0, 228,  41,   2],
        [  0, 229,  41,   2]], device='cuda:0'), 'p2v_map': tensor([    0,     1,     2,  ..., 64958, 64960, 57075], device='cuda:0',
       dtype=torch.int32), 'v2p_map': tensor([[    1,     0,     0,  ...,     0,     0,     0],
        [    2,     1,   106,  ...,     0,     0,     0],
        [    2,     2,    77,  ...,     0,     0,     0],
        ...,
        [    3, 92799, 92801,  ...,     0,     0,     0],
        [    1, 92802,     0,  ...,     0,     0,     0],
        [    2, 92803, 92805,  ...,     0,     0,     0]], device='cuda:0',
       dtype=torch.int32), 'locs': tensor([[  0,  25,  20,   9],
        [  0,  25,  19,   8],
        [  0,  25,  20,   8],
        ...,
        [  0, 228,  42,   2],
        [  0, 229,  41,   2],
        [  0, 227,  44,   2]], device='cuda:0'), 'locs_float': tensor([[-1.4505, -0.7623, -0.8699],
        [-1.4424, -0.7747, -0.8870],
        [-1.4458, -0.7634, -0.8874],
        ...,
        [ 2.6209, -0.3197, -1.0005],
        [ 2.6300, -0.3277, -0.9942],
        [ 2.6002, -0.2715, -1.0065]], device='cuda:0'), 'feats': tensor([[-0.9765, -0.9765, -0.9922],
        [-0.9843, -0.9843, -0.9922],
        [-0.9765, -0.9765, -0.9922],
        ...,
        [-0.8353, -0.8667, -0.9059],
        [-0.8353, -0.8667, -0.9059],
        [-0.8275, -0.8745, -0.8980]], device='cuda:0'), 'spatial_shape': array([231, 171, 139]), 'batch_offsets': tensor([    0, 92807], device='cuda:0', dtype=torch.int32), 'pc_mins': tensor([[-1.9537, -1.1645, -1.0521]], device='cuda:0', dtype=torch.float64), 'pc_maxs': tensor([[2.6526, 2.2472, 1.7230]], device='cuda:0', dtype=torch.float64), 'labels': tensor([2, 2, 1,  ..., 1, 1, 1], device='cuda:0')} with output_feats: tensor([], device='cuda:0', size=(0, 16))
Traceback (most recent call last):
  File "test_fs.py", line 352, in <module>
    do_test(
  File "test_fs.py", line 198, in do_test
    outputs = model(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/model/geoformer/geoformer_fs.py", line 565, in forward
    raise e
  File "/workspace/model/geoformer/geoformer_fs.py", line 558, in forward
    mask_features_ = self.mask_tower(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/util/warpper.py", line 145, in forward
    x = super().forward(x)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 302, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 298, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Calculated padded input size per channel: (0). Kernel size: (1). Kernel size can't be greater than actual input size

I've used a breakpoint to trace the issue, and at some point the scores are just too low, so an empty list is returned. From the spconv documentation, weight layout in 1.x uses RSCK, but 2.x uses RSKC or KRSC. With the 2.x default set, the model weights will only load if I uncomment your weight permutation in the loading script. However, I get the empty detections and the error described above. I have also tried re-commenting those lines with the weight layout set to "RSCK", "RSKC", and "KRSC", but all result in shape mismatches between the model definition and the loaded weights. I suspect that I am getting these empty detections due to an incorrect loading order of the weights. Perhaps the input and output are swapped, or the weight for x coordinate swapped with the weight for z, etc.? That would result in correct layer sizes, but spurious model results. Could you print out the weight values of one of the model layers as it loads for you with spconv 1.x and cuda 10.2? If I know the values I can check if the order of the weights is loading correctly.

Changes Made

I've added a Dockerfile to use CUDA 11.3, spconv 2.3.6 for CUDA 11.3, and pytorch 1.11.0. Pytorch was upgraded to this version to avoid this bug with MinkowskiEngine. If you use my Dockerfile, note that I installed pointgroup_ops and pointnet2 from within the container rather than during the image build. You may not encounter this problem, but I was unable to get docker to recognize CUDA during the build to install those packages, but it did inside the container.

I removed THC,

and updated the imports for spconv.
Those are the relevant changes. Other changes were automatic from my python linter.

Please let me know if you have any other thoughts on why the detections returned are empty.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant