GPU usage slower than CPU #53

GilesStrong · 2021-08-26T10:16:57Z

PR #52 fixes code to run on GPU, however runtimes are about twice as slow than on CPU.
Possible causes:

Data isn't moved optimally to the GPU
Class attributes are repeatedly moved to CPU, and should instead be cached once

Need to examine code. The new PyTorch profiler might help highlight the slow parts of the code.

GilesStrong · 2021-08-27T08:43:00Z

Using profiling, it seems that all parts of the forward/backward loop are slower on GPU. This was for a batch of 10 muons, since profiling is really slow to run (typically we run 100 per batch). I would guess that for our kinds of operations, we aren't able to really benefit from GPU, due to each operation being relatively light (compared to multiplying large matrices in a DNN), and the muon batch sizes a small.

volume_propagation
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                         Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
           volume_propagation        33.04%      11.829ms        99.93%      35.774ms      35.774ms             1  
                  aten::index         7.23%       2.587ms        11.19%       4.005ms      31.289us           128  
                    aten::mul         6.65%       2.381ms        10.71%       3.833ms      12.445us           308  
                 aten::select         6.76%       2.421ms         7.67%       2.745ms       4.725us           581  
                  aten::slice         6.73%       2.410ms         7.55%       2.703ms       5.941us           455  
                     aten::to         3.60%       1.288ms         6.37%       2.280ms       4.560us           500  
                     aten::ge         1.71%     613.000us         4.79%       1.716ms      26.812us            64  
              aten::unsqueeze         4.76%       1.705ms         4.77%       1.709ms     213.625us             8  
             aten::index_put_         0.87%     310.000us         4.44%       1.590ms      28.393us            56  
       aten::_index_put_impl_         2.09%     747.000us         3.58%       1.280ms      22.857us            56  
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 35.800ms

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                     volume_propagation         0.40%      12.071ms       100.00%        3.030s        3.030s       0.000us         0.00%       4.487ms       4.487ms             1  
                                       cudaLaunchKernel        98.26%        2.977s        98.26%        2.977s       1.849ms       0.000us         0.00%       0.000us       0.000us          1610  
                                              aten::tan         0.02%     568.000us        97.98%        2.969s      74.222ms     160.000us         3.57%     160.000us       4.000us            40  
                                            aten::index         0.11%       3.190ms         0.52%      15.645ms     122.227us     512.000us        11.41%       1.462ms      11.422us           128  
                                          aten::nonzero         0.20%       6.159ms         0.49%      14.925ms      88.839us       1.432ms        31.91%       1.432ms       8.524us           168  
                                       aten::index_put_         0.01%     360.000us         0.22%       6.553ms     117.018us       0.000us         0.00%     706.000us      12.607us            56  
                                 aten::_index_put_impl_         0.03%     974.000us         0.20%       6.193ms     110.589us     224.000us         4.99%     706.000us      12.607us            56  
                                              aten::mul         0.10%       3.131ms         0.19%       5.612ms      18.221us     616.000us        13.73%     616.000us       2.000us           308  
                                            aten::slice         0.13%       4.038ms         0.14%       4.328ms       9.450us       0.000us         0.00%       0.000us       0.000us           458  
                                        cudaMemcpyAsync         0.10%       2.976ms         0.10%       2.976ms      15.830us       0.000us         0.00%       0.000us       0.000us           188  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 3.030s
Self CUDA time total: 4.487ms

scatter_inference
-----------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
-----------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                  scatter_inference        29.13%       2.163ms        99.56%       7.393ms       7.393ms             1  
                        aten::slice         9.87%     733.000us        10.75%     798.000us       5.285us           151  
                          aten::cat         2.09%     155.000us         9.52%     707.000us      22.806us            31  
                         aten::_cat         5.04%     374.000us         7.43%     552.000us      17.806us            31  
                 aten::linalg_solve         1.83%     136.000us         7.22%     536.000us     268.000us             2  
                        aten::index         4.52%     336.000us         6.54%     486.000us      27.000us            18  
                        aten::stack         3.33%     247.000us         6.44%     478.000us      31.867us            15  
                          aten::mul         4.93%     366.000us         6.33%     470.000us      13.824us            34  
                          aten::div         4.92%     365.000us         6.07%     451.000us      13.667us            33  
                           aten::to         2.11%     157.000us         4.15%     308.000us       4.053us            76  
-----------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 7.426ms

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                     aten::linalg_solve         0.02%     130.000us       193.30%        1.548s     774.158ms       0.000us         0.00%      84.000us      42.000us             2  
                                      scatter_inference         0.50%       3.967ms       100.00%     800.940ms     800.940ms       0.000us         0.00%     838.000us     838.000us             1  
                        aten::_linalg_solve_out_helper_         0.71%       5.670ms        96.61%     773.861ms     773.861ms      37.000us         4.42%      37.000us      37.000us             1  
                                               cudaFree        95.62%     765.895ms        95.62%     765.895ms      85.099ms       0.000us         0.00%       0.000us       0.000us             9  
                                            aten::index         0.07%     572.000us         1.50%      12.050ms     669.444us      72.000us         8.59%     123.000us       6.833us            18  
                                          aten::nonzero         0.04%     295.000us         1.39%      11.110ms       1.852ms      51.000us         6.09%      51.000us       8.500us             6  
                                        cudaMemcpyAsync         1.34%      10.701ms         1.34%      10.701ms     629.471us       0.000us         0.00%       0.000us       0.000us            17  
                                               aten::ge         0.02%     155.000us         0.51%       4.119ms       1.030ms       7.000us         0.84%      14.000us       3.500us             4  
                                       cudaLaunchKernel         0.47%       3.726ms         0.47%       3.726ms      16.634us       0.000us         0.00%       0.000us       0.000us           224  
                                          cudaHostAlloc         0.18%       1.458ms         0.18%       1.458ms     364.500us       0.000us         0.00%       0.000us       0.000us             4  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 800.977ms
Self CUDA time total: 838.000us

location_unc
-----------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
-----------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                       location_unc         8.34%      26.715ms        99.99%     320.397ms     320.397ms             1  
                 aten::linalg_solve        10.17%      32.598ms        62.59%     200.545ms     113.946us          1760  
                LinalgSolveBackward         0.16%     500.000us        34.90%     111.825ms       6.989ms            16  
             aten::_index_put_impl_         8.25%      26.441ms        23.06%      73.879ms      41.227us          1792  
                       aten::matmul         2.81%       8.996ms        19.14%      61.333ms     136.904us           448  
                      IndexBackward         0.15%     475.000us        16.37%      52.457ms     819.641us            64  
                       aten::select        11.07%      35.460ms        12.01%      38.483ms       5.200us          7401  
                        aten::cross         3.72%      11.905ms        10.52%      33.698ms      37.609us           896  
                          aten::mul         4.54%      14.544ms         9.66%      30.960ms      20.342us          1522  
                     SWhereBackward         0.04%     116.000us         8.15%      26.101ms       1.631ms            16  
-----------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 320.417ms

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                           location_unc        50.68%     835.124ms        51.18%     843.378ms     843.378ms       0.000us         0.00%     514.000us     514.000us             1  
                                     aten::linalg_solve         2.95%      48.610ms        43.75%     720.973ms     409.644us       0.000us         0.00%     115.370ms      65.551us          1760  
                                 aten::_index_put_impl_         6.41%     105.677ms        39.54%     651.595ms     363.613us      63.937ms        44.14%     162.434ms      90.644us          1792  
                                          IndexBackward         0.03%     492.000us        21.11%     347.898ms       5.436ms       0.000us         0.00%      81.335ms       1.271ms            64  
                                    LinalgSolveBackward         0.03%     517.000us        17.64%     290.750ms      18.172ms       0.000us         0.00%      40.756ms       2.547ms            16  
                                       cudaLaunchKernel         7.65%     126.053ms         7.65%     126.053ms       5.854us       0.000us         0.00%       0.000us       0.000us         21534  
                        aten::_linalg_solve_out_helper_         1.79%      29.491ms         6.68%     110.164ms     127.505us      32.146ms        22.19%      32.146ms      37.206us           864  
                                  cudaStreamSynchronize         5.22%      86.089ms         5.22%      86.089ms      16.607us       0.000us         0.00%       0.000us       0.000us          5184  
                                            aten::copy_         2.02%      33.290ms         4.98%      82.087ms      16.391us      11.632ms         8.03%      11.632ms       2.323us          5008  
                                              aten::mul         1.52%      25.117ms         3.84%      63.328ms      19.486us       8.046ms         5.56%      12.171ms       3.745us          3250  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.648s
Self CUDA time total: 144.841ms

dtheta_unc
---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                 dtheta_unc        11.24%      16.426ms        99.98%     146.081ms     146.081ms             1  
     aten::_index_put_impl_        12.36%      18.066ms        37.11%      54.217ms      44.586us          1216  
              IndexBackward         0.31%     447.000us        27.75%      40.552ms     633.625us            64  
                  aten::mul         8.19%      11.964ms        17.27%      25.233ms      17.026us          1482  
               aten::select        13.13%      19.188ms        14.03%      20.501ms       6.218us          3297  
               DivBackward0         2.51%       3.665ms        12.98%      18.971ms      91.207us           208  
              SliceBackward         0.85%       1.238ms        11.52%      16.838ms      27.694us           608  
             SWhereBackward         0.08%     114.000us        10.88%      15.902ms     993.875us            16  
       aten::slice_backward         2.83%       4.132ms        10.68%      15.600ms      25.658us           608  
           aten::zeros_like         2.46%       3.601ms         7.95%      11.614ms      38.204us           304  
---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 146.110ms

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                 aten::_index_put_impl_         9.87%      73.241ms        58.62%     435.071ms     357.789us      42.624ms        58.50%     108.288ms      89.053us          1216  
                                             dtheta_unc        51.44%     381.809ms        52.44%     389.254ms     389.254ms       0.000us         0.00%     481.000us     481.000us             1  
                                          IndexBackward         0.06%     466.000us        31.06%     230.560ms       3.603ms       0.000us         0.00%      54.272ms     848.000us            64  
                                       cudaLaunchKernel         8.44%      62.627ms         8.44%      62.627ms       5.575us       0.000us         0.00%       0.000us       0.000us         11234  
                                              aten::mul         2.80%      20.801ms         6.96%      51.653ms      19.610us       6.604ms         9.06%      10.449ms       3.967us          2634  
                                  cudaStreamSynchronize         5.83%      43.296ms         5.83%      43.296ms      18.792us       0.000us         0.00%       0.000us       0.000us          2304  
                                          SliceBackward         0.17%       1.244ms         4.67%      34.679ms      57.038us       0.000us         0.00%       2.821ms       4.640us           608  
                                   aten::slice_backward         0.51%       3.772ms         4.52%      33.514ms      55.122us       0.000us         0.00%       2.821ms       4.640us           608  
                                           DivBackward0         0.46%       3.437ms         3.88%      28.823ms     138.572us       0.000us         0.00%       6.158ms      29.606us           208  
                                            aten::copy_         1.79%      13.275ms         3.74%      27.744ms      15.078us       4.261ms         5.85%       4.261ms       2.316us          1840  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 742.216ms
Self CUDA time total: 72.860ms

theta_in_unc
---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
               theta_in_unc        10.42%       3.644ms        99.92%      34.937ms      34.937ms             1  
     aten::_index_put_impl_        12.47%       4.359ms        33.65%      11.766ms      38.704us           304  
              IndexBackward         0.35%     121.000us        24.24%       8.476ms     529.750us            16  
                  aten::mul         8.33%       2.913ms        17.52%       6.126ms      17.064us           359  
               DivBackward0         2.43%     851.000us        13.14%       4.593ms      88.327us            52  
             SWhereBackward         0.08%      29.000us        12.81%       4.479ms       1.120ms             4  
              SliceBackward         0.94%     330.000us        12.55%       4.388ms      28.868us           152  
       aten::slice_backward         2.79%     976.000us        11.61%       4.058ms      26.697us           152  
               aten::select         9.70%       3.390ms        10.61%       3.711ms       4.531us           819  
                aten::where         3.96%       1.385ms         9.37%       3.276ms      43.105us            76  
---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 34.964ms

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                 aten::_index_put_impl_         8.43%      22.054ms        56.82%     148.600ms     488.816us      10.656ms        58.61%      27.072ms      89.053us           304  
                                           theta_in_unc        51.18%     133.864ms        52.73%     137.921ms     137.921ms       0.000us         0.00%     131.000us     131.000us             1  
                                          IndexBackward         0.07%     187.000us        30.19%      78.969ms       4.936ms       0.000us         0.00%      13.568ms     848.000us            16  
                                       cudaLaunchKernel         9.82%      25.671ms         9.82%      25.671ms       9.185us       0.000us         0.00%       0.000us       0.000us          2795  
                                              aten::mul         3.36%       8.800ms         7.89%      20.628ms      31.883us       1.633ms         8.98%       2.581ms       3.989us           647  
                                          SliceBackward         0.16%     420.000us         4.70%      12.302ms      80.934us       0.000us         0.00%     706.000us       4.645us           152  
                                   aten::slice_backward         0.54%       1.425ms         4.54%      11.882ms      78.171us       0.000us         0.00%     706.000us       4.645us           152  
                                           DivBackward0         0.46%       1.206ms         4.09%      10.699ms     205.750us       0.000us         0.00%       1.553ms      29.865us            52  
                                  cudaStreamSynchronize         4.03%      10.537ms         4.03%      10.537ms      18.293us       0.000us         0.00%       0.000us       0.000us           576  
                                            aten::copy_         1.73%       4.528ms         3.84%      10.051ms      21.850us       1.066ms         5.86%       1.066ms       2.317us           460  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 261.543ms
Self CUDA time total: 18.180ms

theta_out_unc
---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
              theta_out_unc        10.14%       3.758ms        99.93%      37.037ms      37.037ms             1  
     aten::_index_put_impl_        12.60%       4.671ms        33.69%      12.488ms      41.079us           304  
              IndexBackward         0.29%     107.000us        24.11%       8.935ms     558.438us            16  
                  aten::mul         8.32%       3.082ms        17.17%       6.365ms      17.730us           359  
              SliceBackward         0.81%     302.000us        13.36%       4.952ms      32.579us           152  
             SWhereBackward         0.12%      46.000us        13.14%       4.870ms       1.218ms             4  
       aten::slice_backward         2.85%       1.056ms        12.57%       4.659ms      30.651us           152  
               DivBackward0         2.38%     882.000us        12.24%       4.538ms      87.269us            52  
                aten::where         5.26%       1.950ms        10.09%       3.740ms      49.211us            76  
               aten::select         9.14%       3.389ms         9.98%       3.700ms       4.442us           833  
---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 37.063ms

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                 aten::_index_put_impl_         8.35%      18.687ms        57.92%     129.694ms     426.625us      10.656ms        58.62%      27.072ms      89.053us           304  
                                          theta_out_unc        51.26%     114.777ms        52.78%     118.173ms     118.173ms       0.000us         0.00%     132.000us     132.000us             1  
                                          IndexBackward         0.06%     144.000us        30.71%      68.758ms       4.297ms       0.000us         0.00%      13.568ms     848.000us            16  
                                       cudaLaunchKernel         9.86%      22.072ms         9.86%      22.072ms       7.897us       0.000us         0.00%       0.000us       0.000us          2795  
                                              aten::mul         3.52%       7.892ms         7.95%      17.811ms      27.529us       1.633ms         8.98%       2.580ms       3.988us           647  
                                  cudaStreamSynchronize         4.73%      10.587ms         4.73%      10.587ms      18.380us       0.000us         0.00%       0.000us       0.000us           576  
                                          SliceBackward         0.16%     363.000us         4.55%      10.177ms      66.954us       0.000us         0.00%     704.000us       4.632us           152  
                                   aten::slice_backward         0.50%       1.130ms         4.39%       9.821ms      64.612us       0.000us         0.00%     704.000us       4.632us           152  
                                           DivBackward0         0.46%       1.027ms         4.00%       8.951ms     172.135us       0.000us         0.00%       1.553ms      29.865us            52  
                                            aten::copy_         1.69%       3.791ms         3.73%       8.359ms      18.172us       1.064ms         5.85%       1.064ms       2.313us           460  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 223.912ms
Self CUDA time total: 18.177ms

x0_inf
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                         Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                       x0_inf        49.17%       4.702ms        99.76%       9.540ms       9.540ms             1  
                    aten::mul        11.12%       1.063ms        17.38%       1.662ms      11.230us           148  
                    aten::eye         3.10%     296.000us         6.05%     579.000us      32.167us            18  
                 MulBackward0         1.62%     155.000us         6.04%     578.000us      24.083us            24  
                 PowBackward0         0.88%      84.000us         5.56%     532.000us      59.111us             9  
                    aten::pow         2.55%     244.000us         4.42%     423.000us      26.438us            16  
                  CosBackward         0.43%      41.000us         4.12%     394.000us      65.667us             6  
           ReciprocalBackward         0.73%      70.000us         3.99%     382.000us      42.444us             9  
                     aten::to         1.92%     184.000us         3.86%     369.000us       8.200us            45  
                    aten::sum         2.82%     270.000us         3.80%     363.000us      19.105us            19  
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 9.563ms

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                 x0_inf         0.51%      11.650ms        99.75%        2.276s        2.276s       0.000us         0.00%     419.000us     419.000us             1  
                                       cudaLaunchKernel        99.08%        2.261s        99.08%        2.261s       9.380ms       0.000us         0.00%       0.000us       0.000us           241  
                                              aten::sum         0.04%     975.000us        99.05%        2.260s     118.938ms     142.000us        18.02%     146.000us       7.684us            19  
                                            aten::copy_         0.01%     251.000us        99.01%        2.259s     118.892ms      49.000us         6.22%      49.000us       2.579us            19  
                                               aten::to         0.01%     117.000us        99.01%        2.259s      90.356ms       0.000us         0.00%      31.000us       1.240us            25  
                                              aten::mul         0.08%       1.719ms         0.19%       4.295ms      29.020us     294.000us        37.31%     471.000us       3.182us           148  
                                           MulBackward0         0.01%     190.000us         0.06%       1.374ms      57.250us       0.000us         0.00%     102.000us       4.250us            24  
                                           PowBackward0         0.01%     117.000us         0.06%       1.366ms     151.778us       0.000us         0.00%      75.000us       8.333us             9  
                                            aten::index         0.01%     285.000us         0.05%       1.170ms     146.250us      32.000us         4.06%      98.000us      12.250us             8  
                                     ReciprocalBackward         0.00%     113.000us         0.05%       1.043ms     115.889us       0.000us         0.00%      72.000us       8.000us             9  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.281s
Self CUDA time total: 788.000us

pred_passive
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                         Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                 pred_passive        28.01%     212.066ms       100.00%     756.965ms     756.965ms             1  
                    aten::mul        13.79%     104.377ms        14.61%     110.589ms      15.040us          7353  
                    aten::add         5.03%      38.063ms         9.10%      68.878ms      16.357us          4211  
                     aten::eq         5.11%      38.680ms         8.08%      61.143ms       8.478us          7212  
                    aten::sub         6.27%      47.478ms         6.88%      52.086ms       9.642us          5402  
                    aten::all         5.65%      42.780ms         6.64%      50.283ms      13.944us          3606  
                    aten::erf         5.23%      39.577ms         5.60%      42.405ms      11.779us          3600  
                    aten::div         4.86%      36.804ms         5.25%      39.743ms      10.961us          3626  
             aten::reciprocal         4.74%      35.866ms         5.13%      38.815ms      10.779us          3601  
                     aten::to         2.29%      17.343ms         4.13%      31.286ms       8.506us          3678  
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 756.980ms

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                           pred_passive         7.15%     258.415ms        99.85%        3.609s        3.609s       0.000us         0.00%     118.013ms     118.013ms             1  
                                       cudaLaunchKernel        68.51%        2.476s        68.51%        2.476s      68.155us       0.000us         0.00%       0.000us       0.000us         36330  
                                              aten::sum         0.03%     919.000us        61.56%        2.225s     105.952ms     165.000us         0.14%     169.000us       8.048us            21  
                                            aten::copy_         0.01%     452.000us        61.55%        2.225s      63.559ms      90.000us         0.08%      90.000us       2.571us            35  
                                               aten::to         0.00%     173.000us        61.55%        2.225s      41.973ms       0.000us         0.00%      57.000us       1.075us            53  
                                               aten::eq         2.16%      78.021ms         6.05%     218.711ms      30.326us       7.215ms         6.09%      14.430ms       2.001us          7212  
                                              aten::mul         3.25%     117.597ms         5.22%     188.565ms      25.645us      14.712ms        12.43%      14.891ms       2.025us          7353  
                                              aten::sub         2.11%      76.109ms         3.66%     132.262ms      24.484us      14.404ms        12.17%      14.404ms       2.666us          5402  
                                       aten::is_nonzero         0.32%      11.414ms         3.24%     116.959ms      32.426us       0.000us         0.00%       3.608ms       1.000us          3607  
                                              aten::all         1.76%      63.781ms         3.16%     114.288ms      31.694us      21.965ms        18.55%      21.965ms       6.091us          3606  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 3.614s
Self CUDA time total: 118.384ms

loss_calc
-----------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
-----------------------  ------------  ------------  ------------  ------------  ------------  ------------  
              loss_calc        34.70%     566.000us        98.47%       1.606ms       1.606ms             1  
              aten::pow         7.66%     125.000us        10.36%     169.000us      33.800us             5  
              aten::sum         9.01%     147.000us        10.18%     166.000us      18.444us             9  
              aten::mul         4.84%      79.000us         7.73%     126.000us      25.200us             5  
             aten::relu         4.97%      81.000us         7.66%     125.000us      15.625us             8  
               aten::to         3.86%      63.000us         7.42%     121.000us       7.118us            17  
              aten::div         4.60%      75.000us         7.05%     115.000us      23.000us             5  
            aten::stack         1.53%      25.000us         5.46%      89.000us      89.000us             1  
            aten::expm1         4.54%      74.000us         4.78%      78.000us      19.500us             4  
              aten::add         3.80%      62.000us         4.29%      70.000us       8.750us             8  
-----------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.631ms

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                              loss_calc        14.16%     586.000us        99.13%       4.102ms       4.102ms       0.000us         0.00%     147.000us     147.000us             1  
                                            aten::stack         8.12%     336.000us        46.09%       1.907ms       1.907ms       0.000us         0.00%       7.000us       7.000us             1  
                                       cudaLaunchKernel        43.84%       1.814ms        43.84%       1.814ms      39.435us       0.000us         0.00%       0.000us       0.000us            46  
                                              aten::cat         0.43%      18.000us        37.12%       1.536ms       1.536ms       0.000us         0.00%       7.000us       7.000us             1  
                                             aten::_cat         1.06%      44.000us        36.68%       1.518ms       1.518ms       7.000us         4.76%       7.000us       7.000us             1  
                                              aten::pow         3.94%     163.000us         7.23%     299.000us      59.800us      10.000us         6.80%      10.000us       2.000us             5  
                                              aten::sum         4.20%     174.000us         6.55%     271.000us      33.875us      43.000us        29.25%      43.000us       5.375us             8  
                                             aten::relu         2.46%     102.000us         6.43%     266.000us      33.250us       0.000us         0.00%      16.000us       2.000us             8  
                                              aten::div         2.56%     106.000us         4.08%     169.000us      33.800us      13.000us         8.84%      13.000us       2.600us             5  
                                              aten::mul         2.51%     104.000us         4.06%     168.000us      33.600us      10.000us         6.80%      10.000us       2.000us             5  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 4.138ms
Self CUDA time total: 147.000us

loss_backward
-----------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
-----------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                      loss_backward        16.44%     193.080ms       100.00%        1.174s        1.174s             1  
                          aten::mul        19.70%     231.344ms        24.92%     292.601ms       9.115us         32101  
                        ErfBackward         4.87%      57.221ms        20.34%     238.791ms      66.331us          3600  
                       SubBackward0         2.89%      33.880ms        18.72%     219.782ms      40.603us          5413  
                          aten::neg        10.31%     121.036ms        15.59%     183.065ms       7.231us         25316  
                      ProdBackward1         2.39%      28.118ms        14.68%     172.434ms     287.390us           600  
                       MulBackward0         4.75%      55.750ms        11.43%     134.164ms      18.359us          7308  
                 ReciprocalBackward         2.47%      29.028ms         9.36%     109.865ms      30.476us          3605  
                          aten::pow         3.93%      46.104ms         6.13%      71.954ms      19.893us          3617  
                           aten::to         2.50%      29.352ms         4.27%      50.171ms       4.106us         12220  
-----------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.174s

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                          loss_backward        53.99%        2.010s        54.04%        2.011s        2.011s       0.000us         0.00%       2.000us       2.000us             1  
                                              aten::mul         9.88%     367.569ms        17.10%     636.426ms      19.800us      65.215ms        38.96%      65.215ms       2.029us         32143  
                                              aten::neg         4.25%     158.093ms        12.96%     482.496ms      19.059us      25.321ms        15.13%      50.642ms       2.000us         25316  
                                            ErfBackward         1.05%      39.209ms        11.91%     443.370ms     123.158us       0.000us         0.00%      42.940ms      11.928us          3600  
                                       cudaLaunchKernel        11.49%     427.551ms        11.49%     427.551ms       5.972us       0.000us         0.00%       0.000us       0.000us         71591  
                                     ReciprocalBackward         0.72%      26.927ms         8.21%     305.638ms      84.782us       0.000us         0.00%      21.422ms       5.942us          3605  
                                          ProdBackward1         0.78%      28.872ms         7.65%     284.835ms     474.725us       0.000us         0.00%      32.327ms      53.878us           600  
                                           MulBackward0         1.33%      49.435ms         6.84%     254.730ms      34.856us       0.000us         0.00%      22.005ms       3.011us          7308  
                                           SubBackward0         0.62%      22.928ms         6.73%     250.341ms      46.248us       0.000us         0.00%      21.594ms       3.989us          5413  
                                            aten::empty         3.90%     145.136ms         3.90%     145.157ms       2.432us       0.000us         0.00%       0.000us       0.000us         59679  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 3.722s
Self CUDA time total: 167.402ms

GilesStrong added enhancement New feature or request medium priority Should be fixed soon, but doesn't disastrously impact project labels Aug 26, 2021

GilesStrong self-assigned this Aug 26, 2021

GilesStrong mentioned this issue Nov 10, 2022

refactor: make the CPU the default device, even if CUDA available unt… #151

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU usage slower than CPU #53

GPU usage slower than CPU #53

GilesStrong commented Aug 26, 2021

GilesStrong commented Aug 27, 2021

GPU usage slower than CPU #53

GPU usage slower than CPU #53

Comments

GilesStrong commented Aug 26, 2021

GilesStrong commented Aug 27, 2021