Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZLUDA Wave64 implementation may have issue on GFX8/9 #208

Closed
wfjsw opened this issue Apr 20, 2024 · 16 comments · Fixed by #216
Closed

ZLUDA Wave64 implementation may have issue on GFX8/9 #208

wfjsw opened this issue Apr 20, 2024 · 16 comments · Fixed by #216

Comments

@wfjsw
Copy link

wfjsw commented Apr 20, 2024

This is purely based on deduction.

What's known

When using ZLUDA with Stable Diffusion, an Vega20 user got this sort of image:

f7a1b0a580ad812a279c089802e64db6

It is currently known that gfx803 / gfx900 / gfx906 users all get similar output.

The exact reason behind this image is unknown. However, it is known that the problem exists somewhere in the CLIP-UNet stage, and I have tried VAE Encode/Decode with no issue. I have also tried several basic PyTorch operators and they all succeed with correct result.

What's not causing the issue

I tried to mitigate the issue in several ways, and find out:

  • This works on Linux, with the same ROCm version (5.7) and MIOpen disabled.
    image
  • Tensile is not causing the issue.
    • I have tried to tune a fresh TensileLibrary with the corresponding Tensile version, and the result does not change.
    • I have tried to remove all tuned assembly kernels and only use source kernel fallback (as suggested by xuhuisheng/rocm-build), and the result does not change.
  • rocBLAS is not causing the issue.
    • I asked the user to run a full rocblas-test suite on Windows, and all tests passed.

My deduction

At this point, there are only 2 components that could cause this issue. One is ZLUDA, and the other one is ROCm driver. I'm not sure what's happening on the driver side as it is closed-source, and I'm not seeing much similar issues on Windows. (notable issues: ollama/ollama#2453 (comment) and ROCm/rocBLAS#1218 (comment), but the situation there seems somewhat different, as gfx9 is mostly issue-free on them)

The 1 big difference between gfx8/9 and gfx10/11 is the support of Wave32. While DoubleWave32OnWave64 have this sort of issue, I have asked an RX580 user to turn on ZLUDA_WAVE64_SLOW_MODE=1 for Wave32OnWave64, and he got this error constantly:

Exception in thread Thread-31 (load_model):
Traceback (most recent call last):
  File "D:\sd-webui-aki-v4.7\python\lib\threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "D:\sd-webui-aki-v4.7\python\lib\threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "D:\sd-webui-aki-v4.7\modules\initialize.py", line 153, in load_model
    devices.first_time_calculation()
  File "D:\sd-webui-aki-v4.7\modules\devices.py", line 271, in first_time_calculation
    conv2d(x)
  File "D:\sd-webui-aki-v4.7\python\lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\sd-webui-aki-v4.7\python\lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\sd-webui-aki-v4.7\extensions-builtin\Lora\networks.py", line 515, in network_Conv2d_forward
    return originals.Conv2d_forward(self, input)
  File "D:\sd-webui-aki-v4.7\python\lib\site-packages\torch\nn\modules\conv.py", line 460, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "D:\sd-webui-aki-v4.7\python\lib\site-packages\torch\nn\modules\conv.py", line 456, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: 
CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

At this point, I suspect the implementation of Wave64 in ZLUDA has something to do with this issue. Hopefully someone could point me to the right direction on how to get this fixed.

@wfjsw wfjsw changed the title zluda could be the source of gfx8/9 user misery ZLUDA Wave64 implementation may have issue on GFX8/9 Apr 20, 2024
@vosen
Copy link
Owner

vosen commented Apr 20, 2024

Good analysis. I don't have a wave64 GPU at hand (neither pre-RDNA nor CDNA), but I can explain what you are seeing and offer some pointers.

I think you are correct in suspecting ZLUDA wave64 mode. Only a handful of functions require special treatment in wave64, but they are fairly tricky and there might as well be bugs. Especially since I did not even write it for pre-RDNA desktop cards, but for CDNA. I'm always surprised by all the complex workloads (well, except this one) that reportedly work on pre-RDNA.

You would need to figure out which exact kernel is the first kernel that produces different result. If I had the repro: PTX module with the kernel, kernel name, input and good&bad output I could figure out what's wrong. The tricky part is getting there. There is no single simple solution, but how to get there:

  • If you are using something open source then the best path is to change that application to save the outputs of the kernels in the source code and from that figure out which kernel is problematic. You can then extract PTX during normal compilation process (pass -save-temps to the nvcc)
  • If you don't have source code it gets much harder. You'd start with running under dumper as outlined in troubleshooting (https://github.com/vosen/ZLUDA/blob/master/TROUBLESHOOTING.md#zluda-dumper) and then modify zluda_dump source to save output of the kernels. ZLUDA used to have a mode where you could do this generically by setting kernel name and/or index but it was removed because it worked only with really simple workloads. Developers are very creative when it comes to passing data into kernels: pointers, pointers in structs, pointer to pointers, texrefs, texture objects, surfrefs, surface objects, etc. I should probably bring it back for cases like this

BTW, why even use ZLUDA with Stable Diffusion, especially with pre-RDNA? Is there no better path there? I am asking because I never even touched SD and want to focus on workloads that are right now impossible with AMD cards. It was not on the list of potential ZLUDA workloads when I was with AMD. The thinking was that nod.ai has got it covered

@wfjsw
Copy link
Author

wfjsw commented Apr 20, 2024

I do have the source code but that is PyTorch, which is several level higher in abstraction layers. It seems eventually I have to step through all the operators involved first.


There are few options over there for AMD cards on PyTorch, given that MIOpen for Windows is still not available and PyTorch is not yet configured to build without that. DirectML works but they're buggy and slow, and is no longer maintained. ZLUDA is one of the best way to get there right now, as ROCm 6.1 again comes without Windows support.

@wfjsw
Copy link
Author

wfjsw commented Apr 21, 2024

I'm going to log some of my progress here to avoid being lost somehow.


For the code below, the outputs go bad as soon as magic >= 256.

import torch

magic = 256

ln = torch.nn.LayerNorm((magic, ), eps=1e-05, elementwise_affine=True)
ln_cuda = torch.nn.LayerNorm((magic, ), eps=1e-05, elementwise_affine=True).cuda()

weight_values = torch.ones(magic)
bias_values = torch.zeros(magic)

ln.weight.data = weight_values
ln.bias.data = bias_values

ln_cuda.weight.data = weight_values.cuda()
ln_cuda.bias.data = bias_values.cuda()

input = torch.rand(1, 1, magic)

with torch.no_grad():
    output_cpu = ln(input)
    output_gpu = ln_cuda(input.cuda())

print(torch.sum(output_cpu))
print(torch.sum(output_gpu.cpu()))
tensor(-4.1008e-05)
tensor(-18.8131)

@wfjsw
Copy link
Author

wfjsw commented Apr 21, 2024

Hmm... zluda_dump gives me 1 single ptx file and 9 elf files... not sure what I can do with them.

But anyway, here are the files:

python.exe.zip

@wfjsw
Copy link
Author

wfjsw commented Apr 21, 2024

Some more discoveries...

Pytorch has a vectorized layer norm optimization that applys when magic % 4 == 0. The above code only have problem when such optimization is applied and when magic > 128. The branch is here:

https://github.com/pytorch/pytorch/blob/07cecf4168503a5b3defef9b2ecaeb3e075f4761/aten/src/ATen/native/cuda/layer_norm_kernel.cu#L787-L800


I'm making the code deterministic so problem can be spotted more easier:

import torch

magic = 132

ln = torch.nn.LayerNorm((magic, ), eps=1e-05, elementwise_affine=True)
ln_cuda = torch.nn.LayerNorm((magic, ), eps=1e-05, elementwise_affine=True).cuda()

weight_values = torch.ones(magic)
bias_values = torch.zeros(magic)

ln.weight.data = weight_values
ln.bias.data = bias_values

ln_cuda.weight.data = weight_values.cuda()
ln_cuda.bias.data = bias_values.cuda()

input = torch.linspace(-1., 1., steps=magic).unsqueeze(0).unsqueeze(0)

with torch.no_grad():
    output_cpu = ln(input)
    output_gpu = ln_cuda(input.cuda())

print(output_cpu)
print(output_gpu.cpu())

Example output when magic = 132:

tensor([[[-1.7190, -1.6927, -1.6665, -1.6402, -1.6140, -1.5877, -1.5615,
          -1.5352, -1.5090, -1.4828, -1.4565, -1.4303, -1.4040, -1.3778,
          -1.3515, -1.3253, -1.2991, -1.2728, -1.2466, -1.2203, -1.1941,
          -1.1678, -1.1416, -1.1154, -1.0891, -1.0629, -1.0366, -1.0104,
          -0.9841, -0.9579, -0.9316, -0.9054, -0.8792, -0.8529, -0.8267,
          -0.8004, -0.7742, -0.7479, -0.7217, -0.6955, -0.6692, -0.6430,
          -0.6167, -0.5905, -0.5642, -0.5380, -0.5117, -0.4855, -0.4593,
          -0.4330, -0.4068, -0.3805, -0.3543, -0.3280, -0.3018, -0.2756,
          -0.2493, -0.2231, -0.1968, -0.1706, -0.1443, -0.1181, -0.0919,
          -0.0656, -0.0394, -0.0131,  0.0131,  0.0394,  0.0656,  0.0919,
           0.1181,  0.1443,  0.1706,  0.1968,  0.2231,  0.2493,  0.2756,
           0.3018,  0.3280,  0.3543,  0.3805,  0.4068,  0.4330,  0.4593,
           0.4855,  0.5117,  0.5380,  0.5642,  0.5905,  0.6167,  0.6430,
           0.6692,  0.6955,  0.7217,  0.7479,  0.7742,  0.8004,  0.8267,
           0.8529,  0.8792,  0.9054,  0.9316,  0.9579,  0.9841,  1.0104,
           1.0366,  1.0629,  1.0891,  1.1154,  1.1416,  1.1678,  1.1941,
           1.2203,  1.2466,  1.2728,  1.2991,  1.3253,  1.3515,  1.3778,
           1.4040,  1.4303,  1.4565,  1.4828,  1.5090,  1.5352,  1.5615,
           1.5877,  1.6140,  1.6402,  1.6665,  1.6927,  1.7190]]])
tensor([[[-1.7452, -1.7177, -1.6902, -1.6627, -1.6353, -1.6078, -1.5803,
          -1.5528, -1.5253, -1.4978, -1.4704, -1.4429, -1.4154, -1.3879,
          -1.3604, -1.3329, -1.3055, -1.2780, -1.2505, -1.2230, -1.1955,
          -1.1680, -1.1406, -1.1131, -1.0856, -1.0581, -1.0306, -1.0031,
          -0.9757, -0.9482, -0.9207, -0.8932, -0.8657, -0.8382, -0.8108,
          -0.7833, -0.7558, -0.7283, -0.7008, -0.6733, -0.6459, -0.6184,
          -0.5909, -0.5634, -0.5359, -0.5084, -0.4810, -0.4535, -0.4260,
          -0.3985, -0.3710, -0.3435, -0.3161, -0.2886, -0.2611, -0.2336,
          -0.2061, -0.1786, -0.1512, -0.1237, -0.0962, -0.0687, -0.0412,
          -0.0137,  0.0137,  0.0412,  0.0687,  0.0962,  0.1237,  0.1512,
           0.1786,  0.2061,  0.2336,  0.2611,  0.2886,  0.3161,  0.3435,
           0.3710,  0.3985,  0.4260,  0.4535,  0.4810,  0.5084,  0.5359,
           0.5634,  0.5909,  0.6184,  0.6459,  0.6733,  0.7008,  0.7283,
           0.7558,  0.7833,  0.8108,  0.8382,  0.8657,  0.8932,  0.9207,
           0.9482,  0.9757,  1.0031,  1.0306,  1.0581,  1.0856,  1.1131,
           1.1406,  1.1680,  1.1955,  1.2230,  1.2505,  1.2780,  1.3055,
           1.3329,  1.3604,  1.3879,  1.4154,  1.4429,  1.4704,  1.4978,
           1.5253,  1.5528,  1.5803,  1.6078,  1.6353,  1.6627,  1.6902,
           1.7177,  1.7452,  1.7727,  1.8002,  1.8276,  1.8551]]])

Example output when magic = 256 (the output gets really off):

tensor([[[-1.7253, -1.7117, -1.6982, -1.6847, -1.6711, -1.6576, -1.6441,
          -1.6306, -1.6170, -1.6035, -1.5900, -1.5764, -1.5629, -1.5494,
          -1.5358, -1.5223, -1.5088, -1.4952, -1.4817, -1.4682, -1.4546,
          -1.4411, -1.4276, -1.4140, -1.4005, -1.3870, -1.3735, -1.3599,
          -1.3464, -1.3329, -1.3193, -1.3058, -1.2923, -1.2787, -1.2652,
          -1.2517, -1.2381, -1.2246, -1.2111, -1.1975, -1.1840, -1.1705,
          -1.1569, -1.1434, -1.1299, -1.1164, -1.1028, -1.0893, -1.0758,
          -1.0622, -1.0487, -1.0352, -1.0216, -1.0081, -0.9946, -0.9810,
          -0.9675, -0.9540, -0.9404, -0.9269, -0.9134, -0.8998, -0.8863,
          -0.8728, -0.8593, -0.8457, -0.8322, -0.8187, -0.8051, -0.7916,
          -0.7781, -0.7645, -0.7510, -0.7375, -0.7239, -0.7104, -0.6969,
          -0.6833, -0.6698, -0.6563, -0.6427, -0.6292, -0.6157, -0.6022,
          -0.5886, -0.5751, -0.5616, -0.5480, -0.5345, -0.5210, -0.5074,
          -0.4939, -0.4804, -0.4668, -0.4533, -0.4398, -0.4262, -0.4127,
          -0.3992, -0.3856, -0.3721, -0.3586, -0.3451, -0.3315, -0.3180,
          -0.3045, -0.2909, -0.2774, -0.2639, -0.2503, -0.2368, -0.2233,
          -0.2097, -0.1962, -0.1827, -0.1691, -0.1556, -0.1421, -0.1285,
          -0.1150, -0.1015, -0.0880, -0.0744, -0.0609, -0.0474, -0.0338,
          -0.0203, -0.0068,  0.0068,  0.0203,  0.0338,  0.0474,  0.0609,
           0.0744,  0.0880,  0.1015,  0.1150,  0.1285,  0.1421,  0.1556,
           0.1691,  0.1827,  0.1962,  0.2097,  0.2233,  0.2368,  0.2503,
           0.2639,  0.2774,  0.2909,  0.3045,  0.3180,  0.3315,  0.3451,
           0.3586,  0.3721,  0.3856,  0.3992,  0.4127,  0.4262,  0.4398,
           0.4533,  0.4668,  0.4804,  0.4939,  0.5074,  0.5210,  0.5345,
           0.5480,  0.5616,  0.5751,  0.5886,  0.6022,  0.6157,  0.6292,
           0.6427,  0.6563,  0.6698,  0.6833,  0.6969,  0.7104,  0.7239,
           0.7375,  0.7510,  0.7645,  0.7781,  0.7916,  0.8051,  0.8187,
           0.8322,  0.8457,  0.8593,  0.8728,  0.8863,  0.8998,  0.9134,
           0.9269,  0.9404,  0.9540,  0.9675,  0.9810,  0.9946,  1.0081,
           1.0216,  1.0352,  1.0487,  1.0622,  1.0758,  1.0893,  1.1028,
           1.1164,  1.1299,  1.1434,  1.1569,  1.1705,  1.1840,  1.1975,
           1.2111,  1.2246,  1.2381,  1.2517,  1.2652,  1.2787,  1.2923,
           1.3058,  1.3193,  1.3329,  1.3464,  1.3599,  1.3735,  1.3870,
           1.4005,  1.4140,  1.4276,  1.4411,  1.4546,  1.4682,  1.4817,
           1.4952,  1.5088,  1.5223,  1.5358,  1.5494,  1.5629,  1.5764,
           1.5900,  1.6035,  1.6170,  1.6306,  1.6441,  1.6576,  1.6711,
           1.6847,  1.6982,  1.7117,  1.7253]]])
tensor([[[-2.4301, -2.3919, -2.3536, -2.3153, -2.2771, -2.2388, -2.2005,
          -2.1622, -2.1240, -2.0857, -2.0474, -2.0092, -1.9709, -1.9326,
          -1.8944, -1.8561, -1.8178, -1.7795, -1.7413, -1.7030, -1.6647,
          -1.6265, -1.5882, -1.5499, -1.5117, -1.4734, -1.4351, -1.3969,
          -1.3586, -1.3203, -1.2820, -1.2438, -1.2055, -1.1672, -1.1290,
          -1.0907, -1.0524, -1.0142, -0.9759, -0.9376, -0.8993, -0.8611,
          -0.8228, -0.7845, -0.7463, -0.7080, -0.6697, -0.6315, -0.5932,
          -0.5549, -0.5166, -0.4784, -0.4401, -0.4018, -0.3636, -0.3253,
          -0.2870, -0.2488, -0.2105, -0.1722, -0.1339, -0.0957, -0.0574,
          -0.0191,  0.0191,  0.0574,  0.0957,  0.1339,  0.1722,  0.2105,
           0.2488,  0.2870,  0.3253,  0.3636,  0.4018,  0.4401,  0.4784,
           0.5166,  0.5549,  0.5932,  0.6315,  0.6697,  0.7080,  0.7463,
           0.7845,  0.8228,  0.8611,  0.8993,  0.9376,  0.9759,  1.0142,
           1.0524,  1.0907,  1.1290,  1.1672,  1.2055,  1.2438,  1.2820,
           1.3203,  1.3586,  1.3969,  1.4351,  1.4734,  1.5117,  1.5499,
           1.5882,  1.6265,  1.6647,  1.7030,  1.7413,  1.7795,  1.8178,
           1.8561,  1.8944,  1.9326,  1.9709,  2.0092,  2.0474,  2.0857,
           2.1240,  2.1622,  2.2005,  2.2388,  2.2771,  2.3153,  2.3536,
           2.3919,  2.4301,  2.4684,  2.5067,  2.5449,  2.5832,  2.6215,
           2.6598,  2.6980,  2.7363,  2.7746,  2.8128,  2.8511,  2.8894,
           2.9276,  2.9659,  3.0042,  3.0425,  3.0807,  3.1190,  3.1573,
           3.1955,  3.2338,  3.2721,  3.3103,  3.3486,  3.3869,  3.4252,
           3.4634,  3.5017,  3.5400,  3.5782,  3.6165,  3.6548,  3.6930,
           3.7313,  3.7696,  3.8079,  3.8461,  3.8844,  3.9227,  3.9609,
           3.9992,  4.0375,  4.0757,  4.1140,  4.1523,  4.1906,  4.2288,
           4.2671,  4.3054,  4.3436,  4.3819,  4.4202,  4.4584,  4.4967,
           4.5350,  4.5733,  4.6115,  4.6498,  4.6881,  4.7263,  4.7646,
           4.8029,  4.8411,  4.8794,  4.9177,  4.9560,  4.9942,  5.0325,
           5.0708,  5.1090,  5.1473,  5.1856,  5.2238,  5.2621,  5.3004,
           5.3386,  5.3769,  5.4152,  5.4535,  5.4917,  5.5300,  5.5683,
           5.6065,  5.6448,  5.6831,  5.7213,  5.7596,  5.7979,  5.8362,
           5.8744,  5.9127,  5.9510,  5.9892,  6.0275,  6.0658,  6.1040,
           6.1423,  6.1806,  6.2189,  6.2571,  6.2954,  6.3337,  6.3719,
           6.4102,  6.4485,  6.4867,  6.5250,  6.5633,  6.6016,  6.6398,
           6.6781,  6.7164,  6.7546,  6.7929,  6.8312,  6.8694,  6.9077,
           6.9460,  6.9843,  7.0225,  7.0608,  7.0991,  7.1373,  7.1756,
           7.2139,  7.2521,  7.2904,  7.3287]]])

@vosen
Copy link
Owner

vosen commented Apr 21, 2024

Hmm... zluda_dump gives me 1 single ptx file and 9 elf files... not sure what I can do with them.

But anyway, here are the files:

python.exe.zip

That is perfectly normal and expected. nvcc will compile your code to multiple code modules (module = kernels + globals). Then a single module gets compiled to a single fat binary. Fat binary contains multiple variants of the same module: usually one for each target GPU architecture and a PTX for unknown architectures. What you see is a single fat binary split into those architecture-specific variants and a PTX.

The log is slightly weird: it contains only a single kernel dispatch.
Anyway, can you try running this ZLUDA test (execute in the main ZLUDA directory):

cargo test -p zluda --test shuffle -- --test-threads=1 zluda

Expected output is this:

running 4 tests
test shuffle_bfly_zluda ... ok
test shuffle_down_zluda ... ok
test shuffle_idx_zluda ... ok
test shuffle_up_zluda ... ok

test result: ok. 4 passed; 0 failed; 0 ignored; 0 measured; 4 filtered out; finished in 1.31s

@wfjsw
Copy link
Author

wfjsw commented Apr 22, 2024

running 4 tests
test shuffle_bfly_zluda ... ok
test shuffle_down_zluda ... ok
test shuffle_idx_zluda ... ok
test shuffle_up_zluda ... ok

test result: ok. 4 passed; 0 failed; 0 ignored; 0 measured; 4 filtered out; finished in 2.14s

It looks good

@wfjsw
Copy link
Author

wfjsw commented Apr 22, 2024

The log is slightly weird: it contains only a single kernel dispatch.

The log is from the minimal reproduction I posted above. It only contains a Layer norm operation.

@vosen
Copy link
Owner

vosen commented Apr 22, 2024

Hmmm, I have a suspicion what specifically went wrong. Can you try and use ZLUDA from bpermute branch in this repo (git pull and git switch bpermute)?

@vosen
Copy link
Owner

vosen commented Apr 22, 2024

Additionally, could you run this test (on version you already have and if it fails then on the bpermute branch):

cargo test -p ptx -- --test-threads=1 test::spirv_run::shfl_hip_wave32onwave64 test::spirv_run::shfl_hip_doublewave32onwave64 --exact

@vosen
Copy link
Owner

vosen commented Apr 22, 2024

ZLUDA on RDNA can be forced (by changing the source code) to run wave64. I've tried that and I can't reproduce the issue:

:3:C:\constructicon\builds\gfx\two\23.30\drivers\compute\clr\hipamd\src\hip_module.cpp:434 : 447741741711 us: [pid:251428 tid:0x3a9b8]  hipModuleLaunchKernel ( 0x0000017CB679F960, 1, 1, 1, 32, 4, 1, 24, stream:<null>, 00000057E51ED250, char array:<null> )
:3:C:\constructicon\builds\gfx\two\23.30\drivers\compute\clr\hipamd\src\hip_module.cpp:456 : 447741742049 us: [pid:251428 tid:0x3a9b8] hipModuleLaunchKernel: Returned hipSuccess :
:3:C:\constructicon\builds\gfx\two\23.30\drivers\compute\clr\rocclr\device\pal\palvirtual.cpp:2576: 447741742059 us: [pid:251428 tid:0x3e5d8] ! kernel : _ZN2at6native53_GLOBAL__N__dc8387eb_20_layer_norm_kernel_cu_3ff0b71f28vectorized_layer_norm_kernelIffEEviT0_PKT_S6_S6_PS3_S7_PS4_

tensor([[[-1.7253, -1.7117, -1.6982, -1.6847, -1.6711, -1.6576, -1.6441,
          -1.6306, -1.6170, -1.6035, -1.5900, -1.5764, -1.5629, -1.5494,
          -1.5358, -1.5223, -1.5088, -1.4952, -1.4817, -1.4682, -1.4546,
          -1.4411, -1.4276, -1.4140, -1.4005, -1.3870, -1.3735, -1.3599,
          -1.3464, -1.3329, -1.3193, -1.3058, -1.2923, -1.2787, -1.2652,
          -1.2517, -1.2381, -1.2246, -1.2111, -1.1975, -1.1840, -1.1705,
          -1.1569, -1.1434, -1.1299, -1.1164, -1.1028, -1.0893, -1.0758,
          -1.0622, -1.0487, -1.0352, -1.0216, -1.0081, -0.9946, -0.9810,
          -0.9675, -0.9540, -0.9404, -0.9269, -0.9134, -0.8998, -0.8863,
          -0.8728, -0.8593, -0.8457, -0.8322, -0.8187, -0.8051, -0.7916,
          -0.7781, -0.7645, -0.7510, -0.7375, -0.7239, -0.7104, -0.6969,
          -0.6833, -0.6698, -0.6563, -0.6427, -0.6292, -0.6157, -0.6022,
          -0.5886, -0.5751, -0.5616, -0.5480, -0.5345, -0.5210, -0.5074,
          -0.4939, -0.4804, -0.4668, -0.4533, -0.4398, -0.4262, -0.4127,
          -0.3992, -0.3856, -0.3721, -0.3586, -0.3451, -0.3315, -0.3180,
          -0.3045, -0.2909, -0.2774, -0.2639, -0.2503, -0.2368, -0.2233,
          -0.2097, -0.1962, -0.1827, -0.1691, -0.1556, -0.1421, -0.1285,
          -0.1150, -0.1015, -0.0880, -0.0744, -0.0609, -0.0474, -0.0338,
          -0.0203, -0.0068,  0.0068,  0.0203,  0.0338,  0.0474,  0.0609,
           0.0744,  0.0880,  0.1015,  0.1150,  0.1285,  0.1421,  0.1556,
           0.1691,  0.1827,  0.1962,  0.2097,  0.2233,  0.2368,  0.2503,
           0.2639,  0.2774,  0.2909,  0.3045,  0.3180,  0.3315,  0.3451,
           0.3586,  0.3721,  0.3856,  0.3992,  0.4127,  0.4262,  0.4398,
           0.4533,  0.4668,  0.4804,  0.4939,  0.5074,  0.5210,  0.5345,
           0.5480,  0.5616,  0.5751,  0.5886,  0.6022,  0.6157,  0.6292,
           0.6427,  0.6563,  0.6698,  0.6833,  0.6969,  0.7104,  0.7239,
           0.7375,  0.7510,  0.7645,  0.7781,  0.7916,  0.8051,  0.8187,
           0.8322,  0.8457,  0.8593,  0.8728,  0.8863,  0.8998,  0.9134,
           0.9269,  0.9404,  0.9540,  0.9675,  0.9810,  0.9946,  1.0081,
           1.0216,  1.0352,  1.0487,  1.0622,  1.0758,  1.0893,  1.1028,
           1.1164,  1.1299,  1.1434,  1.1569,  1.1705,  1.1840,  1.1975,
           1.2111,  1.2246,  1.2381,  1.2517,  1.2652,  1.2787,  1.2923,
           1.3058,  1.3193,  1.3329,  1.3464,  1.3599,  1.3735,  1.3870,
           1.4005,  1.4140,  1.4276,  1.4411,  1.4546,  1.4682,  1.4817,
           1.4952,  1.5088,  1.5223,  1.5358,  1.5494,  1.5629,  1.5764,
           1.5900,  1.6035,  1.6170,  1.6306,  1.6441,  1.6576,  1.6711,
           1.6847,  1.6982,  1.7117,  1.7253]]])
:3:C:\constructicon\builds\gfx\two\23.30\drivers\compute\clr\hipamd\src\hip_memory.cpp:1467: 447741744571 us: [pid:251428 tid:0x3a9b8]  hipMemcpyDtoHAsync ( 0000031746003000, 0000000400200C00, 1024, stream:<null> )
:3:C:\constructicon\builds\gfx\two\23.30\drivers\compute\clr\hipamd\src\hip_memory.cpp:1475: 447741745819 us: [pid:251428 tid:0x3a9b8] hipMemcpyDtoHAsync: Returned hipSuccess : : duration: 1248 us
:3:C:\constructicon\builds\gfx\two\23.30\drivers\compute\clr\hipamd\src\hip_stream.cpp:451 : 447741745893 us: [pid:251428 tid:0x3a9b8]  hipStreamSynchronize ( stream:<null> )
:3:C:\constructicon\builds\gfx\two\23.30\drivers\compute\clr\hipamd\src\hip_stream.cpp:452 : 447741746340 us: [pid:251428 tid:0x3a9b8] hipStreamSynchronize: Returned hipSuccess :
tensor([[[-1.7253, -1.7117, -1.6982, -1.6847, -1.6711, -1.6576, -1.6441,
          -1.6306, -1.6170, -1.6035, -1.5900, -1.5764, -1.5629, -1.5494,
          -1.5358, -1.5223, -1.5088, -1.4952, -1.4817, -1.4682, -1.4546,
          -1.4411, -1.4276, -1.4140, -1.4005, -1.3870, -1.3735, -1.3599,
          -1.3464, -1.3329, -1.3193, -1.3058, -1.2923, -1.2787, -1.2652,
          -1.2517, -1.2381, -1.2246, -1.2111, -1.1975, -1.1840, -1.1705,
          -1.1569, -1.1434, -1.1299, -1.1164, -1.1028, -1.0893, -1.0758,
          -1.0622, -1.0487, -1.0352, -1.0216, -1.0081, -0.9946, -0.9810,
          -0.9675, -0.9540, -0.9404, -0.9269, -0.9134, -0.8998, -0.8863,
          -0.8728, -0.8593, -0.8457, -0.8322, -0.8187, -0.8051, -0.7916,
          -0.7781, -0.7645, -0.7510, -0.7375, -0.7239, -0.7104, -0.6969,
          -0.6833, -0.6698, -0.6563, -0.6427, -0.6292, -0.6157, -0.6022,
          -0.5886, -0.5751, -0.5616, -0.5480, -0.5345, -0.5210, -0.5074,
          -0.4939, -0.4804, -0.4668, -0.4533, -0.4398, -0.4262, -0.4127,
          -0.3992, -0.3856, -0.3721, -0.3586, -0.3451, -0.3315, -0.3180,
          -0.3045, -0.2909, -0.2774, -0.2639, -0.2503, -0.2368, -0.2233,
          -0.2097, -0.1962, -0.1827, -0.1691, -0.1556, -0.1421, -0.1285,
          -0.1150, -0.1015, -0.0880, -0.0744, -0.0609, -0.0474, -0.0338,
          -0.0203, -0.0068,  0.0068,  0.0203,  0.0338,  0.0474,  0.0609,
           0.0744,  0.0880,  0.1015,  0.1150,  0.1285,  0.1421,  0.1556,
           0.1691,  0.1827,  0.1962,  0.2097,  0.2233,  0.2368,  0.2503,
           0.2639,  0.2774,  0.2909,  0.3045,  0.3180,  0.3315,  0.3451,
           0.3586,  0.3721,  0.3856,  0.3992,  0.4127,  0.4262,  0.4398,
           0.4533,  0.4668,  0.4804,  0.4939,  0.5074,  0.5210,  0.5345,
           0.5480,  0.5616,  0.5751,  0.5886,  0.6022,  0.6157,  0.6292,
           0.6427,  0.6563,  0.6698,  0.6833,  0.6969,  0.7104,  0.7239,
           0.7375,  0.7510,  0.7645,  0.7781,  0.7916,  0.8051,  0.8187,
           0.8322,  0.8457,  0.8593,  0.8728,  0.8863,  0.8998,  0.9134,
           0.9269,  0.9404,  0.9540,  0.9675,  0.9810,  0.9946,  1.0081,
           1.0216,  1.0352,  1.0487,  1.0622,  1.0758,  1.0893,  1.1028,
           1.1164,  1.1299,  1.1434,  1.1569,  1.1705,  1.1840,  1.1975,
           1.2111,  1.2246,  1.2381,  1.2517,  1.2652,  1.2787,  1.2923,
           1.3058,  1.3193,  1.3329,  1.3464,  1.3599,  1.3735,  1.3870,
           1.4005,  1.4140,  1.4276,  1.4411,  1.4546,  1.4682,  1.4817,
           1.4952,  1.5088,  1.5223,  1.5358,  1.5494,  1.5629,  1.5764,
           1.5900,  1.6035,  1.6170,  1.6306,  1.6441,  1.6576,  1.6711,
           1.6847,  1.6982,  1.7117,  1.7253]]])

Well, there's only one way to know now. I've just ordered a Vega 10 GPU. It will arrive in a few days and then I'll be able to actually debug this.

@wfjsw
Copy link
Author

wfjsw commented Apr 22, 2024

Tried bpermute and nothing seems to change.

@wfjsw
Copy link
Author

wfjsw commented Apr 22, 2024

The ptx test is not failing.

@wfjsw
Copy link
Author

wfjsw commented Apr 23, 2024

By the way, while _ZN2at6native53_GLOBAL__N__8b2fdba7_20_layer_norm_kernel_cu_3ff0b71f28vectorized_layer_norm_kernelIffEEviT0_PKT_S6_S6_PS3_S7_PS4_ is broken, the regular layer norm which calls _ZN2at6native53_GLOBAL__N__8b2fdba7_20_layer_norm_kernel_cu_3ff0b71f24RowwiseMomentsCUDAKernelIffEEvxT0_PKT_PS3_S7_ and _ZN2at6native53_GLOBAL__N__8b2fdba7_20_layer_norm_kernel_cu_3ff0b71f26LayerNormForwardCUDAKernelIffEEvxPKT_PKT0_S8_S5_S5_PS3_ does return valid results.

I tried to figure out what special instruction is involved:

Instructions in vectorized_layer_norm_kernel.ptx but not in unoptimized.ptx: {'ld.global.nc.v4.f32', 'setp.lt.u32', 'ld.param.u32', 'setp.ge.u32', 'and.pred', 'cvt.u32.u64', 'cvt.s64.s32', 'rcp.rn.f32', 'not.b32', 'setp.lt.s32', 'st.global.v4.f32', 'shfl.sync.idx.b32', 'div.u32', 'mul.wide.s32', 'mul.lo.s32', 'setp.ne.s64', 'or.b64', 'cvt.rn.f32.s32', 'setp.eq.s32', 'setp.gt.s32', 'or.pred', 'setp.leu.f32'}
Instructions in unoptimized.ptx but not in vectorized_layer_norm_kernel.ptx: {'setp.eq.f32', 'cvt.rn.f32.s64', 'setp.ge.s64', 'st.shared.v4.u8', 'ld.shared.u64', 'setp.lt.s64', 'st.shared.u64', 'mul.lo.s64', 'max.f32', 'mov.b64', 'mov.u64'}

@vosen
Copy link
Owner

vosen commented Apr 29, 2024

Should be fixed as of #216. At least repro posted by @wfjsw works correctly.
Problem wasn't with the compiler, but rather that ZLUDA host code (incorrectly) reported warp size as 64, but kernel compiler compiled with warp size 32. This greatly confused PyTorch

@DrENDzZ
Copy link

DrENDzZ commented Apr 29, 2024

Should be fixed as of #216. At least repro posted by @wfjsw works correctly. Problem wasn't with the compiler, but rather that ZLUDA host code (incorrectly) reported warp size as 64, but kernel compiler compiled with warp size 32. This greatly confused PyTorch

Thanks! it is fixed for me, verified with gfx906 Vega 20 (Radeon Pro VII)
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants