Skip to content

[Bug][Vulkan] SPIR-V backend generates sequentially consistant OpControlBarriers and OpMemberDecorate DecorationOffset on non-interface blocks #18915

@kistenklaus

Description

@kistenklaus

Expected behavior

I was trying to get TVM to work, just to benchmark it against other frameworks. Specifically i am interessted in DNN compilers, which compile down networks down to the instruction level.
That's why i was interessted in TVM, but it didn't work out the box as expected, even for very simple networks, that only contain a convolution and a relu layer.

I also think that i already found the bug, later more (last section)

Actual behavior

Network didn't compile, see error log later.

Environment

System:

OS: Manjaro Linux x86_64 
Host: B650I AORUS ULTRA -CF 
Kernel: 6.6.126-1-MANJARO 
Uptime: 51 mins 
Packages: 1948 (pacman) 
Shell: bash 5.3.9 
Resolution: 1920x1080 
WM: i3 
Theme: Breeze [GTK2/3] 
Icons: breeze [GTK2/3] 
Terminal: Neovim Terminal 
CPU: AMD Ryzen 7 7800X3D (16) @ 5.050GHz 
GPU: NVIDIA GeForce RTX 4070 
GPU: AMD ATI 10:00.0 Raphael 
Memory: 16369MiB / 31200MiB               

TVM directly from main (c9fb8cd)

Steps to reproduce

build with USE_VULKAN=ON and set(USE_KHRONOS_SPIRV /usr).
Then run this script:

import tvm
from tvm import relax
from tvm.relax.frontend import nn
from tvm.s_tir import dlight as dl


class TinyConv(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = nn.Conv2D(
            in_channels=3,
            out_channels=32,
            kernel_size=3,
            stride=1,
            padding=1,
            bias=True,
            dtype="float16",
            data_layout="NCHW",
        )
        self.relu = nn.ReLU()

    def forward(self, x):
        return self.relu(self.conv(x))


def main():
    input_shape = (1, 3, 1088, 1920)
    input_dtype = "float16"

    model = TinyConv()

    mod, _ = model.export_tvm(
        {"forward": {"x": nn.spec.Tensor(input_shape, input_dtype)}}
    )

    dev = tvm.vulkan(0)
    print("Vulkan exists:", tvm.vulkan().exist)
    print("Device:", dev)
    if not tvm.vulkan().exist:
        raise RuntimeError("TVM does not see a Vulkan device")

    target = tvm.target.Target.from_device(dev)

    with target:
        mod = tvm.ir.transform.Sequential(
            [
                relax.get_pipeline("zero"),
                dl.ApplyDefaultSchedule(
                    dl.gpu.Matmul(),
                    dl.gpu.GEMV(),
                    dl.gpu.Reduction(),
                    dl.gpu.GeneralReduction(),
                    dl.gpu.Fallback(),
                ),
            ]
        )(mod)

    print("Starting compile...")
    ex = relax.build(mod, target)
    print("Compile OK")
    print(type(ex))


if __name__ == "__main__":
    main()

This fails with:

[kistenklaus@kiste tvm-fork]$ python bench.py 
Vulkan exists: True
Device: vulkan:0
Starting compile...
Traceback (most recent call last):
  File "/home/kistenklaus/Documents/tvm-fork/bench.py", line 65, in <module>
    main()
  File "/home/kistenklaus/Documents/tvm-fork/bench.py", line 59, in main
    ex = relax.build(mod, target)
         ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kistenklaus/Documents/tvm-fork/.venv/lib/python3.11/site-packages/
tvm/relax/vm_build.py", line 262, in build
    return _vmlink(
           ^^^^^^^^
  File "/home/kistenklaus/Documents/tvm-fork/.venv/lib/python3.11/site-packages/tvm/relax/vm_build.py", line 157, in _vmlink
    lib = tvm.tir.build(tir_mod, target=target, pipeline=tir_pipeline)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kistenklaus/Documents/tvm-fork/.venv/lib/python3.11/site-packages/tvm/tir/build.py", line 238, in build
    return tir_to_runtime(host_mod, device_mod_dict, target_host)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kistenklaus/Documents/tvm-fork/.venv/lib/python3.11/site-packages/tvm/tir/build.py", line 146, in tir_to_runtime
    device_modules.append(codegen_build(device_mod, target))
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kistenklaus/Documents/tvm-fork/.venv/lib/python3.11/site-packages/tvm/tir/build.py", line 130, in codegen_build
    return bf(mod, target)
           ^^^^^^^^^^^^^^^
  File "python/tvm_ffi/cython/function.pxi", line 929, in tvm_ffi.core.Function.__call__
  File "<unknown>", line 0, in tvm::codegen::BuildSPIRV(tvm::IRModule, tvm::Target)
  File "<unknown>", line 0, in tvm::codegen::LowerToSPIRV[abi:cxx11](tvm::IRModule, tvm::Target)
  File "/home/kistenklaus/Documents/tvm-fork/src/target/spirv/spirv_utils.cc", line 101, in void tvm::codegen::SPIRVTools::ValidateShader(const std::vecto
r<unsigned int>&)
    TVM_FFI_ICHECK_EQ(res, SPV_SUCCESS)

tvm.error.InternalError: Check failed: res == SPV_SUCCESS (-14 vs. 0) :  index=270 error:[VUID-StandaloneSpirv-MemorySemantics-10866] ControlBarrier: Memo
ry Semantics with SequentiallyConsistent memory order must not be used in the Vulkan API
  OpControlBarrier %int_2 %int_2 %int_272

Possible Layman Fix

I grepped through the error messages a bit and the only line where spv::MemorySemanticsSequentiallyConsistentMask is used is in src/target/spirv/codegen_spirv.cc.
In the following function:

spirv::Value CodeGenSPIRV::CreateStorageSync(const CallNode* op) {
  const std::string& sync = op->args[0].as<StringImmNode>()->value;
  spirv::Value value;

  uint32_t vulkan_api_version = spirv_support_.vulkan_api_version;

  int64_t sync_scope;
  int64_t memory_semantics = spv::MemorySemanticsSequentiallyConsistentMask;
  if ((sync == "warp") && (vulkan_api_version >= VK_API_VERSION_1_1)) {
    // Synchronize control at the Subgroup level, but memory at the
    // Workgroup level.  This is because different invocations in a
    // subgroup may have each modified memory that exists at the
    // workgroup scope.  This should be changed if/when tir exposes
    // more information as to which memory access needs to be
    // synchronized.
    sync_scope = spv::ScopeSubgroup;
    memory_semantics |=
        spv::MemorySemanticsSubgroupMemoryMask | spv::MemorySemanticsWorkgroupMemoryMask;

  } else if ((sync == "shared") || (sync == "warp")) {
    sync_scope = spv::ScopeWorkgroup;
    memory_semantics |= spv::MemorySemanticsWorkgroupMemoryMask;
  } else {
    TVM_FFI_THROW(InternalError) << "Do not support sync " << sync;
  }

  auto type_int = builder_->GetSType(DataType::Int(32));
  builder_->MakeInst(spv::OpControlBarrier, builder_->IntImm(type_int, sync_scope),
                     builder_->IntImm(type_int, sync_scope),
                     builder_->IntImm(type_int, memory_semantics));

  return value;
}

Now iam not very familiar with the rest of the TVM codebase, so i might be wrong and this might break correctness somewhere, but for vulkan a normal workgroup or subgroup barrier has acquire release semantics not sequentially consistant.
My suspicion would be that this came from some code related to OpenCL, because OpenCL requires OpControlBarriers which are sequentially consistant (no idea why, doesn't make any sense to me).

I tried to verify this in the SPIR-V spec, but they are broken yet again =^(.
Regarless iam quite confident that sequential consistancy within a vulkan context for OpControlBarrier is a bug.

Maybe to further motivate this claim https://github.com/KhronosGroup/GLSL/blob/main/extensions/khr/GL_KHR_memory_scope_semantics.txt also defaults to acquire release semantics for barrier();

This doesn't make it run out of the box, but i think i found another bug, but i will split this up into another issue.

Once i get a reply confirming that this is a bug, i can create a quick PR fixing it.

Triage

  • needs-triage
  • actionable
  • backend:vulkan

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-triagePRs or issues that need to be investigated by maintainers to find the right assignees to address ittype: bug

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions