MLP in Warp with wp.tile_matmul #817

DINHQuangDung1999 · 2025-07-03T07:49:54Z

DINHQuangDung1999
Jul 3, 2025

(GPU: NVIDIA RTX 6000 Ada - 48 GB memory)
Hello, I am running into some trouble with implementing wp.tile_matmul that I need help.

I am trying out the examples with tile primitives of warp. The goal is to learn to build a multilayer perceptron with Warp, which can be used as the actor network that will be trained directly in Warp without having to stream to Pytorch. In example_tile_mlp.py, when I change DIM_HID to a big value (e.g. 128), I ran into this error

Warp CUDA error 901: operation failed due to a previous error during capture (in function memset_device, /builds/omniverse/warp/warp/native/warp.cu:896)
Warp CUDA error 901: operation failed due to a previous error during capture (in function memset_device, /builds/omniverse/warp/warp/native/warp.cu:896)
Warp CUDA error 901: operation failed due to a previous error during capture (in function memset_device, /builds/omniverse/warp/warp/native/warp.cu:896)
Warp CUDA error 901: operation failed due to a previous error during capture (in function memset_device, /builds/omniverse/warp/warp/native/warp.cu:896)
Warp CUDA error 901: operation failed due to a previous error during capture (in function memset_device, /builds/omniverse/warp/warp/native/warp.cu:896)
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/qdinh/Desktop/working/.venv/lib/python3.10/site-packages/warp/examples/tile/example_tile_mlp.py", line 382, in <module>
    example.train_warp()
  File "/home/qdinh/Desktop/working/.venv/lib/python3.10/site-packages/warp/examples/tile/example_tile_mlp.py", line 238, in train_warp
    graph = wp.capture_end()
  File "/home/qdinh/Desktop/working/.venv/lib/python3.10/site-packages/warp/context.py", line 6263, in capture_end
    raise RuntimeError(f"CUDA graph capture failed. {runtime.get_error_string()}")
RuntimeError: CUDA graph capture failed. Warp CUDA error 901: operation failed due to a previous error during capture (in function memset_device, /builds/omniverse/warp/warp/native/warp.cu:896)

Moreover, when I make a custom code with matmul, the result seems incorrect, perhaps because I did something wrong.

import numpy as np 
import warp as wp

# Dimensions ------------------------
DIM_IN = 19
DIM_HID1 = 128
DIM_HID2 = 64
DIM_HID3 = 32
DIM_OUT = 12
TILESIZE = 8
NUM_THREADS = 64

dtype = float

# Kernels ------------------------
ACTIVATION_STRENGTH = 0.3
@wp.func
def tanh(x: dtype):
    return wp.tanh(x) * ACTIVATION_STRENGTH

@wp.func
def relu(x: dtype):
    return wp.max(x, dtype(0.0))

@wp.kernel
def mlp_forward(
    inputs: wp.array2d(dtype=dtype),     # (DIM_IN, batch_size)
    weights1: wp.array2d(dtype=dtype),   # (DIM_HID1, DIM_IN)
    outputs: wp.array2d(dtype=dtype),    # (DIM_OUT, batch_size)
):
    tid = wp.tid()
    # Load the input vector of this thread (one column)
    x = wp.tile_load(inputs, shape = (DIM_IN, TILESIZE), offset = (0, tid * TILESIZE))
    w1 = wp.tile_load(weights1, shape=(DIM_HID1, DIM_IN))
    out = wp.tile_matmul(w1, x) 
    wp.tile_store(outputs, out, offset = (0, tid * TILESIZE))

# Launch --------------------
def make_weight(out_dim, in_dim):
    w = np.random.uniform(-0.1, 0.1, (out_dim, in_dim))
    return wp.array(w, dtype=float)

def make_bias(out_dim):
    b = np.random.uniform(-0.1, 0.1, (out_dim, 1))
    return wp.array(b, dtype=float)

batch_size = 64 # num_envs
inputs = wp.array(np.random.rand(DIM_IN, batch_size), dtype=float)
weights1 = make_weight(DIM_HID1, DIM_IN)
bias1 = make_bias(DIM_HID1)
outputs = wp.zeros((DIM_HID1, batch_size), dtype=float)

wp.launch(
    mlp_forward,
    dim=batch_size//TILESIZE,
    inputs=[
        inputs,
        weights1,
        outputs,
    ],
    block_dim=NUM_THREADS
)
print(weights1.numpy() @ inputs.numpy())
print(outputs)
breakpoint()

--->
Warp 1.7.2 initialized:
   CUDA Toolkit 12.8, Driver 12.4
   Devices:
     "cpu"      : "x86_64"
     "cuda:0"   : "NVIDIA RTX 6000 Ada Generation" (47 GiB, sm_89, mempool enabled)
   Kernel cache:
     /home/qdinh/.cache/warp/1.7.2
Module __main__ 5dec8ea load on device 'cuda:0' took 25.87 ms  (cached)

[[ 0.14059812  0.00169774  0.03145858 ... -0.01560105  0.0656199
   0.03130146]
 [-0.35374838 -0.19310498 -0.2725287  ... -0.26600957 -0.23854364
  -0.32543784]
 [-0.2454795  -0.28866872 -0.3192376  ... -0.23043914 -0.16633302
  -0.25515395]
 ...
 [-0.14496426 -0.07421998 -0.05098626 ... -0.10918549 -0.15450275
  -0.13656417]
 [-0.11610939  0.01073766 -0.14057928 ...  0.01334776  0.00483525
   0.03980882]
 [ 0.13005096  0.12315289  0.2367125  ...  0.21599288  0.12363739
   0.21928905]]
   
[[-0.03256423 -0.00344288 -0.05661737 ...  0.          0.
   0.        ]
 [ 0.          0.          0.         ...  0.          0.
   0.        ]
 [ 0.          0.          0.         ...  0.          0.
   0.        ]
 ...
 [ 0.          0.          0.         ...  0.          0.
   0.        ]
 [ 0.          0.          0.         ...  0.          0.
   0.        ]
 [ 0.          0.          0.         ...  0.          0.
   0.        ]]

Answered by DINHQuangDung1999

Jul 4, 2025

Thank you for your respond. I managed to make it work. The trick is to load both the input and the (weight, bias) in small size tile as you suggested. However, I notice that there is some minor differences between the calculation made by Warp and Numpy. Why is this the case? Below is the code.

import warp as wp 
import numpy as np 

DIM_IN = 19
DIM_HID1 = 512
DIM_HID2 = 256
DIM_HID3 = 128
DIM_OUT = 12

NUM_THREADS = 64
batch_size = 1024
TILESIZE = 8

dtype = wp.float32

# MLP ---------------------------
@wp.func
def relu(x: dtype):
    return wp.max(x, dtype(0.0))

@wp.kernel
def mlp_layer1(
    inputs: wp.array2d(dtype=dtype),     # (DIM_IN, batch_size)
    weights: wp.array2d(dtype=dtyp…

View full answer

shi-eric · 2025-07-03T14:20:34Z

shi-eric
Jul 3, 2025
Maintainer

When you get a message about graph capture failing, I normally turn off graph capture and run again. If the error is still unclear you can turn on the appropriate debugging option to synchronize and check for CUDA errors after every operation.

1 reply

DINHQuangDung1999 Jul 4, 2025
Author

Hi, can you guide me on how to debug Warp this way, since it is not working if I try putting breakpoint inside kernel.

DINHQuangDung1999 · 2025-07-04T04:03:32Z

DINHQuangDung1999
Jul 4, 2025
Author

In addition, when I run the code below on my other computer, it often takes ~ 1 mins (perhaps to compile), then I receive the following result which state an error of failing to configure kernel dynamic shared memory

Warp 1.7.2 initialized:
   CUDA Toolkit 12.8, Driver 12.4
   Devices:
     "cpu"      : "x86_64"
     "cuda:0"   : "NVIDIA RTX 4000 Ada Generation Laptop GPU" (12 GiB, sm_89, mempool enabled)
   Kernel cache:
     /home/dung-admin/.cache/warp/1.7.2
Module __main__ b610cef load on device 'cuda:0' took 42326.31 ms  (compiled)
Warning: Failed to configure kernel dynamic shared memory for this device, tried to configure mlp_forward_b50b8f2a_cuda_kernel_backward kernel for 110240 bytes, but maximum available is 101376

The code is

import numpy as np 
import warp as wp

# Constants -----------------------
DIM_IN = 16
DIM_HID1 = 128
DIM_HID2 = 64
DIM_HID3 = 32
DIM_OUT = 3

NUM_THREADS = 32
batch_size = 128
dtype = wp.float16

# MLP ---------------------------
@wp.func
def relu(x: dtype):
    return wp.max(x, dtype(0.0))

@wp.kernel
def mlp_forward(
    inputs: wp.array2d(dtype=dtype),     # (DIM_IN, batch_size)
    weights1: wp.array2d(dtype=dtype),   # (DIM_HID1, DIM_IN)
    bias1: wp.array2d(dtype=dtype),      # (DIM_HID1, 1)
    weights2: wp.array2d(dtype=dtype),   # (DIM_HID2, DIM_HID1)
    bias2: wp.array2d(dtype=dtype),      # (DIM_HID2, 1)
    weights3: wp.array2d(dtype=dtype),   # (DIM_HID3, DIM_HID2)
    bias3: wp.array2d(dtype=dtype),      # (DIM_HID3, 1)
    weights_out: wp.array2d(dtype=dtype),# (DIM_OUT, DIM_HID3)
    bias_out: wp.array2d(dtype=dtype),   # (DIM_OUT, 1)
    outputs: wp.array2d(dtype=dtype),    # (DIM_OUT, batch_size)
):
    tid = wp.tid()

    # Load the input vector of this thread (one column)
    local = wp.vector(dtype=dtype, length=DIM_IN)
    for i in range(DIM_IN):
        local[i] = inputs[i, tid]

    # Tile it across the block
    f = wp.tile(local)

    # Layer 1
    w1 = wp.tile_load(weights1, shape=(DIM_HID1, DIM_IN))
    b1 = wp.tile_load(bias1, shape=(DIM_HID1, 1))
    z1 = wp.tile_map(relu, wp.tile_matmul(w1, f) + wp.tile_broadcast(b1, shape=(DIM_HID1, NUM_THREADS)))

    # Layer 2
    w2 = wp.tile_load(weights2, shape=(DIM_HID2, DIM_HID1))
    b2 = wp.tile_load(bias2, shape=(DIM_HID2, 1))
    z2 = wp.tile_map(relu, wp.tile_matmul(w2, z1) + wp.tile_broadcast(b2, shape=(DIM_HID2, NUM_THREADS)))

    # Layer 3
    w3 = wp.tile_load(weights3, shape=(DIM_HID3, DIM_HID2))
    b3 = wp.tile_load(bias3, shape=(DIM_HID3, 1))
    z3 = wp.tile_map(relu, wp.tile_matmul(w3, z2) + wp.tile_broadcast(b3, shape=(DIM_HID3, NUM_THREADS)))

    # Output Layer
    w_out = wp.tile_load(weights_out, shape=(DIM_OUT, DIM_HID3))
    b_out = wp.tile_load(bias_out, shape=(DIM_OUT, 1))
    o = wp.tile_matmul(w_out, z3) + wp.tile_broadcast(b_out, shape=(DIM_OUT, NUM_THREADS))

    # Untile back to per-thread output vector
    out_vector = wp.untile(o)

    # Write output
    for i in range(DIM_OUT):
        outputs[i, tid] = dtype(out_vector[i])


# Create weights and biases ----------------------
def make_weight(out_dim, in_dim):
    w = np.random.uniform(-0.1, 0.1, (out_dim, in_dim))
    return wp.array(w, dtype=dtype)

def make_bias(out_dim):
    b = np.random.uniform(-0.1, 0.1, (out_dim, 1))
    return wp.array(b, dtype=dtype)

inputs = wp.array(np.random.rand(DIM_IN, batch_size), dtype=dtype)

weights1 = make_weight(DIM_HID1, DIM_IN)
bias1 = make_bias(DIM_HID1)

weights2 = make_weight(DIM_HID2, DIM_HID1)
bias2 = make_bias(DIM_HID2)

weights3 = make_weight(DIM_HID3, DIM_HID2)
bias3 = make_bias(DIM_HID3)

weights_out = make_weight(DIM_OUT, DIM_HID3)
bias_out = make_bias(DIM_OUT)

outputs = wp.zeros((DIM_OUT, batch_size), dtype=dtype)

# Launch --------------------------
wp.launch(
    mlp_forward,
    dim=batch_size,
    inputs=[
        inputs,
        weights1, bias1,
        weights2, bias2,
        weights3, bias3,
        weights_out, bias_out,
        outputs,
    ],
    block_dim=NUM_THREADS
)

0 replies

shi-eric · 2025-07-04T06:12:12Z

shi-eric
Jul 4, 2025
Maintainer

Failed to configure kernel dynamic shared memory for this device is telling you that your code requires more shared memory than there is available on the hardware, e.g. the tile sizes are too big and some kind of batching might be required as done in

warp/warp/examples/benchmarks/benchmark_gemm.py

Lines 45 to 49 in ab88f0f

    
           for k in range(count): 
        
               a = wp.tile_load(A, shape=(TILE_M, TILE_K), offset=(i * TILE_M, k * TILE_K)) 
        
               b = wp.tile_load(B, shape=(TILE_K, TILE_N), offset=(k * TILE_K, j * TILE_N)) 
        
               wp.tile_matmul(a, b, sum)

0 replies

DINHQuangDung1999 · 2025-07-04T10:51:40Z

DINHQuangDung1999
Jul 4, 2025
Author

Thank you for your respond. I managed to make it work. The trick is to load both the input and the (weight, bias) in small size tile as you suggested. However, I notice that there is some minor differences between the calculation made by Warp and Numpy. Why is this the case? Below is the code.

import warp as wp 
import numpy as np 

DIM_IN = 19
DIM_HID1 = 512
DIM_HID2 = 256
DIM_HID3 = 128
DIM_OUT = 12

NUM_THREADS = 64
batch_size = 1024
TILESIZE = 8

dtype = wp.float32

# MLP ---------------------------
@wp.func
def relu(x: dtype):
    return wp.max(x, dtype(0.0))

@wp.kernel
def mlp_layer1(
    inputs: wp.array2d(dtype=dtype),     # (DIM_IN, batch_size)
    weights: wp.array2d(dtype=dtype),    # (DIM_HID1, DIM_IN)
    bias: wp.array2d(dtype=dtype),       # (DIM_HID1, 1)
    outputs: wp.array2d(dtype=dtype) # (DIM_HID1, batch_size)
):
    row, col = wp.tid()

    x = wp.tile_load(inputs, shape=(DIM_IN, TILESIZE), offset=(0, col*TILESIZE))

    w = wp.tile_load(weights, shape=(TILESIZE, DIM_IN), offset=(row*TILESIZE, 0))
    b = wp.tile_load(bias, shape=(TILESIZE, 1), offset=(row*TILESIZE, 0))
    z = wp.tile_matmul(w, x) + wp.tile_broadcast(b, shape=(TILESIZE, TILESIZE))

    wp.tile_store(outputs, z, offset=(row*TILESIZE, col*TILESIZE))
@wp.kernel
def mlp_layer2(
    inputs: wp.array2d(dtype=dtype),     # (DIM_IN, batch_size)
    weights: wp.array2d(dtype=dtype),    # (DIM_HID1, DIM_IN)
    bias: wp.array2d(dtype=dtype),       # (DIM_HID1, 1)
    outputs: wp.array2d(dtype=dtype) # (DIM_HID1, batch_size)
):
    row, col = wp.tid()

    x = wp.tile_load(inputs, shape=(DIM_HID1, TILESIZE), offset=(0, col*TILESIZE))

    w = wp.tile_load(weights, shape=(TILESIZE, DIM_HID1), offset=(row*TILESIZE, 0))
    b = wp.tile_load(bias, shape=(TILESIZE, 1), offset=(row*TILESIZE, 0))
    z = wp.tile_matmul(w, x) + wp.tile_broadcast(b, shape=(TILESIZE, TILESIZE))

    wp.tile_store(outputs, z, offset=(row*TILESIZE, col*TILESIZE))

@wp.kernel
def mlp_layer3(
    inputs: wp.array2d(dtype=dtype),     # (DIM_IN, batch_size)
    weights: wp.array2d(dtype=dtype),    # (DIM_HID1, DIM_IN)
    bias: wp.array2d(dtype=dtype),       # (DIM_HID1, 1)
    outputs: wp.array2d(dtype=dtype) # (DIM_HID1, batch_size)
):
    row, col = wp.tid()

    x = wp.tile_load(inputs, shape=(DIM_HID2, TILESIZE), offset=(0, col*TILESIZE))

    w = wp.tile_load(weights, shape=(TILESIZE, DIM_HID2), offset=(row*TILESIZE, 0))
    b = wp.tile_load(bias, shape=(TILESIZE, 1), offset=(row*TILESIZE, 0))
    z = wp.tile_matmul(w, x) + wp.tile_broadcast(b, shape=(TILESIZE, TILESIZE))

    wp.tile_store(outputs, z, offset=(row*TILESIZE, col*TILESIZE))

# Create weights and biases ----------------------
def make_weight(out_dim, in_dim):
    w = np.random.uniform(-0.1, 0.1, (out_dim, in_dim))
    return wp.array(w, dtype=dtype)

def make_bias(out_dim):
    b = np.random.uniform(-0.1, 0.1, (out_dim, 1))
    return wp.array(b, dtype=dtype)

# Create input array (random)
inputs = wp.array(np.random.rand(DIM_IN, batch_size), dtype=dtype)
weights1 = make_weight(DIM_HID1, DIM_IN)
bias1 = make_bias(DIM_HID1)

weights2 = make_weight(DIM_HID2, DIM_HID1)
bias2 = make_bias(DIM_HID2)

weights3 = make_weight(DIM_HID3, DIM_HID2)
bias3 = make_bias(DIM_HID3)

# weights_out = make_weight(DIM_OUT, DIM_HID3)
# bias_out = make_bias(DIM_OUT)

outputs1 = wp.zeros((DIM_HID1, batch_size), dtype=dtype)
outputs2 = wp.zeros((DIM_HID2, batch_size), dtype=dtype)
outputs3 = wp.zeros((DIM_HID3, batch_size), dtype=dtype)

wp.launch_tiled(
    mlp_layer1,
    dim=(DIM_HID1//TILESIZE, batch_size//TILESIZE),
    block_dim=NUM_THREADS,
    inputs=[inputs, weights1, bias1,],
    outputs = [outputs1,]
)

wp.launch_tiled(
    mlp_layer2,
    dim=(DIM_HID2//TILESIZE, batch_size//TILESIZE),
    block_dim=NUM_THREADS,
    inputs=[outputs1, weights2, bias2,],
    outputs=[outputs2, ]
)

wp.launch_tiled(
    mlp_layer3,
    dim=(DIM_HID3//TILESIZE, batch_size//TILESIZE),
    block_dim=NUM_THREADS,
    inputs=[outputs2, weights3, bias3,],
    outputs = [outputs3,]
)

print((weights1.numpy() @ inputs.numpy() + bias1.numpy() - outputs1.numpy()).sum())
print((weights2.numpy() @ outputs1.numpy() + bias2.numpy() - outputs2.numpy()).sum())
print((weights3.numpy() @ outputs2.numpy() + bias3.numpy() - outputs3.numpy()).sum())
breakpoint()

The result

Warp 1.7.2 initialized:
   CUDA Toolkit 12.8, Driver 12.4
   Devices:
     "cpu"      : "x86_64"
     "cuda:0"   : "NVIDIA RTX 4000 Ada Generation Laptop GPU" (12 GiB, sm_89, mempool enabled)
   Kernel cache:
     /home/dung-admin/.cache/warp/1.7.2
Module __main__ ea3d943 load on device 'cuda:0' took 21423.61 ms  (compiled)
-1.06403604e-07
--Return--
> /home/dung-admin/Desktop/working/experiments/interp_exp/autograd.py(125)<module>()->None
-> breakpoint()
(Pdb) print((weights2.numpy() @ outputs1.numpy() + bias2.numpy() - outputs2.numpy()).sum())
-2.0310981e-06
(Pdb) print((weights3.numpy() @ outputs2.numpy() + bias3.numpy() - outputs3.numpy()).sum())
-1.2409873e-06Warp 1.7.2 initialized:
   CUDA Toolkit 12.8, Driver 12.4
   Devices:
     "cpu"      : "x86_64"
     "cuda:0"   : "NVIDIA RTX 4000 Ada Generation Laptop GPU" (12 GiB, sm_89, mempool enabled)
   Kernel cache:
     /home/dung-admin/.cache/warp/1.7.2
Module __main__ ea3d943 load on device 'cuda:0' took 21423.61 ms  (compiled)
-1.06403604e-07
--Return--
> /home/dung-admin/Desktop/working/experiments/interp_exp/autograd.py(125)<module>()->None
-> breakpoint()
(Pdb) print((weights2.numpy() @ outputs1.numpy() + bias2.numpy() - outputs2.numpy()).sum())
-2.0310981e-06
(Pdb) print((weights3.numpy() @ outputs2.numpy() + bias3.numpy() - outputs3.numpy()).sum())
-1.2409873e-06

3 replies

shi-eric Jul 9, 2025
Maintainer

Have you tried to see if the discrepancy is reduced if you use float64? @daedalus5 do you have any suggestions?

daedalus5 Jul 9, 2025
Maintainer

Yes it's not surprising that there are numerical differences. The CPU and GPU approaches are different and will accumulate values differently, so minor differences in rounding errors will compound. @shi-eric 's suggestion is a good one: I would expect the discrepancy to be reduced with increased precision.

DINHQuangDung1999 Jul 10, 2025
Author

This is true. There is negligible difference when I use wp.float64. Thank you for clarifying!

MLP in Warp with wp.tile_matmul #817

Uh oh!

Uh oh!

DINHQuangDung1999 Jul 3, 2025

Replies: 4 comments · 4 replies

Uh oh!

shi-eric Jul 3, 2025 Maintainer

Uh oh!

DINHQuangDung1999 Jul 4, 2025 Author

Uh oh!

Uh oh!

DINHQuangDung1999 Jul 4, 2025 Author

Uh oh!

shi-eric Jul 4, 2025 Maintainer

Uh oh!

DINHQuangDung1999 Jul 4, 2025 Author

Uh oh!

shi-eric Jul 9, 2025 Maintainer

Uh oh!

daedalus5 Jul 9, 2025 Maintainer

Uh oh!

DINHQuangDung1999 Jul 10, 2025 Author

DINHQuangDung1999
Jul 3, 2025

Replies: 4 comments 4 replies

shi-eric
Jul 3, 2025
Maintainer

DINHQuangDung1999 Jul 4, 2025
Author

DINHQuangDung1999
Jul 4, 2025
Author

shi-eric
Jul 4, 2025
Maintainer

DINHQuangDung1999
Jul 4, 2025
Author

shi-eric Jul 9, 2025
Maintainer

daedalus5 Jul 9, 2025
Maintainer

DINHQuangDung1999 Jul 10, 2025
Author