Block-wise Scatter-Reduce : getitem / setitem errors #5639

olivier-peltre · 2025-01-17T19:28:28Z

olivier-peltre
Jan 17, 2025

While trying to implement a scatter-add reduction inside a kernel, I get errors related to tensor.__{get,set}item__.

As far as I understand, indexing features in triton are either limited or new and I'd like some advice on how to find a satisfactory workaround 🙂

Here is a sample code and descriptions of errors:

    ...
    # load indices and values
    idx = tl.load(idx_ptr + j + row)
    values = tl.load(values_ptr + begin + block)     
    # prepare output 
    out = tl.zeros((BLOCK_HEIGHT, OUT_WIDTH), values.dtype)
    mask = tl.full((OUT_WIDTH,), 0, tl.int1)

    # reduce on axis -1 in horizontal for-loop (FAILS) 
    offset = idx[0]
    for k in tl.static_range(BLOCK_WIDTH): 
        idx_k = idx[k] - offset
        mask[idx_k] = 1 
        out[:,idx_k] += values[:,k]
    
    # store output (atomic lock in case of small global height)
    tl.atomic_add(out_ptr + idx_0, out, mask)

when trying to access indices I get e.g. line offset = idx[0] I get:
```
> `offset = idx[0]`
ValueError("unsupported tensor index: constexpr[0]")
```
is this intended?

I can obviously replace this by tl.load(idx_ptr + j (+k)) as workaround
I further get upon calling setitem:
```
> mask[idx_k] = 1
AssertionError()
```
Overlooking the source, I guess this is because tensor doesn't have a __setitem__ method, though throwing AttributeError would be more informative.
I assume the following
```
out[:, idx_k] += val[:,k]
```
would fail for the same reason, though I could probably work around this with tl.where. I'm also unsure whether the shared memory (?) accumulator out shouldn't be transposed for performance befor the final store instruction.

But more generally, what would you recommend? Maybe this has been done somewhere else?
Any form of advice greatly appreciated 🙏

I'm tagging you here @apgoucher and @Mogball, as I wonder whether your recent #5262 PR could help on this subject.

Note

I'm looking into whether triton could help scale-up scatter-add kernels, after having noticed large runtime discrepancies between jax and torch implementations (torch ~4x faster) though both scale bad to large input sizes. I'm hoping that aligning loads to 128B sector sizes while putting more work on each warp (as opposed to a naive implementation relying solely on atomics) could improve the scaling in the large input size limit.

I'm also wondering whether leveraging cuda __shfl_sync semantics wouldn't be necessary to get satisfactory results on this (which would probably fall out of triton's scope). Though I'm still looking for a good triton implementation first!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Block-wise Scatter-Reduce : getitem / setitem errors #5639

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Block-wise Scatter-Reduce : __getitem__ / __setitem__ errors #5639

olivier-peltre Jan 17, 2025

Note

Replies: 0 comments

Block-wise Scatter-Reduce : getitem / setitem errors #5639

olivier-peltre
Jan 17, 2025