Memodo is an linear attention solution that combining the advantages of both RWKV and DeltaNet.
Just use memodo.MemodoLayer, this is an subclass of torch.nn.Module.
Memodo use the General Delta Rule directly:
S -> S * diag(i) + S * a^T * b + c^T * d
return r * S
With Dynamic Token Shift:
d[t] = sigmoid(silu(lerp(x[t], x[t - 1], w1) * w2) * w3)
x[t] = lerp(x[t], x[t - 1], d[t])
And gated residual:
R -> R + Block(x) * sigmoid(silu(LayerNorm(R) * w1) * w2)