You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The difference you're seeing here is not actually from the allocation, but overhead from the optimizations applied by autotune and fusion. It doesn't really make sense for your minimal use case, if you simply disable these default features you'll get the similar timings.
letmut sum:f32 = 0.0;for t in out_vec {
sum += t.sum().into_scalar().to_f32();}
You're syncing the backend in a loop, maybe Tensor::cat(out_vec).sum() would be faster? The memory allocation algorithm might be slow, but I wouldn't expect it to slow down only the first iteration if it were the bottleneck. Also, if fusion is enabled, we're capturing the graph during the first iteration to perform some optimization. We plan to add caching to reduce cold start lag, but right now we're really optimizing for throughput.
Describe the bug
The Cuda backend seems to have a performance issue when keeping results in VMEM.
The following example is very slow in the first iteration (about 9 seconds).
Following iterations are fast (about 0.2 seconds)
Copying the VMEM tensor into Tensordata and back to a VMEM tensor results in a speed up of the first iteration.
(hack = true in the example)
This was tested on the current main branch.
To Reproduce
Expected behavior
Expected less than one second in iteration i:0 of backend: "fusion<jit>", device: Cuda(0)
testing backend: "ndarray", device: Cpu
i: 0, sum: 115605500, elapsed: 0.55636384
i: 1, sum: 115605500, elapsed: 0.462195338
i: 2, sum: 115605500, elapsed: 0.454074468
i: 3, sum: 115605500, elapsed: 0.395695947
i: 4, sum: 115605500, elapsed: 0.39472438
i: 5, sum: 115605500, elapsed: 0.394821323
i: 6, sum: 115605500, elapsed: 0.39547931
i: 7, sum: 115605500, elapsed: 0.395147096
testing backend: "candle", device: Cuda(CudaDevice { device: CudaDevice(DeviceId(1)), index: 0 })
i: 0, sum: 115605500, elapsed: 0.304088321
i: 1, sum: 115605500, elapsed: 0.244612769
i: 2, sum: 115605500, elapsed: 0.243556794
i: 3, sum: 115605500, elapsed: 0.24344835
i: 4, sum: 115605500, elapsed: 0.243947939
i: 5, sum: 115605500, elapsed: 0.243708218
i: 6, sum: 115605500, elapsed: 0.244127436
i: 7, sum: 115605500, elapsed: 0.243139218
testing backend: "tch", device: Cuda(0)
i: 0, sum: 115605500, elapsed: 0.206799282
i: 1, sum: 115605500, elapsed: 0.169857142
i: 2, sum: 115605500, elapsed: 0.168459104
i: 3, sum: 115605500, elapsed: 0.168022733
i: 4, sum: 115605500, elapsed: 0.168296829
i: 5, sum: 115605500, elapsed: 0.171444608
i: 6, sum: 115605500, elapsed: 0.171398742
i: 7, sum: 115605500, elapsed: 0.168834901
testing backend: "fusion<jit>", device: Cuda(0)
i: 0, sum: 115605500, elapsed: 9.259544625
i: 1, sum: 115605500, elapsed: 0.221510606
i: 2, sum: 115605500, elapsed: 0.218899816
i: 3, sum: 115605500, elapsed: 0.219641511
i: 4, sum: 115605500, elapsed: 0.220051723
i: 5, sum: 115605500, elapsed: 0.218108528
i: 6, sum: 115605500, elapsed: 0.218188098
i: 7, sum: 115605500, elapsed: 0.220182168
Screenshots
Desktop (please complete the following information):
Fedora 41, Nvidia cuda 12.6
The text was updated successfully, but these errors were encountered: