Cuda memory allocation slow #2772

rztz · 2025-02-03T13:43:46Z

Describe the bug
The Cuda backend seems to have a performance issue when keeping results in VMEM.

The following example is very slow in the first iteration (about 9 seconds).
Following iterations are fast (about 0.2 seconds)

Copying the VMEM tensor into Tensordata and back to a VMEM tensor results in a speed up of the first iteration.
(hack = true in the example)

This was tested on the current main branch.

To Reproduce

fn test_memory_alloc() {
    unsafe { libloading::Library::new("libtorch.so").unwrap() };

    // number of loops
    let loops = 8;

    test_device_memory_alloc::<burn::backend::NdArray>(
        &burn::backend::ndarray::NdArrayDevice::Cpu,
        loops,
    );

    test_device_memory_alloc::<burn::backend::Candle>(
        &burn::backend::candle::CandleDevice::cuda(0),
        loops,
    );

    test_device_memory_alloc::<burn::backend::LibTorch>(
        &burn::backend::libtorch::LibTorchDevice::Cuda(0),
        loops,
    );

    test_device_memory_alloc::<burn::backend::Cuda>(
        &burn::backend::cuda::CudaDevice::new(0),
        loops,
    );
}

fn test_device_memory_alloc<B: Backend>(device: &<B as Backend>::Device, loops: usize) {
    println!("testing backend: {:?}, device: {:?}", B::name(), device);
    const DIM: usize = 4;
    let shape = [32, 4, 84, 84];
    let l = shape.iter().product();
    let data_v = vec![0; l];

    let steps = 128;
    let hack = false;

    for i in 0..loops {
        let start = std::time::Instant::now();

        let mut in_vec = vec![];
        for _ in 0..steps {
            let td = TensorData::new(data_v.clone(), shape);
            let t: Tensor<B, DIM> = Tensor::from_data(td, device);
            in_vec.push(t);
        }

        let mut out_vec = vec![];
        for mut t in in_vec {
            t = t.add_scalar(1.0);
            if hack {
                let td = t.into_data();
                t = burn::tensor::Tensor::from_data(td, device);
            }
            out_vec.push(t);
        }

        let mut sum: f32 = 0.0;
        for t in out_vec {
            sum += t.sum().into_scalar().to_f32()
        }

        println!(
            "i: {}, sum: {}, elapsed: {}",
            i,
            sum,
            start.elapsed().as_secs_f64()
        );
    }
}

Expected behavior
Expected less than one second in iteration i:0 of backend: "fusion<jit>", device: Cuda(0)

testing backend: "ndarray", device: Cpu
i: 0, sum: 115605500, elapsed: 0.55636384
i: 1, sum: 115605500, elapsed: 0.462195338
i: 2, sum: 115605500, elapsed: 0.454074468
i: 3, sum: 115605500, elapsed: 0.395695947
i: 4, sum: 115605500, elapsed: 0.39472438
i: 5, sum: 115605500, elapsed: 0.394821323
i: 6, sum: 115605500, elapsed: 0.39547931
i: 7, sum: 115605500, elapsed: 0.395147096
testing backend: "candle", device: Cuda(CudaDevice { device: CudaDevice(DeviceId(1)), index: 0 })
i: 0, sum: 115605500, elapsed: 0.304088321
i: 1, sum: 115605500, elapsed: 0.244612769
i: 2, sum: 115605500, elapsed: 0.243556794
i: 3, sum: 115605500, elapsed: 0.24344835
i: 4, sum: 115605500, elapsed: 0.243947939
i: 5, sum: 115605500, elapsed: 0.243708218
i: 6, sum: 115605500, elapsed: 0.244127436
i: 7, sum: 115605500, elapsed: 0.243139218
testing backend: "tch", device: Cuda(0)
i: 0, sum: 115605500, elapsed: 0.206799282
i: 1, sum: 115605500, elapsed: 0.169857142
i: 2, sum: 115605500, elapsed: 0.168459104
i: 3, sum: 115605500, elapsed: 0.168022733
i: 4, sum: 115605500, elapsed: 0.168296829
i: 5, sum: 115605500, elapsed: 0.171444608
i: 6, sum: 115605500, elapsed: 0.171398742
i: 7, sum: 115605500, elapsed: 0.168834901
testing backend: "fusion<jit>", device: Cuda(0)
i: 0, sum: 115605500, elapsed: 9.259544625
i: 1, sum: 115605500, elapsed: 0.221510606
i: 2, sum: 115605500, elapsed: 0.218899816
i: 3, sum: 115605500, elapsed: 0.219641511
i: 4, sum: 115605500, elapsed: 0.220051723
i: 5, sum: 115605500, elapsed: 0.218108528
i: 6, sum: 115605500, elapsed: 0.218188098
i: 7, sum: 115605500, elapsed: 0.220182168

Screenshots

Desktop (please complete the following information):
Fedora 41, Nvidia cuda 12.6

The text was updated successfully, but these errors were encountered:

laggui · 2025-02-03T19:58:35Z

The difference you're seeing here is not actually from the allocation, but overhead from the optimizations applied by autotune and fusion. It doesn't really make sense for your minimal use case, if you simply disable these default features you'll get the similar timings.

rztz · 2025-02-03T20:06:18Z

If step is increased from 128 to 256, it takes 75 seconds for the first iteration.
Looks exponentially slow

nathanielsimard · 2025-02-10T15:26:33Z

Maybe it's this part that is slow:

let mut sum: f32 = 0.0;
for t in out_vec {
    sum += t.sum().into_scalar().to_f32();
}

You're syncing the backend in a loop, maybe Tensor::cat(out_vec).sum() would be faster? The memory allocation algorithm might be slow, but I wouldn't expect it to slow down only the first iteration if it were the bottleneck. Also, if fusion is enabled, we're capturing the graph during the first iteration to perform some optimization. We plan to add caching to reduce cold start lag, but right now we're really optimizing for throughput.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cuda memory allocation slow #2772

Cuda memory allocation slow #2772

rztz commented Feb 3, 2025 •

edited

Loading

laggui commented Feb 3, 2025

rztz commented Feb 3, 2025

nathanielsimard commented Feb 10, 2025

Cuda memory allocation slow #2772

Cuda memory allocation slow #2772

Comments

rztz commented Feb 3, 2025 • edited Loading

laggui commented Feb 3, 2025

rztz commented Feb 3, 2025

nathanielsimard commented Feb 10, 2025

rztz commented Feb 3, 2025 •

edited

Loading