Skip to content

Performance issue with mindspore.ops.normal #192

@fr30

Description

@fr30

Environment

Hardware Environment(Ascend/GPU/CPU):

/device gpu

Software Environment:

  • MindSpore version (source or binary): 1.8.0
  • Python version (e.g., Python 3.7.5): 3.7.10
  • OS platform and distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
  • GCC/Compiler version (if compiled from source): 7.5.0

Describe the current behavior

mindspore.ops.normal and mindspore.ops.StandardNormal have terrible performance. Generating a single random tensor of size 100x100x100 takes around 6 seconds, which is unacceptable. Also the problem probably occurs for different random ops.

Describe the expected behavior

Random ops should be much faster.

Steps to reproduce the issue\

Simply run the code:

from mindspore import Tensor, dtype
from typing import List
from mindspore.ops import normal
import time
import mindspore

mindspore.set_seed(2137)

def run_random_calculation(iters: List[int], shapes: List[int], prnt = False):
    assert len(iters) == len(shapes)

    mean = Tensor(0.0, dtype.float32)
    std = Tensor(1.0, dtype.float32)


    for i in range(len(iters)):
        iter_no = iters[i]
        shape = shapes[i]

        for j in range(iter_no):
            x = normal(shape, mean, std)

            if(prnt):
                print(x[:2][:2][:1])
warmup_iters = [
    1,
    1,
    0
]
benchmark_iters = [
    0,
    1,
    0
]
shapes = [
    (10, 10, 10),
    (100, 100, 100),
    (500, 500, 100)
]

run_random_calculation(warmup_iters, shapes)

start = time.time()

run_random_calculation(benchmark_iters, shapes)

end = time.time()

print(f'Result \nshapes: {shapes}\niters: {benchmark_iters}\ntime {end - start}')

Related log / screenshot

Result
shapes: [(10, 10, 10), (100, 100, 100), (500, 500, 100)]
iters: [0, 1, 0]
time 5.9976

Special notes for this issue

The problem lays in the file mindspore/ccsrc/plugin/device/gpu/kernel/cuda_impl/cuda_ops/random_op_impl.cu. Kernels for random generation run curand_init() for each iteration, which is expensive operation. Instead they could exploit the fact that curand_normal() changes the state passed as argument.
The problem is described in https://docs.nvidia.com/cuda/curand/device-api-overview.html#performance-notes also with a snippet that helps solving it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions