Performance issue with mindspore.ops.normal

## Environment
### Hardware Environment(`Ascend`/`GPU`/`CPU`): 

`/device gpu`</br>

### Software Environment:
- **MindSpore version (source or binary)**: 1.8.0
- **Python version (e.g., Python 3.7.5)**: 3.7.10
- **OS platform and distribution (e.g., Linux Ubuntu 16.04)**: Ubuntu 18.04
- **GCC/Compiler version (if compiled from source)**: 7.5.0

## Describe the current behavior
mindspore.ops.normal and mindspore.ops.StandardNormal have terrible performance. Generating a single random tensor of size 100x100x100 takes around 6 seconds, which is unacceptable. Also the problem probably occurs for different random ops.

## Describe the expected behavior
Random ops should be much faster.

## Steps to reproduce the issue\
Simply run the code: 

```
from mindspore import Tensor, dtype
from typing import List
from mindspore.ops import normal
import time
import mindspore

mindspore.set_seed(2137)

def run_random_calculation(iters: List[int], shapes: List[int], prnt = False):
    assert len(iters) == len(shapes)

    mean = Tensor(0.0, dtype.float32)
    std = Tensor(1.0, dtype.float32)


    for i in range(len(iters)):
        iter_no = iters[i]
        shape = shapes[i]

        for j in range(iter_no):
            x = normal(shape, mean, std)

            if(prnt):
                print(x[:2][:2][:1])
warmup_iters = [
    1,
    1,
    0
]
benchmark_iters = [
    0,
    1,
    0
]
shapes = [
    (10, 10, 10),
    (100, 100, 100),
    (500, 500, 100)
]

run_random_calculation(warmup_iters, shapes)

start = time.time()

run_random_calculation(benchmark_iters, shapes)

end = time.time()

print(f'Result \nshapes: {shapes}\niters: {benchmark_iters}\ntime {end - start}')
```

## Related log / screenshot
Result
shapes: [(10, 10, 10), (100, 100, 100), (500, 500, 100)]
iters: [0, 1, 0]
time 5.9976

## Special notes for this issue
The problem lays in the file mindspore/ccsrc/plugin/device/gpu/kernel/cuda_impl/cuda_ops/random_op_impl.cu. Kernels for random generation run curand_init() for each iteration, which is expensive operation. Instead they could exploit the fact that curand_normal() changes the state passed as argument. 
The problem is described in https://docs.nvidia.com/cuda/curand/device-api-overview.html#performance-notes also with a snippet that helps solving it.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance issue with mindspore.ops.normal #192

Environment

Hardware Environment(`Ascend`/`GPU`/`CPU`):

Software Environment:

Describe the current behavior

Describe the expected behavior

Steps to reproduce the issue\

Related log / screenshot

Special notes for this issue

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance issue with mindspore.ops.normal #192

Description

Environment

Hardware Environment(Ascend/GPU/CPU):

Software Environment:

Describe the current behavior

Describe the expected behavior

Steps to reproduce the issue\

Related log / screenshot

Special notes for this issue

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Hardware Environment(`Ascend`/`GPU`/`CPU`):