Guidelines to convert CUDA(CuPy) kernel to OpenCL(ClPy) kernel

Thread, Grid, Block -> Work Group, Work Item

threadIdx.{x,y,z} -> get_local_id({0, 1, 2})

blockDim.{x,y,z} -> get_local_size({0, 1, 2})

blockIdx.{x,y,z} -> get_group_id({0, 1, 2})

The concepts of thread, block, grid (for CUDA) and workitem, workgroup (for OpenCL) are quite different.

To launch total 1024 threads grouped by 32 in 1D,

CUDA	OpenCL
`blocksize = (32, 1, 1)`, `gridsize = (32, 1, 1)`	`global_work_size = (1024, 1, 1)`, `local_work_size = (32, 1, 1)`

__syncthreads()
-> barrier(CLK_LOCAL_MEM_FENCE)

If ultima will be applied, these changes are not necessary.

CArray<T, N> arr
-> __global T* arr, CArray_N arr_info

arr.size()
-> arr_info.size_

arr[I]
-> arr[get_CArrayIndexI_N(&arr_info, I)/sizeof(<type of arr[0]>)]