Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/sycl support #146

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from
Open

Conversation

slgogar
Copy link

@slgogar slgogar commented Oct 7, 2021

Creating a pull request for adding a SYCL/DPC++ port for ZFP compression Library. Steps to use the compression library are same as the development branch.

git checkout feature/sycl-support
mkdir build
cmake -DZFP_WITH_SYCL=1 ..
make -j
make test

@slgogar slgogar closed this Oct 7, 2021
@slgogar slgogar reopened this Oct 7, 2021
@Michoumichmich
Copy link

Michoumichmich commented Oct 7, 2021

Hello @slgogar, I've worked on porting the back-end to SYCL too. I'm getting 80% of the CUDA performance. Where are you at ? Porting the code using DPCT yields bad performance on CUDA

@lindstro
Copy link
Member

lindstro commented Oct 7, 2021

@slgogar I really appreciate your contribution. I've been in contact with @Michoumichmich about his SYCL port for a few months now. It would be great to hear from both of you regarding performance and which features are currently supported in each implementation (e.g., compression modes, scalar types, array dimensionalities).

As both of you probably know, we have experimental CUDA and HIP implementations of variable-rate compression as well as a separate implementation that addresses some of the issues with parallel decompression. We're still weeding out some (mostly benign) bugs in these implementations, which currently do not pass all tests. I'd be curious how your SYCL implementations do on the latest CUDA unit tests (see the feature/cuda-variable-rate branch) and the testexec example that's on the feature/hip-support branch.

Regarding PR submissions, we're getting close to releasing another version of zfp that's currently being worked on on the develop branch, so I would prefer that PRs from either of you be submitted on a separate feature branch as we won't have time to incorporate this PR into the upcoming release. I will set up a new branch but am not yet sure whose implementation to merge.

@slgogar
Copy link
Author

slgogar commented Oct 8, 2021

Hi @lindstro I started porting the current development branch using Intel Data Parallel C++ Compatibility Tool. So the feature set for accelerator device is not very different then what is supported in that branch. After your comment i tested and added slightly modified version (enabling ZFP_WITH_SYCL) of testexec.cpp and as expected i noticed that the tests for 1,2 and 3 dimensionality passes for floating point and double precision data for Fixed rate mode only (i.e. all 192 and 384 tests passed with SYCL backend).
@Michoumichmich I wasn't aware of this ongoing effort to enable SYCL backend. Can you please share a link to the referred branch? Is there a particular test you are using to compare performance? I am happy to collaborate as needed and do not mind which implementation moves forward as long the overall goal is accomplished - which is to enable SYCL backend for ZFP.

@Michoumichmich
Copy link

Michoumichmich commented Oct 12, 2021

@Michoumichmich I wasn't aware of this ongoing effort to enable SYCL backend. Can you please share a link to the referred branch? Is there a particular test you are using to compare performance? I am happy to collaborate as needed and do not mind which implementation moves forward as long the overall goal is accomplished - which is to enable SYCL backend for ZFP.

Hello @slgogar, You may find my version of the implementation here: https://github.com/Michoumichmich/zfp/tree/feature/sycl-variable-rate. To print the compression/decompression bandwidth you can use the macro SYCL_ZFP_RATE_PRINT in both of the implementations. I also ported the variable-rate version of the implementation. It took way more time than expected because it raised several bugs in DPC++ which were tricky to track down. The biggest performance improvements I had were on GPU because DPC++ is very conservative on loop unrolling. Benchmarking a simple port from CUDA to SYCL (just changing the API calls) lead to a 5x slowdown!

You will find the datasets used to benchmark ZFP here: https://sdrbench.github.io

@lindstro
Copy link
Member

@slgogar, @Michoumichmich: I'd like to make sure we make progress on supporting SYCL.

Now that you are aware of your parallel efforts, can you please give me a status report? Would be nice to get a sense of relative performance, features supported, limitations, correctness (via testing), etc.

@Michoumichmich
Copy link

Hello @lindstro,
I will follow-up with the benchmarks. Regarding the tests, for the fixed-rate compression everything passes on the GPU as well as the new tests from feature/cuda-variable-rate branch. However, I still have failures (3D fp32/fp64 decompression) when running the same code on the OpenCL/CPU back-end. I will take a look at this code to see if I can find something, after ensuring that this code passes the tests too

@bassantmagdy900
Copy link

@Michoumichmich could you please tell me the oneApi version that is used in your branch?

@Michoumichmich
Copy link

Michoumichmich commented Dec 30, 2021

@Michoumichmich could you please tell me the oneApi version that is used in your branch?

Hi, I worked on the tip of the Intel/llvm as bugs got fixed (one was that dpc++ was crashing when building zfp with the variable rate implementation on CUDA). It works with all versions released since september

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants