Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SM90 Support #126

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
63782d4
initial enablement of SM90 macros
AD2605 Aug 20, 2024
bb1a825
resolve remaining compilation issues
AD2605 Aug 20, 2024
d5975e8
enable a few more macros
AD2605 Aug 20, 2024
9ebafdc
try by setting __CUDA_ARCH_FEAT_SM90_ALL via cmake
AD2605 Aug 20, 2024
d49a52e
enable path in gmma_builder.inl
AD2605 Aug 20, 2024
d31c736
minor bug fix
AD2605 Aug 21, 2024
546934c
set __CUDA_ARCH_FEAT_SM90_ALL via CMakeLists.txt
AD2605 Aug 21, 2024
e1f5516
fix mistake in FindDPCPP.cmake
AD2605 Aug 21, 2024
9495689
remove manual setting of sm_90a macro after bug-fix in FindDPCPP.cmake
AD2605 Aug 21, 2024
21fc904
reduce scope for warpgroup_reg_alloc/dealloc
AD2605 Aug 22, 2024
628ecb4
add event from cluster launch to event manager
AD2605 Aug 23, 2024
3206cea
enable some pathways in epilogue folders
AD2605 Aug 23, 2024
2ca58ae
enable error reporting in cuda path
AD2605 Aug 23, 2024
5e3802f
remove .cpp example file, use .cu instead
AD2605 Aug 23, 2024
e0f5a3e
add a forgotten __SYCL_CUDA_ARCH__ in static_tile_scheduler.hpp
AD2605 Aug 23, 2024
f88ea00
some missed cuda archs in cutlass.h
AD2605 Aug 25, 2024
0c9d5e1
add a runtime error, and a TODO comment
AD2605 Aug 26, 2024
d1badb7
fix issues with pvc build
AD2605 Aug 26, 2024
d3d97ab
use __PTX_VERSION__ macro, add comments in cmake, and remove noise fr…
AD2605 Aug 27, 2024
c9e395e
Revert "use __PTX_VERSION__ macro, add comments in cmake, and remove …
AD2605 Aug 27, 2024
73b6c7d
move initialization out of timing iterations
AD2605 Aug 27, 2024
dc56737
selectively use __PTX_VERSION__ macro
AD2605 Aug 27, 2024
3798c3a
disable cluster launch if using icpx
AD2605 Aug 27, 2024
83f0547
remove mistakenly added spaces and new lines from the PR
AD2605 Aug 27, 2024
386b14b
Add comment as to why cuda in included in sm_90 SYCL path for now
AD2605 Aug 27, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion cmake/FindDPCPP.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ find_library(DPCPP_LIB_DIR NAMES sycl sycl6 PATHS "${DPCPP_BIN_DIR}/../lib")

add_library(DPCPP::DPCPP INTERFACE IMPORTED)

set(DPCPP_FLAGS "-fsycl;")
set(DPCPP_FLAGS "-fsycl;-mllvm;-enable-global-offset=false;")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This flag was moved to line 58

 list(APPEND DPCPP_COMPILE_ONLY_FLAGS; "-mllvm;-enable-global-offset=false;")

Copy link
Collaborator Author

@AD2605 AD2605 Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a TODO comment which I thought I had added as a part of 0c9d5e1 over here,
which basically was basically about investigating why this line is needed,

I was aware of this change, but for some reason I was still seeing a kernel *_with_offset, hence I added that as a temporary fix,
This is also partly the reason why this PR is draft

set(DPCPP_COMPILE_ONLY_FLAGS "")

if(NOT "${DPCPP_SYCL_TARGET}" STREQUAL "")
Expand All @@ -51,6 +51,8 @@ endif()

if(NOT "${DPCPP_SYCL_ARCH}" STREQUAL "")
if("${DPCPP_SYCL_TARGET}" STREQUAL "nvptx64-nvidia-cuda")
list(APPEND DPCPP_FLAGS "-fno-sycl-decompose-functor;") #To enable GRID_CONSTANT like behaviour
list(APPEND DPCPP_FLAGS "-fgpu-inline-threshold=1000000;")
list(APPEND DPCPP_FLAGS "-Xsycl-target-backend")
list(APPEND DPCPP_FLAGS "--cuda-gpu-arch=${DPCPP_SYCL_ARCH}")
list(APPEND DPCPP_COMPILE_ONLY_FLAGS; "-mllvm;-enable-global-offset=false;")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -76,12 +76,20 @@
#include "cutlass/util/tensor_view_io.h"
#include "cutlass/util/reference/device/gemm.h"
#include "cutlass/util/reference/device/tensor_compare.h"
#if defined(SYCL_NVIDIA_TARGET)
#include "cutlass/util/reference/device/sycl_tensor_fill.h"
#else
#include "cutlass/util/reference/device/tensor_fill.h"
#endif

#include "helper.h"

using namespace cute;

#if defined(SYCL_NVIDIA_TARGET)
using namespace cutlass;
#endif

Comment on lines +89 to +92
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because types like cudaError_t and cudaSuccess are defined in the cutlass namespace in the non cuda path

#if defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)

/////////////////////////////////////////////////////////////////////////////////////////////////
Expand Down Expand Up @@ -379,7 +387,11 @@ bool verify(const Options &options) {
ref_D);

// Wait for kernel to finish
CUDA_CHECK(cudaDeviceSynchronize());
#if defined(SYCL_NVIDIA_TARGET)
syclcompat::wait_and_throw();
#else
CUDA_CHECK(cudaDeviceSynchronize());
#endif

// Check if output from CUTLASS kernel and reference kernel are equal or not
bool passed = cutlass::reference::device::BlockCompareEqual(block_ref_D.get(), block_D.get(), block_D.size());
Expand Down Expand Up @@ -427,10 +439,10 @@ int run(Options &options)
// Run profiling loop
if (options.iterations > 0)
{
CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
GpuTimer timer;
timer.start();
for (int iter = 0; iter < options.iterations; ++iter) {
CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
CUTLASS_CHECK(gemm.run());
}
timer.stop();
Expand Down Expand Up @@ -466,6 +478,7 @@ int main(int argc, char const **args) {

// CUTLASS must be compiled with CUDA 12.0 Toolkit to run this example
// and must have compute capability at least 90.
#if !defined(SYCL_NVIDIA_TARGET)
if (__CUDACC_VER_MAJOR__ < 12) {
std::cerr << "This example requires CUDA 12 or newer.\n";
// Returning zero so this test passes on older Toolkits. Its actions are no-op.
Expand All @@ -483,6 +496,7 @@ int main(int argc, char const **args) {
<< "later (compute capability 90 or greater).\n";
return 0;
}
#endif
//
// Parse options
//
Expand Down
6 changes: 6 additions & 0 deletions examples/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,11 @@ function(cutlass_example_add_executable NAME)
if (NOT CUTLASS_ENABLE_SYCL)
SET(ADD_CUDA ON)
endif()

if (DPCPP_SYCL_ARCH STREQUAL "sm_90a")
# This is needed to call the function that initialise the TMA descriptor
SET(ADD_CUDA ON)
endif()

target_link_libraries(
${NAME}
Expand Down Expand Up @@ -155,6 +160,7 @@ if (NOT CUTLASS_ENABLE_SYCL)
else()
foreach(EXAMPLE
14_ampere_tf32_tensorop_gemm
48_hopper_warp_specialized_gemm
cute
sycl
)
Expand Down
14 changes: 8 additions & 6 deletions include/cute/arch/cluster_sm90.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -34,11 +34,13 @@

// Config
#if (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900) && \
((__CUDACC_VER_MAJOR__ >= 12) || ((__CUDACC_VER_MAJOR__ == 11) && (__CUDACC_VER_MINOR__ >= 8))))
((__CUDACC_VER_MAJOR__ >= 12) || ((__CUDACC_VER_MAJOR__ == 11) && (__CUDACC_VER_MINOR__ >= 8)))) || \
(defined(__SYCL_CUDA_ARCH__) && (__SYCL_CUDA_ARCH__ >= 900) && defined(__PTX_VERSION__) && (__PTX_VERSION__ >= 80))
# define CUTE_ARCH_CLUSTER_SM90_ENABLED
#endif

#if (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900) && (__CUDACC_VER_MAJOR__ >= 12))
#if (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900) && (__CUDACC_VER_MAJOR__ >= 12)) || \
(defined(__SYCL_CUDA_ARCH__) && (__SYCL_CUDA_ARCH__ >= 900) && defined(__PTX_VERSION__) && (__PTX_VERSION__ >= 80))
# define CUTE_ARCH_ELECT_ONE_SM90_ENABLED
#endif

Expand Down Expand Up @@ -192,8 +194,8 @@ CUTE_HOST_DEVICE uint32_t elect_one_sync()
: "+r"(laneid), "+r"(pred)
: "r"(0xFFFFFFFF));
return pred;
#elif defined(__CUDA_ARCH__)
return (threadIdx.x % 32) == 0;
#elif defined(__CUDA_ARCH__) || defined(__SYCL_CUDA_ARCH__)
return (ThreadIdxX() % 32) == 0;
#else
return true;
#endif
Expand Down Expand Up @@ -222,8 +224,8 @@ elect_one_leader_sync()
: "+r"(laneid), "+r"(pred)
: "r"(0xFFFFFFFF));
return {pred, laneid};
#elif defined(__CUDA_ARCH__)
return {(threadIdx.x % 32) == 0, 0};
#elif defined(__CUDA_ARCH__) || defined(__SYCL_CUDA_ARCH__)
return {(ThreadIdxX() % 32) == 0, 0};
#else
return {true, 0};
#endif
Expand Down
6 changes: 4 additions & 2 deletions include/cute/arch/copy_sm90.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -35,13 +35,15 @@
#include <cute/arch/copy.hpp>

// Config
#if (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900) && (__CUDACC_VER_MAJOR__ >= 12))
#if (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900) && (__CUDACC_VER_MAJOR__ >= 12)) || \
(defined(__SYCL_CUDA_ARCH__) && (__SYCL_CUDA_ARCH__ >= 900) && defined(__PTX_VERSION__) && (__PTX_VERSION__ >= 80))
# define CUTE_ARCH_STSM_SM90_ENABLED
# define CUTE_ARCH_TMA_SM90_ENABLED
#endif

#if defined(CUTE_ARCH_TMA_SM90_ENABLED) && \
((__CUDACC_VER_MAJOR__ > 12) || ((__CUDACC_VER_MAJOR__ == 12) && (__CUDACC_VER_MINOR__ >= 3)))
(((__CUDACC_VER_MAJOR__ > 12) || ((__CUDACC_VER_MAJOR__ == 12) && (__CUDACC_VER_MINOR__ >= 3))) || \
defined(__PTX_VERSION__) && (__PTX_VERSION__ >= 80))
# define CUTE_ARCH_DEVICE_MODIFIABLE_TMA_SM90_ENABLED
#endif

Expand Down
4 changes: 2 additions & 2 deletions include/cute/arch/copy_sm90_desc.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@
**************************************************************************************************/
#pragma once

#if !defined(__CUDACC_RTC__) && !defined(CUTLASS_ENABLE_SYCL)
#if (!defined(__CUDACC_RTC__) && !defined(CUTLASS_ENABLE_SYCL)) || defined(SYCL_NVIDIA_TARGET)
#include <cuda.h>
#include <cinttypes>
#endif
Expand Down Expand Up @@ -176,7 +176,7 @@ enum class CacheHintSm90 : uint64_t {
EVICT_LAST = 0x14F0000000000000,
};

#if (__CUDACC_VER_MAJOR__ >= 12)
#if (__CUDACC_VER_MAJOR__ >= 12) || defined(SYCL_NVIDIA_TARGET)

#if !defined(__CUDACC_RTC__)
/// @return The TMA descriptor datatype enum corresponding to T.
Expand Down
4 changes: 2 additions & 2 deletions include/cute/arch/copy_sm90_tma.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -355,7 +355,7 @@ struct SM90_TMA_LOAD_IM2COL_3D
uint64_t gmem_int_desc = reinterpret_cast<uint64_t>(desc_ptr);
uint32_t smem_int_mbar = cast_smem_ptr_to_uint(mbar_ptr);
uint32_t smem_int_ptr = cast_smem_ptr_to_uint(smem_ptr);
// Copy from global to shared::cluster.
// Copy from global to shared::cluster
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert?

asm volatile (
"cp.async.bulk.tensor.3d.shared::cluster.global.im2col.mbarrier::complete_tx::bytes"
" [%0], [%1, {%3, %4, %5}], [%2], {%6};"
Expand Down Expand Up @@ -1113,7 +1113,7 @@ CUTE_HOST_DEVICE static void
tma_store_fence() {
#if defined(CUTE_ARCH_TMA_SM90_ENABLED)
asm volatile ("fence.proxy.async.shared::cta;");
#elif defined(__CUDA_ARCH__)
#elif defined(__CUDA_ARCH__) || (__SYCL_CUDA_ARCH__)
CUTE_INVALID_CONTROL_PATH("Trying to use tma without CUTE_ARCH_TMA_SM90_ENABLED.");
#endif
}
Expand Down
3 changes: 2 additions & 1 deletion include/cute/arch/mma_sm90.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,8 @@
#include <cute/arch/mma.hpp>

// Config
#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900)
#if (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900)) || \
(defined(__SYCL_CUDA_ARCH__) && (__SYCL_CUDA_ARCH__ >= 900))
# define CUTE_ARCH_MMA_SM90_ENABLED
# define CUTE_ARCH_MMA_F64_SM90_ENABLED
#endif
Expand Down
4 changes: 3 additions & 1 deletion include/cute/arch/mma_sm90_desc.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,9 @@
#include <cute/arch/mma.hpp>

// Config
#if (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900) && defined(__CUDA_ARCH_FEAT_SM90_ALL))
#if ((defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900)) || \
(defined(__SYCL_CUDA_ARCH__) && (__SYCL_CUDA_ARCH__ >= 900))) && \
defined(__CUDA_ARCH_FEAT_SM90_ALL)
# define CUTE_ARCH_MMA_SM90A_ENABLED
#endif

Expand Down
8 changes: 5 additions & 3 deletions include/cute/arch/mma_sm90_gmma.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,9 @@
#include <cute/config.hpp>
#include <cute/arch/mma.hpp>
// Config
#if (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900) && defined(__CUDA_ARCH_FEAT_SM90_ALL))
#if ((defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900)) || \
(defined(__SYCL_CUDA_ARCH__) && (__SYCL_CUDA_ARCH__ >= 900))) && \
defined(__CUDA_ARCH_FEAT_SM90_ALL)
# define CUTE_ARCH_MMA_SM90A_ENABLED
#endif

Expand Down Expand Up @@ -84,15 +86,15 @@ warpgroup_fence_operand(uint32_t& reg) {
// MSVC emits a build error for 'asm volatile'
// even if it only occurs in a __device__ function.
// This prevents the error.
#if defined(__CUDA_ARCH__)
#if defined(__CUDA_ARCH__) || defined(__SYCL_CUDA_ARCH__)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This SYCL_CUDA_ARCH seems to create a lot of noise in the code can we we wrap it up with cuda_arch

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we cannot do that yet.
So there is a cuda compatibility flag planned (-fsycl-cuda-compatibility), which will define the CUDA_ARCH and more, but those will be a bit more involved changes, as there is still a lot of nvcc specific code which currently shielded by the CUDA_ARCH, namely nvcc intrinsics which would not pertain to the functionality added in this PR, but would come as a part of a later PR

asm volatile("" : "+r"(reg) :: "memory");
#endif
}

CUTE_HOST_DEVICE
void
warpgroup_fence_operand(float& reg) {
#if defined(__CUDA_ARCH__)
#if defined(__CUDA_ARCH__) || defined(__SYCL_CUDA_ARCH__)
asm volatile("" : "+f"(reg) :: "memory");
#endif
}
Expand Down
2 changes: 1 addition & 1 deletion include/cute/atom/copy_atom.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -762,7 +762,7 @@ print_latex_copy(LayoutS const& S, ThrIDS const& TS, // (m,n) -> (tid,vid) and
#include <cute/atom/copy_traits_sm90.hpp>

// Config
#if (__CUDACC_VER_MAJOR__ >= 12)
#if (__CUDACC_VER_MAJOR__ >= 12) || defined(SYCL_NVIDIA_TARGET)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use PTX version for SYCL instead of SYCL_NVIDIA_TARGET. Since SYCL_NVIDIA_TARGET is more generic than versioning

# define CUTE_COPY_ATOM_TMA_SM90_ENABLED
#endif

Expand Down
21 changes: 11 additions & 10 deletions include/cute/atom/copy_traits_sm90_tma.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -78,8 +78,8 @@ struct TMA_LOAD_Unpack
#if 0
auto [c0,c1,c2,c3,c4] = append<5>(src_coord, 0);
printf("THR (%d,%d,%d) BLK (%d,%d,%d) TMACRD (%d,%d,%d,%d,%d) SMEMADDR (%p)\n",
threadIdx.x, threadIdx.y, threadIdx.z,
blockIdx.x, blockIdx.y, blockIdx.z,
ThreadIdxX(), ThreadIdxY(), ThreadIdxZ(),
BlockIdxX(), BlockIdxY(), BlockIdxZ(),
int32_t(c0), int32_t(c1), int32_t(c2), int32_t(c3), int32_t(c4), dst_ptr);
#endif
return detail::explode_tuple(detail::CallCOPY<CopyOp>{},
Expand Down Expand Up @@ -314,8 +314,8 @@ struct TMA_STORE_Unpack
#if 0
auto [c0,c1,c2,c3,c4] = append<5>(dst_coord, 0);
printf("THR (%d,%d,%d) BLK (%d,%d,%d) TMACRD (%d,%d,%d,%d,%d) SMEMADDR (%p)\n",
threadIdx.x, threadIdx.y, threadIdx.z,
blockIdx.x, blockIdx.y, blockIdx.z,
ThreadIdxX(), ThreadIdxY(), ThreadIdxZ(),
BlockDimX(), BlockDimY(), BlockDimZ(),
int32_t(c0), int32_t(c1), int32_t(c2), int32_t(c3), int32_t(c4), src_ptr);
#endif
return detail::explode_tuple(detail::CallCOPY<SM90_TMA_STORE>{},
Expand Down Expand Up @@ -375,8 +375,8 @@ struct Copy_Traits<SM90_TMA_STORE, NumBitsPerTMA, AuxParams_>
#if 0
auto [c0,c1,c2,c3,c4] = append<5>(dst_coord, 0);
printf("THR (%d,%d,%d) BLK (%d,%d,%d) TMACRD (%d,%d,%d,%d,%d) SMEMADDR (%p)\n",
threadIdx.x, threadIdx.y, threadIdx.z,
blockIdx.x, blockIdx.y, blockIdx.z,
ThreadIdxX(), ThreadIdxY(), ThreadIdxZ(),
BlockIdxX(), BlockIdxY(), BlockIdxZ(),
int32_t(c0), int32_t(c1), int32_t(c2), int32_t(c3), int32_t(c4), src_ptr);
#endif
return detail::explode_tuple(detail::CallCOPY<SM90_TMA_STORE>{},
Expand Down Expand Up @@ -457,8 +457,8 @@ struct Copy_Traits<SM90_TMA_REDUCE_ADD, NumBitsPerTMA, AuxParams_>
#if 0
auto [c0,c1,c2,c3,c4] = append<5>(dst_coord, 0);
printf("THR (%d,%d,%d) BLK (%d,%d,%d) TMACRD (%d,%d,%d,%d,%d) SMEMADDR (%p)\n",
threadIdx.x, threadIdx.y, threadIdx.z,
blockIdx.x, blockIdx.y, blockIdx.z,
ThreadIdxX(), ThreadIdxY(), ThreadIdxZ(),
BlockIdxX(), BlockIdxY(), BlockIdxZ(),
int32_t(c0), int32_t(c1), int32_t(c2), int32_t(c3), int32_t(c4), src_ptr);
#endif

Expand Down Expand Up @@ -974,7 +974,8 @@ make_tma_copy_desc(Tensor<GEngine,GLayout> const& gtensor, // The origin
// TMA general info
//

#if (__CUDACC_VER_MAJOR__ >= 12) && !defined(__CUDACC_RTC__)
#if ((__CUDACC_VER_MAJOR__ >= 12) && !defined(__CUDACC_RTC__)) || \
defined(SYCL_NVIDIA_TARGET)

CUtensorMapDataType tma_format = TMA::to_CUtensorMapDataType<TmaInternalType>();
CUtensorMapInterleave tma_interleave = CU_TENSOR_MAP_INTERLEAVE_NONE;
Expand All @@ -984,7 +985,7 @@ make_tma_copy_desc(Tensor<GEngine,GLayout> const& gtensor, // The origin
// TMA smem swizzle type
CUtensorMapSwizzle smem_swizzle = TMA::to_CUtensorMapSwizzle(get_tma_swizzle_bits(swizzle));
CUresult result = cuTensorMapEncodeTiled(
&tma_desc,
reinterpret_cast<CUtensorMap*>(&tma_desc),
Comment on lines -987 to +988
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the CuTensorMapEncodeTiled accepts a pointer to CUtensorMap,
it otherwise leads to a compilation error

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must clarify that this change is only temporary,
till we have the tensor map initialization via SYCL support

tma_format,
tma_dim,
gmem_address,
Expand Down
3 changes: 2 additions & 1 deletion include/cutlass/arch/barrier.h
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,8 @@

#include <cutlass/arch/memory_sm75.h>
#include <cute/arch/cluster_sm90.hpp>
#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 900 && (__CUDACC_VER_MAJOR__ >= 12)
#if (defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 900 && (__CUDACC_VER_MAJOR__ >= 12)) || \
(defined(__SYCL_CUDA_ARCH__) && (__SYCL_CUDA_ARCH__ >= 900) && defined(__PTX_VERSION__) && (__PTX_VERSION__ >= 80))
#define CUDA_BARRIER_ENABLED 1
#else
#define CUDA_BARRIER_ENABLED 0
Expand Down
5 changes: 3 additions & 2 deletions include/cutlass/arch/memory.h
Original file line number Diff line number Diff line change
Expand Up @@ -61,9 +61,10 @@ struct global_load;

/////////////////////////////////////////////////////////////////////////////////////////////////

#if (((__CUDACC_VER_MAJOR__ == 11) && (__CUDACC_VER_MINOR__ >= 4)) || \
#if ((((__CUDACC_VER_MAJOR__ == 11) && (__CUDACC_VER_MINOR__ >= 4)) || \
(__CUDACC_VER_MAJOR__ > 11)) && \
defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 750)
defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 750)) || \
(defined(__SYCL_CUDA_ARCH__) && (__SYCL_CUDA_ARCH__ >= 750))
#define CUTLASS_ENABLE_L2_PREFETCH 1
#else
#define CUTLASS_ENABLE_L2_PREFETCH 0
Expand Down
16 changes: 11 additions & 5 deletions include/cutlass/arch/mma_sm90.h
Original file line number Diff line number Diff line change
Expand Up @@ -46,25 +46,31 @@

////////////////////////////////////////////////////////////////////////////////

#if ((__CUDACC_VER_MAJOR__ > 11) || (__CUDACC_VER_MAJOR__ == 11 && __CUDACC_VER_MINOR__ >= 8))
#if ((__CUDACC_VER_MAJOR__ > 11) || (__CUDACC_VER_MAJOR__ == 11 && __CUDACC_VER_MINOR__ >= 8)) || \
defined(SYCL_NVIDIA_TARGET)
#define CUTLASS_ARCH_MMA_SM90_F64_MMA_SUPPORTED
#if (!defined(CUTLASS_ARCH_MMA_SM90_F64_MMA_ENABLED))
#if (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900))
#if (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900)) || \
(defined(__SYCL_CUDA_ARCH__) && (__SYCL_CUDA_ARCH__ >= 900) && \
defined(__PTX_VERSION__) && (__PTX_VERSION__ >= 80))
#define CUTLASS_ARCH_MMA_SM90_F64_MMA_ENABLED
#endif
#endif
#endif

#if (__CUDACC_VER_MAJOR__ >= 12)
#if (__CUDACC_VER_MAJOR__ >= 12) || defined(SYCL_NVIDIA_TARGET)
#define CUTLASS_ARCH_MMA_SM90_SUPPORTED
#if (!defined(CUTLASS_ARCH_MMA_SM90_ENABLED))
#if (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900))
#if (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900)) || \
(defined(__SYCL_CUDA_ARCH__) && (__SYCL_CUDA_ARCH__ >= 900) &&\
defined(__PTX_VERSION__) && (__PTX_VERSION__ >= 80))
#define CUTLASS_ARCH_MMA_SM90_ENABLED
#endif
#endif
#endif

#if ((__CUDACC_VER_MAJOR__ > 12) || ((__CUDACC_VER_MAJOR__ == 12) && (__CUDACC_VER_MINOR__ >= 3)))
#if ((__CUDACC_VER_MAJOR__ > 12) || ((__CUDACC_VER_MAJOR__ == 12) && (__CUDACC_VER_MINOR__ >= 3))) || \
defined(SYCL_NVIDIA_TARGET)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here as well, the SYCL_NVIDIA_TARGET covers a wide range of targets including SM80. we need to use PTX version here or at least make sure that the Nvidia target >= 900

#define CUTLASS_ARCH_MMA_MODIFIABLE_TMA_SM90_SUPPORTED
#endif

Expand Down
7 changes: 5 additions & 2 deletions include/cutlass/arch/reg_reconfig.h
Original file line number Diff line number Diff line change
Expand Up @@ -37,8 +37,11 @@

#include "cutlass/cutlass.h"

#if (defined(__CUDA_ARCH__) &&\
(__CUDA_ARCH__ >= 900) && (__CUDACC_VER_MAJOR__ >= 12) && defined(__CUDA_ARCH_FEAT_SM90_ALL))
#if (((defined(__CUDA_ARCH__) && \
(__CUDA_ARCH__ >= 900) && (__CUDACC_VER_MAJOR__ >= 12)) || \
(defined(__SYCL_CUDA_ARCH__) && (__SYCL_CUDA_ARCH__ >= 900) &&\
defined(__PTX_VERSION__) && (__PTX_VERSION__ >= 80))) \
&& defined(__CUDA_ARCH_FEAT_SM90_ALL))
#define CUDA_CTA_RECONFIG_ACTIVATED 1
#endif

Expand Down
2 changes: 1 addition & 1 deletion include/cutlass/barrier.h
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ struct GenericBarrier {
{
int state = 0;

#if (__CUDA_ARCH__ >= 700)
#if (__CUDA_ARCH__ >= 700) || (__SYCL_CUDA_ARCH__ >= 700)
/// SM70 and newer use memory consistency qualifiers

// Acquire pattern using acquire modifier
Expand Down
Loading
Loading