Cudastf #794

sidelnik · 2024-11-05T00:00:30Z

Initial updates to the build system to get Matx working with CUDASTF

cliffburdick · 2024-11-05T21:40:27Z

examples/cgsolve.cu

@@ -83,6 +88,10 @@ int main([[maybe_unused]] int argc, [[maybe_unused]] char **argv)
  (maxn = matx::max(sqrt(norm))).run(exec);

  exec.sync();
+#if 1
+  ctx.finalize();


what is finalize used for vs sync? Could you hide the context in the executor so the user doesn't need it, and calling exec.sync() calls finalize()?

finalize terminates everything in the stf context, it waits for asynchronous tasks, deletes internal resources etc... you can only do it once, sync is more equivalent to a ctx.task_fence() which is a non blocking fence (it returns a CUDA stream, and waiting for that stream means everything was done).

I'd like to move finalize to the dtor of the executor, but there are some caveats if you define the executor as a static variable, is this allowed ? The caveat might be some inappropriate unload ordering of CUDA and STF libraries as usual ...

Sounds good. I think the destructor is the right place. but does sync() work as expected?

@sidelnik is it doing a task fence with a stream sync ?

@caugonnet , sync() should be calling ctx.task_fence() now. I agree, I think we should place the ctx.finalize() inside the stf executor dtor

cliffburdick · 2024-11-05T21:58:38Z

examples/fft_conv.cu

@@ -129,18 +138,30 @@ int main([[maybe_unused]] int argc, [[maybe_unused]] char **argv)

  }

+#if 0
  cudaEventRecord(stop, stream);


Eventually we should mask these events behind the executor as well so the timing is the same regardless of the executor.

Yes this makes it look like the code is very different for both executors but that timing is the sole reason especially if finalize is moved to the dtor

include/matx/core/operator_utils.h

cliffburdick · 2024-11-06T17:54:15Z

include/matx/core/tensor.h

@@ -107,7 +108,7 @@ class tensor_t : public detail::tensor_impl_t<T,RANK,Desc> {
   * @param rhs Object to copy from
   */
  __MATX_HOST__ tensor_t(tensor_t const &rhs) noexcept
-      : detail::tensor_impl_t<T, RANK, Desc>{rhs.ldata_, rhs.desc_}, storage_(rhs.storage_)
+      : detail::tensor_impl_t<T, RANK, Desc>{rhs.ldata_, rhs.desc_, rhs.stf_ldata_}, storage_(rhs.storage_)


It would be good to understand why this extra data member is needed, because this pointer exists on the device potentially many times, so it can increase the size of the operator.

That's where a careful review of the design is needed ... Our logical data class tracks the use of a specific piece of data, your tensor seems to be a view to some data (with shapes and so on), so it's ok to have just the pointer and shapes, but in STF we do need to keep track of the internal state of the data (who owns a copy, which tasks depend on it, etc...). This is what the logical data does on your behalf and which your tensors cannot do by merely using the pointer.

One conservative take is to say that if you slice a tensor, this is the SAME logical data, so that further concurrent write accesses are serialized. This is sub-optimal when you have non overlapping slices but we cannot do better in a simple strategy. This ensures correctness but not optimality for concurrency

@cliffburdick you say it exists many times on the device, but isn't this a host only class ?

cliffburdick · 2024-11-14T17:52:25Z

include/matx/executors/stf.h

+       * 
+       * @param stream CUDA stream
+       */
+      stfExecutor(cudaStream_t stream) : stream_(stream) {


What does a stream do here? I thought STF had its own internal streams?

@cliffburdick In STF you can create nested/localized contexts & streams from existing (non-STF created) streams. This allows STF mechanisms to be correctly synchronized within the existing stream ecosystem. @caugonnet correct me if I am wrong.

caugonnet · 2024-11-14T19:11:47Z

include/matx/core/tensor.h

@@ -177,6 +180,16 @@ class tensor_t : public detail::tensor_impl_t<T,RANK,Desc> {
    this->SetLocalData(storage_.data());
  }

+  template <typename S2 = Storage, typename D2 = Desc,
+            std::enable_if_t<is_matx_storage_v<typename remove_cvref<S2>::type> && is_matx_descriptor_v<typename remove_cvref<D2>::type>, bool> = true>
+  tensor_t(S2 &&s, D2 &&desc, T* ldata, std::optional<stf_logicaldata_type > *stf_ldata_) :


We need to do something about that type ... std::optional<stf_logicaldata_type > *stf_ldata_

The rationale is to be able to define a tensor before it is associated to an executor, so the logical data might be set lazily.

caugonnet · 2024-11-14T19:13:08Z

include/matx/core/tensor_impl.h

    }

    /**
     * Constructor for a rank-0 tensor (scalar).
     */
    tensor_impl_t() {
-
+      auto ldptr = new std::optional<stf_logicaldata_type>();


this feels bad

This won't compile anymore since we don't allow std:: types on the device. It might work with cuda::std::optional, but we don't use that anywhere currently.

caugonnet · 2024-11-14T19:13:30Z

include/matx/core/tensor_impl.h

+    template <typename DescriptorType, std::enable_if_t<is_matx_descriptor_v<typename remove_cvref<DescriptorType>::type>, bool> =     true>
+    __MATX_INLINE__ __MATX_DEVICE__ __MATX_HOST__ tensor_impl_t(T *const ldata,
+                    DescriptorType &&desc, std::optional<stf_logicaldata_type > *stf_ldata)
+        : ldata_(ldata), desc_{std::forward<DescriptorType>(desc)}, stf_ldata_(stf_ldata)


::std::move(stf_ldata) ?

caugonnet · 2024-11-14T19:17:48Z

include/matx/core/tensor_impl.h

+#endif
+
+      if (perm == 0) {
+          task.add_deps(ld.write());


We could directly build a task_dep in CUDASTF, matching perm value with the type ... But it seems there is no such thing as a clean way to do this !

caugonnet · 2024-11-14T19:19:21Z

include/matx/core/tensor_impl.h

+            place = getDataPlace(Data());
+#endif
+
+            *stf_ldata_ = ctx.logical_data(cuda::experimental::stf::void_interface());


Some comment would be welcome here :) This is creating a logical data with a void data interface because we don't rely on CUDASTF for transfers/allocation, it's just for sync.

Putting a value here, and not a shape of a void interface means we don't have to issue a "write" task in CUDASTF

caugonnet · 2024-11-14T19:29:19Z

include/matx/core/utils.h

@@ -45,6 +45,30 @@

 namespace matx {
 namespace detail {
+
+#if 0
+__MATX_INLINE__ cuda::experimental::stf::data_place getDataPlace(void *ptr) {


Why don't we keep it ? Note that for void data interface it's not super critical but still ...

caugonnet · 2024-11-14T19:31:50Z

include/matx/core/utils.h

+                        return data_place::current_device();
+                    case MATX_INVALID_MEMORY:
+                        //std::cout << "Data kind is invalid: assuming managed memory\n";
+                        return data_place::managed;


this seems like an error

caugonnet · 2024-11-14T19:33:26Z

include/matx/executors/stf.h

+            }
+            else {
+                //std::cout << " RANK 0 not on LHS operator = " << op.str() << '\n';
+                detail::matxOpT0Kernel<<<blocks, threads, 0, stream_>>>(op);


Why do we sometimes use something without a task ? Is it coherent with STF tasks?

caugonnet · 2024-11-14T19:34:48Z

include/matx/executors/stf.h

+
+            bool stride = detail::get_grid_dims<Op::Rank()>(blocks, threads, sizes, 256);
+
+            if constexpr (Op::Rank() == 1) {


It looks like we could factorize all that constexpr cascade, and move the constexpr tests into the lambda ?

caugonnet · 2024-11-14T19:35:41Z

include/matx/generators/generator1d.h

@@ -54,6 +54,9 @@ namespace matx
            return f_(pp_get<Dim>(indices...));
          }

+        template <typename Task>
+        __MATX_INLINE__ void apply_dep_to_task([[maybe_unused]] Task &&task, [[maybe_unused]] int perm=1) const noexcept { }


So this operator is defined per operator, and is STF specific ? it's not part of the executor nor relying on overloads / traits ?

caugonnet · 2024-11-14T19:37:58Z

include/matx/operators/conv.h

+              b_.apply_dep_to_task(tsk, 1);
+
+              tsk->*[&](cudaStream_t s) {
+                  auto exec = cudaExecutor(s);


So create a nested MatX executor, is that legal ?

I think it should be fine. The cache is ultimately what would possibly have side effects

cliffburdick · 2024-11-20T19:06:05Z

include/matx/core/tensor_impl.h

+        if constexpr (is_cuda_executor_v<Executor>) {
+            return;
+        }
+        else if constexpr (!is_cuda_executor_v<Executor>) {


just else?

cliffburdick · 2024-11-20T19:07:12Z

include/matx/core/tensor_impl.h

@@ -1094,6 +1165,9 @@ IGNORE_WARNING_POP_GCC
  protected:
    T *ldata_;
    Desc desc_;
+
+  public:
+    mutable std::optional<stf_logicaldata_type > *stf_ldata_;


As discussed before this won't work since we can't use std:: objects on the device. It might work with cuda::std::optional, but we'd likely need to justify the overhead vs other options

cliffburdick · 2024-11-22T22:24:43Z

include/matx/operators/constval.h

@@ -55,6 +55,9 @@ namespace matx
        __MATX_INLINE__ __MATX_DEVICE__ __MATX_HOST__ T operator()(Is...) const { 
          return v_; };

+      template <typename Task>
+      __MATX_INLINE__ void apply_dep_to_task([[maybe_unused]] Task &&task, [[maybe_unused]] int perm) const noexcept { }


Operator members typically use camel case format

cliffburdick · 2024-11-22T22:26:10Z

include/matx/operators/all.h

+            tsk.set_symbol("all_task");
+
+            output.PreRun(out_dims_, std::forward<Executor>(ex));
+            output.apply_dep_to_task(tsk, 0);


Why isn't apply_dep_to_task just part of PreRun? It looks like it's called in the same place

cliffburdick · 2024-11-22T23:17:45Z

include/matx/operators/fft.h

-            if constexpr (std::is_same_v<FFTType, fft_t>) { 
-              fft_impl(permute(cuda::std::get<0>(out), perm_), permute(a_, perm_), fft_size_, norm_, ex);
+            // stfexecutor case
+            if constexpr (!is_cuda_executor_v<Executor>) {


Do you want this to run for the host executor too?

cliffburdick · 2024-11-22T23:19:48Z

include/matx/operators/fft.h

+                output.apply_dep_to_task(tsk, 0);
+                a_.apply_dep_to_task(tsk, 1);
+
+                tsk->*[&](cudaStream_t s) {


Rather than checking if this is not a cuda executor, then creating one inside, can it somehow pull a stream from STF and just use that here?

sidelnik added 6 commits November 4, 2024 15:22

Update build config to pull CUDASTF

d6dc01d

remove const expr

245b20f

Updates to get basic cudastf functionality working with matx

9b35ec8

move to void_interface

7d298d4

add stf executor

154b3f9

support for cgsolve operator and a few examples

c8ef988

cliffburdick reviewed Nov 5, 2024

View reviewed changes

cliffburdick reviewed Nov 6, 2024

View reviewed changes

include/matx/core/operator_utils.h Show resolved Hide resolved

cliffburdick reviewed Nov 6, 2024

View reviewed changes

cliffburdick reviewed Nov 14, 2024

View reviewed changes

caugonnet reviewed Nov 14, 2024

View reviewed changes

cliffburdick reviewed Nov 20, 2024

View reviewed changes

cliffburdick reviewed Nov 22, 2024

View reviewed changes

sidelnik added 2 commits December 3, 2024 13:17

make the sync() that is part of stfexecutor call ctx.task_fence()

52b18c9

fix typo

d726b10

sidelnik added 9 commits December 17, 2024 10:43

Added test case

5e7576c

Fixes to the sync

1373699

add support for cgsolve

92e7204

update to the simple radar code

a608f3f

minor typo fix

b062577

update version of stf

bbf9abc

cleanup constexpr case for stfexecutor

3e831ea

cleanup constexpr case for stfexecutor

702fe79

add conditional support for cudagraph to the stf executor

5bfe21e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cudastf #794

Cudastf #794

sidelnik commented Nov 5, 2024

cliffburdick Nov 5, 2024

caugonnet Nov 7, 2024

cliffburdick Nov 7, 2024

caugonnet Nov 7, 2024

sidelnik Dec 3, 2024

cliffburdick Nov 5, 2024

caugonnet Nov 7, 2024

cliffburdick Nov 6, 2024

caugonnet Nov 7, 2024

caugonnet Nov 14, 2024

cliffburdick Nov 14, 2024

sidelnik Dec 19, 2024

caugonnet Nov 14, 2024

caugonnet Nov 14, 2024

cliffburdick Nov 18, 2024

caugonnet Nov 14, 2024

caugonnet Nov 14, 2024

caugonnet Nov 14, 2024 •

edited

Loading

caugonnet Nov 14, 2024

caugonnet Nov 14, 2024

caugonnet Nov 14, 2024 •

edited

Loading

caugonnet Nov 14, 2024

caugonnet Nov 14, 2024

caugonnet Nov 14, 2024

cliffburdick Nov 22, 2024

cliffburdick Nov 20, 2024

cliffburdick Nov 20, 2024

cliffburdick Nov 22, 2024

cliffburdick Nov 22, 2024

cliffburdick Nov 22, 2024

cliffburdick Nov 22, 2024


		bool stride = detail::get_grid_dims<Op::Rank()>(blocks, threads, sizes, 256);

		if constexpr (Op::Rank() == 1) {

Cudastf #794

Are you sure you want to change the base?

Cudastf #794

Conversation

sidelnik commented Nov 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

caugonnet Nov 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

caugonnet Nov 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

caugonnet Nov 14, 2024 •

edited

Loading

caugonnet Nov 14, 2024 •

edited

Loading