Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor to use thrust::reduce on any. #685

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 13 additions & 2 deletions include/matx/operators/any.h
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,8 @@

#pragma once

#include <thrust/reduce.h>
#include <thrust/device_ptr.h>

#include "matx/core/type_utils.h"
#include "matx/operators/base_operator.h"
Expand Down Expand Up @@ -71,8 +73,17 @@ namespace detail {
};

template <typename Out, typename Executor>
void Exec(Out &&out, Executor &&ex) const {
any_impl(cuda::std::get<0>(out), a_, ex);
void Exec(Out &&out, Executor) const {
auto output_tensor = cuda::std::get<0>(out);
using out_tensor_t = decltype(output_tensor);
auto inp_ptr = thrust::device_pointer_cast(a_.Data());
auto result_ptr = output_tensor.Data();
auto op = detail::reduceOpAny<typename out_tensor_t::value_type>();
auto result = thrust::reduce(inp_ptr,
inp_ptr + a_.TotalSize(),
op.Init(),
op);
*result_ptr = result;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'm not really that faimliar with MatX. on line 79 I am doing a device_pointer_cast - IIUC the Executor is what determines if it's on host or device. I changed Executor ex to Executor to get past the warning as error that it wasn't used.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ZelboK, the Exec functions take any generic operator and calls the transform function. In this case there's a separate transform overload for a CUDA executor and a host executor. Since the input can be any type of operator and not just a tensor, there may not be a Data() method since it doesn't have to have memory backing it. For example, a user could do:

(a = any(ones<int>({4,4}))).run();

ones has no Data function since it has no memory backing it. So for this to work properly it would have to use thrust's iterator interface and wrap the operator in that as we do in other transforms. This is not a trivial change and can be a bit difficult if you're not familiar with the library.

Copy link
Author

@ZelboK ZelboK Jul 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cliffburdick I've contributed a bit to CCCL actually so I think I should be fine in that regard. That makes sense to me so far, I'll have to try and actually implement it to see how it goes. I presume you are referring to make_reduce_iterator for example. Curious to know why this isn't trivial though, is there something I'm missing?

Tangentially, is there a way to drastically reduce compile times? The feedback loop right now takes quite a long time.

I run with these options

cmake -DMATX_BUILD_TESTS=ON -DMATX_BUILD_BENCHMARKS=OFF -DMATX_BUILD_EXAMPLES=OFF -DMATX_BUILD_DOCS=OFF ..

and have tried commenting out tests but it still takes a long time before I actually get to see errors from the compiler. I have a pretty decent CPU (i9 12900k) too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ZelboK, it might not be too hard then if you're familiar with CCCL and their iterators. We have our own iterator classes in iterator.h. Typically the way we use them is like this:

https://github.com/NVIDIA/MatX/blob/main/include/matx/transforms/cub.h#L720

We write a lambda to perform the function (CUB in this case), and ReduceInput wraps it in iterators and collapses it. The tricky part might be that we have not done this with thrust though, so I don't know if our iterators are missing something to get it to work there.

To reduce compile times you shouldn't build everything each time. What I do is I take an existing example like fft_conv.cu (or make a new one), put my code in there that I'm testing, and compile just that with something like make fft_conv. That should compile in just about 10 seconds on most machines, whereas compiling everything can approach an hour on weaker machines.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @cliffburdick , I test out my code in an example now and the feedback loop is much better. Much appreciated.

}

static __MATX_INLINE__ constexpr __MATX_HOST__ __MATX_DEVICE__ int32_t Rank()
Expand Down
19 changes: 13 additions & 6 deletions include/matx/transforms/reduce.h
Original file line number Diff line number Diff line change
Expand Up @@ -798,17 +798,24 @@ template <typename T> class reduceOpMax {
* Performs a reduction of two values of type T by returning 1 if either
* of the values are non-zero.
*/
template <typename T> class reduceOpAny {
template <typename T>
class reduceOpAny {
public:
using type = T; // This type is for Thrust
using matx_reduce = bool;
using matx_no_cub_reduce = bool; // Don't use CUB for this reduction type
__MATX_HOST__ __MATX_DEVICE__ __MATX_INLINE__ T Reduce(const T &v1, const T &v2)
{

__MATX_HOST__ __MATX_DEVICE__ __MATX_INLINE__ T operator()(const T &v1, const T &v2) const {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs to be const for Thrust.

return (v1 != 0) || (v2 != 0);
}
__MATX_HOST__ __MATX_DEVICE__ __MATX_INLINE__ T operator()(T &v1, T &v2) { v1 = ((v1 != 0) || (v2 != 0)); return v1; }
__MATX_HOST__ __MATX_DEVICE__ __MATX_INLINE__ T Init() { return (T)(0); }
__MATX_DEVICE__ __MATX_INLINE__ void atomicReduce(T *addr, T val) { atomicAny(addr, val); }

__MATX_HOST__ __MATX_DEVICE__ __MATX_INLINE__ T Init() const {
return static_cast<T>(0);
}

__MATX_DEVICE__ __MATX_INLINE__ void atomicReduce(T *addr, T val) const {
atomicAny(addr, val);
}
};

/**
Expand Down