Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor to use thrust::reduce on any. #685

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

ZelboK
Copy link

@ZelboK ZelboK commented Jul 27, 2024

Saw the issue open for this and did this on a whim. I'm not too comfortable deleting code from this repository so for brevity I'll leave things as is. All this does is just use thrust::reduce instead of the defined kernels

Some clarifying points:

  1. This PR is likely not complete. I figured it would be better to get feedback earlier on incase I am doing something wrong.
  2. Tests are seemingly passing on my 3080 WSL2 build but some of them segfault on main so I can't replicate a perfect test scenario.
  3. I am leaving the dispatching to thrust. The Executors I didn't spend too much time studying.

inp_ptr + a_.TotalSize(),
op.Init(),
op);
*result_ptr = result;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'm not really that faimliar with MatX. on line 79 I am doing a device_pointer_cast - IIUC the Executor is what determines if it's on host or device. I changed Executor ex to Executor to get past the warning as error that it wasn't used.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ZelboK, the Exec functions take any generic operator and calls the transform function. In this case there's a separate transform overload for a CUDA executor and a host executor. Since the input can be any type of operator and not just a tensor, there may not be a Data() method since it doesn't have to have memory backing it. For example, a user could do:

(a = any(ones<int>({4,4}))).run();

ones has no Data function since it has no memory backing it. So for this to work properly it would have to use thrust's iterator interface and wrap the operator in that as we do in other transforms. This is not a trivial change and can be a bit difficult if you're not familiar with the library.

Copy link
Author

@ZelboK ZelboK Jul 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cliffburdick I've contributed a bit to CCCL actually so I think I should be fine in that regard. That makes sense to me so far, I'll have to try and actually implement it to see how it goes. I presume you are referring to make_reduce_iterator for example. Curious to know why this isn't trivial though, is there something I'm missing?

Tangentially, is there a way to drastically reduce compile times? The feedback loop right now takes quite a long time.

I run with these options

cmake -DMATX_BUILD_TESTS=ON -DMATX_BUILD_BENCHMARKS=OFF -DMATX_BUILD_EXAMPLES=OFF -DMATX_BUILD_DOCS=OFF ..

and have tried commenting out tests but it still takes a long time before I actually get to see errors from the compiler. I have a pretty decent CPU (i9 12900k) too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ZelboK, it might not be too hard then if you're familiar with CCCL and their iterators. We have our own iterator classes in iterator.h. Typically the way we use them is like this:

https://github.com/NVIDIA/MatX/blob/main/include/matx/transforms/cub.h#L720

We write a lambda to perform the function (CUB in this case), and ReduceInput wraps it in iterators and collapses it. The tricky part might be that we have not done this with thrust though, so I don't know if our iterators are missing something to get it to work there.

To reduce compile times you shouldn't build everything each time. What I do is I take an existing example like fft_conv.cu (or make a new one), put my code in there that I'm testing, and compile just that with something like make fft_conv. That should compile in just about 10 seconds on most machines, whereas compiling everything can approach an hour on weaker machines.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @cliffburdick , I test out my code in an example now and the feedback loop is much better. Much appreciated.

__MATX_HOST__ __MATX_DEVICE__ __MATX_INLINE__ T Reduce(const T &v1, const T &v2)
{

__MATX_HOST__ __MATX_DEVICE__ __MATX_INLINE__ T operator()(const T &v1, const T &v2) const {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs to be const for Thrust.

Copy link
Collaborator

@cliffburdick cliffburdick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ZelboK , thanks for the PR! I left a comment that outlines the basic summary. This change may require significant work since it has to use iterators and we currently have no existing operator that uses thrust as an example.

ZelboK added 3 commits July 29, 2024 21:18
…od. Refactor so that the operator_utils is capable of dispatching to thrust
output_t out_base = output_;
auto op = detail::reduceOpAny<value_t>();

auto rv = ReduceInputThrust(std::forward<OpA>(a_), std::forward<out_t>(output_), std::forward<decltype(op)>(op));
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We construct the op, and then send the inputs and outputs in their original form to the new function ReduceInputThrust and of which gets collapsed and dispatched to thrust accordingly.

if (in_base.IsContiguous()) {
// the conversion is already handled for us by RandomOperatorIterator
thrust::reduce(
iter + *begin, iter + *end, op.Init(), op
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cliffburdick

So this function is responsible for dealing with the facct that In may not necessarily be a tensor_t but rather an operator of some sort. It'll get the respective offsets and construct an iterator for thrust to use, which, thankfully, does seem to be perfectly compatible.

Consequentially it would seem the code has also become potentially simpler than it's counterpart on main. Thrust is now responsible for deciding whom exactly to use in CUB rather than MatX. I might be missing some context though on whether or not MatX needing to be responsible for what function to use in CUB/Thrust.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In hindsight, I realize now that there might be issues with this approach. Does matXBinaryOp for example need to be utilized? I see that it has methods like PreRun PostRun etc that make use of the Executor(which I have used here)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ZelboK since a tensor is an operator, the iterator wrapper can turn any MatX operator type into an iterator. However, we do the contiguous check to allow CUB/thrust an optimization if it's a flat pointer with contiguous strides.

Regarding your second comment, the ReduceInput function shouldn't/doesn't need to know whether it's CUB or thrust. That's something you pass in your lambda that you give to the function. I'm saying this all without actually trying it, but if you assume the iterator type is compatible between both libraries then there may not be any changes to that code. This was how the example I pointed to previously worked (just search ReduceInput in cub.h).

I'm not sure what your question about matxBinaryOp is, but that's a wrapper class for any binary type and should be completely separate from this. matxBinaryOp, like most of our types is an operator, and you can pass it to thrust/cub and have the iterator pull from it.

Copy link
Author

@ZelboK ZelboK Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cliffburdick I see. I should have clarified - I actually did try to follow that example you've shown me like this(but I did it incorrectly). If you're curious, here's what I did.

auto output_ = cuda::std::get<0>(out);
        using out_t = decltype(output_);
        using value_t = typename out_t::value_type;
        using input_t = typename detail::base_type_t<OpA>; // incorrect, this is not any.h responsibility
        using output_t = typename detail::base_type_t<out_t>; // incorrect
        input_t in_base = a_;
        output_t out_base = output_;
        auto op = detail::reduceOpAny<value_t>();

        auto fn = [&](input_t &&input, 
          output_t && output, 
          BeginOffset<input_t> &&begin,  
          EndOffset<input_t> &&end) { 
            return thrust::reduce(
              input + *begin, input + *end, op.Init(), op
            );
        };
        auto rv = ReduceInput(fn, out_base, in_base);

which brought me to my question of matXBinaryOp. This led to errors like

has no member "type"
    typedef typename thrust::iterator_system<InputIterator>::type System;   

where InputIterator was of type

matx::detail::matxBinaryOp<matx::detail::ConstVal<int, cuda::std::__4::array<matx::index_t, 2UL>>, matx::index_t, matx::detail::BinOp<int, matx::index_t, matx::detail::AddF<int, matx::index_t>>>

I should not be passing the base types though. Let me push the fix. Thanks for the help.

input + *begin, input + *end, op.Init(), op
);
};
auto rv = ReduceInput(fn, output_, a_);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The iterators are thankfully compatible.

@ZelboK
Copy link
Author

ZelboK commented Aug 5, 2024

@cliffburdick Do you need anything done from my end? I imagine if so it would be relating to the executors?

@cliffburdick
Copy link
Collaborator

@cliffburdick Do you need anything done from my end? I imagine if so it would be relating to the executors?

Hi @ZelboK is this ready to test?

@cliffburdick
Copy link
Collaborator

/build

@ZelboK
Copy link
Author

ZelboK commented Aug 5, 2024

@cliffburdick Do you need anything done from my end? I imagine if so it would be relating to the executors?

Hi @ZelboK is this ready to test?

Yeah I recall running the tests which seemed to pass. Only thing is(even on main) some irrelevant tests segfault. Not sure if I am able to reproduce a good testing environment locally.

@cliffburdick
Copy link
Collaborator

Hi @ZelboK it's running, but we require all of our tests to pass. Unless it's an environment issue on your end we'll have to debug it.

@cliffburdick
Copy link
Collaborator

cliffburdick commented Aug 7, 2024

Hi @ZelboK , it looks like we're getting an exception in the tests:

[----------] 3 tests from ReductionTestsNumericNonComplexAllExecs/0, where TypeParam = cuda::std::__4::tuple<unsigned int, matx::cudaExecutor>
[ RUN      ] ReductionTestsNumericNonComplexAllExecs/0.Any
matxException (matxCudaError: ) - /home/jenkins/workspace/unit-tests/include/matx/operators/any.h:87

Stack Trace:
 test/matx_test : ()+0x1918f16
 test/matx_test : ()+0x9087ea6
 test/matx_test : ()+0x89fa02c
 /home/jenkins/workspace/unit-tests/build/lib/libgtestd.so.1.11.0 : void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)+0x69
 /home/jenkins/workspace/unit-tests/build/lib/libgtestd.so.1.11.0 : void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)+0x5e
 /home/jenkins/workspace/unit-tests/build/lib/libgtestd.so.1.11.0 : testing::Test::Run()+0x104
 /home/jenkins/workspace/unit-tests/build/lib/libgtestd.so.1.11.0 : testing::TestInfo::Run()+0x146
 /home/jenkins/workspace/unit-tests/build/lib/libgtestd.so.1.11.0 : testing::TestSuite::Run()+0x140
 /home/jenkins/workspace/unit-tests/build/lib/libgtestd.so.1.11.0 : testing::internal::UnitTestImpl::RunAllTests()+0x40a
 /home/jenkins/workspace/unit-tests/build/lib/libgtestd.so.1.11.0 : bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*)+0x69
 /home/jenkins/workspace/unit-tests/build/lib/libgtestd.so.1.11.0 : bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*)+0x5e
 /home/jenkins/workspace/unit-tests/build/lib/libgtestd.so.1.11.0 : testing::UnitTest::Run()+0xcd
 test/matx_test : ()+0x18eda91
 test/matx_test : ()+0x18ed0b1
 /usr/lib/x86_64-linux-gnu/libc.so.6 : ()+0x29d90
 /usr/lib/x86_64-linux-gnu/libc.so.6 : __libc_start_main()+0x80
 test/matx_test : _start()+0x25
rv: Error in any(1 != 0)

Have you tried running them?

@ZelboK
Copy link
Author

ZelboK commented Aug 8, 2024

Hi @ZelboK , it looks like we're getting an exception in the tests:

[----------] 3 tests from ReductionTestsNumericNonComplexAllExecs/0, where TypeParam = cuda::std::__4::tuple<unsigned int, matx::cudaExecutor>
[ RUN      ] ReductionTestsNumericNonComplexAllExecs/0.Any
matxException (matxCudaError: ) - /home/jenkins/workspace/unit-tests/include/matx/operators/any.h:87

Stack Trace:
 test/matx_test : ()+0x1918f16
 test/matx_test : ()+0x9087ea6
 test/matx_test : ()+0x89fa02c
 /home/jenkins/workspace/unit-tests/build/lib/libgtestd.so.1.11.0 : void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)+0x69
 /home/jenkins/workspace/unit-tests/build/lib/libgtestd.so.1.11.0 : void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)+0x5e
 /home/jenkins/workspace/unit-tests/build/lib/libgtestd.so.1.11.0 : testing::Test::Run()+0x104
 /home/jenkins/workspace/unit-tests/build/lib/libgtestd.so.1.11.0 : testing::TestInfo::Run()+0x146
 /home/jenkins/workspace/unit-tests/build/lib/libgtestd.so.1.11.0 : testing::TestSuite::Run()+0x140
 /home/jenkins/workspace/unit-tests/build/lib/libgtestd.so.1.11.0 : testing::internal::UnitTestImpl::RunAllTests()+0x40a
 /home/jenkins/workspace/unit-tests/build/lib/libgtestd.so.1.11.0 : bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*)+0x69
 /home/jenkins/workspace/unit-tests/build/lib/libgtestd.so.1.11.0 : bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*)+0x5e
 /home/jenkins/workspace/unit-tests/build/lib/libgtestd.so.1.11.0 : testing::UnitTest::Run()+0xcd
 test/matx_test : ()+0x18eda91
 test/matx_test : ()+0x18ed0b1
 /usr/lib/x86_64-linux-gnu/libc.so.6 : ()+0x29d90
 /usr/lib/x86_64-linux-gnu/libc.so.6 : __libc_start_main()+0x80
 test/matx_test : _start()+0x25
rv: Error in any(1 != 0)

Have you tried running them?

Hi,

Let me take a look when I have time later around Friday. Unfortunately I have less free time now than I'd like. I'll report back if I have troubles running the relevant tests. Curious to know if the test pipeline is only available through the NVIDIA VPN as an employee? I am not able to access the link.

@cliffburdick
Copy link
Collaborator

Hi @ZelboK , it looks like we're getting an exception in the tests:

[----------] 3 tests from ReductionTestsNumericNonComplexAllExecs/0, where TypeParam = cuda::std::__4::tuple<unsigned int, matx::cudaExecutor>
[ RUN      ] ReductionTestsNumericNonComplexAllExecs/0.Any
matxException (matxCudaError: ) - /home/jenkins/workspace/unit-tests/include/matx/operators/any.h:87

Stack Trace:
 test/matx_test : ()+0x1918f16
 test/matx_test : ()+0x9087ea6
 test/matx_test : ()+0x89fa02c
 /home/jenkins/workspace/unit-tests/build/lib/libgtestd.so.1.11.0 : void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)+0x69
 /home/jenkins/workspace/unit-tests/build/lib/libgtestd.so.1.11.0 : void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)+0x5e
 /home/jenkins/workspace/unit-tests/build/lib/libgtestd.so.1.11.0 : testing::Test::Run()+0x104
 /home/jenkins/workspace/unit-tests/build/lib/libgtestd.so.1.11.0 : testing::TestInfo::Run()+0x146
 /home/jenkins/workspace/unit-tests/build/lib/libgtestd.so.1.11.0 : testing::TestSuite::Run()+0x140
 /home/jenkins/workspace/unit-tests/build/lib/libgtestd.so.1.11.0 : testing::internal::UnitTestImpl::RunAllTests()+0x40a
 /home/jenkins/workspace/unit-tests/build/lib/libgtestd.so.1.11.0 : bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*)+0x69
 /home/jenkins/workspace/unit-tests/build/lib/libgtestd.so.1.11.0 : bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*)+0x5e
 /home/jenkins/workspace/unit-tests/build/lib/libgtestd.so.1.11.0 : testing::UnitTest::Run()+0xcd
 test/matx_test : ()+0x18eda91
 test/matx_test : ()+0x18ed0b1
 /usr/lib/x86_64-linux-gnu/libc.so.6 : ()+0x29d90
 /usr/lib/x86_64-linux-gnu/libc.so.6 : __libc_start_main()+0x80
 test/matx_test : _start()+0x25
rv: Error in any(1 != 0)

Have you tried running them?

Hi,

Let me take a look when I have time later around Friday. Unfortunately I have less free time now than I'd like. I'll report back if I have troubles running the relevant tests. Curious to know if the test pipeline is only available through the NVIDIA VPN as an employee? I am not able to access the link.

Hi @ZelboK , unfortunately no, there's no way you can view it, but I can look into that. You should be able to reproduce the exact same error by compiling the unit tests, then from your build directory run:

test/matx_test --gtest_filter="*0.Any*"

@ZelboK
Copy link
Author

ZelboK commented Aug 11, 2024

Hi @ZelboK , unfortunately no, there's no way you can view it, but I can look into that. You should be able to reproduce the exact same error by compiling the unit tests, then from your build directory run:

test/matx_test --gtest_filter="*0.Any*"

@cliffburdick Let me take a look now. By any chance are you ever on the nvidia developer discord btw? I know a lot of other teams like CCCL and cutlass participate.

@cliffburdick
Copy link
Collaborator

Hi @ZelboK , unfortunately no, there's no way you can view it, but I can look into that. You should be able to reproduce the exact same error by compiling the unit tests, then from your build directory run:

test/matx_test --gtest_filter="*0.Any*"

@cliffburdick Let me take a look now. By any chance are you ever on the nvidia developer discord btw? I know a lot of other teams like CCCL and cutlass participate.

I didn't even know that existed, but I will try it out today or tomorrow.

@tmartin-gh
Copy link
Collaborator

Please check out PR #772 and PR #777 which should use CUB's DeviceReduce/DeviceSegmentedReduce with custom compare operators for any() and all(). I think this PR may be closed due to the other merged PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants