-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor to use thrust::reduce on any. #685
Open
ZelboK
wants to merge
6
commits into
NVIDIA:main
Choose a base branch
from
ZelboK:feat_any_reduce
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 1 commit
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
d0f3789
Refactor to use thrust::reduce on any.
ZelboK 6dc560a
Account for the fact that some operators will not have a .Data() meth…
ZelboK 15358ec
Clean up
ZelboK 350b292
minor clean up
ZelboK 55d78c0
revert changes to reduce.h and simplify design in any.h to use thrust
ZelboK b8ea6c2
remove changes to example
ZelboK File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -798,17 +798,24 @@ template <typename T> class reduceOpMax { | |
* Performs a reduction of two values of type T by returning 1 if either | ||
* of the values are non-zero. | ||
*/ | ||
template <typename T> class reduceOpAny { | ||
template <typename T> | ||
class reduceOpAny { | ||
public: | ||
using type = T; // This type is for Thrust | ||
using matx_reduce = bool; | ||
using matx_no_cub_reduce = bool; // Don't use CUB for this reduction type | ||
__MATX_HOST__ __MATX_DEVICE__ __MATX_INLINE__ T Reduce(const T &v1, const T &v2) | ||
{ | ||
|
||
__MATX_HOST__ __MATX_DEVICE__ __MATX_INLINE__ T operator()(const T &v1, const T &v2) const { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. needs to be const for Thrust. |
||
return (v1 != 0) || (v2 != 0); | ||
} | ||
__MATX_HOST__ __MATX_DEVICE__ __MATX_INLINE__ T operator()(T &v1, T &v2) { v1 = ((v1 != 0) || (v2 != 0)); return v1; } | ||
__MATX_HOST__ __MATX_DEVICE__ __MATX_INLINE__ T Init() { return (T)(0); } | ||
__MATX_DEVICE__ __MATX_INLINE__ void atomicReduce(T *addr, T val) { atomicAny(addr, val); } | ||
|
||
__MATX_HOST__ __MATX_DEVICE__ __MATX_INLINE__ T Init() const { | ||
return static_cast<T>(0); | ||
} | ||
|
||
__MATX_DEVICE__ __MATX_INLINE__ void atomicReduce(T *addr, T val) const { | ||
atomicAny(addr, val); | ||
} | ||
}; | ||
|
||
/** | ||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I'm not really that faimliar with MatX. on line 79 I am doing a
device_pointer_cast
- IIUC theExecutor
is what determines if it's on host or device. I changedExecutor ex
toExecutor
to get past the warning as error that it wasn't used.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ZelboK, the
Exec
functions take any generic operator and calls the transform function. In this case there's a separate transform overload for a CUDA executor and a host executor. Since the input can be any type of operator and not just a tensor, there may not be aData()
method since it doesn't have to have memory backing it. For example, a user could do:ones
has noData
function since it has no memory backing it. So for this to work properly it would have to use thrust's iterator interface and wrap the operator in that as we do in other transforms. This is not a trivial change and can be a bit difficult if you're not familiar with the library.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cliffburdick I've contributed a bit to CCCL actually so I think I should be fine in that regard. That makes sense to me so far, I'll have to try and actually implement it to see how it goes. I presume you are referring to
make_reduce_iterator
for example. Curious to know why this isn't trivial though, is there something I'm missing?Tangentially, is there a way to drastically reduce compile times? The feedback loop right now takes quite a long time.
I run with these options
and have tried commenting out tests but it still takes a long time before I actually get to see errors from the compiler. I have a pretty decent CPU (i9 12900k) too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ZelboK, it might not be too hard then if you're familiar with CCCL and their iterators. We have our own iterator classes in iterator.h. Typically the way we use them is like this:
https://github.com/NVIDIA/MatX/blob/main/include/matx/transforms/cub.h#L720
We write a lambda to perform the function (CUB in this case), and
ReduceInput
wraps it in iterators and collapses it. The tricky part might be that we have not done this with thrust though, so I don't know if our iterators are missing something to get it to work there.To reduce compile times you shouldn't build everything each time. What I do is I take an existing example like fft_conv.cu (or make a new one), put my code in there that I'm testing, and compile just that with something like
make fft_conv
. That should compile in just about 10 seconds on most machines, whereas compiling everything can approach an hour on weaker machines.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @cliffburdick , I test out my code in an example now and the feedback loop is much better. Much appreciated.