-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make vectors trivially relocatable. #337
base: main
Are you sure you want to change the base?
Make vectors trivially relocatable. #337
Conversation
Signed-off-by: Yuya Asano <[email protected]>
@AlexMWells Sorry to drag you into this, but I know that in the past you have gone super deep on how tweaks to these classes can help or inhibit vectorization in loops. Is there anything here that would concern you? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These look fine to me, though I pinged Alex Wells for an opinion as well, I want to make sure that these changes aren't going to disrupt any SIMD vectorization that we depend on in OSL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm.
Historically with vectorizing compilers, a data type can have its members turned into scalars with a "scalar replacement of aggregates" optimization pass. Once a data type has become scalars, its members can live in registers vs. being forced to standard data layout and being kept on the stack. This is important to minimize the movement between stack and registers when vectorizing as that is when an AOS/SOA transformation might be required and generate extra instructions. One thing that can disqualify SROA is passing the address of the structure to a function call, as that would require the standard data layout and the object exist on the stack vs registers. Now the wrinkle is the "default" copy constructor and possibly other default operators is that a compiler could implement a memcpy(*this, &src, sizeof(T). And the taking the address of "src" could disqualify SROA and end up forcing it to exist on the stack in standard data layout. Perhaps another optimization pass would convert the memcpy to assignments before SROA is applied, but those can often be specialized (just for 2 or 3 data members in a specific pattern). So defensive programming technique was to explicitly provide those constructors and operators with member-wise assignment/copies vs. letting the default possibly use memcpy/memset. Now fast forward many years and std::is_trivially_copyable was introduced in c++11 and then compilers/libraries start adding it to memcpy and other places with warnings, etc. In short I think you might want to keep both ways of having those default or not controllable by a define. |
Thank you for your advices. |
Signed-off-by: Yuya Asano <[email protected]>
config/ImathConfig.h.in
Outdated
@@ -174,4 +174,7 @@ | |||
# endif | |||
#endif | |||
|
|||
// Whether the copy and move constructors of Vec2, Vec3 and Vec4 are default or not. | |||
#cmakedefine IMATH_VEC_USE_DEFAULT_CONSTRUCTOR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think instead we want
#cmakedefine IMATH_VEC_USE_DEFAULT_CONSTRUCTOR | |
#ifndef IMATH_VEC_USE_DEFAULT_CONSTRUCTOR | |
#define IMATH_VEC_USE_DEFAULT_CONSTRUCTOR 1 | |
#endif |
Because it's easy to imagine that a particular downstream app needs to make sure it's the other way, and this would allow it to be overridden if the downstream modules defines the symbol as 0 before including any imath headers.
Then the various places where you have
#ifdef IMATH_VEC_USE_DEFAULT_CONSTRUCTOR
should be
#if IMATH_VEC_USE_DEFAULT_CONSTRUCTOR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fly in the ointment, if we go both ways, are we inviting an recurring support burden along these lines?
I can't make my build work with MegaDCC",
oh right, did you match IMATH_VEC_USE_DEFAULT_CONSTRUCTOR to what they picked?
but that's in conflict to our internal build
yeah but did you namespace?
we don't want to have to have two copies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The OPENEXR_DLL thing was sort of a harbinger, I'm worried about a matrix of macros people have to solve for compatibility
or maybe I'm misunderstanding completely, and it's all inlined and completely doesn't matter which you pick
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think because it's inlined it's all ok? There shouldn't be any link issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's the case, great. Then my only request is that we document why we have it both ways, even if it is a copy and paste of @AlexMWells thorough notes above. This is strongly in the realm of "gosh, this is rather esoteric, why is this a choice, and how do I make it?" I'd prefer the doc is inline in the CMakeLists.txt where you are faced with making a choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An example, well with an older compiler, but just wanted to show the behavior.
https://godbolt.org/z/rEY4Ef5GM
With #define VEC_ENABLE_DEFAULT_CONSTRUCTOR 1
The optimization report shows:
remark #15518: Structure assignment was serialized [ <source>(145,14) ]
remark #15518: Structure assignment was serialized [ <source>(147,14) ]
remark #15518: Structure assignment was serialized [ <source>(149,14) ]
remark #15301: OpenMP SIMD LOOP WAS VECTORIZED
remark #15450: unmasked unaligned unit stride loads: 11
remark #15451: unmasked unaligned unit stride stores: 4
remark #15475: --- begin vector cost summary ---
remark #15476: scalar cost: 367
remark #15477: vector cost: 125.870
remark #15478: estimated potential speedup: 2.490
remark #15488: --- end vector cost summary ---
...
<source>(145,14):remark #34000: call to memcpy implemented inline with loads and stores with proven source
Structure assignment was serialized really kills performance, and we can see the compiler by "default" decided to generate a memcpy.
By providing the copy constructor and assignment operator explicitly we avoid the memcpy or memset (for initializations) and with #define VEC_ENABLE_DEFAULT_CONSTRUCTOR 0
The optimization report shows:
remark #15309: vectorization support: normalized vectorization overhead 0.352
remark #15301: OpenMP SIMD LOOP WAS VECTORIZED
remark #15450: unmasked unaligned unit stride loads: 11
remark #15451: unmasked unaligned unit stride stores: 4
remark #15475: --- begin vector cost summary ---
remark #15476: scalar cost: 201
remark #15477: vector cost: 44.370
remark #15478: estimated potential speedup: 3.350
Now this is all very dependent on compiler and order of optimizations, and perhaps isn't as required on newer compilers. But this is just a simple example, real world kernels could have much larger state being kept on the stack and any nested aggregate that happens to use a memcpy or memset by default could prevent the SROA pass from fully succeeding.
Thus it would be nice for projects like OSL to choose to disable the use of default constructors. And if the supported compiler matrix has moved on to where it is not a problem then it could chose to enable them. Problem is examining optimization reports before and after is required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that sounds like explicitly defining copy and assignment, and not using default, we get the behavior that we want, which is that things optimize as we wish. "we" includes OSL. I can't think of a reason to provide this as one option, and default constructors and copy another option.
I am rereading this argument which suggests maybe we need the option,
but I'm not understanding a case to preserve the default option. Sorry if I'm being dense....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why I made this PR was because I wanted to copy Vec3
sequences between host and device using thrust::async::copy
. It requires Vec3
to be trivially relocatable, so default constructors and assignment operators for copy and move are needed for me.
I'm sorry for not telling it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My apologies, I didn't mean to shut down your effort. Maybe I am misunderstanding? I thought that adding the explicit copy and move gives us the trivial relocatability you are searching for, and also satisfies the SROA concern that @AlexMWells raises? I thought we can solve both your need, and also the SIMD concern at the same time, without needing to have a special conditional. My intent wasn't to stop your effort, I was hoping we can find one solution that solves both problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first, I apologize for my quick but rough decision.
I misunderstood I couldn't solve the assertion error, but I noticed that it was caused by ODR violation. Though I'm running CI in my forked repository, it will be OK.
I think that the trivial relocatability and the SROA are incompatible because the trivial relocatability requires Vec3
not to have user-provided copy constructor and assignment operator(even though Vec3::Vec3(const Vec3& v) : x(v.x), y(v.y), z(v.z) {}
), but on the other hand, providing the copy constructor and assignment operator increases the performance as @AlexMWells showed earlier.
So, we have to switch default
or not in some way.
Thank you :)
Signed-off-by: Yuya Asano <[email protected]>
This reverts commit abc7c5e. Signed-off-by: Yuya Asano <[email protected]>
Signed-off-by: Yuya Asano <[email protected]>
Signed-off-by: Yuya Asano <[email protected]>
I fixed codes following an advice, but an assertion error occured. One of the causes seems that |
Signed-off-by: Yuya Asano <[email protected]>
This reverts commit 5d4ed11. Signed-off-by: Yuya Asano <[email protected]>
Signed-off-by: Yuya Asano <[email protected]>
I found that only changing copy constructor from user-defined to default caused an assertion error, so I give up making |
Signed-off-by: Yuya Asano <[email protected]>
Here is the summary as I understand it: Using But there are a few specialized situations where it could sabotage certain auto-vectorization cases. Those situations can be locally fixed if there's a way to include the header files in such a way that it doesn't use the =default versions in a module-local way (say, by defining a symbol before including the Imath header). OSL is one of these. If Imath itself does not allow this, we might have to do something even uglier like replicate the Imath headers inside OSL and modify them to allow it. |
I'm sorry, but what is OSL? |
Open Shading Language |
Thank you! |
We discussed this in today's technical steering committee meeting. We support this addition in principle, the ultimate objective seems sound, but we're concerned about adding additional complexity through another batch of #if's in the code without a thorough analysis of the benefits. As @AlexMWells points out, the actual performance impact is complicated to assess for all cases. @sukeya, can you describe more about the motivation for the change? Was this motivated by a specific use case? Did you do a performance analysis that quantifies the impact? We also discussed briefly whether the same objective could be achieved by adding the trivially_relocatable attribute to the class definition itself, rather than adding new class methods. We are planning a v3.2 release in mid-August, which will be cut from the main branch, so we'd like to wait until after that before merging the PR. That will provide more time for evaluation before official release. Thanks for the contribution and thanks for your patience. |
I see. I will do a performance analysis. I talk about the motivation for the change. I have been writing a snow simulator which reads polygon meshes from an Alembic file exported by Blender. I think that the exported file generally stores a pair of a transform matrix ( Thank you. |
Adding the trivially_relocatable attribute is good for me. |
I have done a performance analysis. The result is the following:
The environment is the following:
The test data is a pair of 1 million 3 dimension vector generated random and a 4 dimension matrix. Let The programs needs oneTBB(2021.5.0 or higher) and CUDA(12.2), and must be cloned recursively. I didn't calculate the deviation of each test case, but felt that each sample in the test cases using CPU and |
Can you clarify what exactly is the difference between "default" and "user defined"? |
I see.
"user defined" means that vectors use current implementation of said functions. |
The performance difference on GPU is striking. |
// Whether the copy and move constructors of Vec2, Vec3 and Vec4 are default or not. | ||
#ifndef IMATH_VEC_BE_TRIVIALLY_RELOCATABLE | ||
# define IMATH_VEC_BE_TRIVIALLY_RELOCATABLE 0 | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Author of P1144 here. :) I think it's great you're adopting the terminology — and orthogonally great that you're =default
'ing all the things — but, looking at this PR, I think the focus on the word "relocatable" is actually a distraction. You should just say IMATH_VEC_RULE_OF_ZERO
or something like that. After your changes, it is true that a Vec2<T>
will be trivially relocatable if T
is trivially relocatable; but it will also be trivially copyable if T
is trivially copyable; trivially copy-assignable if T
is trivially copy-assignable; and so on. There's nothing relocation-specific about this PR as far as I can tell. (Am I missing something?)
Followup: Why is this gated on a macro at all? The =default
changes look like a strict improvement to me. If the non-defaulted implementations were in a .cpp file, so they were only compiled once, then there might be a reason to keep them; but nope, they're inline constexpr
in the header! I think you should consider very hard whether there's any reason to ever set the macro to anything but 1
. (...Is it that you need to have an opt-in way to avoid the ABI break that comes with Vec2 suddenly being passed in registers?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @sukeya @cary-ilm, there is right now this week a PR to add fully-P1144-compliant __is_trivially_relocatable(T)
to Clang (fixing a long-standing false-positive bug on Windows and also making it respect user-defined special members including operator=
so as to avoid false positives on all platforms). If you're interested (professionally, financially, or just plain interested) in P1144 trivial relocatability, I'd really appreciate it if you'd take a look at llvm/llvm-project#84621 and weigh in with your thoughts. (Full disclosure: my impression is that the false-positives being fixed don't actually matter to @sukeya's use-case of thrust::async::copy
. But I'd still really appreciate your taking a look, if you have interest in moving P1144 forward at all. It [and I mean that PR, specifically!] needs to see some customer demand in order to progress.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your advice! I commited your patches.
I'm busy now, but I'll see that PR when I have a time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your advice! I committed your patches.
I'm busy now, but I'll see that PR when I have a time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sukeya Thank you! I'd really appreciate it if you leave a 👍 reaction now, so you'll receive notifications that remind you to add a fuller comment when you have the time.
@lgritz Since you & I debated it earlier in this thread, a half year ago, and pursuant to @Quuxplusone's comments, I'm still of the opinion that we shouldn't guard default with |
Co-authored-by: Arthur O'Dwyer <[email protected]> Signed-off-by: Yuya Asano <[email protected]>
rm IMATH_CONSTEXPR14 and IMATH_NOEXCEPT from default constructor and assignment. Co-authored-by: Arthur O'Dwyer <[email protected]> Signed-off-by: Yuya Asano <[email protected]>
rm IMATH_NOEXCEPT from default destructor. Co-authored-by: Arthur O'Dwyer <[email protected]> Signed-off-by: Yuya Asano <[email protected]>
@meshula Do we not still have the problems that @AlexMWells detailed, where the |
The website check failed because of a change in required versions at readthedocs. This is fixed in the main branch. @sukeya , could you possibly merge the main branch with yours to pick up those fixes? |
@Quuxplusone and @sukeya, I need some more time to refresh my understanding of the proposal, but I think the consensus is it's headed in the right direction. We'd ideally like to have an analysis like @AlexMWells describes above. Some of the current constructs were set in place after such a careful expert analysis, although that was a long time ago with different compilers. It's definitely worth modernizing, but also good to confirm there aren't unintended consequences. |
FWIW, I believe I see now the scenario that @AlexMWells is worried about: Imagine a compiler that does escape analysis after inlining, and also optimizes a trivial constructor into memcpy, and also its escape analysis does not understand the semantics of memcpy (i.e. that its pointer arguments do not escape). Then we'd have this: https://godbolt.org/z/YjKcd1Pf1 Notice that Clang optimizes the return; but if you change I don't understand precisely how this kind of thing would interact with vectorization, but based on these observations I'd be surprised if memcpy interacted poorly with anything (in a recent enough compiler). I know another surprising pitfall where if a library skips running a trivial (pseudo-)destructor, lifetime analysis might not realize the object is dead; but my test case for that got fixed in GCC 11, and only ever happened on GCC AFAIK anyway. I bring it up merely as another case where making something more trivial can cause worse performance — but I'm quite confident that I wouldn't let that case stop me from making anything more trivial. Anyway, my main contribution to this PR is to say that "trivially relocatable" should just be "trivial" everywhere it appears in this PR; and to invite people interested in relocatability to comment in support of the Clang implementation llvm/llvm-project#84621 . :) |
Hello,
I'm sorry for making the same issue three times.
I made Vec2, Vec3 and Vec4 trivially relocatable.
I didn't change the behavior of them.
Could you please merge this PR?