-
Notifications
You must be signed in to change notification settings - Fork 322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
'VFT dispatching' to call into SIMD-ISA-specific code #2364
Comments
Thanks for sharing your result, which is of course welcome :)
Thank you! It has actually been a few years since we designed this. You might currently have the best understanding of dispatching and how the pieces fit together.
Yes, that's fair. Maybe HWY_TARGET_NAME, though we did want to discourage using it for anything other than a namespace. If I understand your new system correctly, the derived class calls the hn:: implementation directly, so once you have your Very cool, congrats! I'd encourage you to also write up an intro on this system, it looks useful for when there are many functions to dispatch. FYI some time ago we expanded in an unrelated direction: enabling dispatch of function templates, at least with |
Claiming that would be presumptuous - I figured out how to use highway's mechanism, not quite how it actually does what it does. But I did notice that a lot of it works with the preprocessor, rather than relying on C++ language features like templates.
I have something in the pipeline, I'll post a link to a draft version once I have it online.
Dispatching this way is indeed very efficient, it's really just the load-the-vptr plus indirect call. The level at which this dispatch happens should not be too performance-critical, though - The stuff which really needs to be fast should all be inlined inside the chunks of code which contain performance-critical code, and the function-level dispatch should only happen when you do stuff like call a routine which processes an entire array of data or such - as zimt does when it works over nD arrays of *xel data. The lookup of the current target is an issue, though. I think we discussed this once already, but my memory fails me. The question is when such a lookup may occur at run-time. If there is a possibility that the available SIMD ISA changes from one call into the SIMD-ISA-specific code to the next, my method is not applicable without a dispatch pointer refetch - the code would crash with an illegal instruction if the current ISA does not provide the instructions. But I think this is a rare exception, and it would be hard to handle anyway: what if you've just figured out you can use AVX3, and when you proceed to call the AVX3-code the system puts your thread on a core which only has AVX2? You'd have to make sure the code proceeds without such switches until it's done, and this may be hard to 'nail down'. I started this thread to also get feedback from you guys on potential stumbling stones, and this is certainly one - but you're in a better position to know whether this is indeed relevant, so I'd be glad to get some advice, beyond 'which does happen'. When and on which platforms does it happen? Since I intend this level of dispatch for entire blocks of code rather than at the level of individual SIMD operations or small sequences, it would be unproblematic to re-fetch the dispatch pointer before using it (losing the direct VFT call speed advantage), if the ISA switch can happen in mid-run-time. But if the ISA switch can occur between the acquisition of the dispatch pointer and it's use immediately afterwards, it's a problem - I doubt, though - even without investigating deeply - that your dispatching code is shielded against such extreme disruption. |
Here's the text I've written on the topic: https://github.com/kfjahnke/zimt/blob/multi_isa/examples/multi_isa_example/multi_simd_isa.md The 'multi_isa_example' folder also has example code, a program using the dispatch mechanism I have proposed in my initial post. The .md file has a lengthy text which starts out by describing highway's dispatch mechanism and the layer of code I have added on top to use 'VFT dispatching'. The code and text in this folder describe the general how-to - zimt's own use of VFT dispatching is more involved (there's the zimt namespace to deal with as well) and it's still only half-finished as of this writing, with only the core of the zimt functionality accessible via VFT dispatching. I intend the multi_isa branch to evolve so that all zimt code can be dispatched this way but keeps the option of using the other SIMD library back-ends (Vc, std::simd, zimt's own 'goading' code) as an alternative. Once that's done, I'll merge it back to master. If we can settle the open issue about the ISA switching while a program is running (if that ever occurs), I think my method should be generally viable. I think my example code and the text will clarify precisely how VFT dispatching works - it's much more elaborate than my initial post here. Again, comments welcome! |
I agree, but it's surprising, nice, and rare that more convenience actually comes with more speed.
I wouldn't worry to much about this. Intel has warned since 10+ years that CPUID info might become heterogeneous, but no one bothers to check. The most common case is where someone disables targets at runtime using a flag. This happens early on in main(), and users can arrange to call get_dispatch after that, so no problem. Excellent article, thanks for putting this together! If you want to go into a bit more detail about the mechanism, we could expand on "deems most suitable at the time". Bonus: who initializes this bitfield? We don't want to get into the init order fiasco by setting it in a ctor. Instead we arrange that the first entry in the table is a special version of the user's code, which first sets the bitfield, then calls the user code. Subsequent dispatch will go straight to the user code. Some typos: Personally I'd stop before the SIMD_REGISTER step - some people like to minimize macros, and it might be useful to
FYI Highway will remain open source. Google has a policy of not deleting open source projects.
Want to add a link to the code? We'd welcome adding this to the Highway g3doc. Or would you prefer a link in the readme? |
That's a relief. I did suspect this was a no-go - it would be just too disruptive and make people even less likely to invest in coding with SIMD.
Fixed, thanks @jan-wassenberg!
I'm glad you approve of my tedious repetitive technical outporings :-)
I'll think about it. I used this to good effect in lux. But of course it's an extra frill which isn't strictly necessary - maybe I'll reduce it to a hint that it can be done, rather than 'going all the way' in the example code. You're right that less may be more, and it should really be about transporting the concept, to the advance of SIMD acceptance.
Do you mean to the code in lux? lux' single cmake file is here, and the code about 'flavours' starts in line 318, as of this writing. You can see that's quite a mouthful. I'll be glad to get rid of all this cmake code once I have moved lux to use zimt with automatic ISA dispatching. If you mean the code for the article, it's here
Thanks for the offer. I think a link in the README would be more appropriate for now - I feel my text doesn't really qualify as documentation, it's more in style with a blog post. But of course you can quote me if you like. I do have another concrete question. In my example code, I haven't used HWY_PREFIX, HWY_BEFORE_NAMESPACE or HWY_AFTER_NAMESPACE. While the code seems to run okay on my machine, this may be problematic elsewhere. Can you shed some light on these macros? If they are necessary, I'd like to add them to my code. |
PiperOrigin-RevId: 692102588
Sounds good!
👍
Yes, so readers can see the complexity there.
We'll add a link for now, but no worries, I would not be shy about calling this documentation, there is certainly a place for an introduction.
Do you mean HWY_ATTR? These are definitely necessary, they the mechanism by which pragma target is applied. Without that, you might only get baseline SIMD code. From the README:
|
PiperOrigin-RevId: 692102588
PiperOrigin-RevId: 692107963
I've gone over the text some more and just committed an updated version. On this occasion, I had a look at the link you put in your README. Thanks for placing it so prominently! Work on zimt's multi_isa branch is continuing, I have a good example set up and documented in the text. Two observations:
The latter one is puzzling - The code is compute-heavy (repeated 1M evaluations of a 2D b-spline), so I would have thought that using AVX2 should speed things up. I haven't looked at the machine code yet to see if I've maybe made a mistake and my dispatch isn't working properly. Have you seen this happen in your tests? Running b-spline code with zimt makes for good benchmarking code. Doing stuff like evaluating a b-spline of multi-channel data at 1M 2D coordinates and writing the results to memory is a 'realistic' workload testing memory access, de/interleaving, gather/scatter and raw arithmetic speed due to many additions and multiplications. The addition of b-spline evaluation code to zimt is quite recent and wraps up my porting effort from the vspline library to zimt. |
:) AVX2 being slower is surprising. Possible causes could be that the memory is only 16-byte aligned, or heavy use of shuffles which are more expensive (3-cycle latency) when crossing 128-bit blocks, whereas they are still single-cycle on SSSE3. |
It's a mixed picture I get. At times, with specific compiler flags, back-end and workload, g++ can produce binary which outperforms everything else. But it doesn't do so consistently, and I usually see clang++ coming out on top. That's why I prefer it - and because it's error messages are more human-friendly. I think I've figured out why the better targets didn't run faster: I only optimized with -O2 only and used a large-ish spline degree. I tried this morning with -O3 and cubic splines and got the expected performance increase going from SSE2 to AVX2 (I don't have a machine with AVX3). This also had the results for tests compiled with clang++ and g++ closer together. With larger spline degrees the 'better' ISAs tend to perform worse, and I don't have a good idea why this would be.
That's a good hint - my test code uses splines of three-channel xel data, to mimick a typical image processing workload. Such xel data need to be de/interleaved to/from SoA configuration for SIMD processing which is likely using shuffles. There is one thing I notice with my back-ends which might merit a separate 'excursion issue': zimt uses C++ classes to 'harness' the 'naked' vectors. This goes beyond simply wrapping single vectors in objects - the objects contain several vectors (or their equivalent in memory), e.g. two or four. I find that this significantly influences performance, and here I have a good guess at what happens: I think that the resulting machine code, which performs several equal operations in sequence, hides latency which the SIMD instructions need to be set up and/or makes it possible for the CPU to use pipelining more effectively. Using this technique, I do get performance gains, and I've been using it for years now to good effect in lux. It may help to squeeze even more performance out of a given CPU. Give it a shot - using zimt, or simply by processing small batches of vectors rather than individual ones. |
Yes, Load/StoreInterleaved3 does involve quite a few shuffles. Might be worth considering using 4 channels just for the faster interleaving :) I agree unrolling is often helpful. One concern about storing vectors in classes is that it's harder to guarantee alignment. |
PiperOrigin-RevId: 692107963
This reverts commit f3f3a4a.
This reverts commit f3f3a4a.
Hi again!
I modified the examples.sh shell script, which compiles all examples, so that example files which use foreach_target and zimt's dispatch mechanism are compiled in an 'incarnation' using dynamic dispatch (once with clang++ and once with g++), while other examples, which still rely on picking a specific ISA by passing appropriate compiler arguments, are only compiled with the four zimt back-ends. If you have all back-ends installed, you can simply run |
hm, hard to say - there's a lot of code. It can be that unrolling or branch prediction differs depending on code structure. Performance counters might be useful to narrow down where the difference lies. |
It turned out that some of my code was not placed correctly into the ISA-specific nested namespaces, which resulted in sub-optimal performance. I've now managed to get the zimt library to fully cooperate with highway's foreach_target mechanism, and dynamically-dispatched versions run just as fast as single-ISA compiles. |
Thanks for the updates, and congrats on the result that dynamic == single ISA speed :) |
Yes, that was it, thanks for the pointer! |
Hi! This is more of a little excursion than a 'true' issue, but it's about a technique which I've found useful and would like to share. The occasion is that I'm extending my library zimt to use highway's foreach_target mechanism.
My first remark - before I start out on the issue proper - is about this mechanism. I knew it was there, I thought it might be a good idea to use it, but the documentation was thin and I had a working solution already. Before I turned my attention to zimt again this autumn, I did a lot of reading in the SIMD literature, and I also decided to have a closer look at highway's foreach_target mechanism. Lacking extensive documentation, I sat myself down and read the code. Only then I realized just how well-thought-out and useful it actually is. Yes, using it is slightly intrusive to the client code, but you've really done a good job to hide the complexity and make it easy to 'suck' code into the SIMD-ISA-specific nested namespaces and dispatch to it. But here I do actually have some criticism - to figure that out I had to read and understand the code! It would have been much easier had there been some sort of technical outline, paper, or such - to explain the concept. This criticism goes beyond this specific topic - I think you'd be well advised to improve documentation, to address a wider user base.
My first step in introducing highway's multi-ISA capability into zimt was to introduce 'corresponding' nested namespaces in both my library's 'zimt' namespace and in the 'project' namespace (let's use this one for user-side code). I had a hard time initially figuring out the namespace scheme, probably because the name of the central macro HWY_NAMESPACE. The naming is unfortunate - of course it's a symbol for a namespace, but a name like HWY_SIMD_ISA would have hinted at it's semantics, not at syntax. With the namespaces set up, I could use foreach_target.h and dynamic dispatch. But I found the way to introduce the ISA-specific code via a free function verbose, so I tried to figure out a way to make this more concise and manageable. I 'bent' some of the code I used in lux to the purpose, and this is where I come to 'VFT dispatching'. The concept is quite simple:
So this is where the 'VFT' in 'VFT dispatching' comes from: it uses the virtual function table of a class with virtual functions. The language guarantees that the VFTs of all classes derived from the base have the same layout (otherwise the mechanism could not function). What do I gain? the base class pointer is a uniform handle to a - possible large - set of functions I want to keep ISA-specific versions of. Dispatching is as simple as calling through the dispatcher base class pointer, so once I have obtained it, it serves as a conduit:
This is more or less it, with one more little twist which I also found useful. In a first approach, I wrote out the declaration of the pure virtual member function in the base class, and again the declaration (now no longer pure) in the derived, ISA-specific, class. This is error-prone, so I now use an interface header, introducing the member functions via a macro. In a header 'interface.h' I put macro invocations only:
Then I can #include this header into the class declarations, #defining the macro differently:
This ensures that the declarations are consistent. For the actual implementation, the signature has to be written out once again, but since there is a declaration, providing a definition with different signature is an error, and when providing the implementation, coding with the signature 'in sight' is advisable anyway - especially when the argument list becomes long. the 'interface.h' header provides a good reference to the set of functions using the dispatcher, and additional dispatch-specific functionality can be coded for the lot. I think it makes a neat addition to VFT dispatching.
To wrap up, I'd like to point out that this mechanism is generic and can be used to good effect for all sorts of dispatches - If appropriate specific derived dispatch classes are coded along with a mechanism to pick a specific one, it can function quite independently of highway's dispatch. It can also be used to 'pull in' code which doesn't even use highway - e.g. code with vector-friendly small loops ('goading') relying on autovectorization which will still benefit from being re-compiled several times with ISA-specific flags, be it with highway's foreach_target or by setting up separate TUs with externally supplied ISA-specific compiler flags - this is what I currently do in lux, but it requires quite some 'scaffolding' code in cmake.
Comments welcome! I hope you find this useful - I intended to share useful bits here every now and then and it's been a while since the last one (about goading), but better late than never. If you're interested, you can have a peek into zimt's new multi_isa branch, where I have a first working example using the method (see linspace.cc and driver.cc in the examples section). If you don't approve of my intruding into your issue space, let me know.
The text was updated successfully, but these errors were encountered: