Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'VFT dispatching' to call into SIMD-ISA-specific code #2364

Open
kfjahnke opened this issue Oct 28, 2024 · 15 comments
Open

'VFT dispatching' to call into SIMD-ISA-specific code #2364

kfjahnke opened this issue Oct 28, 2024 · 15 comments

Comments

@kfjahnke
Copy link

Hi! This is more of a little excursion than a 'true' issue, but it's about a technique which I've found useful and would like to share. The occasion is that I'm extending my library zimt to use highway's foreach_target mechanism.

My first remark - before I start out on the issue proper - is about this mechanism. I knew it was there, I thought it might be a good idea to use it, but the documentation was thin and I had a working solution already. Before I turned my attention to zimt again this autumn, I did a lot of reading in the SIMD literature, and I also decided to have a closer look at highway's foreach_target mechanism. Lacking extensive documentation, I sat myself down and read the code. Only then I realized just how well-thought-out and useful it actually is. Yes, using it is slightly intrusive to the client code, but you've really done a good job to hide the complexity and make it easy to 'suck' code into the SIMD-ISA-specific nested namespaces and dispatch to it. But here I do actually have some criticism - to figure that out I had to read and understand the code! It would have been much easier had there been some sort of technical outline, paper, or such - to explain the concept. This criticism goes beyond this specific topic - I think you'd be well advised to improve documentation, to address a wider user base.

My first step in introducing highway's multi-ISA capability into zimt was to introduce 'corresponding' nested namespaces in both my library's 'zimt' namespace and in the 'project' namespace (let's use this one for user-side code). I had a hard time initially figuring out the namespace scheme, probably because the name of the central macro HWY_NAMESPACE. The naming is unfortunate - of course it's a symbol for a namespace, but a name like HWY_SIMD_ISA would have hinted at it's semantics, not at syntax. With the namespaces set up, I could use foreach_target.h and dynamic dispatch. But I found the way to introduce the ISA-specific code via a free function verbose, so I tried to figure out a way to make this more concise and manageable. I 'bent' some of the code I used in lux to the purpose, and this is where I come to 'VFT dispatching'. The concept is quite simple:

  • Instead of using free functions, I work with virtual member functions in a 'dispatch' object
  • The dispatch object's base class has declarations of these as pure virtual static member functions
  • In each ISA-specific nested namespace I inherit from the dispatch base class and provide the implementations
  • I have a 'get_dispatch' routine (using highway's dynamic dispatch) which yields a dispatch base class pointer
  • which does in fact point to the derived class with the ISA-specific implementations

So this is where the 'VFT' in 'VFT dispatching' comes from: it uses the virtual function table of a class with virtual functions. The language guarantees that the VFTs of all classes derived from the base have the same layout (otherwise the mechanism could not function). What do I gain? the base class pointer is a uniform handle to a - possible large - set of functions I want to keep ISA-specific versions of. Dispatching is as simple as calling through the dispatcher base class pointer, so once I have obtained it, it serves as a conduit:

// inside code submitted to foreach_target I have:

struct _dispatch
: public dispatch // inherit from dispatch base class
{
   ...
} ;

static const _dispatch local_dispatch ; // instantiate the derived class

const dispatch * const get_dispatch() // install the ISA-specific fetch routine
{
  return & local_dispatch ; // return type will be cast to base class, as declared
}

// in the main body of code (in HWY_ONCE) I use highway's dynamic dispatch:

HWY_EXPORT(_get_dispatch);

const dispatch * const get_dispatch()
{
  return HWY_DYNAMIC_DISPATCH(_get_dispatch)() ;
}

// with a member function 'foo' in the dispatcher, I can now dispatch like this:

auto dp = get_dispatch() ;
bar = dp->foo() ; // calls the ISA-specific variant

This is more or less it, with one more little twist which I also found useful. In a first approach, I wrote out the declaration of the pure virtual member function in the base class, and again the declaration (now no longer pure) in the derived, ISA-specific, class. This is error-prone, so I now use an interface header, introducing the member functions via a macro. In a header 'interface.h' I put macro invocations only:

// register int dummy ( float z ) for dispatch:

ZIMT_REGISTER(int, dummy, float z)

Then I can #include this header into the class declarations, #defining the macro differently:

// the dispatch base class is coded like this:
struct dispatch
{
#define ZIMT_REGISTER(RET,NAME,...) virtual RET NAME ( __VA_ARGS__ ) const = 0 ;
#include "interface.h"
#undef ZIMT_REGISTER
} ;
// the ISA-specific derived class (inside code submitted to foreach_target):
struct _dispatch
: public dispatch
{
#define ZIMT_REGISTER(RET,NAME,...) RET NAME ( __VA_ARGS__ ) const ;
#include "interface.h"
#undef ZIMT_REGISTER
} ;
// followed, inside the same nested namespace, by the definition(s)
int _dispatch::dummy (float z) const
{
  return 23 ;
}

This ensures that the declarations are consistent. For the actual implementation, the signature has to be written out once again, but since there is a declaration, providing a definition with different signature is an error, and when providing the implementation, coding with the signature 'in sight' is advisable anyway - especially when the argument list becomes long. the 'interface.h' header provides a good reference to the set of functions using the dispatcher, and additional dispatch-specific functionality can be coded for the lot. I think it makes a neat addition to VFT dispatching.

To wrap up, I'd like to point out that this mechanism is generic and can be used to good effect for all sorts of dispatches - If appropriate specific derived dispatch classes are coded along with a mechanism to pick a specific one, it can function quite independently of highway's dispatch. It can also be used to 'pull in' code which doesn't even use highway - e.g. code with vector-friendly small loops ('goading') relying on autovectorization which will still benefit from being re-compiled several times with ISA-specific flags, be it with highway's foreach_target or by setting up separate TUs with externally supplied ISA-specific compiler flags - this is what I currently do in lux, but it requires quite some 'scaffolding' code in cmake.

Comments welcome! I hope you find this useful - I intended to share useful bits here every now and then and it's been a while since the last one (about goading), but better late than never. If you're interested, you can have a peek into zimt's new multi_isa branch, where I have a first working example using the method (see linspace.cc and driver.cc in the examples section). If you don't approve of my intruding into your issue space, let me know.

@jan-wassenberg
Copy link
Member

Thanks for sharing your result, which is of course welcome :)

I sat myself down and read the code. Only then I realized just how well-thought-out and useful it actually is. Yes, using it is slightly intrusive to the client code, but you've really done a good job to hide the complexity and make it easy to ..

Thank you! It has actually been a few years since we designed this. You might currently have the best understanding of dispatching and how the pieces fit together.
If you'd be willing to write an overview in g3doc/, in Markdown format, I think this would be useful for others and believe you'd be great at it because you write thorough comments.

a name like HWY_SIMD_ISA would have hinted at it's semantics

Yes, that's fair. Maybe HWY_TARGET_NAME, though we did want to discourage using it for anything other than a namespace.

If I understand your new system correctly, the derived class calls the hn:: implementation directly, so once you have your
base-class pointer, the dispatch is just one load of the vptr, plus one indirect call through it. This may actually
be faster than our dispatch, which looks up the current enabled targets (with the advantage that users can actually
change that at runtime, which does happen).

Very cool, congrats! I'd encourage you to also write up an intro on this system, it looks useful for when there are many functions to dispatch.

FYI some time ago we expanded in an unrelated direction: enabling dispatch of function templates, at least with
one template argument (due to MSVC macro limitations that could be worked around if there is demand for it).
This is the HWY_EXPORT_AND_DYNAMIC_DISPATCH_T and related.

@kfjahnke
Copy link
Author

You might currently have the best understanding of dispatching and how the pieces fit together.

Claiming that would be presumptuous - I figured out how to use highway's mechanism, not quite how it actually does what it does. But I did notice that a lot of it works with the preprocessor, rather than relying on C++ language features like templates.

If you'd be willing to write an overview in g3doc/, in Markdown format, I think this would be useful for others and believe you'd be great at it because you write thorough comments.

I have something in the pipeline, I'll post a link to a draft version once I have it online.

If I understand your new system correctly, the derived class calls the hn:: implementation directly, so once you have your
base-class pointer, the dispatch is just one load of the vptr, plus one indirect call through it. This may actually
be faster than our dispatch, which looks up the current enabled targets (with the advantage that users can actually
change that at runtime, which does happen).

Dispatching this way is indeed very efficient, it's really just the load-the-vptr plus indirect call. The level at which this dispatch happens should not be too performance-critical, though - The stuff which really needs to be fast should all be inlined inside the chunks of code which contain performance-critical code, and the function-level dispatch should only happen when you do stuff like call a routine which processes an entire array of data or such - as zimt does when it works over nD arrays of *xel data.

The lookup of the current target is an issue, though. I think we discussed this once already, but my memory fails me. The question is when such a lookup may occur at run-time. If there is a possibility that the available SIMD ISA changes from one call into the SIMD-ISA-specific code to the next, my method is not applicable without a dispatch pointer refetch - the code would crash with an illegal instruction if the current ISA does not provide the instructions. But I think this is a rare exception, and it would be hard to handle anyway: what if you've just figured out you can use AVX3, and when you proceed to call the AVX3-code the system puts your thread on a core which only has AVX2? You'd have to make sure the code proceeds without such switches until it's done, and this may be hard to 'nail down'. I started this thread to also get feedback from you guys on potential stumbling stones, and this is certainly one - but you're in a better position to know whether this is indeed relevant, so I'd be glad to get some advice, beyond 'which does happen'. When and on which platforms does it happen?

Since I intend this level of dispatch for entire blocks of code rather than at the level of individual SIMD operations or small sequences, it would be unproblematic to re-fetch the dispatch pointer before using it (losing the direct VFT call speed advantage), if the ISA switch can happen in mid-run-time. But if the ISA switch can occur between the acquisition of the dispatch pointer and it's use immediately afterwards, it's a problem - I doubt, though - even without investigating deeply - that your dispatching code is shielded against such extreme disruption.

@kfjahnke
Copy link
Author

Here's the text I've written on the topic:

https://github.com/kfjahnke/zimt/blob/multi_isa/examples/multi_isa_example/multi_simd_isa.md

The 'multi_isa_example' folder also has example code, a program using the dispatch mechanism I have proposed in my initial post. The .md file has a lengthy text which starts out by describing highway's dispatch mechanism and the layer of code I have added on top to use 'VFT dispatching'. The code and text in this folder describe the general how-to - zimt's own use of VFT dispatching is more involved (there's the zimt namespace to deal with as well) and it's still only half-finished as of this writing, with only the core of the zimt functionality accessible via VFT dispatching. I intend the multi_isa branch to evolve so that all zimt code can be dispatched this way but keeps the option of using the other SIMD library back-ends (Vc, std::simd, zimt's own 'goading' code) as an alternative. Once that's done, I'll merge it back to master.

If we can settle the open issue about the ISA switching while a program is running (if that ever occurs), I think my method should be generally viable. I think my example code and the text will clarify precisely how VFT dispatching works - it's much more elaborate than my initial post here. Again, comments welcome!

@jan-wassenberg
Copy link
Member

The level at which this dispatch happens should not be too performance-critical, though

I agree, but it's surprising, nice, and rare that more convenience actually comes with more speed.

If there is a possibility that the available SIMD ISA changes from one call into the SIMD-ISA-specific code to the next, my method is not applicable without a dispatch pointer refetch

I wouldn't worry to much about this. Intel has warned since 10+ years that CPUID info might become heterogeneous, but no one bothers to check. The most common case is where someone disables targets at runtime using a flag. This happens early on in main(), and users can arrange to call get_dispatch after that, so no problem.

Excellent article, thanks for putting this together!
It builds up nicely and makes each step clear.

If you want to go into a bit more detail about the mechanism, we could expand on "deems most suitable at the time".
Under the hood, we have a bitfield of targets, which by convention has lowest=best, so we can use the bit-scan instruction
to find the index of the first/best bit, and use that as an index into a table of function pointers. Minor wrinkle: because
some targets' indices start above zero, we subtract the per-platform starting index so that the tables can be smaller.

Bonus: who initializes this bitfield? We don't want to get into the init order fiasco by setting it in a ctor. Instead we arrange that the first entry in the table is a special version of the user's code, which first sets the bitfield, then calls the user code. Subsequent dispatch will go straight to the user code.

Some typos:
dacades -> decades
get a bit -> see some speedup
sepcific -> specific
witout -> without
RISCV -> RISC-V
i86 -> x86?
hwy::SSE2, hwy::AVX2 -> FWIW they are actually called N_SSE2, N_AVX2.
HWY_NAMEPSACE -> HWY_NAMESPACE (2x)
lat's -> let's
oreach_target -> foreach_target (2x)
paylod -> payload
is it's -> if it's

Personally I'd stop before the SIMD_REGISTER step - some people like to minimize macros, and it might be useful to
see the declarations. The compiler will anyway complain if there is a mismatch.

libraries come and go or change license, so an exit strategy is a good idea

FYI Highway will remain open source. Google has a policy of not deleting open source projects.

and a lot of cmake scaffolding

Want to add a link to the code?

We'd welcome adding this to the Highway g3doc. Or would you prefer a link in the readme?
Consider also posting this to Hacker News and/or Reddit r/simd?

@kfjahnke
Copy link
Author

I wouldn't worry to much about this. Intel has warned since 10+ years that CPUID info might become heterogeneous, but no one bothers to check. The most common case is where someone disables targets at runtime using a flag. This happens early on in main(), and users can arrange to call get_dispatch after that, so no problem.

That's a relief. I did suspect this was a no-go - it would be just too disruptive and make people even less likely to invest in coding with SIMD.

Some typos:
...

Fixed, thanks @jan-wassenberg!

Excellent article, thanks for putting this together!
It builds up nicely and makes each step clear.

I'm glad you approve of my tedious repetitive technical outporings :-)

Personally I'd stop before the SIMD_REGISTER step - some people like to minimize macros, and it might be useful to
see the declarations. The compiler will anyway complain if there is a mismatch.

I'll think about it. I used this to good effect in lux. But of course it's an extra frill which isn't strictly necessary - maybe I'll reduce it to a hint that it can be done, rather than 'going all the way' in the example code. You're right that less may be more, and it should really be about transporting the concept, to the advance of SIMD acceptance.

Want to add a link to the code?

Do you mean to the code in lux? lux' single cmake file is here, and the code about 'flavours' starts in line 318, as of this writing. You can see that's quite a mouthful. I'll be glad to get rid of all this cmake code once I have moved lux to use zimt with automatic ISA dispatching. If you mean the code for the article, it's here

We'd welcome adding this to the Highway g3doc. Or would you prefer a link in the readme?

Thanks for the offer. I think a link in the README would be more appropriate for now - I feel my text doesn't really qualify as documentation, it's more in style with a blog post. But of course you can quote me if you like.

I do have another concrete question. In my example code, I haven't used HWY_PREFIX, HWY_BEFORE_NAMESPACE or HWY_AFTER_NAMESPACE. While the code seems to run okay on my machine, this may be problematic elsewhere. Can you shed some light on these macros? If they are necessary, I'd like to add them to my code.

copybara-service bot pushed a commit that referenced this issue Nov 1, 2024
PiperOrigin-RevId: 692102588
@jan-wassenberg
Copy link
Member

maybe I'll reduce it to a hint that it can be done, rather than 'going all the way' in the example code.

Sounds good!

You're right that less may be more, and it should really be about transporting the concept, to the advance of SIMD acceptance.

👍

Do you mean to the code in lux?

Yes, so readers can see the complexity there.

Thanks for the offer. I think a link in the README would be more appropriate for now - I feel my text doesn't really qualify as documentation, it's more in style with a blog post.

We'll add a link for now, but no worries, I would not be shy about calling this documentation, there is certainly a place for an introduction.

I do have another concrete question. In my example code, I haven't used HWY_PREFIX, HWY_BEFORE_NAMESPACE or HWY_AFTER_NAMESPACE. While the code seems to run okay on my machine, this may be problematic elsewhere. Can you shed some light on these macros?

Do you mean HWY_ATTR? These are definitely necessary, they the mechanism by which pragma target is applied. Without that, you might only get baseline SIMD code. From the README:

Additionally, each function that calls Highway ops (such as `Load`) must either
be prefixed with `HWY_ATTR`, OR reside between `HWY_BEFORE_NAMESPACE()` and
`HWY_AFTER_NAMESPACE()`. Lambda functions currently require `HWY_ATTR` before
their opening brace.

copybara-service bot pushed a commit that referenced this issue Nov 1, 2024
PiperOrigin-RevId: 692102588
copybara-service bot pushed a commit that referenced this issue Nov 1, 2024
PiperOrigin-RevId: 692107963
@kfjahnke
Copy link
Author

I've gone over the text some more and just committed an updated version. On this occasion, I had a look at the link you put in your README. Thanks for placing it so prominently! Work on zimt's multi_isa branch is continuing, I have a good example set up and documented in the text. Two observations:

  • the binaries I get with clang++ are (much) faster than what g++ produces with my compiler flags.
  • when the highway 'flavour' dispatches to AVX2, that's slower on my machine than SSE2 or SSSE3.

The latter one is puzzling - The code is compute-heavy (repeated 1M evaluations of a 2D b-spline), so I would have thought that using AVX2 should speed things up. I haven't looked at the machine code yet to see if I've maybe made a mistake and my dispatch isn't working properly. Have you seen this happen in your tests?

Running b-spline code with zimt makes for good benchmarking code. Doing stuff like evaluating a b-spline of multi-channel data at 1M 2D coordinates and writing the results to memory is a 'realistic' workload testing memory access, de/interleaving, gather/scatter and raw arithmetic speed due to many additions and multiplications. The addition of b-spline evaluation code to zimt is quite recent and wraps up my porting effort from the vspline library to zimt.

@jan-wassenberg
Copy link
Member

:)
I agree with your observation that clang code is better than gcc's, have often seen the same.

AVX2 being slower is surprising. Possible causes could be that the memory is only 16-byte aligned, or heavy use of shuffles which are more expensive (3-cycle latency) when crossing 128-bit blocks, whereas they are still single-cycle on SSSE3.
If it's possible to get your code on Godbolt, that's also a good way to quickly look at what's happening.

@kfjahnke
Copy link
Author

I agree with your observation that clang code is better than gcc's, have often seen the same.

It's a mixed picture I get. At times, with specific compiler flags, back-end and workload, g++ can produce binary which outperforms everything else. But it doesn't do so consistently, and I usually see clang++ coming out on top. That's why I prefer it - and because it's error messages are more human-friendly.

I think I've figured out why the better targets didn't run faster: I only optimized with -O2 only and used a large-ish spline degree. I tried this morning with -O3 and cubic splines and got the expected performance increase going from SSE2 to AVX2 (I don't have a machine with AVX3). This also had the results for tests compiled with clang++ and g++ closer together. With larger spline degrees the 'better' ISAs tend to perform worse, and I don't have a good idea why this would be.

AVX2 being slower is surprising. Possible causes could be that the memory is only 16-byte aligned, or heavy use of shuffles which are more expensive (3-cycle latency) when crossing 128-bit blocks, whereas they are still single-cycle on SSSE3.

That's a good hint - my test code uses splines of three-channel xel data, to mimick a typical image processing workload. Such xel data need to be de/interleaved to/from SoA configuration for SIMD processing which is likely using shuffles.

There is one thing I notice with my back-ends which might merit a separate 'excursion issue': zimt uses C++ classes to 'harness' the 'naked' vectors. This goes beyond simply wrapping single vectors in objects - the objects contain several vectors (or their equivalent in memory), e.g. two or four. I find that this significantly influences performance, and here I have a good guess at what happens: I think that the resulting machine code, which performs several equal operations in sequence, hides latency which the SIMD instructions need to be set up and/or makes it possible for the CPU to use pipelining more effectively. Using this technique, I do get performance gains, and I've been using it for years now to good effect in lux. It may help to squeeze even more performance out of a given CPU. Give it a shot - using zimt, or simply by processing small batches of vectors rather than individual ones.

@jan-wassenberg
Copy link
Member

de/interleaved to/from SoA configuration for SIMD processing which is likely using shuffles.

Yes, Load/StoreInterleaved3 does involve quite a few shuffles. Might be worth considering using 4 channels just for the faster interleaving :)

I agree unrolling is often helpful. One concern about storing vectors in classes is that it's harder to guarantee alignment.

mazimkhan pushed a commit to cambridgeconsultants/aeroway that referenced this issue Nov 18, 2024
mazimkhan added a commit to cambridgeconsultants/aeroway that referenced this issue Nov 18, 2024
mazimkhan added a commit to cambridgeconsultants/aeroway that referenced this issue Nov 18, 2024
@kfjahnke
Copy link
Author

Hi again!
I've converted the zimt code base to become compatible with highway's foreach_target mechanism and merged the multi_isa branch back to main. The examples now all work with the modified headers, and there are some modified examples which can use internal multi-SIMD-ISA dispatch:

  • bsp_eval.cc repeatedly evaluates a 1M knot-point b-spline
  • convolve.cc applies a convolution kernel to an image (uses vigra)
  • linspace.cc uses zimt's linspace_t and zimt::process

I modified the examples.sh shell script, which compiles all examples, so that example files which use foreach_target and zimt's dispatch mechanism are compiled in an 'incarnation' using dynamic dispatch (once with clang++ and once with g++), while other examples, which still rely on picking a specific ISA by passing appropriate compiler arguments, are only compiled with the four zimt back-ends. If you have all back-ends installed, you can simply run
./examples.sh *.cc
in the examples folder, and get a large collection of binaries to play with. Some of them time the payload code, and it's interesting to see how the execution times vary with the compiler, back-end and SIMD ISA. I was surprised to see (on my AVX2 machine) that my Vc back-end does at time outperform all other variants, with the highway back-end performing almost as well. I'm not indicating that Vc is 'better' than highway - it's probably due to my implementation of the zimt back-ends which was derived from using Vc::SimdArray, so the Vc back-end does need little extra code for it's implementation, whereas employing highway for the task required quite some coding. The std::simd back-end performs quite well for some of the examples, while SIMD emulation with 'goading' usually comes out last.
I'd be interested in investigating why bsp_eval.cc in particular comes out fastest when using the Vc back-end. My implementation of b-spline evaluation uses a recursive function at it's core (to make the code dimension-agnostic) which uses (vectorized) data on the stack and return values. This core function seems to be what's slower with my highway back-end, data gathering, weight generation and data storing coming out roughly equal. I'd be grateful for any hints to tune the code - maybe these hints already ring a bell!

@jan-wassenberg
Copy link
Member

hm, hard to say - there's a lot of code. It can be that unrolling or branch prediction differs depending on code structure. Performance counters might be useful to narrow down where the difference lies.

@kfjahnke
Copy link
Author

It turned out that some of my code was not placed correctly into the ISA-specific nested namespaces, which resulted in sub-optimal performance. I've now managed to get the zimt library to fully cooperate with highway's foreach_target mechanism, and dynamically-dispatched versions run just as fast as single-ISA compiles.
With this success, I moved on to try and use the new zimt code with a 'real' application. I have a project going on which relies entirely on zimt for mass data processing: the envutil program. That's a utility to do geometrical reprojections of images, as they are needed e.g. in panorama viewing. I also use it as a test-bed for different interpolation methods, and to compare them to the ready-made methods coming with OpenImageIO - envutil can use either. Following the VFT dispatch method outlined in the examples in zimt I managed to get envutil to use highway's foreach_target mechanism, resulting in a binary with dynamic dispatch. I've put a debian package up for download if anyone is interested in trying it out - this package is for debian12 on x86_64 and may not install on other linuxes. Building envutil from source isn't hard, either - it's a standard cmake build and just requires a few dependencies.
So far, everything seems to work just as expected - I haven't been able to test on ISAs better than AVX2, though, because I don't have the hardware.
Since I use a scheme which can either exploit highway's foreach_target (for both the highway and 'goading' back-ends) or build as a single-ISA build for all zimt back-ends (goading, highway, Vc or std::simd) there's a variety of binary variants to play with - and different compilers do make a difference as well - as we discussed earlier, clang++ seems to come out on top for this kind of code. The cmake build is easy to configure for all of these variants. The debian package is built with the highway back-end.
I managed to switch to foreach_target-managed code with little modification of the code base - one pifall were #includes happening somewhere inside the code which had to be moved out from the nested namespaces.
One thing I noticed is that it seems to be problematic to instantiate templates from 'outside' - e.g. originating from headers of other template libraries - inside the foreach_target-managed code. It looks as if the instantiations only use some lowest-common-denominator SIMD ISA and do not compile according to the #pragma directives set up by HWY_BEFORE_NAMESPACE(). I think you mentioned something about dispatching to templated code, but I can't find it just now - it's probably addresing the same issue. I noticed it when I tried to use Imath's quaternion code with simdized data - replacing the use of Imath's template with equivalent 'handwritten' code brought the dynamically-dipatched code up to the same speed as a single-ISA build, where the significant discrepancy had me puzzled for a while.

@jan-wassenberg
Copy link
Member

Thanks for the updates, and congrats on the result that dynamic == single ISA speed :)
For the template, I think you are referring to HWY_EXPORT_AND_DYNAMIC_DISPATCH_T?

@kfjahnke
Copy link
Author

I think you are referring to HWY_EXPORT_AND_DYNAMIC_DISPATCH_T

Yes, that was it, thanks for the pointer!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants