Skip to content

Conversation

@ssh4net
Copy link
Contributor

@ssh4net ssh4net commented Jan 7, 2026

Optional SIMD optimizations for selected ImageBufAlgo operations using the Google Highway library: • add/sub
• mul/div
• mad
• resample
Adds CMake and build system support, new implementation helpers, and developer documentation.

Checklist:

  • I have read the guidelines on contributions and code review procedures.
  • I have updated the documentation if my PR adds features or changes
    behavior.
  • I am sure that this PR's changes are tested somewhere in the
    testsuite
    .
  • I have run and passed the testsuite in CI before submitting the
    PR, by pushing the changes to my fork and seeing that the automated CI
    passed there. (Exceptions: If most tests pass and you can't figure out why
    the remaining ones fail, it's ok to submit the PR and ask for help. Or if
    any failures seem entirely unrelated to your change; sometimes things break
    on the GitHub runners.)
  • My code follows the prevailing code style of this project and I
    fixed any problems reported by the clang-format CI test.
  • If I added or modified a public C++ API call, I have also amended the
    corresponding Python bindings. If altering ImageBufAlgo functions, I also
    exposed the new functionality as oiiotool options.

Optional SIMD optimizations for selected ImageBufAlgo operations using the Google Highway library:
• add/sub
• mul/div
• mad
• resample
Adds CMake and build system support, new implementation helpers, and developer documentation.

Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>
@lgritz
Copy link
Collaborator

lgritz commented Jan 7, 2026

I suspect you used LLM for some of this? Which is fine, but I think you should document in the PR description (commit comment) which tool you used and for what parts.

#endif

#ifdef _MSC_VER
# include <malloc.h> // for alloca
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't make sense to me. We already use alloca extensively without trouble, why did this need to be added now?

Comment on lines +316 to +317
#elif defined(_MSC_VER)
# define OIIO_ALLOCA(type, size) (assert(size < (1<<20)), (size) != 0 ? ((type*)_alloca((size) * sizeof(type))) : nullptr)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? I think this definition is identical to the #else clause below, so the MS case would already be handled properly, right?

Comment on lines +13 to +16
#if defined(_WIN32)
# include <malloc.h> // for alloca
#endif

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? It always worked before

Comment on lines +127 to +131
template<class Rtype, class Atype, class Btype>
static bool
add_impl_hwy(ImageBuf& R, const ImageBuf& A, const ImageBuf& B, ROI roi,
int nthreads)
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't done a line-by-line comparison, but it seems to me that the only difference between add_impl_hwy, sub_impl_hwy, and mul_impl_hwy is likely going to be

[](auto d, auto a, auto b) { return hn::Add(a, b); }

versus that one lambda changing for Sub and Mul.

I would love for even the initial commit to reduce this whole thing to a shared hwy_binary_perpixel_op() template that takes the lambda housing the op kernel as a templated parameter.

Comment on lines +155 to +161
// Process pixel by pixel (scalar fallback for strided channels)
for (int x = roi.xbegin; x < roi.xend; ++x) {
Rtype* r_ptr = ChannelPtr<Rtype>(Rv, x, y, roi.chbegin);
const Atype* a_ptr = ChannelPtr<Atype>(Av, x, y,
roi.chbegin);
const Btype* b_ptr = ChannelPtr<Btype>(Bv, x, y,
roi.chbegin);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should benchmark the strided case and see how it compares to the contiguous case and the full scalar fallback that we've always had. If there is no big speed gain, I would be in favor of eliminating this whole clause and let non-contiguous strides use the old scalar path, then there is much less template expansion for hwy in the cases where there is not a large gain to be had. Note that this means that the "to hwy or not to hwy" test would need to test contiguity in addition to just localpixels().

ssh4net and others added 5 commits January 13, 2026 10:47
Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>
Co-authored-by: Larry Gritz <lg@larrygritz.com>
Signed-off-by: Vlad Erium <shaamaan@gmail.com>
Co-authored-by: Larry Gritz <lg@larrygritz.com>
Signed-off-by: Vlad Erium <shaamaan@gmail.com>
Co-authored-by: Larry Gritz <lg@larrygritz.com>
Signed-off-by: Vlad Erium <shaamaan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants