x86: Implement fma intrinsic #21118

BradleyWood · 2025-02-12T06:06:49Z

This PR implements Math.fma intrinsics on x86 with vfmadd instructions.

BradleyWood · 2025-02-12T06:15:18Z

Requires fma, extensions.

Here are performance numbers,

Benchmark	Compared to Baseline	Compared to Hotspot
FMABench.benchFloatFMA	552x	1.19x
FMABench.benchDoubleFMA	1482x	1.16x

BradleyWood · 2025-02-12T06:47:33Z

FYI @0xdaryl, @JamesKingdon

@hzongaro Would you mind reviewing?

JamesKingdon · 2025-02-12T14:08:20Z

Thanks Brad, what a result!

hzongaro

Thanks for pulling this together so quickly! I just have a few comments and questions.

runtime/compiler/x/codegen/J9TreeEvaluator.cpp

runtime/compiler/x/codegen/J9TreeEvaluator.hpp

runtime/compiler/x/codegen/J9CodeGenerator.cpp

runtime/compiler/x/codegen/J9TreeEvaluator.cpp

Signed-off-by: Bradley Wood <[email protected]>

BradleyWood · 2025-02-14T20:59:27Z

See the force-push

hzongaro

I think the changes look good. I will wait for @0xdaryl to review before running tests.

0xdaryl

I think the logic behind choosing the best instruction format to use is sound. I just have a few issues with the approach behind that.

0xdaryl · 2025-02-21T14:02:39Z

runtime/compiler/x/codegen/J9TreeEvaluator.cpp

+         {
+         // b * c + a
+         TR::MemoryReference *rhsMR = generateX86MemoryReference(thirdChild, cg);
+         generateRegMemInstruction(TR::InstOpCode::MOVSRegMem(is64Bit), node, result, rhsMR, cg);


These two lines are equivalent to: result = cg->evaluate(thirdChild)

0xdaryl · 2025-02-21T14:12:15Z

runtime/compiler/x/codegen/J9TreeEvaluator.cpp

+         TR::MemoryReference *midMR = generateX86MemoryReference(secondChild, cg);
+         rhsReg = cg->evaluate(thirdChild);
+
+         generateRegMemInstruction(TR::InstOpCode::MOVSRegMem(is64Bit), node, result, lhsMR, cg);


This is essentially equivalent to: result = cg->evaluate(firstChild) had you not evaluated the lhs into a memref earlier (and prematurely).

0xdaryl · 2025-02-21T14:15:42Z

runtime/compiler/x/codegen/J9TreeEvaluator.cpp

+   // Choose fma instruction carefully, based on operand form, to reduce number of copies
+   if (memLoadLhs)
+      {
+      TR::InstOpCode::Mnemonic opcode = is64Bit ? TR::InstOpCode::VFMADD231SDRegRegMem : TR::InstOpCode::VFMADD231SSRegRegMem;


This should move to the else block below where this particular opcode (231) is actually employed.

Its used in 2 of the 3 branches

Oh, I missed the use in the first branch.

0xdaryl · 2025-02-21T14:19:38Z

runtime/compiler/x/codegen/J9TreeEvaluator.cpp

+         {
+         // fma = b * c + a
+         midReg = cg->evaluate(secondChild);
+         rhsReg = cg->evaluate(thirdChild);


allocate result here, or clobberEvaluate the third child

0xdaryl · 2025-02-21T14:54:17Z

runtime/compiler/x/codegen/J9TreeEvaluator.cpp

+
+      if (memLoadRhs)
+         {
+         // b * c + a


This comment (and the ones like it below) seem to suggest that the FMA operation changes, and is confusing to a reader. Math.fma() must always compute a * b + c. What is changing in your analysis is the order of these variables on the instruction operands. Maybe something like this would be clearer:
a (3) * b (1) + c (2) or even a (3M) * b (1) + c (2) (where M means memref).

0xdaryl · 2025-02-21T15:15:34Z

runtime/compiler/x/codegen/J9TreeEvaluator.cpp

+   TR::Register *lhsReg = NULL;
+   TR::Register *midReg = NULL;
+   TR::Register *rhsReg = NULL;
+   TR::Register *result = cg->allocateRegister(TR_FPR);


A general comment on your approach. I think you are performing some operations manually below that are more naturally (and idiomatically) handled by other functions. You allocate a result register here and explicitly manage its updates, but this isn't required in a number of cases (if not all of them).

There are situations below where you create a memory reference for a node and manually copy into this register. Simply evaluating the node will produce resultReg naturally.

In the situations where you can't re-use a register, rather than managing the copy yourself you should consider floatClobberEvaluate or doubleClobberEvaluate on the node to evaluate and return a copy if necessary. The copy instruction used in those functions may need to be reconsidered (MOVAPS/D) as it may not be the best on today's architectures.

There may be legitimate reasons for manually copying into a register, but I would like to think those cases are few.

To that end, I don't think you need to allocate the result register upfront, and use the natural mechanisms in the codegen for producing it.

Unfortunately that isn't so simple. You can't reuse the register from a node evaluation as the result register, especially when the reference count is 1. Clobber evaluation is the same as regular evaluation when the reference count is 1, meaning that, it simply returns the nodes register after evaluation. When you go to decrement the reference counts for each node, its register is marked as dead if its reference count is 1. This means the result register will not be live.

So, if in cases where we try to load a node from memory directly into the result register, we cannot replace that with node evaluation. The nodes reference count is guaranteed to be 1.

@0xdaryl Did you have a chance to read my comment?

OK, it's unfortunate you have to manage these copies manually but you're right that setting the result reg on the call node would be a problem for clobber evaluation (which is intended for scratch usage where the reg won't escape).

BradleyWood force-pushed the fma branch 4 times, most recently from ff7bb16 to 229e619 Compare February 12, 2025 06:34

hzongaro self-requested a review February 12, 2025 15:48

hzongaro self-assigned this Feb 12, 2025

hzongaro reviewed Feb 14, 2025

View reviewed changes

x86: Implement fma intrinsic

744b61b

Signed-off-by: Bradley Wood <[email protected]>

BradleyWood force-pushed the fma branch from 229e619 to 744b61b Compare February 14, 2025 20:57

hzongaro approved these changes Feb 20, 2025

View reviewed changes

0xdaryl requested changes Feb 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x86: Implement fma intrinsic #21118

x86: Implement fma intrinsic #21118

BradleyWood commented Feb 12, 2025 •

edited

Loading

BradleyWood commented Feb 12, 2025

BradleyWood commented Feb 12, 2025

JamesKingdon commented Feb 12, 2025

hzongaro left a comment

BradleyWood commented Feb 14, 2025

hzongaro left a comment

0xdaryl left a comment

0xdaryl Feb 21, 2025

0xdaryl Feb 21, 2025

0xdaryl Feb 21, 2025

BradleyWood Feb 21, 2025

0xdaryl Feb 25, 2025

0xdaryl Feb 21, 2025

0xdaryl Feb 21, 2025

0xdaryl Feb 21, 2025

BradleyWood Feb 21, 2025 •

edited

Loading

BradleyWood Feb 24, 2025

0xdaryl Feb 25, 2025

x86: Implement fma intrinsic #21118

Are you sure you want to change the base?

x86: Implement fma intrinsic #21118

Conversation

BradleyWood commented Feb 12, 2025 • edited Loading

BradleyWood commented Feb 12, 2025

BradleyWood commented Feb 12, 2025

JamesKingdon commented Feb 12, 2025

hzongaro left a comment

Choose a reason for hiding this comment

BradleyWood commented Feb 14, 2025

hzongaro left a comment

Choose a reason for hiding this comment

0xdaryl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BradleyWood Feb 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BradleyWood commented Feb 12, 2025 •

edited

Loading

BradleyWood Feb 21, 2025 •

edited

Loading