Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x86: Implement fma intrinsic #21118

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

BradleyWood
Copy link
Member

@BradleyWood BradleyWood commented Feb 12, 2025

This PR implements Math.fma intrinsics on x86 with vfmadd instructions.

Issue: #7474

@BradleyWood
Copy link
Member Author

Requires fma, extensions.

Here are performance numbers,

Benchmark Compared to Baseline Compared to Hotspot
FMABench.benchFloatFMA 552x 1.19x
FMABench.benchDoubleFMA 1482x 1.16x

@BradleyWood BradleyWood force-pushed the fma branch 4 times, most recently from ff7bb16 to 229e619 Compare February 12, 2025 06:34
@BradleyWood
Copy link
Member Author

FYI @0xdaryl, @JamesKingdon

@hzongaro Would you mind reviewing?

@JamesKingdon
Copy link
Contributor

Thanks Brad, what a result!

@hzongaro hzongaro self-requested a review February 12, 2025 15:48
@hzongaro hzongaro self-assigned this Feb 12, 2025
Copy link
Member

@hzongaro hzongaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pulling this together so quickly! I just have a few comments and questions.

Signed-off-by: Bradley Wood <[email protected]>
@BradleyWood
Copy link
Member Author

See the force-push

Copy link
Member

@hzongaro hzongaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the changes look good. I will wait for @0xdaryl to review before running tests.

Copy link
Contributor

@0xdaryl 0xdaryl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the logic behind choosing the best instruction format to use is sound. I just have a few issues with the approach behind that.

{
// b * c + a
TR::MemoryReference *rhsMR = generateX86MemoryReference(thirdChild, cg);
generateRegMemInstruction(TR::InstOpCode::MOVSRegMem(is64Bit), node, result, rhsMR, cg);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two lines are equivalent to: result = cg->evaluate(thirdChild)

TR::MemoryReference *midMR = generateX86MemoryReference(secondChild, cg);
rhsReg = cg->evaluate(thirdChild);

generateRegMemInstruction(TR::InstOpCode::MOVSRegMem(is64Bit), node, result, lhsMR, cg);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is essentially equivalent to: result = cg->evaluate(firstChild) had you not evaluated the lhs into a memref earlier (and prematurely).

// Choose fma instruction carefully, based on operand form, to reduce number of copies
if (memLoadLhs)
{
TR::InstOpCode::Mnemonic opcode = is64Bit ? TR::InstOpCode::VFMADD231SDRegRegMem : TR::InstOpCode::VFMADD231SSRegRegMem;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should move to the else block below where this particular opcode (231) is actually employed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its used in 2 of the 3 branches

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I missed the use in the first branch.

{
// fma = b * c + a
midReg = cg->evaluate(secondChild);
rhsReg = cg->evaluate(thirdChild);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allocate result here, or clobberEvaluate the third child


if (memLoadRhs)
{
// b * c + a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment (and the ones like it below) seem to suggest that the FMA operation changes, and is confusing to a reader. Math.fma() must always compute a * b + c. What is changing in your analysis is the order of these variables on the instruction operands. Maybe something like this would be clearer:
a (3) * b (1) + c (2) or even a (3M) * b (1) + c (2) (where M means memref).

TR::Register *lhsReg = NULL;
TR::Register *midReg = NULL;
TR::Register *rhsReg = NULL;
TR::Register *result = cg->allocateRegister(TR_FPR);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general comment on your approach. I think you are performing some operations manually below that are more naturally (and idiomatically) handled by other functions. You allocate a result register here and explicitly manage its updates, but this isn't required in a number of cases (if not all of them).

There are situations below where you create a memory reference for a node and manually copy into this register. Simply evaluating the node will produce resultReg naturally.

In the situations where you can't re-use a register, rather than managing the copy yourself you should consider floatClobberEvaluate or doubleClobberEvaluate on the node to evaluate and return a copy if necessary. The copy instruction used in those functions may need to be reconsidered (MOVAPS/D) as it may not be the best on today's architectures.

There may be legitimate reasons for manually copying into a register, but I would like to think those cases are few.

To that end, I don't think you need to allocate the result register upfront, and use the natural mechanisms in the codegen for producing it.

Copy link
Member Author

@BradleyWood BradleyWood Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately that isn't so simple. You can't reuse the register from a node evaluation as the result register, especially when the reference count is 1. Clobber evaluation is the same as regular evaluation when the reference count is 1, meaning that, it simply returns the nodes register after evaluation. When you go to decrement the reference counts for each node, its register is marked as dead if its reference count is 1. This means the result register will not be live.

So, if in cases where we try to load a node from memory directly into the result register, we cannot replace that with node evaluation. The nodes reference count is guaranteed to be 1.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@0xdaryl Did you have a chance to read my comment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, it's unfortunate you have to manage these copies manually but you're right that setting the result reg on the call node would be a problem for clobber evaluation (which is intended for scratch usage where the reg won't escape).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants