-
Notifications
You must be signed in to change notification settings - Fork 738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x86: Implement fma intrinsic #21118
base: master
Are you sure you want to change the base?
x86: Implement fma intrinsic #21118
Conversation
Requires fma, extensions. Here are performance numbers,
|
ff7bb16
to
229e619
Compare
FYI @0xdaryl, @JamesKingdon @hzongaro Would you mind reviewing? |
Thanks Brad, what a result! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pulling this together so quickly! I just have a few comments and questions.
Signed-off-by: Bradley Wood <[email protected]>
See the force-push |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the changes look good. I will wait for @0xdaryl to review before running tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the logic behind choosing the best instruction format to use is sound. I just have a few issues with the approach behind that.
{ | ||
// b * c + a | ||
TR::MemoryReference *rhsMR = generateX86MemoryReference(thirdChild, cg); | ||
generateRegMemInstruction(TR::InstOpCode::MOVSRegMem(is64Bit), node, result, rhsMR, cg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two lines are equivalent to: result = cg->evaluate(thirdChild)
TR::MemoryReference *midMR = generateX86MemoryReference(secondChild, cg); | ||
rhsReg = cg->evaluate(thirdChild); | ||
|
||
generateRegMemInstruction(TR::InstOpCode::MOVSRegMem(is64Bit), node, result, lhsMR, cg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is essentially equivalent to: result = cg->evaluate(firstChild)
had you not evaluated the lhs into a memref earlier (and prematurely).
// Choose fma instruction carefully, based on operand form, to reduce number of copies | ||
if (memLoadLhs) | ||
{ | ||
TR::InstOpCode::Mnemonic opcode = is64Bit ? TR::InstOpCode::VFMADD231SDRegRegMem : TR::InstOpCode::VFMADD231SSRegRegMem; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should move to the else
block below where this particular opcode
(231) is actually employed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its used in 2 of the 3 branches
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I missed the use in the first branch.
{ | ||
// fma = b * c + a | ||
midReg = cg->evaluate(secondChild); | ||
rhsReg = cg->evaluate(thirdChild); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allocate result
here, or clobberEvaluate
the third child
|
||
if (memLoadRhs) | ||
{ | ||
// b * c + a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment (and the ones like it below) seem to suggest that the FMA operation changes, and is confusing to a reader. Math.fma()
must always compute a * b + c
. What is changing in your analysis is the order of these variables on the instruction operands. Maybe something like this would be clearer:
a (3) * b (1) + c (2)
or even a (3M) * b (1) + c (2)
(where M
means memref).
TR::Register *lhsReg = NULL; | ||
TR::Register *midReg = NULL; | ||
TR::Register *rhsReg = NULL; | ||
TR::Register *result = cg->allocateRegister(TR_FPR); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A general comment on your approach. I think you are performing some operations manually below that are more naturally (and idiomatically) handled by other functions. You allocate a result register here and explicitly manage its updates, but this isn't required in a number of cases (if not all of them).
There are situations below where you create a memory reference for a node and manually copy into this register. Simply evaluating the node will produce resultReg naturally.
In the situations where you can't re-use a register, rather than managing the copy yourself you should consider floatClobberEvaluate
or doubleClobberEvaluate
on the node to evaluate and return a copy if necessary. The copy instruction used in those functions may need to be reconsidered (MOVAPS/D
) as it may not be the best on today's architectures.
There may be legitimate reasons for manually copying into a register, but I would like to think those cases are few.
To that end, I don't think you need to allocate the result register upfront, and use the natural mechanisms in the codegen for producing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately that isn't so simple. You can't reuse the register from a node evaluation as the result register, especially when the reference count is 1. Clobber evaluation is the same as regular evaluation when the reference count is 1, meaning that, it simply returns the nodes register after evaluation. When you go to decrement the reference counts for each node, its register is marked as dead if its reference count is 1. This means the result register will not be live.
So, if in cases where we try to load a node from memory directly into the result register, we cannot replace that with node evaluation. The nodes reference count is guaranteed to be 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@0xdaryl Did you have a chance to read my comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, it's unfortunate you have to manage these copies manually but you're right that setting the result reg on the call node would be a problem for clobber evaluation (which is intended for scratch usage where the reg won't escape).
This PR implements Math.fma intrinsics on x86 with vfmadd instructions.
Issue: #7474