-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimise logic in ReorderBuffer::commitMicroOps #366
Conversation
} | ||
if (!validForCommit) return; | ||
size_t index = 0; | ||
uint8_t ins_cnt = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jj16791 Can the number of micro-ops generated per macro-op exceed 256?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the number of micro-ops for a single macro-op rather than total micro-ops in the ROB.
In such a case, the max number of micro-ops generated per macro op is implementation defined. The only candidate I can think of for currently supportable instructions is the AArch64 gather/scatter loads IF the vector length is 2048-bits and you are doing a byte load (i.e. loading 256 byte elements).
However, checking the spec the gather / scatter instructions which load/store single bytes only permit the source/destination register to be of the .S
or .D
variant - meaning that there would be a max of 64 or 32 elements per vector respectively to scatter or gather.
In summary - no, the number of micro-ops per macro-op is very very unlikely to exceed 256.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I should've specified i did mean number of micro-ops per macro-op.
In summary - no, the number of micro-ops per macro-op is very very unlikely to exceed 256.
When you say very very unlikely, that means there it could exceed 256 right? If that's the case the data type above needs to be incremented otherwise the logic is broken
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently it could never exceed 256, but in the future there is a extremely small chance it could. For now I would keep it the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a full review, there would be no harm in making this a uint16
to "future proof" this
Please add a description, assignee, labels, and ensure the status on the associated github project is correct |
I've added labels and asignee. I don't think the description is necessary as the title communicates everything that is done. I've just changed one function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All looks fine, just a few comments made.
Could you add to the description of this PR the performance improvement seen on your local system and on the Mac Studio? (in Release mode). Looking at the Jenkin's performance pipeline results there doesn't seem to be much improvement (with the exception of triad_gcc_a64fx
). There are also 2 regressions...
@@ -2,6 +2,7 @@ | |||
.vscode | |||
.idea | |||
.DS_Store | |||
.cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is duplicated from PR #353, please remove it from this PR or close the other PR
} | ||
if (!validForCommit) return; | ||
size_t index = 0; | ||
uint8_t ins_cnt = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid confusion akin to the other comment left on this variable, perhaps rename this to uopCount
or microOpCount
?
if (mop_id == insnId) { | ||
if (!uop->isWaitingCommit()) return; | ||
ins_cnt++; | ||
} else if (ins_cnt && mop_id != insnId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mop_id && insnID
in this else if seems redundant given that this else if can only be reached if this is true
while (index < bsize) { | ||
auto& uop = buffer_[index]; | ||
uint64_t mop_id = uop->getInstructionId(); | ||
|
||
if (mop_id == insnId) { | ||
if (!uop->isWaitingCommit()) return; | ||
ins_cnt++; | ||
} else if (ins_cnt && mop_id != insnId) { | ||
break; | ||
} | ||
index++; | ||
} | ||
|
||
index = index - ins_cnt; | ||
for (int x = 0; x < ins_cnt; x++) { | ||
buffer_[index]->setCommitReady(); | ||
index++; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good. I think some comments akin to the previous implementation, plus one explaining that the setCommitReady
logic is only reached iff all micro-ops of a macro-op are waitingCommit, would be beneficial.
} | ||
if (!validForCommit) return; | ||
size_t index = 0; | ||
uint8_t ins_cnt = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a full review, there would be no harm in making this a uint16
to "future proof" this
} | ||
|
||
index = index - ins_cnt; | ||
for (int x = 0; x < ins_cnt; x++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could x
be given the same type as ins_cnt
Hmm that seems strange, my access to mac studio is a bit messed up given my laptop broke. Let me see what I can do to address this. It might be compiler dependant though but I will investigate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a very "minimal gains" optimisation e.g. no fundamental algorithmic change, rather less repeated array loads/getter calls in the high level language (which may not happen once compiled). I would need to see actual performance gains before approval. Some comments explaining what is going on also needed (took me a bit of time to work it all through and understand)
#rerun tests |
Closing as stale / not planned |
No description provided.