Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise logic in ReorderBuffer::commitMicroOps #366

Closed
wants to merge 3 commits into from
Closed

Conversation

rahahahat
Copy link
Contributor

No description provided.

}
if (!validForCommit) return;
size_t index = 0;
uint8_t ins_cnt = 0;
Copy link
Contributor Author

@rahahahat rahahahat Dec 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jj16791 Can the number of micro-ops generated per macro-op exceed 256?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the number of micro-ops for a single macro-op rather than total micro-ops in the ROB.

In such a case, the max number of micro-ops generated per macro op is implementation defined. The only candidate I can think of for currently supportable instructions is the AArch64 gather/scatter loads IF the vector length is 2048-bits and you are doing a byte load (i.e. loading 256 byte elements).

However, checking the spec the gather / scatter instructions which load/store single bytes only permit the source/destination register to be of the .S or .D variant - meaning that there would be a max of 64 or 32 elements per vector respectively to scatter or gather.

In summary - no, the number of micro-ops per macro-op is very very unlikely to exceed 256.

Copy link
Contributor Author

@rahahahat rahahahat Dec 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I should've specified i did mean number of micro-ops per macro-op.

In summary - no, the number of micro-ops per macro-op is very very unlikely to exceed 256.

When you say very very unlikely, that means there it could exceed 256 right? If that's the case the data type above needs to be incremented otherwise the logic is broken

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it could never exceed 256, but in the future there is a extremely small chance it could. For now I would keep it the same

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a full review, there would be no harm in making this a uint16 to "future proof" this

@FinnWilkinson
Copy link
Contributor

Please add a description, assignee, labels, and ensure the status on the associated github project is correct

@rahahahat rahahahat self-assigned this Dec 22, 2023
@rahahahat rahahahat added the performance Performance optimisation label Dec 22, 2023
@rahahahat
Copy link
Contributor Author

Please add a description, assignee, labels, and ensure the status on the associated github project is correct

I've added labels and asignee. I don't think the description is necessary as the title communicates everything that is done. I've just changed one function.

Copy link
Contributor

@FinnWilkinson FinnWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks fine, just a few comments made.

Could you add to the description of this PR the performance improvement seen on your local system and on the Mac Studio? (in Release mode). Looking at the Jenkin's performance pipeline results there doesn't seem to be much improvement (with the exception of triad_gcc_a64fx). There are also 2 regressions...

@@ -2,6 +2,7 @@
.vscode
.idea
.DS_Store
.cache
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is duplicated from PR #353, please remove it from this PR or close the other PR

}
if (!validForCommit) return;
size_t index = 0;
uint8_t ins_cnt = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid confusion akin to the other comment left on this variable, perhaps rename this to uopCount or microOpCount?

if (mop_id == insnId) {
if (!uop->isWaitingCommit()) return;
ins_cnt++;
} else if (ins_cnt && mop_id != insnId) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mop_id && insnID in this else if seems redundant given that this else if can only be reached if this is true

Comment on lines +44 to +62
while (index < bsize) {
auto& uop = buffer_[index];
uint64_t mop_id = uop->getInstructionId();

if (mop_id == insnId) {
if (!uop->isWaitingCommit()) return;
ins_cnt++;
} else if (ins_cnt && mop_id != insnId) {
break;
}
index++;
}

index = index - ins_cnt;
for (int x = 0; x < ins_cnt; x++) {
buffer_[index]->setCommitReady();
index++;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good. I think some comments akin to the previous implementation, plus one explaining that the setCommitReady logic is only reached iff all micro-ops of a macro-op are waitingCommit, would be beneficial.

}
if (!validForCommit) return;
size_t index = 0;
uint8_t ins_cnt = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a full review, there would be no harm in making this a uint16 to "future proof" this

}

index = index - ins_cnt;
for (int x = 0; x < ins_cnt; x++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could x be given the same type as ins_cnt

@rahahahat
Copy link
Contributor Author

All looks fine, just a few comments made.

Could you add to the description of this PR the performance improvement seen on your local system and on the Mac Studio? (in Release mode). Looking at the Jenkin's performance pipeline results there doesn't seem to be much improvement (with the exception of triad_gcc_a64fx). There are also 2 regressions...

Hmm that seems strange, my access to mac studio is a bit messed up given my laptop broke. Let me see what I can do to address this. It might be compiler dependant though but I will investigate.
Thanks.

Copy link
Contributor

@dANW34V3R dANW34V3R left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a very "minimal gains" optimisation e.g. no fundamental algorithmic change, rather less repeated array loads/getter calls in the high level language (which may not happen once compiled). I would need to see actual performance gains before approval. Some comments explaining what is going on also needed (took me a bit of time to work it all through and understand)

@jj16791
Copy link
Contributor

jj16791 commented Jan 30, 2024

#rerun tests

@FinnWilkinson
Copy link
Contributor

Closing as stale / not planned

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance optimisation
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants