-
Notifications
You must be signed in to change notification settings - Fork 14.5k
Open
Labels
backend:X86 Scheduler ModelsAccuracy of X86 scheduler modelsAccuracy of X86 scheduler modelstools:llvm-mca
Description
Take the small snippet:
incq %r15
addq $0x4, %r13
cmpq $0x3f, %r15
Running this through MCA on skylake
/skylake-avx512
produces the following:
Iterations: 100
Instructions: 300
Total Cycles: 104
Total uOps: 300
Dispatch Width: 6
uOps Per Cycle: 2.88
IPC: 2.88
Block RThroughput: 0.8
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 1 0.25 incq %r15
1 1 0.25 addq $4, %r13
1 1 0.25 cmpq $63, %r15
Resources:
[0] - SKXDivider
[1] - SKXFPDivider
[2] - SKXPort0
[3] - SKXPort1
[4] - SKXPort2
[5] - SKXPort3
[6] - SKXPort4
[7] - SKXPort5
[8] - SKXPort6
[9] - SKXPort7
Resource pressure per iteration:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
- - 0.75 0.75 - - - 0.75 0.75 -
Resource pressure by instruction:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] Instructions:
- - 0.24 0.25 - - - 0.26 0.25 - incq %r15
- - 0.25 0.25 - - - 0.25 0.25 - addq $4, %r13
- - 0.26 0.25 - - - 0.24 0.25 - cmpq $63, %r15
However, running this within llvm-exegesis
(llvm-exegesis -snippets-file=/tmp/test.s --mode=latency
) produces the following:
---
mode: latency
key:
instructions:
- 'INC64r R15 R15'
- 'ADD64ri8 R13 R13 i_0x4'
- 'CMP64ri8 R15 i_0x3f'
config: ''
register_initial_values:
- 'R15=0x123456'
- 'R13=0x123456'
cpu_name: skylake-avx512
llvm_triple: x86_64-pc-linux-gnu
min_instructions: 10000
measurements:
- { key: latency, value: 0.4234, per_snippet_value: 1.26995, validation_counters: {} }
error: ''
info: ''
assembled_snippet: 4157415549BF563412000000000049BD563412000000000049FFC74983C5044983FF3F49FFC74983C5044983FF3F49FFC74983C5044983FF3F49FFC74983C5044983FF3F415D415FC3
...
The predicted throughput from llvm-mca
is almost 40% less than the experimental value. UICA seems to agree with the experimental value, predicting 1.25 cycles/iteration as the reciprocal throughput.
Metadata
Metadata
Assignees
Labels
backend:X86 Scheduler ModelsAccuracy of X86 scheduler modelsAccuracy of X86 scheduler modelstools:llvm-mca
Type
Projects
Milestone
Relationships
Development
Select code repository
Activity
llvmbot commentedon Jul 17, 2024
@llvm/issue-subscribers-tools-llvm-mca
Author: Aiden Grossman (boomanaiden154)
Running this through MCA on
skylake
/skylake-avx512
produces the following:However, running this within
llvm-exegesis
(llvm-exegesis -snippets-file=/tmp/test.s --mode=latency
) produces the following:The predicted throughput from
llvm-mca
is almost 40% less than the experimental value. UICA seems to agree with the experimental value, predicting 1.25 cycles/iteration as the reciprocal throughput.topperc commentedon Jul 17, 2024
MCA is reporting iterations/cycle of nearly 1.0 right? 100 iterations / 104 cycles.
Naively that seems right to me.
The incq on each iteration is dependent on the previous one
The addq on each is iteration is dependent on the previous one.
The cmpq depends on the inc. Nothing depends on the cmpq.
There are 4 ALUs available to each operation.
On the first cycle the ALUs can do one incq and one addq.
On the second cycle the ALUs can do incq and addq the second iteration and the cmpq from the first iteration.
On the third cycle the ALUs can do incq and addq the third iteration and the cmpq from the second iteration.
etc.
...
At the very end we need to do one cmpq by itself.
The iterations per cycle for that should be very nearly 1.
boomanaiden154 commentedon Jul 17, 2024
Yes. I misunderstood what the reciprocal throughput field was representing.
Right. That all makes sense to me. Looking at all the scheduling information for these instructions, they seem correct to me.
CMP64ri8
seems to use the default scheduling class (https://gist.github.com/boomanaiden154/6417e88d67a0facf7995447be74cf7bc) which seems odd to me, but other than that, everything looks good.However, the benchmark clearly shows 1.25 cycles/iteration, and UICA supports that. I still haven't figured out why UICA is reporting numbers that are so different.
boomanaiden154 commentedon Jul 18, 2024
I spoke with Andreas Abel about this issue, and the main bottleneck is non-optimal port assignment by the renamer. Looking at the UICA trace, there is an instruction about every iteration that is waiting to be dispatched as it gets assigned to the same port as another uop and their dispatch cycle would otherwise overlap. This bumps the reciprocal throughput up to 1.25 cycles per iteration.
Given llvm-mca only models instruction dispatched rather than predecode/uop issue, I don't think this is a trivial issue to fix.