Skip to content

[MCA] Inaccuracy in small snippet #99395

@boomanaiden154

Description

@boomanaiden154
Contributor

Take the small snippet:

incq %r15
addq $0x4, %r13
cmpq $0x3f, %r15

Running this through MCA on skylake/skylake-avx512 produces the following:

Iterations:        100
Instructions:      300
Total Cycles:      104
Total uOps:        300

Dispatch Width:    6
uOps Per Cycle:    2.88
IPC:               2.88
Block RThroughput: 0.8


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      1     0.25                        incq	%r15
 1      1     0.25                        addq	$4, %r13
 1      1     0.25                        cmpq	$63, %r15


Resources:
[0]   - SKXDivider
[1]   - SKXFPDivider
[2]   - SKXPort0
[3]   - SKXPort1
[4]   - SKXPort2
[5]   - SKXPort3
[6]   - SKXPort4
[7]   - SKXPort5
[8]   - SKXPort6
[9]   - SKXPort7


Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]
 -      -     0.75   0.75    -      -      -     0.75   0.75    -

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    Instructions:
 -      -     0.24   0.25    -      -      -     0.26   0.25    -     incq	%r15
 -      -     0.25   0.25    -      -      -     0.25   0.25    -     addq	$4, %r13
 -      -     0.26   0.25    -      -      -     0.24   0.25    -     cmpq	$63, %r15

However, running this within llvm-exegesis (llvm-exegesis -snippets-file=/tmp/test.s --mode=latency) produces the following:

---
mode:            latency
key:
  instructions:
    - 'INC64r R15 R15'
    - 'ADD64ri8 R13 R13 i_0x4'
    - 'CMP64ri8 R15 i_0x3f'
  config:          ''
  register_initial_values:
    - 'R15=0x123456'
    - 'R13=0x123456'
cpu_name:        skylake-avx512
llvm_triple:     x86_64-pc-linux-gnu
min_instructions: 10000
measurements:
  - { key: latency, value: 0.4234, per_snippet_value: 1.26995, validation_counters: {} }
error:           ''
info:            ''
assembled_snippet: 4157415549BF563412000000000049BD563412000000000049FFC74983C5044983FF3F49FFC74983C5044983FF3F49FFC74983C5044983FF3F49FFC74983C5044983FF3F415D415FC3
...

The predicted throughput from llvm-mca is almost 40% less than the experimental value. UICA seems to agree with the experimental value, predicting 1.25 cycles/iteration as the reciprocal throughput.

Activity

llvmbot

llvmbot commented on Jul 17, 2024

@llvmbot
Member

@llvm/issue-subscribers-tools-llvm-mca

Author: Aiden Grossman (boomanaiden154)

Take the small snippet: ```asm incq %r15 addq $0x4, %r13 cmpq $0x3f, %r15 ```

Running this through MCA on skylake/skylake-avx512 produces the following:

Iterations:        100
Instructions:      300
Total Cycles:      104
Total uOps:        300

Dispatch Width:    6
uOps Per Cycle:    2.88
IPC:               2.88
Block RThroughput: 0.8


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      1     0.25                        incq	%r15
 1      1     0.25                        addq	$4, %r13
 1      1     0.25                        cmpq	$63, %r15


Resources:
[0]   - SKXDivider
[1]   - SKXFPDivider
[2]   - SKXPort0
[3]   - SKXPort1
[4]   - SKXPort2
[5]   - SKXPort3
[6]   - SKXPort4
[7]   - SKXPort5
[8]   - SKXPort6
[9]   - SKXPort7


Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]
 -      -     0.75   0.75    -      -      -     0.75   0.75    -

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    Instructions:
 -      -     0.24   0.25    -      -      -     0.26   0.25    -     incq	%r15
 -      -     0.25   0.25    -      -      -     0.25   0.25    -     addq	$4, %r13
 -      -     0.26   0.25    -      -      -     0.24   0.25    -     cmpq	$63, %r15

However, running this within llvm-exegesis (llvm-exegesis -snippets-file=/tmp/test.s --mode=latency) produces the following:

---
mode:            latency
key:
  instructions:
    - 'INC64r R15 R15'
    - 'ADD64ri8 R13 R13 i_0x4'
    - 'CMP64ri8 R15 i_0x3f'
  config:          ''
  register_initial_values:
    - 'R15=0x123456'
    - 'R13=0x123456'
cpu_name:        skylake-avx512
llvm_triple:     x86_64-grtev4-linux-gnu
min_instructions: 10000
measurements:
  - { key: latency, value: 0.4234, per_snippet_value: 1.26995, validation_counters: {} }
error:           ''
info:            ''
assembled_snippet: 4157415549BF563412000000000049BD563412000000000049FFC74983C5044983FF3F49FFC74983C5044983FF3F49FFC74983C5044983FF3F49FFC74983C5044983FF3F415D415FC3
...

The predicted throughput from llvm-mca is almost 40% less than the experimental value. UICA seems to agree with the experimental value, predicting 1.25 cycles/iteration as the reciprocal throughput.

topperc

topperc commented on Jul 17, 2024

@topperc
Collaborator

MCA is reporting iterations/cycle of nearly 1.0 right? 100 iterations / 104 cycles.

Naively that seems right to me.
The incq on each iteration is dependent on the previous one
The addq on each is iteration is dependent on the previous one.
The cmpq depends on the inc. Nothing depends on the cmpq.
There are 4 ALUs available to each operation.

On the first cycle the ALUs can do one incq and one addq.
On the second cycle the ALUs can do incq and addq the second iteration and the cmpq from the first iteration.
On the third cycle the ALUs can do incq and addq the third iteration and the cmpq from the second iteration.
etc.
...
At the very end we need to do one cmpq by itself.

The iterations per cycle for that should be very nearly 1.

boomanaiden154

boomanaiden154 commented on Jul 17, 2024

@boomanaiden154
ContributorAuthor

MCA is reporting iterations/cycle of nearly 1.0 right? 100 iterations / 104 cycles.

Yes. I misunderstood what the reciprocal throughput field was representing.

Naively that seems right to me.
The incq on each iteration is dependent on the previous one
The addq on each is iteration is dependent on the previous one.
The cmpq depends on the inc. Nothing depends on the cmpq.
There are 4 ALUs available to each operation.

On the first cycle the ALUs can do one incq and one addq.
On the second cycle the ALUs can do incq and addq the second iteration and the cmpq from the first iteration.
On the third cycle the ALUs can do incq and addq the third iteration and the cmpq from the second iteration.
etc.
...
At the very end we need to do one cmpq by itself.

The iterations per cycle for that should be very nearly 1.

Right. That all makes sense to me. Looking at all the scheduling information for these instructions, they seem correct to me. CMP64ri8 seems to use the default scheduling class (https://gist.github.com/boomanaiden154/6417e88d67a0facf7995447be74cf7bc) which seems odd to me, but other than that, everything looks good.

However, the benchmark clearly shows 1.25 cycles/iteration, and UICA supports that. I still haven't figured out why UICA is reporting numbers that are so different.

boomanaiden154

boomanaiden154 commented on Jul 18, 2024

@boomanaiden154
ContributorAuthor

I spoke with Andreas Abel about this issue, and the main bottleneck is non-optimal port assignment by the renamer. Looking at the UICA trace, there is an instruction about every iteration that is waiting to be dispatched as it gets assigned to the same port as another uop and their dispatch cycle would otherwise overlap. This bumps the reciprocal throughput up to 1.25 cycles per iteration.

Given llvm-mca only models instruction dispatched rather than predecode/uop issue, I don't think this is a trivial issue to fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

    Development

    No branches or pull requests

      Participants

      @topperc@adibiagio@boomanaiden154@llvmbot

      Issue actions

        [MCA] Inaccuracy in small snippet · Issue #99395 · llvm/llvm-project