Appendix:

Graviton Performance Runbook toplevel

This Appendix contains additional information for engineers that want to go deeper on a particular topic, such as using different PMU counters to understand how the code is executing on the hardware, discussion on load generators, and additional tools to help with code observability.

Useful Graviton PMU Events and ratios

The following list of counter ratios has been curated to list events useful for performance debugging. The more extensive list of counters is contained in the following references:

Arm ARM
Neoverse N1 TRM
Neoverse N1 PMU Guide
Neoverse V1 TRM
Neoverse V1 PMU Guide
Neoverse V2 TRM
Neoverse V2 PMU Guide

METRIC	Counter #1	Counter #2	Formula	Description
IPC	0x8 (INST_RETIRED)	0x11 (CPU_CYCLES)	#1 / #2	Instructions per cycle, metric to understand how much parallelism in the core is left unused.
Stalls Frontend PKC	0x23 (STALL_FRONTEND)	0x11 (CPU_CYCLES)	(#1 / #2) * 1000	Stalls per kilo-cycle caused by frontend not having an instruction available for dispatch stage to issue.
Stalls Backend PKC	0x24 (STALL_BACKEND)	0x11 (CPU_CYCLES)	(#1 / #2) * 1000	Stalls per kilo-cycle caused by backend preventing instruction dispatch from frontend due to ALUs IQ full, LSQ full, MCQ full.
Core Mem BW	0x37		(#1 * 64) / (sample period)	Read demand memory bandwidth estimate (excludes writes, prefetch, tablewalks) in Bytes / s from a core.
LLC MPKI	0x37 (LLC_CACHE_MISS_RD)	0x8 (INST_RETIRED)	(#1 / #2) * 1000	From the cores perspective, is the LLC operating well, infer if it should be increased
BMPKI	0x10 (BR_MIS_PRED)	0x8 (INST_RETIRED)	(#1 / #2) * 1000	Branch Miss per Kilo Instruction, metric to understand if branch prediction logic is on average performing well or not. Branches exist every ~4 instructions, BMPKI > 10 is not good.
L2C Inst MPKI	0x108 (L2_REFILL_IFETCH)	0x8	(#1 / #2) * 1000	Instruction fetch miss from L2 MPKI, infer if code size and fetch misses are stalling core
L1C Inst MPKI	0x1 (L1I_CACHE_REFILL)	0x8	(#1 / #2) * 1000	Instruction fetch miss L1 MPKI, >20 is generally an indicator of code size pressure
L2 MPKI	0x17 (L2D_CACHE_REFILL)	0x8	(#1 / #2) * 1000	L2 Cache overall MPKI,
L1C Data MPKI	0x4 (L1_DATA_REFILL)	0x8	(#1 / #2) * 1000	L1 Data cache overall MPKI.
Old Atomics failures	0x6E (STREX_FAIL_SPEC)	0x8	(#1 / #2) * 1000	PKI of store-exclusive failures, indicates software not using LSE in production.
L2 TLB MPKI	0x2D	0x8	(#1 / #2) *1000	L2 TLB MPKI
L1I TLB MPKI	0x2	0x8	(#1 / #2) * 1000	Instruction L1 TLB MPKI
L1I Page Walk PKI	0x35	0x8	(#1 / #2) * 1000	Instruction PKI that requires a page-table walk
L1D TLB MPKI	0x5	0x8	(#1 / #2) * 1000	Data L1 TLB MPKI
L1D Page Walk PKI	0x34	0x8	(#1 / #2) * 1000	Data PKI that requires a page-table walk

Load generators

There are many load generators out there, and most if not all allow for the same basic configurations such as number of simulated clients (or number of connections), number of outstanding requests, target throughput, programmable way to define request mix etc. Pick one that is easy to work with, offers the features needed, is extensible for creating the tests desired, and allows fine grained control over how load is generated. After finding the ideal load generator, it is also important to understand how the load generator creates load to avoid easy measurement mistakes. Load generators fall into two main types: closed and open loop.

Closed loop

A closed loop load generator uses simple, synchronous IO calls for load generation in a similar fashion to below:

void generate_load(socket_t sock)
{
   int wret, rret = 0;
   const char *msg = "GENERATE SOME LOAD";
   char *buf = (char *)malloc(128);
   do {
     wret = write(sock, msg, strlen(msg));
     rret = read(sock, buf, 128);
   } while(wret >= 0 && rret);
}

This code is simple, and load is increased by creating more threads and connections that sit in a tight loop making requests to the SUT (System Under Test). Closed loop testers can also be implemented with event loops that multiplex sockets among a few threads, but allow only 1 outstanding request per socket, which requires more connections to drive more concurrent load. The Wrk load generator is an example of this. Some pitfalls with closed-loop testing:

Maximum load achievable is coupled to the observed round-trip network latency between the driver and the SUT, if one SUT is further away than another from the load generator, it will see less throughput with the same number of connections.
Not good for testing non-client facing services, like middle-layer web-caches, that leverage request pipelining on small number of connections.
Can quickly saturate the CPU on the load generator because high loads require many threads and/or connections.

Advantages of closed loop testers are: many to choose from, highly configurable and approximate client behavior well. Closed loop load generator suggestions (we highly recommend the Wrk2 load generator):

Wrk2
Wrk
JMeter

Open loop

Open-loop generators generate load in an asynchronous fashion. They use non-blocking IO to send requests onto a small set of connections regardless of whether a response was returned for the previous request or not. Since open-loop generators send requests regardless of responses received, they can approximate a greater variety of load scenarios because the sending side will send requests according to a specified time schedule (i.e. every 100us on average following an exponential distribution) instead of waiting until responses from the SUT before generating new requests. This allows an open-loop tester to load the SUT more completely and be less sensitive to latency of the network connection to the SUT.

Some pitfalls of open-loop testing:

Not common
Less configurable
Can not be used with SUTs that only allow synchronous connections. For instance, many DBMS engines like Postgresql only allow a single command to be issued per connection.

Advantages of open loop testers are they are more accurate in measuring latency of an application than closed-loop testers.

Open loop load generator suggestions:

Lancet, Usenix paper
Mutilate
MCPerf
TRex - This is for high performance stateless network testing

eBPF (extended Berkley Packet Filter) — “Inspecting fine-grained code behavior”

If you find a specific function that is mis-behaving, either putting the CPU to sleep more often, or taking longer to execute, but manual code inspection and flamegraphs are not yielding results, you can use eBPF to obtain detailed statistics on the function’s behavior. This can be leveraged for investigating reasons a function causes large latency spikes. Or measuring if it behaves differently on one machine compared to the other. These tools are useful for measuring a very specific issue, are extremely powerful and flexible. Warning: these tools can also have high overhead as they measure an event every time it happens compared to the sampling methods used in the previous sections. If measuring something that happens millions of times a second, any injected eBPF will also run at that rate.

Install the BCC tools

# AL2 
%> sudo `amazon-linux-extras enable BCC
%> sudo yum install bcc
  
# Ubuntu
%> sudo apt-get install linux-headers-$(uname -r) bpfcc-tools`

Write an eBPF program in C that can be embedded in a python file
Run the eBPF program using the python to interface via the BCC tools and collect the statistics
Review the statistics and compare against the x86 SUT to compare.

Example eBPF programs can be found: https://github.com/iovisor/bcc.

Capturing Coherent Mesh Network hardware event counters

The CMN PMU counts events such as requests to DRAM (memory bandwidth), virtual memory manangement events, I/O bus requests and coherence snoop events. These metrics can be used to assess an application's utilization of such system level resources and if resources are used efficiently. CMN counters are only accessible on metal-type instances and certain OSes and kernels.

Distro	Kernel	Graviton2 (6g)	Graviton3 (7g)	Graviton4 (8g)
Ubuntu-20.04	5.15	yes	no	no
Ubuntu-20.04	>=5.19	yes	yes	no
Ubuntu-22.04	>=5.19	yes	yes	no
Ubuntu-24.04	>=6.8.0	yes	yes	yes
AL2023	6.1.2	yes	yes	no

General procedure on Ubuntu

sudo apt install linux-modules-extra-aws
sudo modprobe arm-cmn
ls /sys/devices/arm_cmn_0/events

On AL2023/AL2:

sudo modprobe arm_cmn
ls /sys/devices/arm_cmn_0/events

Examples for capturing events:

sudo perf stat -e /arm_cmn_0/hnf_mc_reqs/ sleep 15 #count of memory request
sudo perf stat -e /arm_cmn_0/rnid_rxdat_flits/ sleep 15 #count AXI 'master' read requests
sudo perf stat -e /arm_cmn_0/rnid_txdat_flits/ sleep 15 #count AXI 'master' write requests

For further information about specific events and useful ratios, please refer to:

ARM documentation for Graviton2's CMN-600

ARM documentation for Graviton3's CMN-650

ARM documentation for Graviton4's CMN-700

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

appendix.md

appendix.md

Appendix:

Useful Graviton PMU Events and ratios

Load generators

Closed loop

Open loop

eBPF (extended Berkley Packet Filter) — “Inspecting fine-grained code behavior”

Capturing Coherent Mesh Network hardware event counters

Files

appendix.md

Latest commit

History

appendix.md

File metadata and controls

Appendix:

Useful Graviton PMU Events and ratios

Load generators

Closed loop

Open loop

eBPF (extended Berkley Packet Filter) — “Inspecting fine-grained code behavior”

Capturing Coherent Mesh Network hardware event counters