Skip to content

Commit

Permalink
fixup! Add profiling with perf chapter OCaml manual
Browse files Browse the repository at this point in the history
  • Loading branch information
tmcgilchrist committed Dec 11, 2024
1 parent 9451889 commit d8fbfcc
Showing 1 changed file with 14 additions and 14 deletions.
28 changes: 14 additions & 14 deletions manual/src/cmds/profile-perf.etex
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

This chapter describes how to use \texttt{perf} to profile OCaml programs.

Linux Performance Events (\texttt{perf (1)}) is a suite of tools for performance observability. The main features covered here are \texttt{perf-record(1)} for recording events and \texttt{perf-report(1)} for printing and visualising recorded events. \texttt{perf} has many additional profiling and visualising features. For more comprehensive documentation, see (\texttt{perf(1)}, \href{https://perfwiki.github.io/main/}{\texttt{perf} wiki} or \href{https://www.brendangregg.com/perf.html}{Brendan Gregg's Blog}).
Linux Performance Events (\texttt{perf (1)}) is a suite of tools for performance observability. The main features covered here are \texttt{perf-record(1)} for recording events, and \texttt{perf-report(1)} for printing and visualising recorded events. \texttt{perf} has many additional profiling and visualising features. For more comprehensive documentation, see (\texttt{perf(1)}, \href{https://perfwiki.github.io/main/}{\texttt{perf} wiki} or \href{https://www.brendangregg.com/perf.html}{Brendan Gregg's Blog}).

\section{s:ocamlperf-call-graph}{Background}

Expand All @@ -14,19 +14,19 @@ CPU profiling is typically performed by sampling the CPU call graph at frequent
\item Hardware Last Branch Record (LBR).
\end{itemize}

Of these options, frame pointers are recommended for profiling OCaml code for the following:
\begin{itemize}
\item Unwinding is faster and uses less CPU.
\item Trace files produced are smaller.
\item Frame pointers provide more accurate call graphs, particularly when used with a Linux distribution that supports them.
\item Frame pointers work better with OCaml 5's non-contiguous stacks.
\end{itemize}
Frame pointer based call graphs use a convention where a register called the frame pointer holds the address for the beginning of the stack frame, and the previous values for the frame pointer are stored at a known offset from the current frame pointer. Using this information the call graph for the current function can be calculated purely based on the current stack, a process called unwinding. On x86_64 the register use for storing the frame pointer is \$rbp, while ARM64 uses the register x29. OCaml 5 introduced non-contiguous stacks as part of the implementation of effects, see \href{https://dl.acm.org/doi/10.1145/3453483.3454039}{Retrofitting effect handlers onto OCaml} (Section 5.5), which work better with the copying nature of perf and frame pointers.

Frame pointer based call graphs use a convention where the head of the linked list of stack frames can be found in a register called the frame pointer (e.g., \$rbp on x86_64), and two pointers to the previous stack frame and the return address are saved at a known offset from the frame pointer. This linked list of stack frames is then used to walk the stack of called functions. OCaml 5 features non-contiguous stacks as part of the implementation of effects, see \href{https://dl.acm.org/doi/10.1145/3453483.3454039}{Retrofitting effect handlers onto OCaml} (Section 5.5).
DWARF based call graphs use the DWARF CFI information to perform unwinding. However this interacts poorly with the copying nature of perf, often leading to truncated call graphs where not enough of the stack has been copied by \texttt{perf}. It also produces larger trace files that are more costly to capture and process. Finally it requires including CFI debugging information in your program resulting in larger binaries.

DWARF based call graphs use the DWARF CFI information to perform unwinding. However this produces larger trace files that are more costly to capture and are often truncated because \texttt{perf} has not copied enough of the call stack. It also requires including CFI debugging information in your program resulting in larger binaries.
Hardware Last Branch Record (LBR) uses a processor provided method to record call graphs. This has the limitations of restricted availability (only on certain Intel CPUs) and a limited stack depth (16 on Haswell and 32 since Skylake).

Hardware Last Branch Record (LBR) uses a processor provided method to record call graphs. This has the dual limitations of restricted availability (only on certain Intel CPUS) and a limited stack depth. The stack depth is 16 on Haswell and 32 since Skylake.
Of these options, frame pointers are recommended for profiling OCaml code as it has the following advantages:
\begin{itemize}
\item Unwinding is faster to calculate.
\item Tracing data produced is smaller.
\item Frame pointers provide more complete call graphs, particularly when used with a Linux distribution that supports them.
\item Frame pointers work better with perf's copying nature and OCaml 5's non-contiguous stacks.
\end{itemize}

\section{s:ocamlperf-compiling}{Compiling for Profiling}

Expand All @@ -38,7 +38,7 @@ To enable frame pointers, configure the compiler with \texttt{--enable-frame-poi
opam switch create <YOUR-SWITCH-NAME-HERE> ocaml-option-fp
\end{verbatim}

Frame pointer support for OCaml is available on x86_64 architecture on Linux starting with OCaml 4.12 and on macOS from OCaml 5.3. ARM64 architecture is supported on Linux and macOS from OCaml 5.4, while other Tier-1 architectures (POWER, RISC-V, and s390x) are currently unsupported.
Frame pointer support for OCaml is available on x86_64 architecture for Linux starting with OCaml 4.12 and on macOS from OCaml 5.3. ARM64 architecture is supported on Linux and macOS from OCaml 5.4, while other Tier-1 architectures (POWER, RISC-V, and s390x) are currently unsupported.

\section{s:ocamlperf-profiling}{Profiling an Execution}

Expand All @@ -51,7 +51,7 @@ The \texttt{-F 99} option sets \texttt{perf} to sample at 99Hz, reducing excessi

The \texttt{perf record} command works by copying a segment of the call stack at each sample and recording this data into a \texttt{perf.data} file. These samples can then be processed after recording using \texttt{perf report} to reconstruct the profiled program’s call stack at every sample.

\texttt{perf} uses the symbols in an OCaml executable, so it helps to understand OCaml's name mangling scheme to map names to OCaml source locations. Before OCaml 5.1, \texttt{ocamlopt} mangled names used the \texttt{camlModule__identifier_stamp} format; from 5.1 onwards, the separator is a dot \texttt{camlModule.identifier_stamp}. Both formats are supported by \texttt{perf}.
\texttt{perf} uses the symbols present in an OCaml executable, so it helps to understand OCaml's name mangling scheme to map names to OCaml source locations. Before OCaml 5.1, \texttt{ocamlopt} mangled names used the \texttt{camlModule__identifier_stamp} format; from 5.1 onwards, the separator is a dot \texttt{camlModule.identifier_stamp}. Both formats are supported by \texttt{perf}.

Consider the following program:

Expand Down Expand Up @@ -81,7 +81,7 @@ The basic \texttt{perf report} command is:
perf report -f --no-children -i perf.data
\end{verbatim}

This command provides an interactive interface where you can navigate through the accumulated call graphs and select functions and threads for detailed information. Alternatively \texttt{--stdio} will output similar data using a text based report writing to stdout. Note that if stack traces appear broken, it may be due to software not having frame pointer support.
This command provides an interactive interface where you can navigate through the accumulated call graphs and select functions and threads for detailed information. Alternatively \texttt{--stdio} will output similar data using a text based report writing to stdout. Note that if stack traces appear broken, it may be due to software not having frame pointer support enabled.

Consider the following program which calculates the Tak function.
\begin{caml_example*}{verbatim}
Expand Down

0 comments on commit d8fbfcc

Please sign in to comment.