This directory contains documentation on the internal architecture of HHVM, targeted at C++ developers looking to hack on HHVM itself. If you're a Hack developer looking for documentation on using HHVM, that can be found here.
HHVM is a virtual machine that executes Hack programs using a bytecode interpreter and a JIT compiler (the latter is vastly more complex, and will get much more airtime here). PHP is also currently supported for historical reasons.
You should already be comfortable reading and writing C++ (specifically, HHVM is written in C++14), as well as navigating around large codebases using grep, ctags, or any similar tool of your choice. Prior experience with compilers and runtimes will help but is not strictly necessary.
Since this guide is intended to help familiarize readers with the HHVM codebase,
it naturally contains a number of links to the code. Some of these links are to
the current version of a file, and others are to specific lines in specific
versions of files. If you find links of either kind that are out-of-date with
current master
or are otherwise misleading, please let us know.
Instructions for building HHVM and running our primary test suite can be found here. The initial build may take an hour or longer, depending on how fast your machine is. Subsequent builds should be faster, as long as you're not touching core header files.
HHVM, like most compilers, is best thought of as a pipeline with many different stages. Source code goes in one end, and after a number of different transformations, executable machine code comes out the other end. Some stages are optional, controlled by runtime options. Each stage is described in detail in the articles listed below, but here's a quick-and-dirty overview:
Source code enters the compiler frontend and is converted to a token stream by the lexer. The parser reads this token stream and converts that to an abstract syntax tree, or AST. This AST is then converted to a stream of bytecode instructions by the bytecode emitter. Everything up to this point is written in OCaml; the rest of HHVM is written in C++ and assembly.
After the bytecode and associated metadata are created, our bytecode optimizer,
HHBBC, is optionally run. Finally, the bytecode, optimized or not, is stored
into a .hhbc
file we call a "repo".
If the frontend was invoked directly by HHVM, the bytecode also lives in-memory
in the hhvm
process, and execution can begin right away. If the frontend was
invoked as an ahead-of-time build step, the bytecode will be loaded from the
repo by hhvm
when it eventually starts. If the JIT is disabled, the bytecode
interpreter steps through the code one instruction at a time, decoding and
executing each one. Otherwise, the JIT is tasked with compiling the bytecode
into machine code.
The first step in the JIT is region selection, which decides how many bytecode instructions to compile at once, based on a number of complicated heuristics. The chosen bytecode region is then lowered to HHIR, our primary intermediate representation. A series of optimizations are run on the HHIR program, then it is lowered into vasm, our low-level IR. More optimizations are run on the vasm program, followed by register allocation, and finally, code generation. Once various metadata is recorded and the code is relocated to the appropriate place, it is ready to be executed.
If you're not sure where to start, skimming these articles is a good first step:
The articles in this section go into more detail about their respective components:
- HHBC spec
- Frontend
- Parser
- Emitter
- HHBBC
- ...
- VM Runtime
- Core data structures
- Memory management
- Execution Context
- Bytecode interpreter
- Unwinder
- Treadmill
- Debugger
- ...
- JIT Compiler