Skip to content

Latest commit

 

History

History
 
 

hackers-guide

Hacker's Guide to HHVM

This directory contains documentation on the internal architecture of HHVM, targeted at C++ developers looking to hack on HHVM itself. If you're a Hack developer looking for documentation on using HHVM, that can be found here.

HHVM is a virtual machine that executes Hack programs using a bytecode interpreter and a JIT compiler (the latter is vastly more complex, and will get much more airtime here). PHP is also currently supported for historical reasons.

You should already be comfortable reading and writing C++ (specifically, HHVM is written in C++14), as well as navigating around large codebases using grep, ctags, or any similar tool of your choice. Prior experience with compilers and runtimes will help but is not strictly necessary.

Code References

Since this guide is intended to help familiarize readers with the HHVM codebase, it naturally contains a number of links to the code. Some of these links are to the current version of a file, and others are to specific lines in specific versions of files. If you find links of either kind that are out-of-date with current master or are otherwise misleading, please let us know.

Building HHVM

Instructions for building HHVM and running our primary test suite can be found here. The initial build may take an hour or longer, depending on how fast your machine is. Subsequent builds should be faster, as long as you're not touching core header files.

Architecture Overview

HHVM, like most compilers, is best thought of as a pipeline with many different stages. Source code goes in one end, and after a number of different transformations, executable machine code comes out the other end. Some stages are optional, controlled by runtime options. Each stage is described in detail in the articles listed below, but here's a quick-and-dirty overview:

Source code enters the compiler frontend and is converted to a token stream by the lexer. The parser reads this token stream and converts that to an abstract syntax tree, or AST. This AST is then converted to a stream of bytecode instructions by the bytecode emitter. Everything up to this point is written in OCaml; the rest of HHVM is written in C++ and assembly.

After the bytecode and associated metadata are created, our bytecode optimizer, HHBBC, is optionally run. Finally, the bytecode, optimized or not, is stored into a .hhbc file we call a "repo".

If the frontend was invoked directly by HHVM, the bytecode also lives in-memory in the hhvm process, and execution can begin right away. If the frontend was invoked as an ahead-of-time build step, the bytecode will be loaded from the repo by hhvm when it eventually starts. If the JIT is disabled, the bytecode interpreter steps through the code one instruction at a time, decoding and executing each one. Otherwise, the JIT is tasked with compiling the bytecode into machine code.

The first step in the JIT is region selection, which decides how many bytecode instructions to compile at once, based on a number of complicated heuristics. The chosen bytecode region is then lowered to HHIR, our primary intermediate representation. A series of optimizations are run on the HHIR program, then it is lowered into vasm, our low-level IR. More optimizations are run on the vasm program, followed by register allocation, and finally, code generation. Once various metadata is recorded and the code is relocated to the appropriate place, it is ready to be executed.

Getting Started

If you're not sure where to start, skimming these articles is a good first step:

HHVM Internals

The articles in this section go into more detail about their respective components: