Skip to content

Latest commit

 

History

History
325 lines (268 loc) · 15 KB

README.md

File metadata and controls

325 lines (268 loc) · 15 KB

Abstract Syntax Tree (AST)

Table of Contents

Description

The Abstract Syntax Tree (AST) is a critical component in the OCaml compilation process. It represents the structure of the source code in a tree-like format, allowing for advanced code manipulations and transformations. This guide explores the importance of the AST, how it is used in preprocessing, and the different methods available for working with it through PPX (PreProcessor eXtensions).

Preprocessing in OCaml

Unlike some programming languages that have built-in preprocessing features—such as C's preprocessor or Rust's macro system, OCaml lacks an integrated macro system. Instead, it relies on standalone preprocessors.

The OCaml Platform officially supports a library for creating these preprocessors, which can operate at two levels:

  • Source Level: Preprocessors work directly on the source code.
  • AST Level: Preprocessors manipulate the AST, offering more powerful and flexible transformations. (Covered in this guide)

⚠️ Warning
One of the key challenges with working with the Parsetree (the AST in OCaml) is that its API is not stable. For instance, in the OCaml 4.13 release, significant changes were made to the Parsetree type, which can impact the compatibility of your preprocessing tools. Read more about it in The Future of PPX

AST Guide

This guide will concentrate on AST-level preprocessing using PPX (PreProcessor eXtensions), providing a comprehensive overview of the following topics:

  1. AST Construction: Learning how to build and manipulate ASTs.
  2. AST Destructuring: Breaking down ASTs into manageable components for advanced transformations.

Why Should I Understand the AST?

OCaml's Parsetree can be confusing, verbose, and hard to understand, but it's a powerful tool that can help you write better code, understand how the compiler works, and develop your own PPXs.

You don't need to be an expert on it knowing all the tree possibilities, but you should know how to read it. For this, I'm going to use the AST Explorer throughout the repository to help you understand the AST.

A simple example of learning more about the OCaml compiler is that types are recursive by default, while values are non-recursive. With the AST, we can see this clearly:

type name = string
let name = "John Doe"
// AST Tree
{
  "type": "structure",
  "structure": [
    // type name = string
    {
      "type": "structure_item",
      "pstr_desc": {
        "type": "Pstr_type",
        "rec_flag": {
          "type": "Recursive"
        },
        "type_declarations": [
          {
            "type": "type_declaration",
            "ptype_name": {/* ... */},
          }
        ]
      },
      "pstr_loc": {/* ... */},
    },
    // let name = "John Doe"
    {
      "type": "structure_item",
      "pstr_desc": {
        "type": "Pstr_value",
        "rec_flag": {
          "type": "Nonrecursive"
        },
        "value_bindings": [
          {
            "type": "value_binding",
            "pvb_pat": {/* ... */},
          }
        ]
      },
      "pstr_loc": {/* ... */},
    }
  ]
}

First Look

By comparing code snippets with their AST representations, you'll better understand how OCaml interprets your code, which is essential for working with PPXs or delving into the compiler's internals. The AST Explorer tool will help make these concepts clearer and more accessible.

Let's take a quick look at the JSON AST representation of a simple OCaml expression:

(* Foo.ml *)
let name = "john doe"
// AST Tree
{
  "type": "structure",
  "structure": [
    {
      "type": "structure_item",
      "pstr_desc": {
        "type": "Pstr_value",
        "rec_flag": {
          /* ... */
        },
        "value_bindings": [
          {
            "type": "value_binding",
            "pvb_pat": {
              "type": "pattern",
              "ppat_desc": {
                "type": "Ppat_var",
                "string_loc": {
                  "type": "string Asttypes.loc",
                  "txt": "name",
                  "loc": {
                    /* ... */
                  }
                }
              },
              "ppat_loc": {
                /* ... */
              }
            },
            "pvb_expr": {
              "type": "expression",
              "pexp_desc": {
                "type": "Pexp_constant",
                "constant": {
                  "type": "Pconst_string",
                  "string": "john doe",
                  "quotation_delimiter": {
                    /* ... */
                  }
                }
              },
              "pexp_loc": {
                /* ... */
              }
            },
            "pvb_loc": {
              /* ... */
            }
          }
        ]
      },
      "pstr_loc": {
        /* ... */
      }
    }
  ]
}

As you can see, it's a little bit verbose. Don't be scared; we are going to learn how to read it, which is the most important thing.

Structure

// AST Tree
{
  "type": "structure",
  "structure": [
    /* ... */
  ]
}

In OCaml, a module serves as a container for grouping related definitions, such as types, values, functions, and even other modules, into a single cohesive unit. This modular approach helps organize your code, making it more manageable, reusable, and easier to understand.

A structure refers to the content within a module. It is composed of various declarations, known as structure items, which include:

  • Type definitions (e.g., type t = ...)
  • let bindings (e.g., let x = 1)
  • Function definitions
  • Exception declarations
  • Other nested modules

The structure represents the body of the module, where all these items are defined and implemented. Since each .ml file is implicitly a module, the entire content of a file can be viewed as the structure of that module.

💡 Tip
Every module in OCaml creates a new structure, and nested modules create nested structures.

Consider the following example:

(* Bar.ml *)
let name = "john doe"

module GameEnum = struct
  type t = Rock | Paper | Scissors

  let to_string = function
    | Rock -> "Rock"
    | Paper -> "Paper"
    | Scissors -> "Scissors"

  let from_string = function
    | "Rock" -> Rock
    | "Paper" -> Paper
    | "Scissors" -> Scissors
    | _ -> failwith "Invalid string"
end
// AST Tree
{
  "type": "structure",
  "structure": [
    {
      "type": "structure_item",
      "pstr_desc": {
        /* ... */
      },
      "pstr_loc": {
        /* ... */
      }
    },
    {
      "type": "structure_item",
      "pstr_desc": {
        "type": "Pstr_module",
        "module_binding": {
          "type": "module_binding",
          "pmb_name": {
            /* ... */
          },
          "pmb_expr": {
            "type": "module_expr",
            "pmod_desc": {
              "type": "Pmod_structure",
              "structure": [
                {
                  "type": "structure_item",
                  "pstr_desc": {
                    /* ... */
                  }
                  /* ... */
                }
              ]
              /* ... */
            }
            /* ... */
          }
          /* ... */
        }
        /* ... */
      }
    }
  ]
}

As you can see, Bar.ml and GameEnum are modules, and their content is a structure that contain a list of structure items.

📝 Note
A structure item can either represent a top-level expression, a type definition, a let definition, etc.

I'm not going to be able to cover all structure items, but you can find more about it in the OCaml documentation. I strongly advise you to take a look at the AST Explorer and play with it; it will help you a lot. Here is a sample.

Language Extensions and Attributes

As the AST represents the structure of the source code in a tree-like format, it also represents the Extension nodes and Attributes. It is mostly from the extension and attributes that the PPXs are built, so it's important to understand that they are part of the AST and have their own structure.

  • Extension nodes are generic placeholders in the syntax tree. They are rejected by the type-checker and are intended to be “expanded” by external tools such as -ppx rewriters. On AST, it is represented as string Ast_414.Asttypes.loc * payload.

    So, as extension nodes are placeholders for a code to be added, adding a new extension node with no extender declared should break the compilation. For example, in the code let name = [%name "John Doe"]. See a demo here

    There are 2 forms of extension nodes:

    • For “algebraic” categories: [%name "John Doe"]
    • For structures and signatures: [%%name "John Doe"]

In the code let name = [%name "John Doe"], [%name "John Doe"] is the extension node, where name is the extension name (string Ast_414.Asttypes.loc) and "John Doe" is the payload. For the entire item let name = "John Doe", you must use %%: [%%name "John Doe].

Don't worry much about creating a new extension node; we'll cover it in the Writing PPXs section.

  • Attributes are “decorations” of the syntax tree, which are mostly ignored by the type-checker but can be used by external tools. Decorators must be attached to a specific node in the syntax tree. (Check it breaking on this AST sample)

    As attributes are just “decorations”, you can add a new attribute without breaking the compilation. For example, in the code, let name = "John Doe" [@print]. See a demo here

    There are 3 forms of attributes:

    • Attached to on “algebraic” categories: [@name]
    • Attached to “blocks”: [@@name]
    • Stand-alone of signatures or structures modules: [@@@name]

In the code let name = "John Doe" [@print expr], [@print expr] is the attribute of the "John Doe" node, where print is the attribute name (string Ast_414.Asttypes.loc) and expr is the payload. To be an attribute of the entire item let name = "John Doe", you must use @@: [@@print]. If it is an stand-alone attribute of a module, you must use @@@: [@@@print].

Don't worry much about creating a new attributes node; we'll cover it in the Writing PPXs section.

I know that it can be a lot, but don't worry; we are going step by step, and you are going to understand it.

Samples

To help you undestand a little bit more about the AST, let's show it with some highlighted examples:

Code Playgrond AST
Code let name = "john doe" with let name = "john doe" highlighted Link AST representation of: let name = "john doe"
Code let name = "john doe" with name highlighted Link AST representation of: name
Code let name = [%name "John Doe"] with [%name "John Doe"] highlighted Link AST representation of: name
Code let name = [%name "John Doe"] with name highlighted Link AST representation of: name
Code let name = [%name "John Doe"] with "John Doe" highlighted Link AST representation of: name
Code let name = "John Doe" [@print expr] with "John Doe" highlighted Link AST representation of: name
Code let name = "John Doe" [@print expr] with print highlighted Link AST representation of: name
Code let name = "John Doe" [@print expr] with expr highlighted Link AST representation of: name
Code module GameEnum = struct (* ... *) end with "GameEnum" highlighted Link AST representation of: name
Code module GameEnum = struct (* ... *) end with struct highlighted Link AST representation of: name
GameEnum module GameEnum = struct (* ... *) end with type t = Rock | Paper | Scissors highlighted Link AST representation of: name