Skip to content

Demo code with examples for educational purpose

License

Notifications You must be signed in to change notification settings

capybara1/CompilerDemo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Compiler Demo

Terminology

Derivation

A derivation of a string for a grammar is a sequence of grammar rule applications that transform the start symbol into the string. A derivation proves that the string belongs to the grammar's language. (Source)

A strategy is followed that deterministically determines the next nonterminal to rewrite:

  • in a leftmost derivation, it is always the leftmost nonterminal;
  • in a rightmost derivation, it is always the rightmost nonterminal. (Source)

Concrete Syntax Tree (CST)

  • Also known as parse tree
  • Output of a parser
  • Complete representation of the input in the leafs
  • Names of Rules in the interior nodes
    • How it is matched

Abstract Syntax Tree (AST)

  • Also known as syntax tree
  • Output of a parser
  • Free of inessential nodes
    • e.g. grouping parentheses are implicit in the tree structure
  • Incomplete representation of the input in the leafs
  • Names of Rules in the interior nodes
    • How it is matched

Architecture

Front end

  • Preprocessing
  • Lexical analysis (lexing)
  • Syntax analysis (parsing)
    • leads to an CST or AST
  • Semantic analysis
  • Generation of intermediate representation

Middle end

  • Analysis
  • Target independent optimizations

Back end

  • Target specific optimizations
  • Code generation or interpretation
  • Assembly of object files (in case the generated code belongs to an assembly language)
  • Linking

Strategies

Using an Existing Toolchain

Often an existing toolchain for a language such as C/C++ is used as middle and back end. In this case the front end would transpile the custom language into C/C++.

This approach might provide a simple solution in case the custom language is close to C/C++. Otherwise the mapping might turn out to be complex (e.g. if memory management by garbage collection is required), which is a downside of the approach.

This problem is a motivation for the LLVM project.

Parsers

Types

Top-down parser

first looks at the highest level of the parse tree and works down the parse tree by using the rewriting rules of a formal grammar Top-down parsing can be viewed as an attempt to find left-most derivations (Source)

Bottom-up parser

recognizes the text's lowest-level small details first, before its mid-level structures, and leaving the highest-level overall structure to last. (Source)

Shift-reduce parser

  • Kind of bottom-up parser

Recursive descent parser

  • Kind of top-down parser
  • Uses mutually-recursive functions rather than tables
  • Lower complexity (in many cases, no parser generator is required)
  • Faster

Recursive ascent parser

  • Kind of bottom-up parser
  • Uses mutually-recursive functions rather than tables
  • Lower complexity (in many cases, no parser generator is required)
  • Faster

Table based parser

  • Lower memory footprint
  • Higher complexity (usually parser generator is required)

Packrat parser

Supported Grammar

Motivation

  • Placing restrictions on a context-free language grammar keeps runtime complexity at bay

LL(k)

  • Top-Down
  • Processes input left to right
  • Leftmost derivation
  • k is a lookahead to amke a decision

LL

  • Synonym for LL(1)

LL(*)

  • LL(k) which is not restricted to a finite number k

LF(k)

  • Also known as strong LL-parser
  • Top-Down

LL(k)

  • Top-Down
  • Processes input left to right
  • Leftmost derivation
  • k is a lookahead

LR(k)

  • Bottom-Up-Parser
  • Processes input left to right
  • Rightmost derivation
  • k is a lookahead to amke a decision

LR

  • Synonym for LR(1)

LALR(1)

  • Look-Ahead LR parser

Parser Generation

  • Also known as compiler compiler

  • Usually supports generation of a lexer and parser based on a given grammar

  • Among the most popular generators are Flex+Bison (LALR) and ANTLR (LL(*))

  • Comparison of parser generators

Lex/Flex

  • Generates a lexer
  • Supported target language is C++

License

Tool License
Lex various
Flex BSD

Yacc/Bison

  • Generates a parser
  • Supported target language is C++
  • Table driven (smaller memory footprint)

License

Tool License
Yacc various
Bison GPL

ANTLR

  • Generates a lexer, parser, syntax analyzer, visitor (tree walker) and/or listener based on a given LL(*) grammer (.g4 format)
  • Recursive descent parser
  • Requires Java Runtime
  • Supported target languages: Java, C#, Python, JavaScript, Go, C++, Swift
  • An IDE tool, called ANTLRworks, is available
  • May be used along with StringTemplates as shown in this article

References

.g4 format

  • Supports labels for rules (rule # label)
  • Supports labels for rule elements (label=token)

Resources

Visitors and Listeners

Feature Visitors Listeners
Explicit call required to include children Yes No
Use a return value Yes No
Support for large parse trees No Yes

References

Installation

Follow instuctions on antlr.org

Demos

Show Syntax-Tree
antlr4 -o src Demo.g4
javac -cp lib/antlr.jar src/*.java
grun src/Demo expr
1+2*3

Emmit EOF (Linux: Ctrl+D, win: Ctrl+Z)

Target C++

antlr4 -o src -Dlanguage=Cpp -package theNamespace Demo.g4

Resources

Optimization and Code Generation

LLVM

  • Initially started by Chris Latner
  • Motivated by the way JIT compilers worked at the time, which performed optimization during run-time. This would therefore be repeatet each time the application was executed unless the result is cached. Thus the working hypothesis was to move as much of the optimization to compile-time.
  • Since typically programs are optimized by manipulating an intermediate representation (IR), the term Low Level Virtual Machine (LLVM) was coined.
  • Since then LLVM developed to an umbrella project, featuring A core project as well as other subprojects concerned with the development of the clang front-end and various tools

Resources

Features

  • Optimizer
  • JIT
  • GC
Intermediate Language

The IL can be provided in 2 representations:

  • human readable ascii representation
  • binary bitcode representation

Bidirectional transformation between representations is possible using the following commands:

  • llvm-as (transforms IL in ascii representation to bitcode representation)
  • llvm-dis (transforms IL in bitcode representation to ascii representation)

Language aspects:

In order to compile for a specific target additional information must be provided:

  • a datalayout which specifies
    • Byte order (endianness)
    • Type specific allocation sizes
    • Type specific memory alignments
    • Style of name mangling
  • a target triple which specifies
    • Architecture
    • Vendor
    • Operating system
    • optional: Environment
  • The source filename

A module may also specify flags for upcoming stages in the toolchain such as e.g. linker options and magic lobal variables.

Usage

  • A front-end that outputs the IL is required
    • This may be achieved by utilization of the core libraries (C++) whilch provide classes such as e.g. an IRBuilder

References

Commands (excerpt):

  • llvm-link (linker for IL in bitcode representation)
  • lli (executes IL in bitcode representation using JIT compilation)
  • lldb (debugger)
  • opt (optimizer)
  • llc (compiles IL in any representation to assembly code)

See also LLVM Command Guide for a full listing

Installation

Build from source: Requirements: Make, CMake, Compiler Repository: https://github.com/llvm/llvm-project

Pre-build: http://releases.llvm.org/download.html

Anatomy of the Optimizer

Based on multiple passes of different types:

  • Module Pass
  • Call Graph Pass
  • Function Pass
  • Basic Block Pass
  • Immutable Pass, Region Pass, MachineFunction Pass

Within each of the mentioned passes:

  • Analysis Pass
  • Transform Pass

Demos

Get IL from C++ source using clang++
clang++ -S -emit-llvm source.cpp

Observations

Running a build-in pass
opt -dot-cfg-only input.ll > /dev/null
xdot cfg.main.dot
Running custom pass
#include "llvm/Pass.h"
#include "llvm/IR/Function.h"
#include "llvm/Support/raw_ostream.h"

using namespace llvm;

namespace {
    struct DemoPass : public FunctionPass {
        static char ID;
        DemoPass() : FunctionPass(ID) {}
        bool runOnFunction(Function &F) override {
            errs().write_escaped(F.getName()) << '\n';
            return false;
        }
    }
}
char DemoPass::ID = 0;
static RegisterPass<Hello> X("name_of_the_pass",
                             "Description of the pass",
                             false,
                             false);
opt -load path/to/DemoPass.so -name_of_the_pass < input.bc

Resources

Demo Compiler

Build demo

  1. Install ANTLR into demo directory
  2. Install LLVM from source
  3. Build demo using the following commands
cd demo
cmake .
cmake --build .

About

Demo code with examples for educational purpose

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published