Parsing Expression Grammar (PEG) for when Regular Expressions just aren't enough!
Pegger is a Parsing Expression Grammar (PEG) implementation. It lets you create text parsers by building up a tree of simple matching rules.
Advanced parsing options let you look ahead with predicates, and throw errors to fail fast.
Pegger has been used (by Fantom Factory) to parse HTML, CSS, Markdown, and ANBF - to name but a few. The general strategy is usually:
- Create a structure of Fantom data classes
- Create a grammar to parse text documents into a node tree
- Trim the tree by removing labels and branches, so the tree resembles the structure of the Fantom classes
- Walk the tree, recursively creating corresponding data classes
Pegger was inspired by Mouse, Parboiled for Java, and pegs for NIM.
Install Pegger
with the Fantom Pod Manager ( FPM ):
C:\> fpm install afPegger
Or install Pegger
with fanr:
C:\> fanr install -r http://eggbox.fantomfactory.org/fanr/ afPegger
To use in a Fantom project, add a dependency to build.fan
:
depends = ["sys 1.0", ..., "afPegger 1.1"]
Full API & fandocs are available on the Eggbox - the Fantom Pod Repository.
using afPegger::Peg
class Example {
Void main() {
input := "<<<Hello Mum>>>"
rule := "'<'+ name:[a-zA-Z ]+ '>'+"
match := Peg(input, rule).match
name := match["name"] // --> "Hello Mum"
}
}
The quick start example saw a lot of crazy symbols... so woah, what just happened?
Pegger attempts to match a rule
against a given string. In the example the string was <<<Hello Mum>>>
and the rule was that mixed bag of crazy characters. Rules can be written in PEG notation (see mixed bag of crazy characters) or they can be created programmatically via Fantom code using the Rules mixin:
// PEG Notation:
// '<'+ ([a-zA-Z] / " ")+ '>'+
// Fantom code:
rule := sequence {
oneOrMore(char('<')),
oneOrMore(firstOf { alphaChar, spaceChar }),
oneOrMore(char('>')),
}
The Fantom code can be a lot simpler to read and understand, but is also a lot more verbose.
Once you run a match, the result is a tree. Use Match.dump()
to see it:
rule := sequence {
oneOrMore(char('<')),
oneOrMore(firstOf { alphaChar, spaceChar }),
oneOrMore(char('>')),
}
input := "<<<Hello Mum>>>"
match := Peg(input, rule).match
match.dump
// --> root : "<<<Hello Mum>>>"
Okay, so that single line starting with root
is not much of a tree. To create a tree, we need to give parts of our rule a label:
rule := sequence {
oneOrMore(char('<')) .withLabel("start"),
oneOrMore(firstOf { alphaChar, spaceChar }).withLabel("name"),
oneOrMore(char('>')) .withLabel("end"),
}
We can do the same in PEG notation by using a label:
prefix:
// PEG Notation:
// start:'<'+ name:[a-zA-Z ]+ end:'>'+
match.dump()
now gives:
root
├─ start : "<<<"
├─ name : "Hello Mum"
└─ end : ">>>"
Each part of the match may be retrieved using the Match.get()
operator:
match["start"].toStr // -> "<<<"
match["name"].toStr // -> "Hello Mum"
match["end"].toStr // -> ">>>"
The same could also be written as PEG grammar. Grammar defines multiple PEG rules. Grammars may be coded programmatically but are usually created from a string. There should always be the one top-level or root rule which is responsible for matching everything:
root = start name end
start = '<'+
name = [a-zA-Z ]+
end = '>'+
Definitions may span multiple lines but proceeding lines must contain leading whitespace to distinguish it from a new rule definition.
root = start
name
end
start = '<'+
name = [a-zA-Z ]+
end = '>'+
When run, the same result is given:
grammarStr := "..."
grammar := Peg.parseGrammar(grammarStr)
input := "<<<Hello Mum>>>"
match := grammar["root"].match(input).dump
// root
// ├─ start : "<<<"
// ├─ name : "Hello Mum"
// └─ end : ">>>"
Once a grammar (or rule) has been parsed and validated, it may be cached for future re-use.
Rules may be omitted from the result tree by prefixing the definitions with a -
:
root = start name end
-start = '<'+
name = [a-zA-Z ]+
-end = '>'+
// root
// └─ name : "Hello Mum"
Writing grammar files can be a lot easier to understand than the verbose programatic method. Here's your guide.
Pegger is primarily concerned with parsing displayable 7-bit ASCII characters, but where mentioned also provides support for 16-bit Unicode characters. Non-visible / non-printable characters are beyond the remit of Pegger; largely because Pegger uses strings as input! See Design notes for details.
A rule is defined with a name followed by =
. The more formal definition of <-
may be used in place of =
.
ruleDef1 = rule
ruleDef2 <- rule
Advanced Note: Prefixing a rule with -
will remove it from the result tree.
-ruleDef1 = rule
-ruleDef2 <- rule
Advanced Note: Suffixing a rule with -
will remove it from debug output.
ruleDef1- = rule
ruleDef2- <- rule
Rules may have both a -
prefix and suffix: -ruleDef- = rule
Ordered sequences of rules are expressed by separating them with a space.
ruleDef = rule1 rule2 rule3
When given a choices, Pegger will match the first rule that passes. Beware: order of choices can be important.
ruleDef = rule1 / rule2 / rule3
Choice rules have a higher operator precedence than Sequence rules, but it is better to group rules with brackets to avoid ambiguity.
ruleDef = (rule1 rule2) / (rule3 rule4)
Rules may be specified to be matched by different amounts.
rule1? // optional
rule1+ // one or more
rule1* // zero or more
rule1{4} // exactly 4
rule1{2,6} // between 2 and 6 (inclusive)
rule1{,5} // at most 5 - same as {0,5}
rule1{3,} // at least 3
Literal strings may be matched with either single or double quotes.
ruleDef1 = "literal 1"
ruleDef2 = 'literal 2'
Use backslash to escape the usual \b \f \n \r \t
characters and to escape quotes. (Note that Fantom normalises all new line characters to just \n
.)
Use an i
suffix to indicate the match should be case-insensitive.
ruleDef3 = "new\nline\n"i
Individual characters, and ranges thereof, may be matched with a regex-esque character class.
ruleDef1 = [a] // matches a
ruleDef2 = [abc] // matches 'a', 'b', or 'c'
ruleDef3 = [a-z] // matches any character in the range from 'a' to 'z' inclusive
ruleDef4 = [a-z]i // as above, but case-insensitive
ruleDef5 = [0-9A-F]i // matches any hex digit
ruleDef6 = [\n\] \t] // backslash escapes
The hat ^
character will match any character BUT the chosen ones.
ruleDef6 = [^0-9] // match any char BUT not digits
Use .
to match any character.
ruleDef = . . . // matches any 3 characters
Use the ampersand &
to look ahead for a match, but NOT consume any characters.
ruleDef = "something " &"good" .
Use exclamation !
to look ahead for a negative match, but again, NOT consume any characters.
ruleDef = "something " !"bad" .
As per the example above, predicates should always be used in conjunction with a character consuming rule.
Use \uXXXX
(hexadecimal) notation to match a 16-bit unicode character. May be used in string literals and character classes.
crlf = [\u000D] "\u000A"
Unicode escapes MUST contain exactly 4 hex digits; 4 to ease parsing and to match Java string characters which are 16 bits.
Pegger introduces macros for useful extensions. These may be used individually as rules.
table:
name function
---------- ---------------------------------
\sos Matches Start-Of-Stream (or Start-Of-String!?) - does not consume chars
\eos Matches End-Of-Stream (or End-Of-String!?) - does not consume chars
\eol Matches Start-Of-Line (also matches \sos) - does not consume chars
\sol Matches End-Of-Line (also matches \eos) - does not consume chars
\lower Matches a lowercase character in the current locale
\upper Matches an uppercase character in the current locale
\alpha Matches a character in the current locale
\pass Always passes the Match
\fail Always fails the Match
\err(xxx) Throws a PegParseErr with msg 'xxx' when processed
Hash #
comments are allowed in grammar, as are double slash //
comments.
# Hash comments are allowed in grammar
// as are double slash comments
a = . // end-of-line comments also allowed
b = [cd]
// comments may also appear inbetween rules
[de]
Interestingly, PEG grammar may itself be expressed as PEG grammar. And indeed, Pegger does use itself to parse your PEG definitions!
PEG grammar:
grammar <- (!\eos (commentLine / ruleDef / \err(FAIL)))+
ruleDef <- exclude:"-"? ruleName debugOff:"-"? cwsp* ("=" / "<-") cwsp* rule commentLine?
ruleName <- [a-zA-Z] (("-" [a-zA-Z0-9_]) / [a-zA-Z0-9_])*
rule <- firstOf / \err(FAIL)
firstOf <- sequence (cwsp* "/" cwsp* sequence)*
sequence <- expression (cwsp* expression)*
expression <- predicate? (label ":")? type multiplicity?
label <- [a-zA-Z] [a-zA-Z0-9_\-]*
type <- group / ruleName / literal / chars / macro / dot
-group <- "(" cwsp* rule cwsp* ")"
predicate <- "!" / "&"
multiplicity <- "*" / "+" / "?" / ("{" min:[0-9]* (com:"," max:[0-9]*)? "}")
literal <- singleQuote / doubleQuote
-singleQuote <- "'" (unicode / ("\\" .) / [^'])+ "'" "i"?
-doubleQuote <- "\"" (unicode / ("\\" .) / [^"])+ "\"" "i"?
chars <- "[" (unicode / ("\\" .) / [^\]])+ "]" "i"?
macro <- "\\" [a-zA-Z]+ ("(" [^)\n]* ")")?
unicode <- "\\" "u" [0-9A-F]i [0-9A-F]i [0-9A-F]i [0-9A-F]i
dot <- "."
-commentLine- <- sp* (nl / comment)
-comment- <- ("#" / "//") (!\eos [^\n])* nl
-cwsp- <- sp / (!\eos cnl (sp / &("#" / "//")))
-cnl- <- nl / comment
-sp- <- [ \t]
-nl- <- \eos / "\n"
A well known limitation of regular expressions is that they can not match nested patterns, such as HTML. (See StackOverflow for explanation.) Pegger to the rescue!
Because PEGs contain rules that may reference themselves in a circular fashion, it is possible to create recursive parsers.
Below is an example that parses nested HTML tags. You can see the recursion from the element
definition which references itself:
grammar := Peg.parseGrammar("element = startTag (element / text)* endTag
startTag = '<' name:[a-z]i+ '>'
endTag = '</' name:[a-z]i+ '>'
text = [^<]+")
html := "<html><head><title>Pegger Example</title></head><body><p>Parsing is Easy!</p></body></html>"
grammar["element"].match(html).dump
Peg(html, element).match.dump
Which outputs the following result tree:
element
├─ startTag
│ └─ name : "html"
├─ element
│ ├─ startTag
│ │ └─ name : "head"
│ ├─ element
│ │ ├─ startTag
│ │ │ └─ name : "title"
│ │ ├─ text : "Pegger Example"
│ │ └─ endTag
│ │ └─ name : "title"
│ └─ endTag
│ └─ name : "head"
├─ element
│ ├─ startTag
│ │ └─ name : "body"
│ ├─ element
│ │ ├─ startTag
│ │ │ └─ name : "p"
│ │ ├─ text : "Parsing is Easy!"
│ │ └─ endTag
│ │ └─ name : "p"
│ └─ endTag
│ └─ name : "body"
└─ endTag
└─ name : "html"
Parsing has never been easier!
Note that only Chuck Norris can parse HTML with regular expressions.
Converting the match results into XML is left as an exercise for the user, but there are a couple of options open to you:
This is usually the easiest way to convert your match results, but not always the cleanest.
It involves looping over the match results the same as you would with any other list, and recursively calling yourself to convert inner data.
Create a walk()
method that recursively calls a given function every time it steps into, or out of, a match:
Void walk(Match match, |Match, Str startOrEnd| fn) {
fn?.call(match, "start")
match.matches.each { walk(it, fn) }
fn?.call(match, "end")
}
Custom rules may be created by subclassing the Rule class.
There (currently) is no support for introducing custom rules in PEG grammar, so use of custom rules limits their use to inclusion in programmatic Fantom based grammars.
By enabling debug logging, Pegger
will spew out a lot of debug / trace information. (Possibly more than you can handle!) But note it will only emit debug information for rules with names or labels.
Enable debug logging with the line:
Peg.debugOn
// or
Log.get("afPegger").level = LogLevel.debug
Which, for the above html parsing example, will generate the following:
[afPegger] [ 1] --> element - Processing: startTag (element / text)* endTag with: <html><title>Pegger Ex...
[afPegger] [ 2] --> startTag - Processing: "<" [a-zA-Z]+ ">" with: <html><title>Pegger Ex...
[afPegger] [ 2] > startTag - Passed!
[afPegger] [ 2] > startTag - Matched: "<html>"
[afPegger] [ 4] --> element - Processing: startTag (element / text)* endTag with: <title>Pegger Example<...
[afPegger] [ 5] --> startTag - Processing: "<" [a-zA-Z]+ ">" with: <title>Pegger Example<...
[afPegger] [ 5] > startTag - Passed!
[afPegger] [ 5] > startTag - Matched: "<title>"
[afPegger] [ 7] --> element - Processing: startTag (element / text)* endTag with: Pegger Example</title>...
[afPegger] [ 8] --> startTag - Processing: "<" [a-zA-Z]+ ">" with: Pegger Example</title>...
[afPegger] [ 8] > startTag - Did not match "<".
[afPegger] [ 8] > startTag - Failed. Rolling back.
[afPegger] [ 7] > element - Did not match startTag.
[afPegger] [ 7] > element - Failed. Rolling back.
[afPegger] [ 7] --> text - Processing: (!"<" .)+ with: Pegger Example</title>...
[afPegger] [ 7] > text - Rule was successfully processed 14 times
[afPegger] [ 7] > text - Passed!
[afPegger] [ 7] > text - Matched: "Pegger Example"
...
...
...
Pegger was purposefully designed to match strings or more specifically, Unicode character data (of variable byte length dependant on encoding) - not binary data. Hence it lacks notation to match bytes and byte ranges as seen in some RFC documents.
If required, hexadecimal Unicode escape sequences ( \uXXXX
) may be used to represent 8-bit binary data.