|
| 1 | + |
| 2 | +# # Example: Faster parsing with lexers |
| 3 | + |
| 4 | +# One disadvantage of pika-style parsers is the large amount of redundant |
| 5 | +# intermediate matches that are produced in the right-to-left parsing process. |
| 6 | +# These generally pollute the match table and cause inefficiency. |
| 7 | +# |
| 8 | +# PikaParser supports greedily pre-lexing the parser input using the terminals |
| 9 | +# in the grammar, which allows you to produce much more precise terminal |
| 10 | +# matches, thus also more compact match table, and, in result, much **faster** |
| 11 | +# and more robust parser. |
| 12 | +# |
| 13 | +# In this example, we simply rewrite the Scheme grammar from [the Scheme |
| 14 | +# tutorial](scheme.md) to use [`PikaParser.scan`](@ref) (which allows you to |
| 15 | +# match many interesting kinds of tokens quickly) and then |
| 16 | +# [`PikaParser.parse_lex`](@ref) (which runs the greedy lexing and uses the |
| 17 | +# result for more efficient parsing). |
| 18 | +# |
| 19 | +# As the main change, we removed the "simple" matches of `:digit` and `:letter` |
| 20 | +# from the grammar, and replaced them with manual matchers of whole tokens. |
| 21 | +# |
| 22 | +# ## Writing scanners |
| 23 | +# |
| 24 | +# First, let's make a very useful helper function that lets us convert any |
| 25 | +# `Char`-matching function into a scanner. This neatens the grammar code later. |
| 26 | +# |
| 27 | +# When constructing the scanner functions, remember that it is important to use |
| 28 | +# the overloaded indexing functions (`nextind`, `prevind`, `firstindex`, |
| 29 | +# `lastindex`) instead of manually computing the integer indexes. Consider what |
| 30 | +# happens with Unicode strings if you try to get an index like `"kůň"[3]`! |
| 31 | +# Compute indexes manually only if you are *perfectly* certain that the input |
| 32 | +# indexing is flat. |
| 33 | + |
| 34 | +takewhile1(f) = (input) -> begin |
| 35 | + isempty(input) && return 0 |
| 36 | + for i in eachindex(input) |
| 37 | + if !f(input[i]) |
| 38 | + return prevind(input, i) |
| 39 | + end |
| 40 | + end |
| 41 | + return lastindex(input) |
| 42 | +end; |
| 43 | + |
| 44 | +# The situation for matching `:ident` is a little more complicated -- we need a |
| 45 | +# different match on the first letter and there are extra characters to think |
| 46 | +# about. So we just make a specialized function for that: |
| 47 | + |
| 48 | +function take_ident(input) |
| 49 | + isempty(input) && return 0 |
| 50 | + i = firstindex(input) |
| 51 | + isletter(input[i]) || return 0 |
| 52 | + i = nextind(input, i) |
| 53 | + while i <= lastindex(input) |
| 54 | + c = input[i] |
| 55 | + if !(isletter(c) || isdigit(c) || c == '-') |
| 56 | + return prevind(input, i) |
| 57 | + end |
| 58 | + i = nextind(input, i) |
| 59 | + end |
| 60 | + return lastindex(input) |
| 61 | +end; |
| 62 | + |
| 63 | +# ## Using scanners in a grammar |
| 64 | +# |
| 65 | +# The grammar becomes slightly simpler than in the original version: |
| 66 | + |
| 67 | +import PikaParser as P |
| 68 | + |
| 69 | +rules = Dict( |
| 70 | + :ws => P.first(:spaces => P.scan(takewhile1(isspace)), P.epsilon), |
| 71 | + :popen => P.seq(P.token('('), :ws), |
| 72 | + :pclose => P.seq(P.token(')'), :ws), |
| 73 | + :sexpr => P.seq(:popen, :insexpr => P.many(:scheme), :pclose), |
| 74 | + :scheme => P.seq( |
| 75 | + :basic => P.first( |
| 76 | + :number => P.seq(P.scan(takewhile1(isdigit)), P.not_followed_by(:ident)), |
| 77 | + :ident => P.scan(take_ident), |
| 78 | + :sexpr, |
| 79 | + ), |
| 80 | + :ws, |
| 81 | + ), |
| 82 | + :top => P.seq(:ws, :sexpr), #support leading blanks |
| 83 | +); |
| 84 | + |
| 85 | +# ## Using the scanners for lexing the input |
| 86 | +# |
| 87 | +# Let's try the lexing on the same input as in the Scheme example: |
| 88 | + |
| 89 | +input = """ |
| 90 | +(plus 1 2 3) |
| 91 | +(minus 1 2(plus 3 2) ) woohoo extra parenthesis here ) |
| 92 | +(complex |
| 93 | + id3nt1f13r5 αβγδ भरत kůň) |
| 94 | +(invalid 1d3n7) |
| 95 | +(something |
| 96 | + 1 |
| 97 | + 2 |
| 98 | + valid) |
| 99 | +(straight (out (missing(parenthesis error)) |
| 100 | +(apply (make-function) (make-data)) |
| 101 | +"""; |
| 102 | +grammar = P.make_grammar([:top], P.flatten(rules, Char)); |
| 103 | + |
| 104 | +P.lex(grammar, input) |
| 105 | + |
| 106 | +# The result is a vector of possible terminals that can be matched at given |
| 107 | +# input positions. As a minor victory, you may see that no terminals are |
| 108 | +# matched inside the initial `plus` token. |
| 109 | +# |
| 110 | +# Now, the lexed input could be used via the argument `fast_match` of |
| 111 | +# [`PikaParser.parse`](@ref), but usually it is much simpler to have the |
| 112 | +# combined function [`PikaParser.parse_lex`](@ref) do everything: |
| 113 | + |
| 114 | +p = P.parse_lex(grammar, input); |
| 115 | + |
| 116 | +# The rest is now essentially same as with the [previous Scheme example](scheme.md): |
| 117 | + |
| 118 | +fold_scheme(m, p, s) = |
| 119 | + m.rule == :number ? parse(Int, m.view) : |
| 120 | + m.rule == :ident ? Symbol(m.view) : |
| 121 | + m.rule == :insexpr ? Expr(:call, :S, s...) : |
| 122 | + m.rule == :sexpr ? s[2] : m.rule == :top ? s[2] : length(s) > 0 ? s[1] : nothing; |
| 123 | + |
| 124 | +next_pos = 1 |
| 125 | +while next_pos <= lastindex(p.input) |
| 126 | + global next_pos |
| 127 | + pos = next_pos |
| 128 | + mid = 0 |
| 129 | + while pos <= lastindex(p.input) # try to find a match |
| 130 | + mid = P.find_match_at!(p, :top, pos) |
| 131 | + mid != 0 && break |
| 132 | + pos = nextind(p.input, pos) |
| 133 | + end |
| 134 | + pos > next_pos && # if we skipped something, report it |
| 135 | + println("Problems with: $(p.input[next_pos:prevind(p.input, pos)])") |
| 136 | + if mid == 0 |
| 137 | + break # if we skipped all the way to the end, quit |
| 138 | + else # we have an actual match, print it. |
| 139 | + value = P.traverse_match(p, mid, fold = fold_scheme) |
| 140 | + println("Parsed OK: $value") |
| 141 | + m = p.matches[mid] # skip the whole match and continue |
| 142 | + next_pos = nextind(p.input, m.last) |
| 143 | + end |
| 144 | +end |
0 commit comments