Skip to content

Commit f90a08d

Browse files
authored
Merge pull request #17 from LCSB-BioCore/mk-doc-lex
document the scanning&lexing use
2 parents 1f28460 + dcbc9c5 commit f90a08d

File tree

8 files changed

+184
-11
lines changed

8 files changed

+184
-11
lines changed

Project.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
name = "PikaParser"
22
uuid = "3bbf5609-3e7b-44cd-8549-7c69f321e792"
33
authors = ["The developers of PikaParser.jl"]
4-
version = "0.5.1"
4+
version = "0.5.2"
55

66
[deps]
77
DocStringExtensions = "ffbed154-4ef7-542d-bbb7-c09d3a79fcae"

docs/make.jl

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
using Documenter, Literate, PikaParser
22

3-
examples = filter(x -> endswith(x, ".jl"), readdir(joinpath(@__DIR__, "src"), join = true))
3+
examples =
4+
sort(filter(x -> endswith(x, ".jl"), readdir(joinpath(@__DIR__, "src"), join = true)))
45

56
for example in examples
67
Literate.markdown(

docs/src/json.jl

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,8 @@
1414
# to remove unnecessary spaces)
1515
# - support for numbers is very ad-hoc, `Float64`-only
1616
# - the escape sequences allowed in strings are rather incomplete
17+
#
18+
# ## Preparing the grammar
1719

1820
import PikaParser as P
1921

@@ -43,6 +45,8 @@ rules = Dict(
4345
:json => P.first(:obj, :array, :string, :number, :t, :f, :null),
4446
);
4547

48+
# ## Making the "fold" function
49+
#
4650
# To manage the folding easily, we keep the fold functions in a data structure
4751
# with the same order as `rules`:
4852
folds = Dict(
@@ -69,12 +73,14 @@ default_fold(v, subvals) = isempty(subvals) ? nothing : subvals[1]
6973

7074
g = P.make_grammar([:json], P.flatten(rules, Char));
7175

76+
# ## Parsing JSON
77+
#
7278
# Let's parse a simple JSONish string that demonstrates most of the rules:
7379
input = """{"something":123,"other":false,"refs":[1,-2.345,[],{},true,false,null,[1,2,3,"haha"],{"is\\"Finished\\"":true}]}""";
7480

7581
p = P.parse(g, input);
7682

77-
# Let's build a Julia JSON-like structure:
83+
# From the result we can build a Julia JSON-like structure:
7884
result = P.traverse_match(
7985
p,
8086
P.find_match_at!(p, :json, 1),

docs/src/scheme.jl

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,8 @@
1212
# We choose not to implement any of the Scheme data types except numbers and
1313
# identifiers; also all top-level expressions must be parenthesized "command"
1414
# S-expressions.
15+
#
16+
# ## Implementing the grammar
1517

1618
import PikaParser as P
1719

@@ -39,6 +41,8 @@ rules = Dict(
3941
# spaces. This way prevents unnecessary checking (and redundant matching) of
4042
# the tokens, and buildup of uninteresting entries in the memo table.
4143

44+
# ## Parsing input
45+
#
4246
# Let's test the grammar on a piece of source code that contains lots of
4347
# whitespace and some errors.
4448

@@ -67,6 +71,8 @@ fold_scheme(m, p, s) =
6771
m.rule == :insexpr ? Expr(:call, :S, s...) :
6872
m.rule == :sexpr ? s[2] : m.rule == :top ? s[2] : length(s) > 0 ? s[1] : nothing;
6973

74+
# ## Recovering from errors and showing partial parses
75+
#
7076
# We can run through all `top` matches, tracking the position where we would
7177
# expect the next match:
7278

@@ -81,12 +87,14 @@ while next_pos <= lastindex(p.input)
8187
pos = nextind(p.input, pos)
8288
end
8389
pos > next_pos && # if we skipped something, report it
84-
@error "Got parsing problems" p.input[next_pos:prevind(p.input, pos)]
90+
println(
91+
"Got problems understanding this: $(p.input[next_pos:prevind(p.input, pos)])",
92+
)
8593
if mid == 0
8694
break # if we skipped all the way to the end, quit
8795
else # we have an actual match, print it.
8896
value = P.traverse_match(p, mid, fold = fold_scheme)
89-
@info "Got a toplevel value" value
97+
println("Got a good value: $value")
9098
m = p.matches[mid] # skip the whole match and continue
9199
next_pos = nextind(p.input, m.last)
92100
end

docs/src/scheme_lex.jl

Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
2+
# # Example: Faster parsing with lexers
3+
4+
# One disadvantage of pika-style parsers is the large amount of redundant
5+
# intermediate matches that are produced in the right-to-left parsing process.
6+
# These generally pollute the match table and cause inefficiency.
7+
#
8+
# PikaParser supports greedily pre-lexing the parser input using the terminals
9+
# in the grammar, which allows you to produce much more precise terminal
10+
# matches, thus also more compact match table, and, in result, much **faster**
11+
# and more robust parser.
12+
#
13+
# In this example, we simply rewrite the Scheme grammar from [the Scheme
14+
# tutorial](scheme.md) to use [`PikaParser.scan`](@ref) (which allows you to
15+
# match many interesting kinds of tokens quickly) and then
16+
# [`PikaParser.parse_lex`](@ref) (which runs the greedy lexing and uses the
17+
# result for more efficient parsing).
18+
#
19+
# As the main change, we removed the "simple" matches of `:digit` and `:letter`
20+
# from the grammar, and replaced them with manual matchers of whole tokens.
21+
#
22+
# ## Writing scanners
23+
#
24+
# First, let's make a very useful helper function that lets us convert any
25+
# `Char`-matching function into a scanner. This neatens the grammar code later.
26+
#
27+
# When constructing the scanner functions, remember that it is important to use
28+
# the overloaded indexing functions (`nextind`, `prevind`, `firstindex`,
29+
# `lastindex`) instead of manually computing the integer indexes. Consider what
30+
# happens with Unicode strings if you try to get an index like `"kůň"[3]`!
31+
# Compute indexes manually only if you are *perfectly* certain that the input
32+
# indexing is flat.
33+
34+
takewhile1(f) = (input) -> begin
35+
isempty(input) && return 0
36+
for i in eachindex(input)
37+
if !f(input[i])
38+
return prevind(input, i)
39+
end
40+
end
41+
return lastindex(input)
42+
end;
43+
44+
# The situation for matching `:ident` is a little more complicated -- we need a
45+
# different match on the first letter and there are extra characters to think
46+
# about. So we just make a specialized function for that:
47+
48+
function take_ident(input)
49+
isempty(input) && return 0
50+
i = firstindex(input)
51+
isletter(input[i]) || return 0
52+
i = nextind(input, i)
53+
while i <= lastindex(input)
54+
c = input[i]
55+
if !(isletter(c) || isdigit(c) || c == '-')
56+
return prevind(input, i)
57+
end
58+
i = nextind(input, i)
59+
end
60+
return lastindex(input)
61+
end;
62+
63+
# ## Using scanners in a grammar
64+
#
65+
# The grammar becomes slightly simpler than in the original version:
66+
67+
import PikaParser as P
68+
69+
rules = Dict(
70+
:ws => P.first(:spaces => P.scan(takewhile1(isspace)), P.epsilon),
71+
:popen => P.seq(P.token('('), :ws),
72+
:pclose => P.seq(P.token(')'), :ws),
73+
:sexpr => P.seq(:popen, :insexpr => P.many(:scheme), :pclose),
74+
:scheme => P.seq(
75+
:basic => P.first(
76+
:number => P.seq(P.scan(takewhile1(isdigit)), P.not_followed_by(:ident)),
77+
:ident => P.scan(take_ident),
78+
:sexpr,
79+
),
80+
:ws,
81+
),
82+
:top => P.seq(:ws, :sexpr), #support leading blanks
83+
);
84+
85+
# ## Using the scanners for lexing the input
86+
#
87+
# Let's try the lexing on the same input as in the Scheme example:
88+
89+
input = """
90+
(plus 1 2 3)
91+
(minus 1 2(plus 3 2) ) woohoo extra parenthesis here )
92+
(complex
93+
id3nt1f13r5 αβγδ भरत kůň)
94+
(invalid 1d3n7)
95+
(something
96+
1
97+
2
98+
valid)
99+
(straight (out (missing(parenthesis error))
100+
(apply (make-function) (make-data))
101+
""";
102+
grammar = P.make_grammar([:top], P.flatten(rules, Char));
103+
104+
P.lex(grammar, input)
105+
106+
# The result is a vector of possible terminals that can be matched at given
107+
# input positions. As a minor victory, you may see that no terminals are
108+
# matched inside the initial `plus` token.
109+
#
110+
# Now, the lexed input could be used via the argument `fast_match` of
111+
# [`PikaParser.parse`](@ref), but usually it is much simpler to have the
112+
# combined function [`PikaParser.parse_lex`](@ref) do everything:
113+
114+
p = P.parse_lex(grammar, input);
115+
116+
# The rest is now essentially same as with the [previous Scheme example](scheme.md):
117+
118+
fold_scheme(m, p, s) =
119+
m.rule == :number ? parse(Int, m.view) :
120+
m.rule == :ident ? Symbol(m.view) :
121+
m.rule == :insexpr ? Expr(:call, :S, s...) :
122+
m.rule == :sexpr ? s[2] : m.rule == :top ? s[2] : length(s) > 0 ? s[1] : nothing;
123+
124+
next_pos = 1
125+
while next_pos <= lastindex(p.input)
126+
global next_pos
127+
pos = next_pos
128+
mid = 0
129+
while pos <= lastindex(p.input) # try to find a match
130+
mid = P.find_match_at!(p, :top, pos)
131+
mid != 0 && break
132+
pos = nextind(p.input, pos)
133+
end
134+
pos > next_pos && # if we skipped something, report it
135+
println("Problems with: $(p.input[next_pos:prevind(p.input, pos)])")
136+
if mid == 0
137+
break # if we skipped all the way to the end, quit
138+
else # we have an actual match, print it.
139+
value = P.traverse_match(p, mid, fold = fold_scheme)
140+
println("Parsed OK: $value")
141+
m = p.matches[mid] # skip the whole match and continue
142+
next_pos = nextind(p.input, m.last)
143+
end
144+
end

src/frontend.jl

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,8 @@ Build a [`Scan`](@ref) clause. Translate to strongly typed grammar with [`flatte
1717
1818
# Example
1919
20-
# rule to match a pair of equal tokens
21-
scan(m -> m[1] == m[2] ? 2 : -1)
20+
# a rule to match any pair of equal tokens
21+
scan(m -> (length(m) >= 2 && m[1] == m[2]) ? 2 : 0)
2222
"""
2323
scan(f::Function) = Scan{Any,Any}(f)
2424

src/parse.jl

Lines changed: 15 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -203,8 +203,21 @@ end
203203
"""
204204
$(TYPEDSIGNATURES)
205205
206-
Greedily find terminals in the input sequence, while avoiding any attempts at
207-
parsing terminals where another terminal was already parsed successfully.
206+
Greedily find terminals in the input sequence. For performance and uniqueness
207+
purposes, terminals are only looked for at stream indexes that follow the final
208+
indexes of terminals found previously. That allows the lexing process to skip
209+
many redundant matches that could not ever be found by the grammar.
210+
211+
As a main outcome, this prevents the typical pika-parser behavior when matching
212+
sequences using [`many`](@ref), where e.g. an identifier like `abcd` also
213+
produces redundant (and often invalid) matches for `bcd`, `cd` and `d`.
214+
Colaterally, greedy lexing also creates less tokens in the match table, which
215+
results in faster parsing.
216+
217+
To produce good terminal matches quickly, use [`scan`](@ref).
218+
219+
In a typical use, this function is best called indirectly via
220+
[`parse_lex`](@ref).
208221
"""
209222
function lex(g::Grammar{G,T}, input::I)::Vector{Vector{Tuple{G,Int}}} where {G,T,I}
210223
q = PikaQueue(lastindex(input))

src/structs.jl

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -51,8 +51,9 @@ $(TYPEDEF)
5151
A single terminal, possibly made out of multiple input tokens.
5252
5353
Given the input stream view, the `match` function scans the input forward and
54-
returns the position of the last item of the terminal starting at the beginning
55-
of the stream. In case there's no match, it returns a zero.
54+
returns the position of the last item of the matched terminal (which is assumed
55+
to start at the beginning of the stream view). In case there's no match, it
56+
returns zero.
5657
5758
# Fields
5859
$(TYPEDFIELDS)

0 commit comments

Comments
 (0)