Skip to content

Commit e4c91a3

Browse files
committed
Parsers determine the if grammar rules start with an identifier followed by a symbol and dynamically set the terminal regular expressions accordingly. Subsequent rules must either all begin with an identifier, or only a symbol.
1 parent 2ed698f commit e4c91a3

File tree

15 files changed

+73
-52
lines changed

15 files changed

+73
-52
lines changed

README.md

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -26,10 +26,9 @@ As LL(1) grammars operate using `alt` and `seq` primitives, allowing for a match
2626
* Transform `a ::= b+` into `a ::= b b*`
2727
* Transform `a ::= b*` into `a ::= _empty | (b a)`
2828
* Transform `a ::= op1 (op2)` into two rules:
29-
```
30-
a ::= op1 _a_1
31-
_a_1_ ::= op2
32-
```
29+
30+
a ::= op1 _a_1
31+
_a_1_ ::= op2
3332

3433
Of note in this implementation is that the tokenizer and parser are streaming, so that they can process inputs of arbitrary size.
3534

@@ -96,7 +95,7 @@ The {EBNF::Writer} class can be used to write parsed grammars out, either as for
9695
The formatted HTML results are designed to be appropriate for including in specifications.
9796

9897
### Parser Errors
99-
On a parsing failure, and exception is raised with information that may be useful in determining the source of the error.
98+
On a parsing failure, an exception is raised with information that may be useful in determining the source of the error.
10099

101100
## EBNF Grammar
102101
The [EBNF][] variant used here is based on [W3C](https://w3.org/) [EBNF][]
@@ -116,7 +115,7 @@ which can also be proceeded by an optional number enclosed in square brackets to
116115

117116
[1] symbol ::= expression
118117

119-
(Note, this can introduce an ambiguity if the previous rule ends in a range or enum and the current rule has no number. In this case, enclosing `expression` within parentheses, or adding intervening comments can resolve the ambiguity.)
118+
(Note, introduces an ambiguity if the previous rule ends in a range or enum and the current rule has no number. The parsers dynamically determine the terminal rules for the `LHS` (the identifier, symbol, and `::=`) and `RANGE`).
120119

121120
Symbols are written in CAPITAL CASE if they are the start symbol of a regular language (terminals), otherwise with they are treated as non-terminal rules. Literal strings are quoted.
122121

etc/ebnf.ebnf

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,8 @@
55

66
# Use the LHS terminal to match the identifier, rule name and assignment due to
77
# confusion between the identifier and RANGE.
8-
# Note, for grammars not using identifiers, it is still possible to confuse
9-
# a rule ending with a range the next rule, as it may be interpreted as an identifier.
10-
# In such case, best to enclose the rule in '()'.
8+
# The PEG parser has special rules for matching LHS and RANGE
9+
# so that RANGE is not confused with LHS.
1110
[3] rule ::= LHS expression
1211

1312
[4] expression ::= alt
@@ -40,7 +39,7 @@
4039

4140
[13] HEX ::= '#x' ([a-f] | [A-F] | [0-9])+
4241

43-
[14] RANGE ::= '[' ((R_CHAR '-' R_CHAR) | (HEX '-' HEX) | R_CHAR | HEX)+ '-'? ']' - LHS
42+
[14] RANGE ::= '[' ((R_CHAR '-' R_CHAR) | (HEX '-' HEX) | R_CHAR | HEX)+ '-'? ']'
4443

4544
[15] O_RANGE ::= '[^' ((R_CHAR '-' R_CHAR) | (HEX '-' HEX) | R_CHAR | HEX)+ '-'? ']'
4645

etc/ebnf.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,7 @@
9595
<td>[14]</td>
9696
<td><code>RANGE</code></td>
9797
<td>::=</td>
98-
<td>'<code class="grammar-literal">[</code>' <code class="grammar-paren">(</code><code class="grammar-paren">(</code><a href="#grammar-production-R_CHAR">R_CHAR</a> '<code class="grammar-literal">-</code>' <a href="#grammar-production-R_CHAR">R_CHAR</a><code class="grammar-paren">)</code> <code class="grammar-alt">|</code> <code class="grammar-paren">(</code><a href="#grammar-production-HEX">HEX</a> '<code class="grammar-literal">-</code>' <a href="#grammar-production-HEX">HEX</a><code class="grammar-paren">)</code> <code class="grammar-alt">|</code> <a href="#grammar-production-R_CHAR">R_CHAR</a> <code class="grammar-alt">|</code> <a href="#grammar-production-HEX">HEX</a><code class="grammar-paren">)</code><code class="grammar-plus">+</code> '<code class="grammar-literal">-</code>'<code class="grammar-opt">?</code> <code class="grammar-paren">(</code>'<code class="grammar-literal">]</code>' <code class="grammar-diff">-</code> <a href="#grammar-production-LHS">LHS</a><code class="grammar-paren">)</code></td>
98+
<td>'<code class="grammar-literal">[</code>' <code class="grammar-paren">(</code><code class="grammar-paren">(</code><a href="#grammar-production-R_CHAR">R_CHAR</a> '<code class="grammar-literal">-</code>' <a href="#grammar-production-R_CHAR">R_CHAR</a><code class="grammar-paren">)</code> <code class="grammar-alt">|</code> <code class="grammar-paren">(</code><a href="#grammar-production-HEX">HEX</a> '<code class="grammar-literal">-</code>' <a href="#grammar-production-HEX">HEX</a><code class="grammar-paren">)</code> <code class="grammar-alt">|</code> <a href="#grammar-production-R_CHAR">R_CHAR</a> <code class="grammar-alt">|</code> <a href="#grammar-production-HEX">HEX</a><code class="grammar-paren">)</code><code class="grammar-plus">+</code> '<code class="grammar-literal">-</code>'<code class="grammar-opt">?</code> '<code class="grammar-literal">]</code>'</td>
9999
</tr>
100100
<tr id="grammar-production-O_RANGE">
101101
<td>[15]</td>

etc/ebnf.ll1.sxp

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -104,10 +104,7 @@
104104
(terminal O_SYMBOL "12a" (plus (alt (range "a-z") (range "A-Z") (range "0-9") '_' '.')))
105105
(terminal HEX "13" (seq '#x' (plus (alt (range "a-f") (range "A-F") (range "0-9")))))
106106
(terminal RANGE "14"
107-
(seq '['
108-
(plus (alt (seq R_CHAR '-' R_CHAR) (seq HEX '-' HEX) R_CHAR HEX))
109-
(opt '-')
110-
(diff ']' LHS)) )
107+
(seq '[' (plus (alt (seq R_CHAR '-' R_CHAR) (seq HEX '-' HEX) R_CHAR HEX)) (opt '-') ']'))
111108
(terminal O_RANGE "15"
112109
(seq '[^' (plus (alt (seq R_CHAR '-' R_CHAR) (seq HEX '-' HEX) R_CHAR HEX)) (opt '-') ']'))
113110
(terminal STRING1 "16" (seq '"' (star (diff CHAR '"')) '"'))

etc/ebnf.peg.rb

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -38,13 +38,12 @@ module EBNFMeta
3838
EBNF::Rule.new(:_HEX_3, "13.3", [:range, "a-f"], kind: :terminal).extend(EBNF::PEG::Rule),
3939
EBNF::Rule.new(:_HEX_4, "13.4", [:range, "A-F"], kind: :terminal).extend(EBNF::PEG::Rule),
4040
EBNF::Rule.new(:_HEX_5, "13.5", [:range, "0-9"], kind: :terminal).extend(EBNF::PEG::Rule),
41-
EBNF::Rule.new(:RANGE, "14", [:seq, "[", :_RANGE_1, :_RANGE_2, :_RANGE_3], kind: :terminal).extend(EBNF::PEG::Rule),
42-
EBNF::Rule.new(:_RANGE_1, "14.1", [:plus, :_RANGE_4], kind: :terminal).extend(EBNF::PEG::Rule),
43-
EBNF::Rule.new(:_RANGE_4, "14.4", [:alt, :_RANGE_5, :_RANGE_6, :R_CHAR, :HEX], kind: :terminal).extend(EBNF::PEG::Rule),
44-
EBNF::Rule.new(:_RANGE_5, "14.5", [:seq, :R_CHAR, "-", :R_CHAR], kind: :terminal).extend(EBNF::PEG::Rule),
45-
EBNF::Rule.new(:_RANGE_6, "14.6", [:seq, :HEX, "-", :HEX], kind: :terminal).extend(EBNF::PEG::Rule),
41+
EBNF::Rule.new(:RANGE, "14", [:seq, "[", :_RANGE_1, :_RANGE_2, "]"], kind: :terminal).extend(EBNF::PEG::Rule),
42+
EBNF::Rule.new(:_RANGE_1, "14.1", [:plus, :_RANGE_3], kind: :terminal).extend(EBNF::PEG::Rule),
43+
EBNF::Rule.new(:_RANGE_3, "14.3", [:alt, :_RANGE_4, :_RANGE_5, :R_CHAR, :HEX], kind: :terminal).extend(EBNF::PEG::Rule),
44+
EBNF::Rule.new(:_RANGE_4, "14.4", [:seq, :R_CHAR, "-", :R_CHAR], kind: :terminal).extend(EBNF::PEG::Rule),
45+
EBNF::Rule.new(:_RANGE_5, "14.5", [:seq, :HEX, "-", :HEX], kind: :terminal).extend(EBNF::PEG::Rule),
4646
EBNF::Rule.new(:_RANGE_2, "14.2", [:opt, "-"], kind: :terminal).extend(EBNF::PEG::Rule),
47-
EBNF::Rule.new(:_RANGE_3, "14.3", [:diff, "]", :LHS], kind: :terminal).extend(EBNF::PEG::Rule),
4847
EBNF::Rule.new(:O_RANGE, "15", [:seq, "[^", :_O_RANGE_1, :_O_RANGE_2, "]"], kind: :terminal).extend(EBNF::PEG::Rule),
4948
EBNF::Rule.new(:_O_RANGE_1, "15.1", [:plus, :_O_RANGE_3], kind: :terminal).extend(EBNF::PEG::Rule),
5049
EBNF::Rule.new(:_O_RANGE_3, "15.3", [:alt, :_O_RANGE_4, :_O_RANGE_5, :R_CHAR, :HEX], kind: :terminal).extend(EBNF::PEG::Rule),

etc/ebnf.peg.sxp

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -35,13 +35,12 @@
3535
(terminal _HEX_3 "13.3" (range "a-f"))
3636
(terminal _HEX_4 "13.4" (range "A-F"))
3737
(terminal _HEX_5 "13.5" (range "0-9"))
38-
(terminal RANGE "14" (seq '[' _RANGE_1 _RANGE_2 _RANGE_3))
39-
(terminal _RANGE_1 "14.1" (plus _RANGE_4))
40-
(terminal _RANGE_4 "14.4" (alt _RANGE_5 _RANGE_6 R_CHAR HEX))
41-
(terminal _RANGE_5 "14.5" (seq R_CHAR '-' R_CHAR))
42-
(terminal _RANGE_6 "14.6" (seq HEX '-' HEX))
38+
(terminal RANGE "14" (seq '[' _RANGE_1 _RANGE_2 ']'))
39+
(terminal _RANGE_1 "14.1" (plus _RANGE_3))
40+
(terminal _RANGE_3 "14.3" (alt _RANGE_4 _RANGE_5 R_CHAR HEX))
41+
(terminal _RANGE_4 "14.4" (seq R_CHAR '-' R_CHAR))
42+
(terminal _RANGE_5 "14.5" (seq HEX '-' HEX))
4343
(terminal _RANGE_2 "14.2" (opt '-'))
44-
(terminal _RANGE_3 "14.3" (diff ']' LHS))
4544
(terminal O_RANGE "15" (seq '[^' _O_RANGE_1 _O_RANGE_2 ']'))
4645
(terminal _O_RANGE_1 "15.1" (plus _O_RANGE_3))
4746
(terminal _O_RANGE_3 "15.3" (alt _O_RANGE_4 _O_RANGE_5 R_CHAR HEX))

etc/ebnf.sxp

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -16,10 +16,7 @@
1616
(terminal O_SYMBOL "12a" (plus (alt (range "a-z") (range "A-Z") (range "0-9") '_' '.')))
1717
(terminal HEX "13" (seq '#x' (plus (alt (range "a-f") (range "A-F") (range "0-9")))))
1818
(terminal RANGE "14"
19-
(seq '['
20-
(plus (alt (seq R_CHAR '-' R_CHAR) (seq HEX '-' HEX) R_CHAR HEX))
21-
(opt '-')
22-
(diff ']' LHS)) )
19+
(seq '[' (plus (alt (seq R_CHAR '-' R_CHAR) (seq HEX '-' HEX) R_CHAR HEX)) (opt '-') ']'))
2320
(terminal O_RANGE "15"
2421
(seq '[^' (plus (alt (seq R_CHAR '-' R_CHAR) (seq HEX '-' HEX) R_CHAR HEX)) (opt '-') ']'))
2522
(terminal STRING1 "16" (seq '"' (star (diff CHAR '"')) '"'))

lib/ebnf/parser.rb

Lines changed: 31 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,12 @@ class Parser
1111
# @return [Array<EBNF::Rule>]
1212
attr_reader :ast
1313

14+
# Set on first rule
15+
attr_reader :lhs_includes_identifier
16+
17+
# Regular expression to match a [...] range, which may be distinguisehd from an LHS
18+
attr_reader :range
19+
1420
# ## Terminals
1521
# Define rules for Terminals, placing results on the input stack, making them available to upstream non-Terminal rules.
1622
#
@@ -28,7 +34,22 @@ class Parser
2834
#
2935
# [11] LHS ::= ('[' SYMBOL+ ']' ' '+)? <? SYMBOL >? ' '* '::='
3036
terminal(:LHS, LHS) do |value, prod|
31-
value.to_s.scan(/(?:\[([^\]]+)\])?\s*<?(\w+)>?\s*::=/).first
37+
md = value.to_s.scan(/(?:\[([^\]]+)\])?\s*<?(\w+)>?\s*::=/).first
38+
if @lhs_includes_identifier.nil?
39+
@lhs_includes_identifier = !md[0].nil?
40+
@range = md[0] ? RANGE_NOT_LHS : RANGE
41+
elsif @lhs_includes_identifier && !md[0]
42+
error("LHS",
43+
"Rule does not begin with a [xxx] identifier, which was established on the first rule",
44+
production: :LHS,
45+
rest: value)
46+
elsif !@lhs_includes_identifier && md[0]
47+
error("LHS",
48+
"Rule begins with a [xxx] identifier, which was not established on the first rule",
49+
production: :LHS,
50+
rest: value)
51+
end
52+
md
3253
end
3354

3455
# Match `SYMBOL` terminal
@@ -48,9 +69,10 @@ class Parser
4869
end
4970

5071
# Terminal for `RANGE` is matched as part of a `primary` rule.
72+
# Note that this won't match if rules include identifiers.
5173
#
52-
# [14] RANGE ::= '[' ((R_CHAR '-' R_CHAR) | (HEX '-' HEX) | R_CHAR | HEX)+ '-'? ']' - LHS
53-
terminal(:RANGE, RANGE) do |value|
74+
# [14] RANGE ::= '[' ((R_CHAR '-' R_CHAR) | (HEX '-' HEX) | R_CHAR | HEX)+ '-'? ']'
75+
terminal(:RANGE, proc {@range}) do |value|
5476
[:range, value[1..-2]]
5577
end
5678

@@ -130,7 +152,9 @@ class Parser
130152
# Invoke callback
131153
id, sym = value[:LHS]
132154
expression = value[:expression]
133-
callback.call(:rule, EBNF::Rule.new(sym.to_sym, id, expression))
155+
rule = EBNF::Rule.new(sym.to_sym, id, expression)
156+
progress(:rule, rule.to_sxp)
157+
callback.call(:rule, rule)
134158
nil
135159
end
136160

@@ -274,6 +298,9 @@ def initialize(input, **options, &block)
274298
tap {|x| x.formatter = lambda {|severity, datetime, progname, msg| "#{severity} #{msg}\n"}}
275299
end
276300

301+
# This is established on the first rule.
302+
self.class.instance_variable_set(:@lhs_includes_identifier, nil)
303+
277304
# Read input, if necessary, which will be used in a Scanner.
278305
@input = input.respond_to?(:read) ? input.read : input.to_s
279306

lib/ebnf/peg/parser.rb

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -68,10 +68,9 @@ def terminal_options; (@terminal_options ||= {}); end
6868
#
6969
# @param [Symbol] term
7070
# The terminal name.
71-
# @param [Regexp] regexp (nil)
72-
# Pattern used to scan for this terminal,
73-
# defaults to the expression defined in the associated rule.
74-
# If unset, the terminal rule is used for matching.
71+
# @param [Regexp, Proc] regexp
72+
# Pattern used to scan for this terminal.
73+
# Passing a Proc will evaluate that proc to retrieve a regular expression.
7574
# @param [Hash] options
7675
# @option options [Boolean] :unescape
7776
# Cause strings and codepoints to be unescaped.
@@ -83,8 +82,8 @@ def terminal_options; (@terminal_options ||= {}); end
8382
# @yieldparam [Proc] block
8483
# Block passed to initialization for yielding to calling parser.
8584
# Should conform to the yield specs for #initialize
86-
def terminal(term, regexp = nil, **options, &block)
87-
terminal_regexps[term] = regexp if regexp
85+
def terminal(term, regexp, **options, &block)
86+
terminal_regexps[term] = regexp
8887
terminal_handlers[term] = block if block_given?
8988
terminal_options[term] = options.freeze
9089
end

lib/ebnf/peg/rule.rb

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,7 @@ def parse(input, **options)
4949
# use that to match the input,
5050
# otherwise,
5151
if regexp = parser.terminal_regexp(sym)
52+
regexp = regexp.call() if regexp.is_a?(Proc)
5253
term_opts = parser.terminal_options(sym)
5354
if matched = input.scan(regexp)
5455
# Optionally map matched
@@ -290,6 +291,7 @@ def rept(input, min, max, prod, string_regexp_opts, **options)
290291
def terminal_also_matches(input, prod, string_regexp_opts)
291292
str_regex = Regexp.new(Regexp.quote(prod), string_regexp_opts)
292293
input.match?(str_regex) && parser.class.terminal_regexps.any? do |sym, re|
294+
re = re.call() if re.is_a?(Proc)
293295
(match_len = input.match?(re)) && match_len > prod.length
294296
end
295297
end

0 commit comments

Comments
 (0)