- Production rules
- Regex pattern
- RuleType
- ReduceAction
- Accessing token data in ReduceAction
- Exclamation mark
!
%tokentype
%token
%start
%eof
%userdata
%left
,%right
,%precedence
%err
,%error
%glr
%lalr
%nooptim
RustyLR's grammar syntax is inspired by parser generators like Yacc and Bison. Grammars are defined using a combination of directives, token definitions, and production rules.
In procedural macros, the grammar is defined using the lr1!
macro.
In build script files, the grammar section is separated from Rust code using %%
. Everything before %%
is treated as regular Rust code and is copied as-is to the generated output.
Each production rule defines how a non-terminal symbol can be derived from a sequence of patterns.
NonTerminalName
: Pattern1 Pattern2 ... PatternN %prec OpName { ReduceAction }
| Pattern1 Pattern2 ... PatternN { ReduceAction }
...
;
- NonTerminalName: The name of the non-terminal symbol being defined.
- PatternX: A terminal or non-terminal symbol, or a pattern as defined below.
- ReduceAction: Optional Rust code executed when the rule is reduced.
- OpName: Use this symbol as an operator for this production rule.
OpName
could be defined%token
or literal, or any unique identifier just for this rule. See ReduceType for more details.
Patterns define the structure of the input that matches a production rule.
name
: Non-terminal or terminal symbolname
defined in the grammar.[term1 term_start-term_last]
,[^term1 term_start-term_last]
: Set of terminal symbols.eof
will be automatically removed from the terminal set.P*
: Zero or more repetition ofP
.P+
: One or more repetition ofP
.P?
: Zero or one repetition ofP
.(P1 P2 P3)
: Grouping of patterns.'a'
orb'a'
: Single character literal or byte literal. This is only supported if the%tokentype
ischar
oru8
."abcd"
orb"abcd"
: String literal or byte string literal. This is only supported if the%tokentype
ischar
oru8
.
Note: When using range patterns like [first-last], the range is determined by the order of %token directives, not by the actual values of the tokens.
If you define tokens in the following order:
%token one '1';
%token two '2';
...
%token zero '0';
%token nine '9';
The range [zero-nine]
will be ['0', '9']
, not ['0'-'9']
.
Assigning a type to a non-terminal allows the parser to carry semantic information.
E(MyType): ... ;
MyType
: The Rust type associated with the non-terminal E.
The actual value of E is evaluated by the result of the ReduceAction.
A ReduceAction is Rust code executed when a production rule is reduced.
- If a
RuleType
is defined, the ReduceAction must evaluate to that type. - If no
RuleType
is defined and only one token holds a value, the ReduceAction can be omitted. - Reduce action can return Result<(), ErrorType> to handle errors during parsing.
- Reduce action can be written in Rust code. It is executed when the rule is matched and reduced.
%err String;
E(i32): A div a2=A {
if a2 == 0 {
return Err("Division by zero".to_string());
}
A / a2 // new value of E
};
Within a ReduceAction, you can access the data associated with tokens and non-terminals:
- Named Patterns: Assign names to patterns to access their values.
E(i32): left=A '+' right=A { left + right };
- Or using their default names if obvious.
E(i32): A '+' right=A { A + right }; // use A directly
- User Data: Access mutable user-defined data passed to the parser.
E(i32): A '+' right=A {
*data += 1; // data: &mut UserData
A + right
};
- Lookahead Token: Inspect the next token without consuming it.
match *lookahead { // lookahead: &TerminalType
'+' => { /* ... */ },
_ => { /* ... */ },
}
- Shift Control: Control whether to perform a shift operation. (for GLR parser)
*shift = false; // Prevent shift action
For some regex pattern, the type of variable will be modified as follows:
P*
:Vec<P>
P+
:Vec<P>
P?
:Option<P>
You can still access the Vec
or Option
by using the base name of the pattern.
E(i32) : A* {
println!( "Value of A: {:?}", A ); // Vec<A>
};
For terminal set [term1 term_start-term_end]
, [^term1 term_start-term_end]
, there is no predefined variable name. You must explicitly define the variable name.
E: digit=[zero-nine] {
println!( "Value of digit: {:?}", digit ); // %tokentype
};
For group (P1 P2 P3)
:
- If none of the patterns hold value, the group itself will not hold any value.
- If only one of the patterns holds value, the group will hold the value of the very pattern. And the variable name will be same as the pattern.
(i.e. If
P1
holds value, and others don't, then(P1 P2 P3)
will hold the value ofP1
, and can be accessed via nameP1
) - If there are multiple patterns holding value, the group will hold
Tuple
of the values. There is no default variable name for the group, you must define the variable name explicitly by=
operator.
NoRuleType: ... ;
I(i32): ... ;
// I will be chosen
A: (NoRuleType I NoRuleType) {
println!( "Value of I: {:?}", I ); // can access by 'I'
I
};
// ( i32, i32 )
B: i2=( I NoRuleType I ) {
println!( "Value of I: {:?}", i2 ); // must explicitly define the variable name
};
An exclamation mark !
can be used right after the token to ignore the value of the token.
The token will be treated as if it is not holding any value.
A(i32) : ... ;
// A in the middle will be chosen, since other A's are ignored
E(i32) : A! A A!;
%tokentype <RustType> ;
Define the type of terminal symbols.
<RustType>
must be accessible at the point where the macro is called.
enum MyTokenType<Generic> {
Digit,
Ident,
...
VariantWithGeneric<Generic>
}
lr! {
...
%tokentype MyTokenType<i32>;
}
%token name <RustExpr> ;
Map terminal symbol name
to the actual value <RustExpr>
.
<RustExpr>
must be accessible at the point where the macro is called.
%tokentype MyToken;
%token zero MyToken::Zero;
%token one MyToken::One;
...
// 'zero' and 'one' will be replaced by b'0' and b'1' respectively
E: zero one;
Note: If %tokentype
is either char
or u8
, you can't use this directive. You must use literal value in the grammar directly.
%start NonTerminalName ;
Set the start symbol of the grammar as NonTerminalName
.
%start E;
// this internally generate augmented rule <Augmented> -> E eof
E: ... ;
%eof <RustExpr> ;
Define the eof
terminal symbol.
<RustExpr>
must be accessible at the point where the macro is called.
'eof' terminal symbol will be automatically added to the grammar.
%eof b'\0';
// you can access eof terminal symbol by 'eof' in the grammar
// without %token eof ...;
%userdata <RustType> ;
Define the type of userdata passed to feed()
function.
struct MyUserData { ... }
...
%userdata MyUserData;
...
fn main() {
...
let mut userdata = MyUserData { ... };
parser.feed( ..., token, &mut userdata); // <-- userdata feed here
}
// reduce first
%left term1 term2 term3 ...;
// shift first
%right term1 ;
%right term1 term2 term3 ... ;
// only precedence
%precedence term1 term2 term3 ... ;
%left can be abbreviated as %reduce or %l, and %right as %shift or %r.
These directives define the associativity and precedence of operators.
As in yacc
and bison
, the order of precedence is determined by the order in which %left, %right, or %precedence directives appear.
// left reduction for binary operator '+'
%left '+';
// right reduction for binary operator '^'
%right '^';
%left '+';
%left '*';
%left UnaryMinus; // << highest priority
E: E '+' E { E + E }
| E '*' E { E * E }
| '-' E %prec UnaryMinus { -E } // make operator for this production rule `UnaryMinus`
;
%err <RustType> ;
%error <RustType> ;
Define the type of Err
variant in Result<(), Err>
returned from ReduceAction
. If not defined, DefaultReduceActionError
will be used.
enum MyErrorType<T> {
ErrVar1,
ErrVar2,
ErrVar3(T),
}
...
%err MyErrorType<GenericType> ;
...
match parser.feed( ... ) {
Ok(_) => {}
Err(err) => {
match err {
ParseError::ReduceAction( err ) => {
// do something with err
}
_ => {}
}
}
}
%lalr;
Switch generated parser table to LALR parser.
%glr;
Swith to GLR parser generation.
If you want to generate GLR parser, add %glr;
directive in the grammar.
With this directive, any Shift/Reduce, Reduce/Reduce conflicts will not be treated as errors.
See GLR Parser section for more details.
%nooptim;
Disable grammar optimization.