Description
Since commit 2f2d8cf, efforts have begun to completely redesign the existing lexer and parser, with the new code currently residing in the nix-parser2
directory.
Existing pain points
Lexer
- Lexer is not fully lazy when it comes to string interpolations, as it collects all of the interpolated tokens eagerly into a nested
Vec
. This is a big design kludge also a performance footgun. Ideally, the lexer should be a dumb iterator yielding a flat, lazy sequence of tokens which map directly to spans in the source text. - Each variant of the
Token
enum carries its own duplicateSpan
information. This should be factored out into aToken
struct which contains both aSpan
and aTokenKind
, which eliminates the need for implementingToSpan
onToken
. - Lexer does some work which the parser should be doing, such as unescaping strings, trimming the leading whitespace from
''
strings, converting paths intoPathBuf
values, converting booleans intobool
values, etc. There should be a stronger separation of concerns, where the produced tokens only carry span information and little else, and the parser is free to slice and retrieve the source text based on those spans, if it so desires. - Lexer should be 100% lossless. Currently, if a path token contains a trailing slash or a string literal is left unterminated, the lexer yields a
Token::Error
. This confounds the parser at later stages, causing it to emit perplexing errors such asexpected path, found `/foo/bar/`
andunexpected token `'`
, respectively. Instead, we should still emitToken::Path
andToken::String
so parsing can continue but mark them as invalid. This matches whatrustc_lexer
does.
Parser
- Error management with
nom
is very hard to get right. It leads to very convoluted code as well as performance footguns, if you're not careful. Producing high-quality diagnostics fromnom
combinators is tedious as well. - Writing an incremental parser with
nom
is very difficult when compared to a hand-written parser.nom
is hard-wired to returnIncomplete(n)
when receiving incomplete input, which is decidedly not what we want when processing a programming language. - Threading state through a
nom
parser, e.g. an arena allocator or string interner, is very cumbersome due to the restrictiveFn
bounds on combinators. - Parser should be 100% lossless. It should not produce an AST, but rather a CST (concrete syntax tree) which preserves all whitespace, comments, and invalid tokens. This would be usable in a code formatter or autocompletion engine, for example. The CST can simplified into an AST at a later stage. See the Roslyn architecture for more details.
- Parser currently can't handle spaces inside string interpolations correctly nor escaped
'
characters inside indented''
strings. - Implementing overflow arguments, trailing commas, and ellipses in formal function arguments is tricky. Implementing a strategy which yields high-quality diagnostics is even trickier.
Current state
The new lexer present in the master
branch still relies on nom
because it works very well for making small and efficient lexers. Since the new lexer is lossless and will never produce an error, there's no need to struggle with custom nom
error management, and the code for it looks much simpler as a result. The redesigned lexer is currently in an MVP state and is essentially feature-complete, though a few pieces are still missing, most notably a function for checking delimiter balancing.
The new parser will ditch nom
and will most likely be hand-written for the sake of better error recovery and higher quality diagnostics. The precise design hasn't been fully fleshed out yet. Instead of producing an AST, the new parser will produce a CST (possibly based on rowan
, but could also be hand-written) whose nodes cannot be constructed by hand. However, the CST can be simplified down into an AST, whose nodes can be constructed by hand.