Redesign nix-parser crate

Since commit 2f2d8cf, efforts have begun to completely redesign the existing lexer and parser, with the new code currently residing in the `nix-parser2` directory.

## Existing pain points

### Lexer

1. Lexer is not fully lazy when it comes to string interpolations, as it collects all of the interpolated tokens eagerly into a nested `Vec`. This is a big design kludge also a performance footgun. Ideally, the lexer should be a dumb iterator yielding a flat, lazy sequence of tokens which map directly to spans in the source text.
2. Each variant of the `Token` enum carries its own duplicate `Span` information. This should be factored out into a `Token` struct which contains both a `Span` and a `TokenKind`, which eliminates the need for implementing `ToSpan` on `Token`.
3. Lexer does some work which the parser should be doing, such as unescaping strings, trimming the leading whitespace from `''` strings, converting paths into `PathBuf` values, converting booleans into `bool` values, etc. There should be a stronger separation of concerns, where the produced tokens only carry span information and little else, and the parser is free to slice and retrieve the source text based on those spans, if it so desires.
4. Lexer should be 100% lossless. Currently, if a path token contains a trailing slash or a string literal is left unterminated, the lexer yields a `Token::Error`. This confounds the parser at later stages, causing it to emit perplexing errors such as ``expected path, found `/foo/bar/` `` and ``unexpected token `'` ``, respectively. Instead, we should still emit `Token::Path` and `Token::String` so parsing can continue but mark them as invalid. [This matches what `rustc_lexer` does](https://docs.rs/rustc_lexer/0.1.0/rustc_lexer/enum.LiteralKind.html).

### Parser

1. [Error management with `nom` is very hard to get right.](https://www.reddit.com/r/rust/comments/fimc5p/a_json_parser_written_with_nom_5/fkjlm7e?utm_source=share&utm_medium=web2x) It leads to very convoluted code as well as performance footguns, if you're not careful. Producing high-quality diagnostics from `nom` combinators is tedious as well.
2. Writing an incremental parser with `nom` is very difficult when compared to a hand-written parser. `nom` is hard-wired to return `Incomplete(n)` when receiving incomplete input, which is decidedly _not_ what we want when processing a programming language.
3. Threading state through a `nom` parser, e.g. an arena allocator or string interner, is very cumbersome due to the restrictive `Fn` bounds on combinators.
4. [Parser should be 100% lossless.](https://matklad.github.io/2018/06/06/modern-parser-generator.html) It should not produce an AST, but rather a CST (concrete syntax tree) which preserves all whitespace, comments, and invalid tokens. This would be usable in a code formatter or autocompletion engine, for example. The CST can simplified into an AST at a later stage. See [the Roslyn architecture](https://github.com/dotnet/roslyn/wiki/Roslyn%20Overview#syntax-trees) for more details.
5. Parser currently can't handle spaces inside string interpolations correctly nor escaped `'` characters inside indented `''` strings.
6. Implementing overflow arguments, trailing commas, and ellipses in formal function arguments is tricky. Implementing a strategy which yields high-quality diagnostics is even trickier.

## Current state

The new lexer present in the `master` branch still relies on `nom` because it works very well for making small and efficient lexers. Since the new lexer is lossless and will never produce an error, there's no need to struggle with custom `nom` error management, and the code for it looks much simpler as a result. The redesigned lexer is currently in an MVP state and is essentially feature-complete, though a few pieces are still missing, most notably a function for checking delimiter balancing.

The new parser will ditch `nom` and will most likely be hand-written for the sake of better error recovery and higher quality diagnostics. The precise design hasn't been fully fleshed out yet. Instead of producing an AST, the new parser will produce a CST (possibly based on `rowan`, but could also be hand-written) whose nodes cannot be constructed by hand. However, the CST can be simplified down into an AST, whose nodes can be constructed by hand.

CC @dingxiangfei2009 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Redesign nix-parser crate #11

Existing pain points

Lexer

Parser

Current state

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Redesign nix-parser crate #11

Description

Existing pain points

Lexer

Parser

Current state

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions