Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thinking About Parser Safety #23

Open
purefunctor opened this issue Mar 6, 2024 · 2 comments
Open

Thinking About Parser Safety #23

purefunctor opened this issue Mar 6, 2024 · 2 comments

Comments

@purefunctor
Copy link
Owner

The current iteration of the analyzer's parser is a hand-written, recursive descent parser that "recovers" on errors instead of failing at the very first one that it encounters, which is a different approach from what the mainline compiler as well as the PureScript-based parser is doing. Recovery basically allows the parser to parse as much information as possible such that the semantic analyzer core can have as much surface area from the source file to work with.

One of the challenges we're seeing in the parser currently is loop safety, specifically for nodes that require repeated parsing of elements. For instance, as of a9cb011, the following declaration fails to parse with an infinite loop error (explanation):

b | if 0 else 1 then 2 = 0

For a moment, let's take a step back and think about what the parser does:

  1. It consumes atoms (i.e. tokens)
  2. It can build nodes (i.e. groups of tokens and nodes)
  3. For unexpected tokens, it can yield an Error node which consumes the incorrect token and marks it with an error message saying what was expected. This is effectively recovery at the atom level.
  4. For nodes, we currently don't have a concrete solution. I'll be talking more about possible ones in the next section.

Nodes can be further subdivided into two kinds:

  1. Composite nodes
  2. Container nodes

Container nodes are composite nodes but composite nodes are not container nodes. More specifically, composite nodes are collections of tokens and more nodes, while container nodes have the additional property of having an unbounded (read: bounded by RAM) number of children.

IfThenElse is an example of a composite node: it contains the tokens for the keywords, as well as the accompanying nodes for the expression.

Import/export lists, declaration lists, variable lists, arrays expressions, record expressions, etc. are examples of container nodes. They all involve repeated parsing of nodes until a condition is met, which is usually when a "terminating" token is seen (e.g. ] for arrays, etc.)

One of the problems we currently have with container nodes is how they handle "recovery" for when items fail to parse. The example I mentioned earlier demonstrates this: while parsing the list of guarded expressions, the parser fails to parse the ternary, which puts the parser in a state where it gets stuck on a single token.

An ideal approach to recovery here would involve "attempting" to parse the item. If the attempt succeeds, then no further action is needed; if the attempt fails, then the parser consumes tokens until the next nearest position where it can confidently begin parsing again.

In summary:

  1. The parser has the mechanisms to perform recovery for unexpected tokens and march on with parsing.
  2. Container nodes are currently parsed aggressively without recovery for nodes.
  3. A "safe" parser would involve some degree of backtracking and recovery for items in container nodes.
@FredTheDino
Copy link
Collaborator

FredTheDino commented Mar 6, 2024

Great write up! I think the elm parser does something similar - it jumps to the next "start of line" expression and carries on from there. So:

a = [ 1, 2, 3, 4 ] ] ] ] ]
                          ^ - error, we recover by skipping until we find another top-level declaration
           # another function
           >>= and stuff

b = 1
^ - Great! Now we know we're at the start of a definition so we can continue from here.

I find this to be a nice read: https://craftinginterpreters.com/parsing-expressions.html#panic-mode-error-recovery

@purefunctor
Copy link
Owner Author

I believe the language-cst-parser also does something similar with statements specifically!

https://github.com/natefaubion/purescript-language-cst-parser/blob/67323ef7038ee6514c1912684398e3da5a3fd207/src/PureScript/CST/Parser.purs#L1172-L1212

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants