Thinking About Parser Safety

The current iteration of the analyzer's parser is a hand-written, recursive descent parser that "recovers" on errors instead of failing at the very first one that it encounters, which is a different approach from what the mainline compiler as well as the PureScript-based parser is doing. Recovery basically allows the parser to parse as much information as possible such that the semantic analyzer core can have as much surface area from the source file to work with.

One of the challenges we're seeing in the parser currently is loop safety, specifically for nodes that require repeated parsing of elements. For instance, as of a9cb01148f54a39c7980c0e4c606091fedfda25c, the following declaration fails to parse with an infinite loop error ([explanation](https://github.com/purefunctor/purescript-analyzer/pull/22#discussion_r1513818803)):
```rs
b | if 0 else 1 then 2 = 0
```

For a moment, let's take a step back and think about what the parser does:
1. It consumes atoms (i.e. tokens)
2. It can build nodes (i.e. groups of tokens and nodes)
3. For unexpected tokens, it can yield an `Error` node which consumes the incorrect token and marks it with an error message saying what was expected. This is effectively recovery at the atom level.
4. For nodes, we currently don't have a concrete solution. I'll be talking more about possible ones in the next section.

Nodes can be further subdivided into two kinds:
1. Composite nodes
2. Container nodes

Container nodes are composite nodes but composite nodes are not container nodes. More specifically, composite nodes are collections of tokens and more nodes, while container nodes have the additional property of having an unbounded (read: bounded by RAM) number of children.

`IfThenElse` is an example of a composite node: it contains the tokens for the keywords, as well as the accompanying nodes for the expression.

Import/export lists, declaration lists, variable lists, arrays expressions, record expressions, etc. are examples of container nodes. They all involve repeated parsing of nodes until a condition is met, which is usually when a "terminating" token is seen (e.g. `]` for arrays, etc.)

One of the problems we currently have with container nodes is how they handle "recovery" for when items fail to parse. The example I mentioned earlier demonstrates this: while parsing the list of guarded expressions, the parser fails to parse the ternary, which puts the parser in a state where it gets stuck on a single token.

An ideal approach to recovery here would involve "attempting" to parse the item. If the attempt succeeds, then no further action is needed; if the attempt fails, then the parser consumes tokens until the next nearest position where it can confidently begin parsing again.

In summary:
1. The parser has the mechanisms to perform recovery for unexpected tokens and march on with parsing.
2. Container nodes are currently parsed aggressively without recovery for nodes.
3. A "safe" parser would involve some degree of backtracking and recovery for items in container nodes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Thinking About Parser Safety #23

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Thinking About Parser Safety #23

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions