Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recipe: indentation-sensitive languages #246

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

aabounegm
Copy link
Contributor

This is a guide on how to use the new IndentationAwareTokenBuilder, added to Langium in eclipse-langium/langium#1578.
It should probably not be published before v3.2.0 (in which the token builder is expected to ship).

@aabounegm aabounegm temporarily deployed to pull-request-preview July 18, 2024 12:21 — with GitHub Actions Inactive
Copy link

github-actions bot commented Jul 18, 2024

PR Preview Action v1.4.6
🚀 Deployed preview to https://eclipse-langium.github.io/langium-previews/pr-previews/pr-246/
on branch previews at 2024-08-29 14:46 UTC

Copy link
Contributor

@Lotes Lotes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice recipe. I have some comments about the insights to the implementation. I think we should add links or detailed information how this token builder/lexer works.


The content you choose for these 3 terminals doesn't matter since it will overridden by `IndentationAwareTokenBuilder` anyway. However, you might still want to choose tokens that don't overlap with other terminals for easier use in the playground.

### Playground compatibility
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mh not sure if we need this. It should be clear for users that the playground is limited.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, removed this section

@aabounegm
Copy link
Contributor Author

Thanks for your comments! I will address them as soon as a decision is reached about eclipse-langium/langium#1608 to edit the last section as well

@aabounegm aabounegm temporarily deployed to pull-request-preview August 22, 2024 14:28 — with GitHub Actions Inactive
@aabounegm aabounegm temporarily deployed to pull-request-preview August 25, 2024 20:12 — with GitHub Actions Inactive
@aabounegm
Copy link
Contributor Author

aabounegm commented Aug 25, 2024

@Lotes Thanks for waiting so long, all comments should be addressed now.
Note that the last commit adds a subsection that depends on eclipse-langium/langium#1647 getting merged first

Copy link
Contributor

@Lotes Lotes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor things and one big question. The most of my thoughts were already resolved. Thanks.

return true
```

the lexer will output the following sequence of tokens: `if`, `BOOLEAN`, `INDENT`, `return`, `BOOLEAN`, `DEDENT`, `else`, `INDENT`, `if`, `BOOLEAN`, `INDENT`, `return`, `BOOLEAN`, `DEDENT`, `DEDENT`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The with capital T

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was intended as a continuation of the sentence before it, only interrupted by the code snippet. Not sure if it makes sense or if it counts as a separate sentence 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, I would suggest to add 3 dots at the end of the first phase and at the beginning of the second phrase.

@@ -0,0 +1,135 @@
---
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question about indentation in the implementation: How do you distinguish between spaces and tabs?
This could be an interesting point of the configuration to show here. Maybe in an own section or as appendix.
How to align this with an editor-config, see https://editorconfig.org or other approaches?

I guess you choose only spaces or tabs for the WS token, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question!
I thought a lot about the best approach here, and in the end decided not to discriminate between them, which is the simpler way. Alternatives included allowing only one or the other through a config parameter, or treating a tab as n spaces (again, for a configurable n). I thought these 2 alternatives were a bit too strict (though that's how Python behaves, for example, by prohibiting mixing them), and I thought that ideally I could issue a warning, but I couldn't find a way to accept a token and still issue a warning/error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, now that I think about it, I could add some payload to the returned token and then in the lexer check for the payload and add to the errors array, but then there would still be no way of making it a warning rather than an error. Perhaps LexerResult should be augmented to allow warnings?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think resolution of my question would block this recipe. I mean we can still change something afterwards. Extending the LexerResult sounds too much for this change.

Another question could be: How to write an indention-aware formatter? Is it even applicable or doable? How is it done for Python?
We do not have to answer this now. I was just interested about some consequences or follow-up tasks.

Copy link
Contributor Author

@aabounegm aabounegm Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not yet have experience writing formatters in Langium, but I don't see why it would be difficult to do. Generally, there are 2 approaches: formatting and pretty printing. One way to implement a formatter is to search for some (anti-)patterns in the code and issue TextEdits just for them. Pretty printers normally use the AST/CST (or some other intermediate representation) and transform them back into code, regardless of how it initially looked like before parsing. (or at least that's how I understand the difference between them)
Both approaches seem possible with the indentation-aware tokens, though the second one (pretty printer) is probably easier to implement, assuming we want the formatter to ensure consistent indentation characters and sizes.

For Python, one of the most popular formatter is black, and it uses the pretty printing approach. Not sure how other formatters handle inconsistent indentation, tbh.

Copy link
Contributor

@Lotes Lotes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have nothing more to add right now. Let's wait for a second opinion on your recipe :-) .

return true
```

the lexer will output the following sequence of tokens: `if`, `BOOLEAN`, `INDENT`, `return`, `BOOLEAN`, `DEDENT`, `else`, `INDENT`, `if`, `BOOLEAN`, `INDENT`, `return`, `BOOLEAN`, `DEDENT`, `DEDENT`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, I would suggest to add 3 dots at the end of the first phase and at the beginning of the second phrase.

@@ -0,0 +1,135 @@
---
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think resolution of my question would block this recipe. I mean we can still change something afterwards. Extending the LexerResult sounds too much for this change.

Another question could be: How to write an indention-aware formatter? Is it even applicable or doable? How is it done for Python?
We do not have to answer this now. I was just interested about some consequences or follow-up tasks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants