Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ambiguous comment token definition in ODIN grammar #31

Open
serefarikan opened this issue Sep 22, 2017 · 9 comments
Open

Ambiguous comment token definition in ODIN grammar #31

serefarikan opened this issue Sep 22, 2017 · 9 comments
Labels

Comments

@serefarikan
Copy link

serefarikan commented Sep 22, 2017

The base patterns file here: https://github.com/openEHR/adl-antlr/blob/master/src/main/antlr/base_patterns.g4 which is used in odin grammar defines comment tokens as follows:

H_CMT_LINE : '--------' '-'? '\n' ; // special type of comment for splitting template overlays
CMT_LINE : '--' .
? '\n' -> skip ; // (increment line count)

This introduces a problem with antlr 4.7, but interestingly, 4.5 manages to parse input, which I think should not be possible: probably Antlr's smart parsing algorithms are relaxing the rules a bit too much.

Since Antlr considers the declaration of tokens in the grammar files during parsing, the H_CMT_LINE token overrides the CMT_LINE token in the following examples, taken from openEHR RM bmm (CIMI bmms have same types of comments too):

------------------------------------------------------
-- BMM version on which these schemas are based.
------------------------------------------------------
bmm_version = <"2.1">

------------------------------------------------------
-- schema identification
-- (schema_id computed as <rm_publisher>_<schema_name>_<rm_release>)
------------------------------------------------------ 

The long comments above are meant to be comments and they don't imply the special template overlay comment. However, based on the grammar file above, the lexer is bound to match these long comment lines to H_CMT_LINE which I'd say is the correct behaviour. The result, using antlr 4.7 (the latest version) is that the parse tree never goes beyond the long comment lines.

Changing the order of CMT_LINE and H_CMT_LINE fixes the issue but I'm sure it'd break the behaviour within archetype files where template overlays are expressed.

I'd suggest changing the template overlay comment token to something else. Maybe something like:

H_CMT_LINE : '-/' ''*? '\n'  ;  // special type of comment for splitting template overlays

@wolandscat wolandscat added the bug label Sep 22, 2017
@wolandscat
Copy link
Member

It seems to me that a simpler approach is to move the H_CMT_LINE to one of the ADL Antlr files, because that special long line is only special in between whole ADL texts. So the ODIN Antlr file would go back to just matching with the CMT_LINE rule.

@pieterbos
Copy link
Contributor

I don't think that will work - the lexer rules are the same for all the files and not specific for one of them, since you use Odin within adl. So you can move them but it will not be handled like two different levers.

In fact, having all the lexer rules in one antlr file makes sense because of this.

Unless you have a way in the lexer to know when Odin parts start and end, then you can define different lexer modes.

But perhaps there is some way in the parser to define when the special comment lines are comments and when they are used to separate overlays.

@wolandscat
Copy link
Member

wolandscat commented Sep 23, 2017

Yep, you are right - that won't work in Antlr. We need a solution based on modes. Or else we do what Seref said, and define a different kind of comment line. Maybe an easy one would be:

//------------------------------

@serefarikan
Copy link
Author

serefarikan commented Sep 23, 2017

I guess upper layers of software that go beyond bmm processing (which is what I'm doing) would need to use modes anyway but it is how we know we're supposed to use the mode_x that is the problem, when the token level ambiguity exits. So I'd stick to my different type of comment suggestion. Your suggestion for syntax is just fine Tom.

@pieterbos
Copy link
Contributor

for lexical modes you need to be able to switch modes in the lexer. I don't think you can, because there's no unambiguous token for when Odin starts and ends - not in the lexer. It's not like we have a CDATA tag like in XML or a /* to start a comment. It's why I could not use lexical modes when I tried to unambiguously fix both regular expressions and path parsing, and had to resort to handling them as a single token in Archie.

With some custom target language code it's probably possible - but I think that has some serious drawbacks since you would need a version per target language.

Handling this in the parser would be hard because these comments can appear literally everywhere.

I don't see an easy solution in the current syntax.

@serefarikan
Copy link
Author

I could not find an example of the template overlay use case for the special comment token so I'm a bit in the dark re its use in that context but it my ignorant guess is switching to a CDATA-ish or /* ... */ comment format would help, based on my understanding of your comments @pieterbos

This is a step further than what @wolandscat is suggesting but it sounds like a more future proof approach.

@wolandscat
Copy link
Member

Example of a single file template.

The original attraction in using long lines was that they visually separate the overlays nicely and that they are already ADL comments anyway.

@pieterbos
Copy link
Contributor

Oh, some other ways to solve this:

  • do not require comment lines in the parser. Of course, this means they are no longer required which can be a drawback
  • treat the comment lines, the whitespace and template_overlay as one single token

I don't think these are very elegant ways though, but both should work.

@serefarikan
Copy link
Author

@pieterbos just to clarify: Are you suggesting that the parser uses only template_overlays keyword to identify t. overlays and not use the long comment line in your first suggestion?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants