Improve Vocabular configurability #3161

jgremmen · 2021-04-21T09:27:01Z

jgremmen
Apr 21, 2021

Summary

Based on the grammar files, ANTLR generates lexer and parser classes which contain contain a vocabular providing literal names for each token. The generated vocabular has several issues which should be addressed in future releases:

the literal names in the vocabular are more incomplete as one would expect from the grammar it is based upon.
there's no easy way to customize the literal names
for lexer/parser combinations the vocabular is generated twice (equally bad) in both classes

Incomplete literal names

Let's take a look at the following lexer definitions:

NE
        :  '<>'  |  '!'  | '!='
        ;
LT
        :  '<'
        ;
LTE
        : '<='
        ;

The generated lexer class only provides '<=' for the LTE token. For the other tokens in this example the literal name is null. For token LT there's no apparent reason why it resorts to null (is this a bug?). As for token NE there should be a reasonable default literal name like '<>', '!' or '!='. Essentially if the token does not represent a single string it defaults to null and even in some other cases (eg. LT from the example above) it behaves unexpectedly.

Customizing the vocabulary

As the default vocabulary is incomplete, the need for customization becomes an issue.

Back in the days (Antlr3) it was possible to subclass the generated lexer/parser and override the literal names in a static part of the class. With Antlr4 everything is either private or final or both. The only way I've found to overcome this problem is to subclass the generated class and override getVocabulary() with a custom Vocabulary implementation .

If this were an exceptional requirement, it would be a valid solution. But as the generated vocabulary is useless for almost every grammar, I've had the need to do this in every project.

Possible improvements:

provide a way to specify the literal name in the grammar itself
provide an easy way to customize the vocabulary without the need for subclassing the generated lexer/parser

Vocabular generated twice for lexer/parser combinations

For lexer/parser combinations the generated vocabularies in both classes are identical. This is inefficient and, with respect to the customization part described above, there's now the need for subclassing both generated classes.

Basically I don't think it should be required to subclass the generated classes at all. The requirement to do so shows that the generated classes lack customization abilities.

Possible improvements:

take care of the issues described in "Incomplete literal names" so there's no need for customization
remove all grammar specific vocabulary code from the lexer/parser classes and generate a vocabulary class (similar to XYZListener and XYZBaseListener) which can be customized once and used for both lexer and parser

skef · 2021-05-11T01:19:09Z

skef
May 11, 2021

+1 for this.

It seems to me that on the lexer side a lot of the customization burden could be cleanly handled with an additional literal() lexer command to supply the string. This isn't all that different in spirit from the current type() command. Given that the generated strings have single quotes it would be nice to have some way of specifying alternatives that didn't require constant escapes (e.g. literal('\'foo\' or \'bar\'')) but even without that the option would be a substantial improvement.

1 reply

jgremmen Mar 12, 2022
Author

@parrt you might take a look at this discussion as well ;-)

parrt · 2022-03-12T19:12:00Z

parrt
Mar 12, 2022
Maintainer

Everything looks fine to me. For your grammar above as L.g4, I get:

$ cat L.tokens
NE=1
LT=2
LTE=3
'<'=2
'<='=3

I understand your interest in customizing token names, but I don't think I will be going down this path.

1 reply

jgremmen Mar 14, 2022
Author

Fine would be if there was a literal for token 1 (eg. '<>, !, !='=1) but unfortunately there isn't ;-)

In the end there is never going to be an automatic solution for token 1 that pleases everyone. So omitting the literal as soon as it is not just a simple piece of text is an understandable approach.

@skef suggested a more generic approach which is actually quite nice:

NE
      :  '<>'  |  '!'  | '!=' -> literal('<>, ! or !=')
      ;

It fits perfectly in the lexer concept and every token can be customized individually.

parrt · 2022-03-14T22:05:00Z

parrt
Mar 14, 2022
Maintainer

Well, one could argue those all should be different tokens.

6 replies

skef Mar 14, 2022

There are too many circumstances now where errors dump long token lists into a user's lap, many of which may have more to do with internal details of the grammar implementation rather than the problem domain.

ericvergnaud Mar 14, 2022
Maintainer

To Ter's point, using different tokens makes it much easier to achieve the goal of providing meaningful error messages since each of them would come with its literal

jgremmen Mar 14, 2022
Author

Yes that's a valid argument but let's take a look at a not so randomly picked lexer rule from the Java grammar:

FLOAT_LITERAL:      (Digits '.' Digits? | '.' Digits) ExponentPart? [fFdD]?
             |       Digits (ExponentPart [fFdD]? | [fFdD])
             ;

One can argue a lot but this will never be something readable for a non-technical user.

skef Mar 14, 2022

I think @parrt and @ericvergnaud are both over-focusing on the details of the supplied example. You can give each of those separate "bottom-level" tokens and then express the disjunction more abstractly but that doesn't solve the problem of dumping all of those bottom-level tokens when some input is unrecognized.

More generally, its desirable to have control at a given point in the grammar over whether that point is represented as a literal in a message as opposed to gathering up all the leaf nodes below it into a list.

(Ninja'd by @jgremmen )

parrt Mar 14, 2022
Maintainer

Which is why it reports FLOAT_LITERAL not the grammatical structure on the right. I'm not going to alter the tool for this purpose. Sorry. why not just change the token name to be "float literal" or something like that at runtime? there's an array of these names right? or just override the error handler. It's better too let people use the runtime API then it is to complicate the tool for all sorts of cases. this is the first request for this I've seen ever for ANTLR v4.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Vocabular configurability #3161

{{title}}

Replies: 3 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Improve Vocabular configurability #3161

jgremmen Apr 21, 2021

Summary

Incomplete literal names

Customizing the vocabulary

Vocabular generated twice for lexer/parser combinations

Replies: 3 comments · 8 replies

skef May 11, 2021

jgremmen Mar 12, 2022 Author

parrt Mar 12, 2022 Maintainer

jgremmen Mar 14, 2022 Author

parrt Mar 14, 2022 Maintainer

skef Mar 14, 2022

ericvergnaud Mar 14, 2022 Maintainer

jgremmen Mar 14, 2022 Author

skef Mar 14, 2022

parrt Mar 14, 2022 Maintainer

jgremmen
Apr 21, 2021

Replies: 3 comments 8 replies

skef
May 11, 2021

jgremmen Mar 12, 2022
Author

parrt
Mar 12, 2022
Maintainer

jgremmen Mar 14, 2022
Author

parrt
Mar 14, 2022
Maintainer

ericvergnaud Mar 14, 2022
Maintainer

jgremmen Mar 14, 2022
Author

parrt Mar 14, 2022
Maintainer