Porting a flex lexer to antlr... #4335

petasis · 2023-06-26T14:17:30Z

petasis
Jun 26, 2023

Hi all,

I have a lex/flex scanner, that I want to re-implement in python (and perhaps in C++) through antlr4.
I have been working on this project for almost a week now, I have some success (i.e. I am close to pass the first test of the original implementation), but there are some features in lex, that are difficult to port:

The {n,m} operator, that repeats a token from n to m times. The solution is to define a recursive rule, like:

fragment DIGIT_1_3                  : [0-9] ([0-9] [0-9]?)? ;
fragment DIGIT_1_4                  : [0-9] ([0-9] ([0-9] [0-9]?)?)? ;
fragment DIGIT_2_4                  : [0-9] [0-9] ([0-9] [0-9]?)? ;

The original implementation uses them quite frequently, and the conversion is slow.

I am using a lot of positive lookahead. This is by far my most puzzling issue. I don't know how to handle this. I have seen that these are solved through semantic actions, and for the time being I am postponing these, through a class method call:

TOKEN_EM1 : EMOTICONS {LOOKAHEAD("({TOKEN_END_EOF}|{EMOTICONS}|{APOSTROPH})")}? ;

The initial thought was to use python's regular expressions to handle these (thus I have the "{}" notation, to substitute them in python with actual regular expressions), but I am not so sure.
Lex/flex allow the full use of the lexer in lookahead, and since the lookahead patterns may be "complex" (i.e. resulting in a fairly large regular expression), somehow I feel I need to recursively apply the antlr lexer (somehow).

There are a couple of cases, where I intentionally apply the lex/flex scanner on the text matched by the lexer.

Is there a better solution for any of the issues above?
A way to handle lex/flex style lookahead, I don't know about?

jimidle · 2023-06-27T03:44:15Z

jimidle
Jun 27, 2023

The ANTLR lexer will select the longest token and should not need the lookahead at all - just try ignoring them. However, we would need to see the flex definition and the lexer grammar you have so far.

Also, try to avoid being too specific in the lexer. After all, what are you going to do error wise with DIGIT_1_3? You lexer rules are overlapping. Just have a NUMBER: [0-9]+ ; and move the error detection as high as possible up the information tree. So, check the number of digits in your tree walk, where you can give a good semantic error about the wrong number of digits (or range, or whatever).

Don't let your lexer have any errors by having a final rule:

NUMBER: [0-9]+ :
ERRC: . ;

Which will move invalid characters up to the level of a syntax error at the ERRC token that is generated.

Also remember that lexing is not driven by the parser. The lexer runs first and creates ALL the tokens. The parser then runs against the tokens that have already been created.

I think that your issues are arising from trying to do too much in the lexer and trying to copy the flex/bison grammar too strictly. As I say, be looser in what you accept and detect and report errors higher up the chain.

0 replies

petasis · 2023-06-27T18:44:06Z

petasis
Jun 27, 2023
Author

Thank you for your comments. What I have not made very obvious, is that there is no bison file. There is no grammar on top of the lexer. The initial flex implementation is just a scanner that chops input into tokens.
There is no language I want to implement, I am tokenising natural language.

Thus, there are points in the input that the lexer must take a decision, like "I found a period. Is it a sentence end, is it an abbreviation, or is it a numbered list it?". The current flex scanner uses lookahead for this, which can use the rule fragments to match what is following (and take a decision).

I have reached to the point that everything works, except in the decision points that need lookahead. And because right now my semantic action method just returns true, it makes wrong decisions.

Right now, my antlr grammar is a very simple one: all lexer rules in a huge OR single rule (with an action on each token). I am not sure I can do anything at the rule level, somehow I feel that its the lexer that needs to choose the correct tokens.

I am still trying to find a solution how to re-use my lexer rule fragments in the semantic actions.

0 replies

petasis · 2023-06-28T19:14:49Z

petasis
Jun 28, 2023
Author

I implemented a way to do lookahead, through a second lexer that implements the lookahead rules as tokens.

Now, the problem is, that lookahead match is not counted in matched token length (as lex), and antlr simply selects other rules, that are wrong...

1 reply

lppedd Nov 23, 2023

So the only way to implement lookahead is through instantiating a new lexer and passing it a stream chunk? Damn that's bad...

Could you post a small snippet just to get the general idea? I'm coming from JFlex and ANTLR is a PITA for lexing.

petasis · 2023-07-05T12:09:49Z

petasis
Jul 5, 2023
Author

Now that I have largely done the port from lex to antlr, and at least the tokenisation test suit passes, I did a comparison of time needed to process the test suite.

The original test suite was gnu flex interfacing to tcl. It needs 0.27 seconds.
The antlr grammar was used to generate a python parser. For the same test suite it needs 14.8 seconds (yes, its unusable).
The antlr grammar was used to generate a C++ parser, which interfaces to python through boost::python. For the same test suite it needs 1.27 seconds.

The test suit involves unicode (greek) chars and emojis. And flex seems blazing fast.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Porting a flex lexer to antlr... #4335

{{title}}

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Porting a flex lexer to antlr... #4335

petasis Jun 26, 2023

Replies: 4 comments · 1 reply

jimidle Jun 27, 2023

petasis Jun 27, 2023 Author

petasis Jun 28, 2023 Author

lppedd Nov 23, 2023

petasis Jul 5, 2023 Author

petasis
Jun 26, 2023

Replies: 4 comments 1 reply

jimidle
Jun 27, 2023

petasis
Jun 27, 2023
Author

petasis
Jun 28, 2023
Author

petasis
Jul 5, 2023
Author