Replies: 4 comments 1 reply
-
The ANTLR lexer will select the longest token and should not need the lookahead at all - just try ignoring them. However, we would need to see the flex definition and the lexer grammar you have so far. Also, try to avoid being too specific in the lexer. After all, what are you going to do error wise with DIGIT_1_3? You lexer rules are overlapping. Just have a Don't let your lexer have any errors by having a final rule: NUMBER: [0-9]+ :
ERRC: . ; Which will move invalid characters up to the level of a syntax error at the ERRC token that is generated. Also remember that lexing is not driven by the parser. The lexer runs first and creates ALL the tokens. The parser then runs against the tokens that have already been created. I think that your issues are arising from trying to do too much in the lexer and trying to copy the flex/bison grammar too strictly. As I say, be looser in what you accept and detect and report errors higher up the chain. |
Beta Was this translation helpful? Give feedback.
-
Thank you for your comments. What I have not made very obvious, is that there is no bison file. There is no grammar on top of the lexer. The initial flex implementation is just a scanner that chops input into tokens. Thus, there are points in the input that the lexer must take a decision, like "I found a period. Is it a sentence end, is it an abbreviation, or is it a numbered list it?". The current flex scanner uses lookahead for this, which can use the rule fragments to match what is following (and take a decision). I have reached to the point that everything works, except in the decision points that need lookahead. And because right now my semantic action method just returns true, it makes wrong decisions. Right now, my antlr grammar is a very simple one: all lexer rules in a huge OR single rule (with an action on each token). I am not sure I can do anything at the rule level, somehow I feel that its the lexer that needs to choose the correct tokens. I am still trying to find a solution how to re-use my lexer rule fragments in the semantic actions. |
Beta Was this translation helpful? Give feedback.
-
I implemented a way to do lookahead, through a second lexer that implements the lookahead rules as tokens. Now, the problem is, that lookahead match is not counted in matched token length (as lex), and antlr simply selects other rules, that are wrong... |
Beta Was this translation helpful? Give feedback.
-
Now that I have largely done the port from lex to antlr, and at least the tokenisation test suit passes, I did a comparison of time needed to process the test suite.
The test suit involves unicode (greek) chars and emojis. And flex seems blazing fast. |
Beta Was this translation helpful? Give feedback.
-
Hi all,
I have a lex/flex scanner, that I want to re-implement in python (and perhaps in C++) through antlr4.
I have been working on this project for almost a week now, I have some success (i.e. I am close to pass the first test of the original implementation), but there are some features in lex, that are difficult to port:
The original implementation uses them quite frequently, and the conversion is slow.
The initial thought was to use python's regular expressions to handle these (thus I have the "{}" notation, to substitute them in python with actual regular expressions), but I am not so sure.
Lex/flex allow the full use of the lexer in lookahead, and since the lookahead patterns may be "complex" (i.e. resulting in a fairly large regular expression), somehow I feel I need to recursively apply the antlr lexer (somehow).
Is there a better solution for any of the issues above?
A way to handle lex/flex style lookahead, I don't know about?
Beta Was this translation helpful? Give feedback.
All reactions