-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Postgresql] Grammar is not idiomatic ANTLR #4291
Comments
Unfortunately, what gets implemented often isn't checked very well. The empty-alternative seems to be an easy thing to detect and fix via script. I'll write a script to detect this and add it to the build. I'll also write a script to fix the problem. This stuff should not be done by hand. The other problem is an ambiguity. It is detected in my recent changes for detecting ambiguity over the grammars in the repo. This script finds the smallest example in the test suite with the ambiguity.
But, as you can see from the ambig.txt file mentioned in #4276, there's a lot of ambiguity in this grammar. It's not clear exactly where this grammar came from, but https://github.com/postgres/postgres has many .y grammars, including src/backend/parser/gram.y, but it doesn't look like this grammar. But, it may have been from an earlier version. Again, there should be links to the exact versions of the files that were the source for this Antlr grammar. |
@kaby76 Thanks! I'd love to see what your script is able to make of this. |
Update... This is an initial version of the Bash/Trash script to detect empty-alternatives in a rule, e.g.,
The script to delete the empty-alternative is a little harder to write because the script must determine which "OR" (i.e.,
The fix script does change the grammar, refactoring it apparently so that the second application of the "detect.sh" script above doesn't find any empty-alternatives. I have tested it, and all example files parse, but there are still a few problems.
|
Wow, that |
When one defines an "expression," usually it's done as an ordered list of simple, recursive alternatives. That way, operator precedence can be implemented easily. But it's not done that way because the rule originally came from nonterminal PLpgSQL_Expr in gram.y. PLpgSQL_Expr is used in a couple of places, but it doesn't check whether the expression is actually empty, e.g., in an assignment statement. But at this point, the order of sub-expressions is now baked into the language, and any other order would be considered a "syntax error." I.e., "FROM table" must come before "WHERE constraint." Also, the names of the nonterminals are terrible. from_clause has two alts, the first not deriving empty, and the second being empty. But the name is not "opt_from_clause". In comparison, opt_sort_clause has the same structure, but it is named with prefix "opt_." The naming is carried over to the Antlr grammar with from_clause and opt_sort_clause. Seeing that gram.y is the "authoritative" source for the syntax for Postgres, how much should the Antlr grammar deviate from that? |
Update...
|
Looks pretty good overall, though it would appear that the script to rename This would help, at least a little bit, with taming the unwieldy monster that is this grammar and its generated parser. (Close to 100,000 lines of C# code!) |
Check out these four PRs: #4299 #4301 #4303 #4307. The renaming is in #4299. They should all be completely independent changes from each other.
What you want is an "unfold" of rules that have a RHS of one symbol. This is easy to do with Trash, but would require a little extra programming to test for the condition. It would also have to delete the now-unused rules, and likely remove useless parentheses because the unfold code adds parentheses at the point of the unfold ( The plan is to continue to remove all the ambiguity in the grammar. I don't think that will fix the performance issues in the grammar because there are fallbacks to full context (SLL, "strong LL"). I can fix this with trclonereplace, but I will need to play around with the points of where to perform the rule cloning. I don't want to force this grammar to SLL using an Antlr
Unfortunately, this is SQL, and all SQL grammars are just crazy. For sure, unfolding will help, but if I add rule duplication to work around SLL slowness, the parser size will explode. I never understood why the LL-parser structure just isn't encoded as a table. Also, I'm not happy with the Floyd-style implementation of the operator precedence. It should be in a more natural Antlr style. That would make the parser C# code much smaller as well. |
Please check the current version of the grammar. I made a few changes, including some for this. If okay then close this issue. https://github.com/antlr/grammars-v4/tree/5ed2a3a3b9c2944a0fc60eaee26ccc1da2e7e26b/sql/postgresql It doesn't have useless productions removed yet, but will fix that and other issues sometime this week. |
Update... The postgresql grammar has been trimmed back to a state equivalent to the Bison grammar for the PostgreSQL server. The grammar is now much, much faster (5x speed up of group parsing of the ~200 .sql files in the examples/ directory, from 42s to now 7s). But, function bodies are not parsed--not that they were the cause of the bad performance, adding only 3s to the overall parse of the ~200 .sql files. The problem was the ambiguity added by merging the grammar for pl/pgsql into the postgresql grammar. The plan is to re-implement that in a way that does not kill performance. #2849 |
This looks very nice! A few notes:
This could be written more simply as:
You can get rid of half of these alternates by using
Likewise, the entire
Also:
|
If I had to guess, it looks like this Postgres parser grammar was produced by someone taking a grammar designed for a different parser generator and translating it as literally as possible. But it's got a lot of content that is very bad ANTLR.
For example, you see this pattern a lot:
Having a rule end in
| ;
means it will always be considered a valid match, even if it contains no content. This is painful to work with, because now in the visitor, on a rule that contains this as a sub-rule, you can't simply sayif (context.from_clause() != null)
to see if you have a real match; the code to check for it is significantly messier.The idomatic way to do this in ANTLR is to define it as:
and then have any rule that uses it invoke it as
from_clause?
.Also, some things are just really weird. For example, the following:
Unless I'm overlooking some crucial detail, the
non_ansi_join
rule here is entirely superfluous because the alternative branch offrom_list
is a strict superset of it. Again, this feels like it was translated overly-literally from some other parser generator's grammar.Would it be possible to clean this grammar up a bit?
The text was updated successfully, but these errors were encountered: