Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mysql/Oracle] Grammar is ambiguous. #4279

Open
kaby76 opened this issue Oct 10, 2024 · 2 comments
Open

[mysql/Oracle] Grammar is ambiguous. #4279

kaby76 opened this issue Oct 10, 2024 · 2 comments

Comments

@kaby76
Copy link
Contributor

kaby76 commented Oct 10, 2024

I've ported the sql/mysql/Oracle grammar to CSharp. I found that the grammar is ambiguous.

$ dotnet trparse ../examples/expressions.sql --ambig | dotnet trtree -a
CSharp 0 ../examples/expressions.sql success 0.139722
(query (simpleStatement (selectStatement (queryExpression (queryExpressionBody (queryPrimary (querySpecification (SELECT_SYMBOL "select") (selectItemList (selectItem (expr (boolPri (predicate (bitExpr (bitExpr (bitExpr (simpleExpr (literalOrNull (literal (numLiteral (int64Literal (INT_NUMBER "1"))))))) (PLUS_OPERATOR "+") (bitExpr (bitExpr (simpleExpr (literalOrNull (literal (numLiteral (int64Literal (INT_NUMBER "2"))))))) (MULT_OPERATOR "*") (bitExpr (simpleExpr (literalOrNull (literal (numLiteral (int64Literal (INT_NUMBER "3"))))))))) (MINUS_OPERATOR "-") (bitExpr (simpleExpr (literalOrNull (literal (numLiteral (int64Literal (INT_NUMBER "4"))))))))))))))))))) (SEMICOLON_SYMBOL ";"))
(query (simpleStatement (selectStatement (queryExpression (queryExpressionBody (queryPrimary (querySpecification (SELECT_SYMBOL "select") (selectItemList (selectItem (expr (boolPri (predicate (bitExpr (bitExpr (bitExpr (bitExpr (simpleExpr (literalOrNull (literal (numLiteral (int64Literal (INT_NUMBER "1"))))))) (PLUS_OPERATOR "+") (bitExpr (simpleExpr (literalOrNull (literal (numLiteral (int64Literal (INT_NUMBER "2")))))))) (MULT_OPERATOR "*") (bitExpr (simpleExpr (literalOrNull (literal (numLiteral (int64Literal (INT_NUMBER "3")))))))) (MINUS_OPERATOR "-") (bitExpr (simpleExpr (literalOrNull (literal (numLiteral (int64Literal (INT_NUMBER "4"))))))))))))))))))) (SEMICOLON_SYMBOL ";"))

(The main reason I did the CSharp port was to check for ambiguities in this grammar. All tools in Trash are in C#.)

Another example comes from the MySQL Reference manual https://dev.mysql.com/doc/refman/8.4/en/select.html.

$ dotnet trparse --ambig -i "SELECT CONCAT(last_name,', ',first_name) AS full_name FROM mytable ORDER BY full_name;" | dotnet trtree
-a
CSharp 0 string success 0.2263846
(queries (query (simpleStatement (selectStatement (queryExpression (queryExpressionBody (queryPrimary (querySpecification (SELECT_SYMBOL "SELECT") (selectItemList (selectItem (expr (boolPri (predicate (bitExpr (simpleExpr (functionCall (pureIdentifier (IDENTIFIER "CONCAT")) (OPEN_PAR_SYMBOL "(") (udfExprList (udfExpr (expr (boolPri (predicate (bitExpr (simpleExpr (columnRef (fieldIdentifier (qualifiedIdentifier (identifier (pureIdentifier (IDENTIFIER "last_name")))))))))))) (COMMA_SYMBOL ",") (udfExpr (expr (boolPri (predicate (bitExpr (simpleExpr (literalOrNull (literal (textLiteral (textStringLiteral (SINGLE_QUOTED_TEXT "', '"))))))))))) (COMMA_SYMBOL ",") (udfExpr (expr (boolPri (predicate (bitExpr (simpleExpr (columnRef (fieldIdentifier (qualifiedIdentifier (identifier (pureIdentifier (IDENTIFIER "first_name"))))))))))))) (CLOSE_PAR_SYMBOL ")"))))))) (selectAlias (AS_SYMBOL "AS") (identifier (pureIdentifier (IDENTIFIER "full_name")))))) (fromClause (FROM_SYMBOL "FROM") (tableReferenceList (tableReference (tableFactor (singleTable (tableRef (qualifiedIdentifier (identifier (pureIdentifier (IDENTIFIER "mytable"))))))))))))) (orderClause (ORDER_SYMBOL "ORDER") (BY_SYMBOL "BY") (orderList (orderExpression (expr (boolPri (predicate (bitExpr (simpleExpr (columnRef (fieldIdentifier (qualifiedIdentifier (identifier (pureIdentifier (IDENTIFIER "full_name"))))))))))))))))) (SEMICOLON_SYMBOL ";")) (EOF ""))
(queries (query (simpleStatement (selectStatement (queryExpression (queryExpressionBody (queryPrimary (querySpecification (SELECT_SYMBOL "SELECT") (selectItemList (selectItem (expr (boolPri (predicate (bitExpr (simpleExpr (functionCall (qualifiedIdentifier (identifier (pureIdentifier (IDENTIFIER "CONCAT")))) (OPEN_PAR_SYMBOL "(") (exprList (expr (boolPri (predicate (bitExpr (simpleExpr (columnRef (fieldIdentifier (qualifiedIdentifier (identifier (pureIdentifier (IDENTIFIER "last_name"))))))))))) (COMMA_SYMBOL ",") (expr (boolPri (predicate (bitExpr (simpleExpr (literalOrNull (literal (textLiteral (textStringLiteral (SINGLE_QUOTED_TEXT "', '")))))))))) (COMMA_SYMBOL ",") (expr (boolPri (predicate (bitExpr (simpleExpr (columnRef (fieldIdentifier (qualifiedIdentifier (identifier (pureIdentifier (IDENTIFIER "first_name")))))))))))) (CLOSE_PAR_SYMBOL ")"))))))) (selectAlias (AS_SYMBOL "AS") (identifier (pureIdentifier (IDENTIFIER "full_name")))))) (fromClause (FROM_SYMBOL "FROM") (tableReferenceList (tableReference (tableFactor (singleTable (tableRef (qualifiedIdentifier (identifier (pureIdentifier (IDENTIFIER "mytable"))))))))))))) (orderClause (ORDER_SYMBOL "ORDER") (BY_SYMBOL "BY") (orderList (orderExpression (expr (boolPri (predicate (bitExpr (simpleExpr (columnRef (fieldIdentifier (qualifiedIdentifier (identifier (pureIdentifier (IDENTIFIER "full_name"))))))))))))))))) (SEMICOLON_SYMBOL ";")) (EOF ""))

Ambiguity is a problem because it causes performance issues for Antlr4.

@mike-lischke
Copy link
Member

I'm surprised you didn't find more ambiguities. The language itself is ambiguous, and I did my best to resolve it.

@kaby76
Copy link
Contributor Author

kaby76 commented Nov 6, 2024

I'm surprised you didn't find more ambiguities. The language itself is ambiguous, and I did my best to resolve it.

Yes, the grammar has a bit of ambiguity. The information I gave are the ambiguous parse trees for one specific decision in the DFA. But, there are others, including these rules.

bitExpr
boolPri
functionCall
grantIdentifier
insertQueryExpression
leadLagInfo
literalOrNull
lvalueVariable
optionValueNoOptionType
qualifiedIdentifier
queryExpressionBody
queryExpressionParens
selectStatement
signedLiteralOrNull
simpleExpr
tableReference
tableReferenceListParens
textLiteral
tsDataFileName

This information is computed via ls *.sql | trperf -x -c ar | grep -v '^0' | awk '{print $2}' | sort -u on every commit to the repo. The -c option is used to set what to display. Here, 'a' of 'ar' means the # of ambiguities, and 'r' of 'ar' the rule name, all values for a particular decision state in the DFA.

trparse --ambig constructs all possible ambiguity trees, for every decision state that is ambiguous. It fails if there are just too many trees to compute, as in the case with trparse --ambig bitrix_queries_cut.sql. Instead, I try to first pick a decision state with a low ambiguity, and then display the tree for that decision. The issue is that some input hit the same problem over and over. See how I apply it in this comment.

I haven't really checkout out SLL fallback problems. It turns out to be a huge problem with the PostgreSQL grammar. Antlr can have its issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants