Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(sql_parse): Provide more meaningful SQLGlot errors #27858

Conversation

john-bodley
Copy link
Member

@john-bodley john-bodley commented Apr 2, 2024

SUMMARY

#26767 replaced the non-validating sqlparse parser with the pseudo-validating SQLGlot parser to aid with SQL parsing. The main use case for SQL parsing—from a security perspective—is to identify/extract the underlying tables/views referenced in the statement and checking whether the user is authorized to access them.

Previously, given the non-validating nature of sqlparse, virtually all SQL statements could be parsed—irrespective of whether they were valid—which meant that any syntax errors were reported by the underlying engine in a pseudo-meaningful manner. Now however, given that SQLGlot is more opinionated, syntax errors are detected by SQLGlot during the table/view extraction phase meaning that the query is never run and thus the engine's parser is never run resulting in a non-actionable error, i.e., it merely stated that the SQL could not be parsed. This is especially problematic in SQL Lab where freeform SQL is the norm and can be rather lengthy.

The TL;DR is now SQL statements which contain a syntax error are likely going to be captured by SQLGlot upstream of being executed by the engine.

This PR proposes leveraging the messaging from SQLGlot's ParseError or TokenError (example) to provide a more meaningful (and thus actionable) error message to the user.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

PREVIOUS

Prior to #26767 sqlparse was able to "parse" the SQL statement—missing an AND after the key = 'foobar' clause—which was then executed by the Trino engine (given that the user had access to the datainfra.one_row table) resulting in the following Trino error,

Screenshot 2024-04-02 at 10 47 11 AM

which detected the error was around the 1 token.

CURRENT

After #26767 SQLGlot was (rightfully) unable to parse the SQL statement and thus could not extract the underlying tables which halted the execution flow. The resulting error message isn't overly helpful as it doesn't inform the user where the actual syntax error is.

Screenshot 2024-04-02 at 12 09 58 PM

PROPOSED

Leveraging SQLGlot's ParseError et al. exceptions to provide a more meaningful error message,

Screenshot 2024-04-02 at 3 42 47 PM

which correctly detected the error was around the 1 token. The messaging is loosely based on how MySQL presents errors.

TESTING INSTRUCTIONS

CI.

ADDITIONAL INFORMATION

  • Has associated issue:
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@john-bodley john-bodley changed the title chore: Provide more meaningful SQLGlot error chore: Provide a more meaningful SQLGlot error Apr 2, 2024
@john-bodley john-bodley force-pushed the john-bodley--fix-sqlglot-error-messaging branch from a2d566e to 3f7cee3 Compare April 2, 2024 22:12
raise SupersetSecurityException(
SupersetError(
error_type=SupersetErrorType.QUERY_SECURITY_ACCESS_ERROR,
message=__(f"Unable to parse SQL ({dialect}): {self.sql}"),
message=__(f"You may have an error in your SQL syntax. {suffix}"),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I included the work may as SQLGlot may result in false positives whilst parsing. See here as an example.


if isinstance(ex, ParseError):
error = ex.errors[0]
suffix = "Check the {dialect}manual for the right syntax to use near '{highlight}' at line {line}".format( # pylint: disable=consider-using-f-string,line-too-long
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simply doing str(ex) doesn't work as SQLGlot includes ANSI codes which are helpful when printing in the terminal, but simply render as [4m and [0m in HTML.

@john-bodley john-bodley force-pushed the john-bodley--fix-sqlglot-error-messaging branch from 3f7cee3 to d70dcfa Compare April 2, 2024 22:26
@john-bodley john-bodley force-pushed the john-bodley--fix-sqlglot-error-messaging branch from d70dcfa to 3593acf Compare April 2, 2024 22:44
@mistercrunch
Copy link
Member

It feels to me that the SQL IDE should really just pass the SQL to the underlying connection and return the results or error message in most cases, and not get in the way of things. Now I'm wondering if _extract_tables_from_sql needs to run at all times or only if there are schema/tables restrictions in place (?)

@john-bodley
Copy link
Member Author

john-bodley commented Apr 2, 2024

@mistercrunch that was the other option I considered and had discussed internally, i.e., the error being somewhat context aware. Rather than SQLGlot's error being wrapped in a SupersetSecurityException it would simply bubble up and be handled in a context aware case-by-case basis.

Thus in the context of SQL Lab—which doesn't leverage a cache and where institutions may have a third party security manager, e.g. Apache Ranger—then allowing the SqlglotError error to pass through to the engine (irrespective of if the user has access to the underlying tables/views) is likely rather safe—especially given that the engine's parser will most likely detect the error.

The challenge arrises when:

  1. There is no underlying authorization checks at the engine level.
  2. SQLGlot returns a false positive (example) which can be exploited by an actor if (1) is true.

Furthermore the current implement of raise_for_access() isn't context aware. Given the enumerated challenges it might be best for individual institutions to augment their security manager as they see fit. For example at Airbnb if we feel that either,

  1. SQLGlot has a non-acceptable false positive rate, or
  2. SQLGlot's error messaging is too terse (when compared with the underlying engine)

then we may augment our security manager to simply not raise when the raise_for_access() method results in a SqlglotError when invoked from SQL Lab (leveraging either the referrer or call stack) as we have the peace of mind that Apache Ranger will prevent unauthorized access.

@john-bodley john-bodley changed the title chore: Provide a more meaningful SQLGlot error chore: Provide more meaningful SQLGlot errors Apr 2, 2024
@john-bodley john-bodley force-pushed the john-bodley--fix-sqlglot-error-messaging branch 2 times, most recently from 74cce94 to 2f8d080 Compare April 2, 2024 23:28
@john-bodley john-bodley changed the title chore: Provide more meaningful SQLGlot errors chore(sql_parse_: Provide more meaningful SQLGlot errors Apr 3, 2024
@john-bodley john-bodley changed the title chore(sql_parse_: Provide more meaningful SQLGlot errors chore(sql_parse): Provide more meaningful SQLGlot errors Apr 3, 2024
@betodealmeida
Copy link
Member

Now I'm wondering if _extract_tables_from_sql needs to run at all times or only if there are schema/tables restrictions in place.

That's a good idea! Part of the problem today is that _extract_tables_from_sql is called by simply accessing the .tables property in the parsed query, we could make it more explicit so it's only called whenever really needed and not by mistake.

Copy link
Member

@betodealmeida betodealmeida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

superset/sql_parse.py Outdated Show resolved Hide resolved
@john-bodley john-bodley force-pushed the john-bodley--fix-sqlglot-error-messaging branch from 4544dde to f3219c7 Compare April 3, 2024 03:12
@john-bodley john-bodley force-pushed the john-bodley--fix-sqlglot-error-messaging branch from f3219c7 to 43f6f84 Compare April 3, 2024 03:47
Copy link
Member

@michael-s-molina michael-s-molina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow! That's what I call a well-defined PR description 👏🏼

Thanks for the improvement and explanations.

@john-bodley john-bodley merged commit c385297 into apache:master Apr 3, 2024
40 checks passed
@john-bodley john-bodley deleted the john-bodley--fix-sqlglot-error-messaging branch April 3, 2024 14:11
@michael-s-molina michael-s-molina added v4.0 Label added by the release manager to track PRs to be included in the 4.0 branch v3.1 Label added by the release manager to track PRs to be included in the 3.1 branch labels Apr 3, 2024
michael-s-molina pushed a commit that referenced this pull request Apr 3, 2024
michael-s-molina pushed a commit that referenced this pull request Apr 3, 2024
jzhao62 pushed a commit to jzhao62/superset that referenced this pull request Apr 4, 2024
EandrewJones pushed a commit to UMD-ARLIS/superset that referenced this pull request Apr 5, 2024
michael-s-molina pushed a commit that referenced this pull request Apr 8, 2024
EnxDev pushed a commit to EnxDev/superset that referenced this pull request Apr 12, 2024
qleroy pushed a commit to qleroy/superset that referenced this pull request Apr 28, 2024
@mistercrunch mistercrunch added 🍒 3.1.3 🍒 4.0.1 🍒 4.0.2 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels labels Jul 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels size/S v3.1 Label added by the release manager to track PRs to be included in the 3.1 branch v4.0 Label added by the release manager to track PRs to be included in the 4.0 branch 🍒 3.1.3 🍒 4.0.1 🍒 4.0.2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants