Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(SIP-85): OAuth2 for databases #27631

Merged
merged 10 commits into from
Apr 3, 2024
Merged

feat(SIP-85): OAuth2 for databases #27631

merged 10 commits into from
Apr 3, 2024

Conversation

betodealmeida
Copy link
Member

@betodealmeida betodealmeida commented Mar 23, 2024

SUMMARY

This PR introduces a new table called database_user_oauth2_tokens. The table is used for storing personal user tokens associated with a given database:

# \d database_user_oauth2_tokens
                                               Table "public.database_user_oauth2_tokens"
         Column          |            Type             | Collation | Nullable |                         Default
-------------------------+-----------------------------+-----------+----------+---------------------------------------------------------
 created_on              | timestamp without time zone |           |          |
 changed_on              | timestamp without time zone |           |          |
 id                      | integer                     |           | not null | nextval('database_user_oauth2_tokens_id_seq'::regclass)
 user_id                 | integer                     |           | not null |
 database_id             | integer                     |           | not null |
 access_token            | bytea                       |           |          |
 access_token_expiration | timestamp without time zone |           |          |
 refresh_token           | bytea                       |           |          |
 created_by_fk           | integer                     |           |          |
 changed_by_fk           | integer                     |           |          |
Indexes:
    "database_user_oauth2_tokens_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
    "database_user_oauth2_tokens_changed_by_fk_fkey" FOREIGN KEY (changed_by_fk) REFERENCES ab_user(id)
    "database_user_oauth2_tokens_created_by_fk_fkey" FOREIGN KEY (created_by_fk) REFERENCES ab_user(id)
    "database_user_oauth2_tokens_database_id_fkey" FOREIGN KEY (database_id) REFERENCES dbs(id)
    "database_user_oauth2_tokens_user_id_fkey" FOREIGN KEY (user_id) REFERENCES ab_user(id)

Whenever a SQLAlchemy engine is instantiated, the personal user token (or None) will be passed to the get_url_for_impersonation method in the DB engine spec, so that a custom URL can be built for the user. For example, for GSheets:

def get_url_for_impersonation(
    cls,
    url: URL,
    impersonate_user: bool,
    username: str | None,
    access_token: str | None,  # <== here
) -> URL:
    if not impersonate_user:
        return url

    if username is not None:
        user = security_manager.find_user(username=username)
        if user and user.email:
            url = url.update_query_dict({"subject": user.email})

    if access_token:
        url = url.update_query_dict({"access_token": access_token})

    return url

The change allows users to login to databases like BigQuery, Snowflake, Dremio, Databricks, Google Sheets, etc. using their own credentials. This makes it easier to set up databases, since service accounts are no longer required, and provides better isolation of data between users. Only support for Google Sheets is implemented in this PR, and it's considered the reference implementation. Note that a newer version of Shillelagh is required, since a change in the Google Auth API introduced a regression.

In order to populate the table with personal access tokens, the DB engine spec checks for a specific exception that signals that OAuth2 should start:

class BaseEngineSpec:
    def execute(...) -> None:
        try:
            cursor.execute(query)
        except cls.oauth2_exception as ex:  # <== here
            if cls.is_oauth2_enabled() and g.user:
                cls.start_oauth2_dance(database_id)
            raise cls.get_dbapi_mapped_exception(ex) from ex
        except Exception as ex:
            raise cls.get_dbapi_mapped_exception(ex) from ex

When called, the start_oauth2_dance method will return the error OAUTH2_REDIRECT to the frontend. The error is captured by the ErrorMessageWithStackTrace component, which provides a link to the user so they can start the OAuth2 authentication. Since this is implemented at the DB engine spec level, any query will trigger it — in SQL Lab, Explore, or dashboards — see the screenshots below for the UX.

Note that while the current implementation triggers OAuth2 when a query needs authorization, we could also implement affordances in the database UI to manually trigger OAuth2 to store the personal access tokens. This could be done in the future.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

SQL Lab. Note that the query runs automatically once OAuth is completed:

SIP-85.Sql.Lab.mov

Explore. Note that the chart is automatically updated after OAuth:

SIP-85.Explore.mov

Same thing for dashboards:

SIP-85.Dashboard.mov

TESTING INSTRUCTIONS

  1. Create a Google Sheets database.
  2. Create a Google OAuth2 application at https://console.cloud.google.com/apis/credentials/oauthclient/ of type "Web application"
  3. Edit superset_config.py and add the client ID and secret:
DATABASE_OAUTH2_CREDENTIALS = {
    "Google Sheets": {
       "CLIENT_ID": "XXX.apps.googleusercontent.com",
       "CLIENT_SECRET": "GOCSPX-YYY",
    },
}
  1. In SQL Lab, try to query a sheet that is not shared publicly. It should trigger OAuth2.
  2. Add the sheet as a dataset and create a chart.
  3. Delete the tokens from the database:
DELETE FROM database_user_oauth2_tokens;
  1. Reload the chart. It should trigger OAuth2.
  2. Add the chart to a dashboard, delete the tokens, and reload the dashboard. It should trigger OAuth2.

ADDITIONAL INFORMATION

  • Has associated issue: [SIP-85] OAuth2 for databases #20300
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@github-actions github-actions bot added risk:db-migration PRs that require a DB migration api Related to the REST API labels Mar 23, 2024
Copy link

codecov bot commented Mar 23, 2024

Codecov Report

Attention: Patch coverage is 89.96416% with 28 lines in your changes are missing coverage. Please review.

Project coverage is 69.96%. Comparing base (883e455) to head (59d4daf).
Report is 28 commits behind head on master.

Files Patch % Lines
superset/db_engine_specs/base.py 68.29% 13 Missing ⚠️
superset/utils/lock.py 86.48% 5 Missing ⚠️
.../components/ErrorMessage/OAuth2RedirectMessage.tsx 92.10% 0 Missing and 3 partials ⚠️
superset/utils/oauth2.py 96.42% 2 Missing ⚠️
superset-frontend/src/setup/setupErrorMessages.ts 0.00% 1 Missing ⚠️
superset/databases/api.py 95.45% 1 Missing ⚠️
superset/db_engine_specs/gsheets.py 97.14% 1 Missing ⚠️
superset/models/core.py 93.75% 1 Missing ⚠️
superset/sql_lab.py 75.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #27631      +/-   ##
==========================================
+ Coverage   69.89%   69.96%   +0.07%     
==========================================
  Files        1911     1916       +5     
  Lines       75024    75377     +353     
  Branches     8355     8403      +48     
==========================================
+ Hits        52435    52741     +306     
- Misses      20539    20571      +32     
- Partials     2050     2065      +15     
Flag Coverage Δ
hive 48.98% <57.08%> (+0.04%) ⬆️
javascript 57.54% <89.74%> (+0.06%) ⬆️
mysql 77.76% <60.00%> (-0.15%) ⬇️
postgres 77.90% <59.58%> (-0.13%) ⬇️
presto 53.66% <57.91%> (+0.01%) ⬆️
python 83.20% <90.00%> (+0.05%) ⬆️
sqlite 77.34% <59.58%> (-0.12%) ⬇️
unit 57.18% <87.50%> (+0.42%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@betodealmeida betodealmeida marked this pull request as ready for review March 24, 2024 00:50
@betodealmeida betodealmeida requested a review from a team as a code owner March 24, 2024 00:50
@john-bodley john-bodley self-requested a review March 25, 2024 16:53
Copy link
Member

@mistercrunch mistercrunch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall seems solid. My most important point is probably around adding a database index on the new model, the rest are comments/notes.

superset/connectors/sqla/models.py Show resolved Hide resolved
superset/exceptions.py Show resolved Hide resolved
superset/utils/oauth2.py Show resolved Hide resolved
superset/db_engine_specs/hive.py Show resolved Hide resolved
@john-bodley
Copy link
Member

@betodealmeida do we perceive there could/would be other authorization frameworks other than OAuth 2.0? If so I was wondering if there was merit in renaming database_user_oauth2_tokens to be something more generic and adding additional column(s) which define said frameworks.

Copy link
Member

@craig-rueda craig-rueda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far! I think there's quite a few cases that could be tested. (Token delete/insert/ etc., the updated engine spec definitions, etc...)

@betodealmeida
Copy link
Member Author

@betodealmeida do we perceive there could/would be other authorization frameworks other than OAuth 2.0? If so I was wondering if there was merit in renaming database_user_oauth2_tokens to be something more generic and adding additional column(s) which define said frameworks.

@john-bodley I'm not sure, to be honest. The beauty of OAuth2 is that the same flow is shared across multiple providers, all you need is an access token and a refresh token, so the same foundation works for BigQuery/GSheets/Snowflake/Dremio/Databricks.

The one case I can think of is if at some point we'd want users to be able to input their own username/password, but I think that trying to address potential future uses would increase the complexity without clear benefits.

@betodealmeida
Copy link
Member Author

Looks good so far! I think there's quite a few cases that could be tested. (Token delete/insert/ etc., the updated engine spec definitions, etc...)

@craig-rueda I did test inserting/deleting the token in the API test. I'll add tests for the new DB engine spec methods.

superset/utils/oauth2.py Outdated Show resolved Hide resolved
superset/utils/oauth2.py Outdated Show resolved Hide resolved
Copy link
Member

@mistercrunch mistercrunch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments but overall LGTM. Stamping my approval but we may want another stamp from @villebro since he got deep in here already and seemed likely to push this further in the future.

@@ -542,6 +543,70 @@ The method `get_url_for_impersonation` updates the SQLAlchemy URI before every q

Alternatively, it's also possible to impersonate users by implementing the `update_impersonation_config`. This is a class method which modifies `connect_args` in place. You can use either method, and ideally they [should be consolidated in a single one](https://github.com/apache/superset/issues/24910).

### OAuth2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: feels like this belongs on the documentation website, probably as a new section "Connecting Users to Databases using OAuth2" under the "Installation and Configuration" section. This README is buried, maybe the guideline for using this readme would be for things that speak to developers working on the db_engine_specs package as opposed to admins looking to install/configure Superset. But the content is great! :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: personally I like to break lines at 80-100 in docs/md, but there's not standard enforced there at the moment, so fine either way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this README documents all the functionality of the DB engine specs and is targeting developers, explaining the methods that are needed. I'm happy to write additional docs about OAuth2 in the main website, I'll do that.

Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a request for making it possible to define clients per database in the future, preferably via a dedicated client model.

@betodealmeida betodealmeida merged commit 9022f5c into master Apr 3, 2024
43 checks passed
jzhao62 pushed a commit to jzhao62/superset that referenced this pull request Apr 4, 2024
EandrewJones pushed a commit to UMD-ARLIS/superset that referenced this pull request Apr 5, 2024
EnxDev pushed a commit to EnxDev/superset that referenced this pull request Apr 12, 2024
@rusackas rusackas deleted the sip-85 branch April 16, 2024 16:52
qleroy pushed a commit to qleroy/superset that referenced this pull request Apr 28, 2024
@joaoferrao
Copy link

joaoferrao commented Aug 26, 2024

@betodealmeida I've been trying to adapt this to work with Trino but I'm a bit stuck. I got the OAuth2RedirectError to be raised and the obligatory tab_id, redirect_uri and oauth_url as well.

This happens just for opening the sqllab menu (Trino is the first selection so the

│ superset Traceback (most recent call last):                                                                                                                                                                                                       
│ superset   File "/app/superset/models/core.py", line 569, in get_raw_connection                                                                                                                                                                   
│ superset     yield conn                                                                                                                                                                                                                           
│ superset   File "/app/superset/models/core.py", line 678, in get_df                                                                                                                                                                               
│ superset     self.db_engine_spec.execute(cursor, sql_, self)                                                                                                                                                                                      
│ superset   File "/app/superset/db_engine_specs/base.py", line 1863, in execute                                                                                                                                                                    
│ superset     cls.start_oauth2_dance(database)                                                                                                                                                                                                     
│ superset   File "/app/superset/db_engine_specs/base.py", line 493, in start_oauth2_dance                                                                                                                                                          
│ superset     raise OAuth2RedirectError(oauth_url, tab_id, default_redirect_uri)                                                                                                                                                                   
│ superset superset.exceptions.OAuth2RedirectError: You don't have permission to access the data.                                                                                                                                                  

Buuut, the FE doesn't show any interaction, no new tabs, no request for permissions for the website missing, etc. It only shows "There was an error loading the schemas".

If I do try to execute a query, I get another (unrelated, hopefuly) error which I can't get around:

image

The question: any tip on what I could look into?

@betodealmeida
Copy link
Member Author

@joaoferrao yeah, I think this is because we run the query in a separate thread for Trino. I'm away from my computer right now and traveling tomorrow, do you want to hop on a call on Wednesday to discuss this?

@betodealmeida
Copy link
Member Author

Basically, I think we need this fix: #29981

@joaoferrao
Copy link

joaoferrao commented Aug 26, 2024

@joaoferrao yeah, I think this is because we run the query in a separate thread for Trino. I'm away from my computer right now and traveling tomorrow, do you want to hop on a call on Wednesday to discuss this?

Thanks for the offer, I dropped you an email.

Basically, I think we need this fix: #29981

I will take a look!

@joaoferrao
Copy link

joaoferrao commented Aug 29, 2024

@betodealmeida let me know if my email didn't reach you, sent it to the one you have on your profile.

In parallel, good news: I was able to make it work, but I will need some suggestions for a couple of points. I will open a PR and mention you, in case you have the time to provide your input.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api Related to the REST API preset-io risk:db-migration PRs that require a DB migration size/XXL
Projects
None yet
Development

Successfully merging this pull request may close these issues.