Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(CLI command): Apache Superset "Factory Reset" CLI command #27207 #27221

Merged

Conversation

mknadh
Copy link
Contributor

@mknadh mknadh commented Feb 23, 2024

feat(CLI command): Apache Superset "Factory Reset" CLI command #27207

SUMMARY

Over time, Apache Superset instances can accumulate large amounts of data, including charts, dashboards, saved queries, and other artifacts. There might be scenarios where users want to start fresh with a clean slate, removing all existing data.

Testing and Development: Developers and administrators often require a quick and efficient way to reset Apache Superset instances to a default state for testing purposes or when setting up development environments.

Data Privacy and Security: In some cases, there might be sensitive or confidential data stored within Apache Superset that needs to be wiped out completely to ensure data privacy and security compliance.

What is the proposal?

Implement a Factory CLI command: Develop CLI command that allows users to trigger a factory reset of Apache Superset. This CLI should be designed to delete all existing data, including charts, dashboards, saved queries, databases, and other related artifacts.

Authorization and Confirmation: Ensure that the CLI requires appropriate authorization to prevent unauthorized access. Additionally, consider implementing a confirmation mechanism to prevent accidental data loss.

Documentation and Best Practices: Provide comprehensive documentation on how to use the Factory Reset CLI, along with best practices for when and how to perform a reset. Include warnings about data loss and the irreversible nature of the operation.
Error Handling and Logging: Implement robust error handling mechanisms within the CLI to handle any unexpected errors gracefully. Additionally, log all reset operations for auditing purposes.

Integration with Configuration Management: Optionally, provide integration with configuration management tools or scripts to automate the process of resetting Apache Superset instances in a controlled and reproducible manner.

ADDITIONAL INFORMATION

  • [x ] Has associated issue:
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@github-actions github-actions bot added the api Related to the REST API label Feb 23, 2024
@mknadh mknadh marked this pull request as ready for review February 23, 2024 04:13
@john-bodley
Copy link
Member

Thanks @mknadh for the PR. Any reason you didn't opt for a CLI command for removing all assets? Additionally if you're after a factory reset why not just drop all the tables in the database and start from afresh?

@michael-s-molina michael-s-molina requested a review from a team February 23, 2024 17:59
@mistercrunch
Copy link
Member

mistercrunch commented Feb 23, 2024

Airflow used to have a "convenient" resetdb CLI subcommand, as in airflow resetdb (maybe it still does? I think we removed it). I thought it was a good idea early on to make my dev life easier to reset environments. Hell I think it even asked for confirmation. Well I changed my mind pretty quickly when someone internally at one of the companies I worked at used it against the wrong environment (PROD).

My take on this would be to document how to do it more at the DBA level -> point to a new schema, init that db, ...

Copy link

codecov bot commented Feb 23, 2024

Codecov Report

Attention: Patch coverage is 0% with 79 lines in your changes missing coverage. Please review.

Project coverage is 83.69%. Comparing base (76d897e) to head (5a12441).
Report is 405 commits behind head on master.

Files Patch % Lines
superset/commands/security/reset.py 0.00% 50 Missing ⚠️
superset/cli/reset.py 0.00% 29 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master   #27221       +/-   ##
===========================================
+ Coverage   60.48%   83.69%   +23.20%     
===========================================
  Files        1931      521     -1410     
  Lines       76236    37488    -38748     
  Branches     8568        0     -8568     
===========================================
- Hits        46114    31376    -14738     
+ Misses      28017     6112    -21905     
+ Partials     2105        0     -2105     
Flag Coverage Δ
hive 48.98% <0.00%> (-0.18%) ⬇️
javascript ?
mysql 77.00% <0.00%> (?)
postgres 77.10% <0.00%> (?)
presto 53.60% <0.00%> (-0.20%) ⬇️
python 83.69% <0.00%> (+20.20%) ⬆️
sqlite 76.58% <0.00%> (?)
unit 59.64% <0.00%> (+2.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@craig-rueda craig-rueda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for local testing, but probably want to lock this down by default, and allow Superset admins to opt-in

except ValidationError as error:
return self.response_400(message=error.messages)
try:
if item.get("all") or item.get("datasets"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make a command for this?

@@ -221,6 +222,7 @@ def init_views(self) -> None:
appbuilder.add_api(QueryRestApi)
appbuilder.add_api(ReportScheduleRestApi)
appbuilder.add_api(ReportExecutionLogRestApi)
appbuilder.add_api(ResetRestApi)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put this add behind a boolean config that defaults to False? Although useful for local dev work, this feels a little dangerous to have enabled by default

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, this is a good idea and will add it behind a feature flag, where the default behavior of the feature flag would be to disable this API (by default). That way, no one can run it against PROD, unless enabled.

CC: @john-bodley , @mistercrunch , @craig-rueda , @michael-s-molina

@craig-rueda
Copy link
Member

One more comment - I agree with @john-bodley that this would probably be better done as a CLI.

@michael-s-molina
Copy link
Member

I also agree this shouldn't be a new API. If we create a command, it should require many levels of confirmation to avoid the problem @mistercrunch mentioned.

@michael-s-molina michael-s-molina added the review-after-release Indicates that a release is in progress and that this PR might be reviewed later label Feb 23, 2024
@giftig
Copy link
Contributor

giftig commented Feb 23, 2024

+1 for the suggestions to make it a cli command if anything, for reasons already stated. But what would this do which isn't already achieved by simply dropping the database?

@mistercrunch
Copy link
Member

mistercrunch commented Feb 23, 2024

Could we have a config setting in superset/config.py that disallows this by default or maybe based on DEBUG, something like ALLOW_RESET = DEBUG or straight up ALLOW_RESET = False

@mknadh
Copy link
Contributor Author

mknadh commented Feb 24, 2024

Thanks @mknadh for the PR. Any reason you didn't opt for a CLI command for removing all assets? Additionally if you're after a factory reset why not just drop all the tables in the database and start from afresh?

#27207 (reply in thread)

@mknadh
Copy link
Contributor Author

mknadh commented Feb 24, 2024

+1 for the suggestions to make it a cli command if anything, for reasons already stated. But what would this do which isn't already achieved by simply dropping the database?

#27207 (reply in thread)

@mknadh
Copy link
Contributor Author

mknadh commented Feb 24, 2024

One more comment - I agree with @john-bodley that this would probably be better done as a CLI.

Here are some reasons of implementing this as a REST API rather than a CLI command:

  1. Better access control: REST APIs allow leveraging OAuth and complex permission policies to control access. CLI commands are harder to lock down.

  2. Automation and integration: REST APIs lend themselves better to automation, configuration management, and integration with other systems.

  3. Consistency: Adding to the existing CRUD REST APIs for other Superset model objects maintains a consistent interface for clients.

  4. Reporting: REST APIs usage can be better tracked, monitored and audited for something as sensitive as a data reset.

  5. Error handling: REST APIs provide better mechanisms for structured error handling compared to CLIs.

CC: @john-bodley, @mistercrunch, @craig-rueda, @michael-s-molina

@mknadh
Copy link
Contributor Author

mknadh commented Feb 24, 2024

I also agree this shouldn't be a new API. If we create a command, it should require many levels of confirmation to avoid the problem @mistercrunch mentioned.

#27221 (comment)

@mknadh
Copy link
Contributor Author

mknadh commented Feb 24, 2024

Could we have a config setting in superset/config.py that disallows this by default or maybe based on DEBUG, something like ALLOW_RESET = DEBUG or straight up ALLOW_RESET = False

#27221 (comment)

@giftig
Copy link
Contributor

giftig commented Feb 24, 2024

@mknadh I don't think I can agree with any of the points you listed above.

  1. CLI commands are inherently more locked down as you're running a command on the box running superset and this is typically very much locked down in a production environment, only available to devops engineers. Bear in mind, as well, anyone who could run a CLI command could also get a flask shell and simply perform the same operation by writing the code themselves.

  2. CLI commands can be automated, and it should be fairly straightforward in any context where you're likely to want to wipe out a whole database. Can you provide examples where you'd want to wipe out the database via REST API and a cli command would not be a viable option?

  3. I don't think this argument is valid; it's unreasonable and dangerous to expect Superset to provide a "wipe out all data" option as a production API.

  4. Monitoring and auditing can be applied to an environment in which a cli command could be run, and I'd make the same arguments as point (1).

  5. That's not true. CLI tools have structured error handling and reporting mechanisms too.

In my opinion an operation like this is too destructive and shouldn't be provided at all.

I'd be a little more comfortable with it provided as a cli command to be run in development environments only, and (1) warn very clearly about the destructive nature of the operation and (2) do some sanity checks of your config to make sure it's not applied in a production-like environment (e.g. what @mistercrunch suggested)

In terms of providing a rest API to do it, no, I don't see any reason why this could be desirable. You'd never want to run this in a production environment. And who should be able to perform this operation via a rest operation? Any admin user? Even in a large deployment where there are half a dozen administration staff of varying levels of familiarity with Superset or technical ability?

@zhaoyongjie
Copy link
Member

zhaoyongjie commented Feb 26, 2024

@mknadh Hey, I believe you can't success to call the "reset" REST API in a real production environment since the database should lock a table if other users read/write at same time.

BTW.....if you really need a new metadata of Superset, please create a new database instance and change SQLALCHEMY_DATABASE_URI in config.py.

Copy link
Member

@dpgaspar dpgaspar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on converting this into a CLI command

@rusackas
Copy link
Member

Agreed on the CLI approach, and either (a) locking that down by default in config, or (b) having some VERY thorough confirmations before it executes ("are you sure you want to drop your database? There's no going back. Solve this math problem to prove you're serious").

@mknadh
Copy link
Contributor Author

mknadh commented Mar 7, 2024

Thank you all.

Agreed and now planning to transition the current REST API to a Command-Line Interface (CLI) with the following specifications. Is this specification for this CLI looking fine?

Command-Line Interface (CLI) for Factory Reset

Specifications

  1. Default State: By default, the factory-reset CLI command is disabled. It can be enabled by setting the boolean feature flag ENABLE_FACTORY_RESET_COMMAND to true.

  2. Authentication: When the feature flag is enabled and the command is executed, the CLI will prompt the user to enter their username and password.

  3. Execution and Data Removal: After validating the provided credentials and ensuring the user belongs to the "Admin" role, the command will perform the following actions (the order may vary):

    • Delete all users, except those with the "Admin" role.
    • Delete all roles, except "Admin", "Public", "Gamma", "Alpha", and "sql_lab".
    • Delete all dashboards.
    • Delete all slices/charts.
    • Delete all datasets.
    • Delete all databases.
    • Delete all favorites (dashboards/slices).
    • Delete all logs.
    • Delete all KeyValue entries (superset metastore cache).
  4. Audit Log: A new audit log entry will be created, containing the username of the admin who executed the CLI command and the timestamp of execution.

@rusackas
Copy link
Member

rusackas commented Mar 7, 2024

Sounds reasonable to me. It might just be good to add a confirmation (with a safe default) in the CLI to the effect of:

Please confirm that you want to:

• Delete all users, except those with the "Admin" role.
• Delete all roles, except "Admin", "Public", "Gamma", "Alpha", and "sql_lab".
• Delete all dashboards.
• Delete all slices/charts.
• Delete all datasets.
• Delete all databases.
• Delete all favorites (dashboards/slices).
• Delete all logs.
• Delete all KeyValue entries (superset metastore cache).

Proceed? [y/N]

@michael-s-molina
Copy link
Member

  1. Default State: By default, the factory-reset CLI command is disabled. It can be enabled by setting the boolean feature flag ENABLE_FACTORY_RESET_COMMAND to true.

Assuming that this can only be executed by admins after layers of confirmations, I don't think we need another feature flag.

  1. Authentication: When the feature flag is enabled and the command is executed, the CLI will prompt the user to enter their username and password.

I'm curious here. Do we currently have authentication for the CLI commands? Or are assuming that only admins can execute CLI commands? The reason I'm asking is because we already have CLI commands that require admin privileges but we don't require authentication. One example is the superset db downgrade that can break the application.

  1. Execution and Data Removal: After validating the provided credentials and ensuring the user belongs to the "Admin" role, the command will perform the following actions (the order may vary):

There are many more entities such as saved queries, CSS templates, native filters, etc. Should we change the message to say that this is just a subset of entities that will be deleted? Or even revert the message saying that we'll reset to the factory state and list what we're going to preserve (the except parts)?

Are we going to preserve the example dashboards?

@michael-s-molina
Copy link
Member

michael-s-molina commented Mar 7, 2024

I believe the number of questions and replies to this PR clearly indicates that this should have been a SIP. If the cat is out of the bag already, maybe we could still send something to the @dev list referencing this PR to give other members of the community a chance to evaluate the proposal.

@rusackas
Copy link
Member

rusackas commented Mar 7, 2024

I don't think it's too late for this to be a SIP. The PR isn't yet doing what's discussed on the thread, and even if the PR were ready, I think it's fine to put a hold label on it, and use it as a point of discussion, and merge it upon consensus.

@mknadh mknadh changed the title feat(REST API): Apache Superset "Factory Reset" REST API #27207 feat(CLI command): Apache Superset "Factory Reset" CLI command #27207 May 31, 2024
@mknadh mknadh force-pushed the features/27207-Factory-Reset-REST-API branch from 3fe790f to dacae1b Compare May 31, 2024 10:43
@github-actions github-actions bot removed the api Related to the REST API label May 31, 2024
@mknadh
Copy link
Contributor Author

mknadh commented May 31, 2024

Hi @craig-rueda, @dpgaspar - Updated the pull-request by converting the superset factory-reset REST API to a CLI command. Please review and merge, when you get a chance. Thanks.

@rusackas
Copy link
Member

rusackas commented Jun 4, 2024

@mknadh you'll need to run the pre-commit hook (https://superset.apache.org/docs/contributing/development/#git-hooks-1) to run the linter(s) and get this to pass CI.

@mknadh mknadh force-pushed the features/27207-Factory-Reset-REST-API branch from dacae1b to 6319d4f Compare July 1, 2024 09:09
@rusackas
Copy link
Member

rusackas commented Jul 1, 2024

Looks like you'll need to run that pre-commit hook again. Then hopefully we can get this across the finish line :D

@mknadh mknadh force-pushed the features/27207-Factory-Reset-REST-API branch from 6319d4f to 5a12441 Compare July 2, 2024 03:52
@mknadh
Copy link
Contributor Author

mknadh commented Jul 2, 2024

Looks like you'll need to run that pre-commit hook again. Then hopefully we can get this across the finish line :D

Pre-commit hook is successful. Please check now.

CC: @rusackas

@craig-rueda craig-rueda merged commit 6b73b69 into apache:master Jul 3, 2024
37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
review-after-release Indicates that a release is in progress and that this PR might be reviewed later size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants