Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SIP-141] Global Async Queries 2.0 #29515

Closed
villebro opened this issue Jul 8, 2024 · 9 comments
Closed

[SIP-141] Global Async Queries 2.0 #29515

villebro opened this issue Jul 8, 2024 · 9 comments
Labels
design:proposal Design proposals sip Superset Improvement Proposal

Comments

@villebro
Copy link
Member

villebro commented Jul 8, 2024

[SIP-141] Proposal for Global Async Queries 2.0

Motivation

With [SIP-39] Global Async Query Support (GAQ for short) still being behind an experimental feature flag, and not actively maintained, I've been thinking about ways we could simplify the architecture, and finally make this feature generally available in a forthcoming Superset release. I feel the following issues have all done their part to keep this feature from gaining wide community traction:

  • No deduplication support: In my experience, the lack of deduplication of heavy queries is one of the main bottlenecks of Superset, and tends to cause major issues when many people try to access the same charts/dashboards. There was an effort to add deduplication support to GAQ, but the proposal went stale. Without this functionality, I feel GAQ fails to address the most critical issue that can be fixed by moving from sync to async queries.
  • Design complexity: There was heated discussion during the SIP process about websockets potentially being too heavy handed for this particular task, with channels, reconnection functionality etc. In retrospect, I feel most of the raised concerns comments turned out to be true - while the websocket implementation may be more elegant than a simple polling solution, very few committers or community members ended up fully understanding the minutiae of the implementation, causing it to fall out of active maintenance.
  • New components: Using the websocket feature required adding a new component to the Superset deployment, along with tightly coupling with the Redis Streams protocol. A simpler polling solution would likely have received more adoption from the community, leading to the feature stabilizing more quickly.

Having said all this, the feature is still as relevant today as it was when the original SIP was opened, and I think stabilizing this feature is very important because Superset's current synchronous query execution model causes lots of issues:

  • If many people open the same chart/dashboard at the same time, they will all start a query to the underlying database, due to no locking of queries
  • if a user refreshes a dashboard multiple times, they can quickly congest the downstream database with heavy queries, both eating up webserver threads and database resources.
  • There's no way to cancel queries that get orphaned by closed browsers.
  • In some cases, the web worker threads/processes get blocked waiting for long running queries to complete executing, making it impossible to effectively scale web worker replica sets based on CPU consumption. By moving queries to async workers it should be possible to get by with much slimmer webworker replica sets. Furthermore, async workers could be scaled up/down based on the queue depth.
    It's also worth noting, that we've had async query support in SQL Lab for a very long time, and it tends to work very well, with a much simpler architecture than that proposed in SIP-39. Therefore, a GAQ framework doesn't necessarily require websockets or Redis Streams.

Proposed Change

To simplify the architecture and reuse existing functionality, I propose the following:

  • The websocket architecture is removed, as it adds a lot of complexity to the architecture - in the future only polling would be supported.
  • The concept of a "query context cache key" is removed in favor of only a single cache key, i.e. the one we already use for chart data.
  • A new model is introduced for async queries. If data for a particular cache key isn't found, an entry is added to the new model, which tracks the query progress. The model will get a menu in the UI, from which users will be able to cancel their own queries (Admins will see all queries). Ultimately, this entry gets deleted once the data request is completed.
  • When requesting chart data, if the data exists in the cache, the data is returned normally.
    When chart data isn't available in the cache, only the cache_key is returned, along with additional details: when the most recent chart data request has been submitted, status (pending, executing), last heartbeat from the async worker etc.

The async execution flow is changed to be similar to SQL Lab async execution, with the following changes:

  • To support automatic cancellation of queries, we add a new optional field poll_ttl to the query context, which makes it possible to automatically cancel queries that are not being actively polled. Every time the cache key is polled, the latest poll time is updated on the metadata object. While executing, the worker periodically checks the metadata object, and if the poll_ttl is defined, and if the last poll time exceeds the TTL, the query is cancelled. This ensures that if a person closes a dashboard with lots of long running queries, the queries are automatically cancelled if nobody is actively waiting for the results. By default, frontend requests have poll_ttl set to whichever value is set in the config (DEFAULT_CHART_DATA_POLL_TTL). Cache warmup requests would likely not have a poll_ttl set, so as to avoid unnecessary polling.
  • To limit hammering the polling endpoint, we introduce a customizable backoff function in superset_config.py, which makes it possible to define how polling backoff should be implemented. The default behavior would be some sort of exponential backoff, where freshly started queries are polled more actively, and queries that have been pending/running for a long time are polled less frequently. When the frontend requests chart data, the backend provides the recommended wait time in the response based on the backoff function. Note, that backoff will be based on time passed since query submission time; this means, that if I open a dashboard with a chart that has a query that's been running for 10 minutes, the browser will repoll much slower than it would if the query would have been dispatched to the async workers right away

Some random thoughts:

  • Currently multi-query query contexts get executed serially. With this new approach the queries can be executed in parallel, as each query is dispatched separately.
  • I feel synchronous execution is very problematic in the context of Superset due to the problems described in the intro of this post. Originally I thought about proposing making async queries the only supported query mechanism in Superset. However, as @betodealmeida pointed out, certain databases that are expected to return data at sub second latencies are better suited to a synchronous flow, as dispatching Celery tasks can add a few seconds of extra overhead to the process. For this reason, we should probably keep async execution an optional feature.

New or Changed Public Interfaces

  • New config flag for providing a backoff function for chart data polling. This will used to tell the frontend when it should poll for completed chart data the next time.
  • New config flag for setting the default chart data poll TTL.
  • New FAB model + DAO for async queries, along with a new menu for managing currently queued/executing queries.
  • New optional fields added to chart data request:
    • poll_ttl: if set, the query will be cancelled unless a client has asked for data within the TTL bounds. This will ensure that dashboards that are closed don't leave orphaned chart data requests.
    • execution_mode: the client can ask the query to be executed sync, async or using the default mode. This is specifically added for programmatic integrations, where implementing the polling mechanism may sometimes add unnecessary complexity.
  • New fields added to the chart data response:
    • poll_delay: how many seconds should the client wait before checking if the query has completed. This value will be calculated by the backend based on the backoff function.
    • status and start_dttm: is the query queued or started, and when did the query start executing. This information can be used by the chart component to give the user information of the state of the query, similar to how we currently display how stale the cached data is. So in the future, we may display the following: "Query is queued", or "Query executing for 2 minutes".

New dependencies

In the base proposal, I suggest not adding any new dependencies, and simply supporting polling. However, we may consider using Server-sent events as noted by @betodealmeida .

Migration Plan and Compatibility

  1. Remove feature flag + related code, Websocket worker code
  2. Remove references to websocket deployment/service from Helm chart
  3. Add new feature flag SIMPLIFIED_GLOBAL_ASYNC_QUERIES

Rejected Alternatives

  • For now I suggest not implementing Server-sent events to keep the implementation simple.
  • Not do anything, and hope that GAQ stabilizies by itself. I feel this is a bad solution, as GAQ currently doesn't solve the query duplication problem, doesn't have a mechanism for cancelling queries, nor is it being actively used or maintained.
@villebro villebro added the sip Superset Improvement Proposal label Jul 8, 2024
@dosubot dosubot bot added the design:proposal Design proposals label Jul 8, 2024
@nytai
Copy link
Member

nytai commented Jul 8, 2024

To support automatic cancellation of queries, we add a new optional field poll_ttl to the query context, which makes it possible to automatically cancel queries that are not being actively polled. Every time the cache key is polled, the latest poll time is updated on the metadata object. While executing, the worker periodically checks the metadata object, and if the poll_ttl is defined, and if the last poll time exceeds the TTL, the query is cancelled. This ensures that if a person closes a dashboard with lots of long running queries, the queries are automatically cancelled if nobody is actively waiting for the results. By default, frontend requests have poll_ttl set to whichever value is set in the config (DEFAULT_CHART_DATA_POLL_TTL). Cache warmup requests would likely not have a poll_ttl set, so as to avoid unnecessary polling.

I'd suggest making this configurable or to respect the "cancel queries on window onload event" that exists in the db connection settings. Some workflows involve dashboard being accessed multiple times a day, so there is some benefit to running and caching the queries even though there isn't someone actively waiting for them. Thinking of a case where someone opens a dashboard, it takes too long to load so they pack up and commute to the office, and open up the dashboard once they're at their desk.

@rusackas rusackas changed the title [SIP] Simplified Global Async Queries [SIP-141] Simplified Global Async Queries Jul 10, 2024
@michael-s-molina
Copy link
Member

michael-s-molina commented Jul 11, 2024

I've been thinking.... the problems of the sync model are clearly described by this SIP. The proposed solution is to use an async model with polling, but @betodealmeida and @rusackas have a valid point about latency for real-time data and also possible push features. Server-side events could be used but it wouldn't be sufficient as we clearly need bi-directional communication. Considering all requirements, shouldn't we use this opportunity to actually fully adopt WebSockets as the single solution to reduce the complexity of many modes of operation? It seems it would address all our requirements and avoid a situation where we add async polling now to later introduce another method for push features.

@villebro
Copy link
Member Author

I feel the issue ultimately comes down to maintainability of the feature. Currently, the main issues that we're facing with synchronous queries are as follows:

  1. no deduplication of queries
  2. no mechanism for cancelling running queries
  3. queries locking webserver threads, making it difficult to scale infra appropriately

While the current WebSocket solution addresses the third point above, the other points are not addressed by the current solution, and in my experience, they're far more critical for enterprise deployments. I'm all for hardening and even expanding the use of WebSocket solution if we can find people to maintain and develop it further. However, given the low community involvement on the superset-websocket codebase, I feel the maintenance burden of this component exceeds the benefit of it in the current architecture.

If we do feel WebSockets are critical for the future roadmap, I feel we still need to simplify the architecture, and support query deduplication and cancellation, as they are not within the scope of the current GAQ solution.

@michael-s-molina
Copy link
Member

If we do feel WebSockets are critical for the future roadmap, I feel we still need to simplify the architecture, and support query deduplication and cancellation, as they are not within the scope of the current GAQ solution.

@villebro Exactly! This 👆🏼 what I meant by fully adopting WebSockets. We would need to simplify its architecture, add missing features and use it as the primary solution in Superset to ensure maintainability.

@villebro
Copy link
Member Author

A quick summary from the discussion on this SIP in the Town Hall meeting on 2024-07-12:

  • The issue of query deduplication, and inability to manage running queries, is confirmed to be a major issue for deployments running at scale. By being able to move query execution from the webservers into async workers we stand to gain a lot in terms of infra efficiency, performance and general usability.
  • On the topic of polling vs WebSockets (WS), there was hesitation to decommission the existing WS server, as WS will likely be very useful for future features, too. For this reason it was decided to attempt to simplify the architecture by removing query context keys in favor of the normal cache keys, and in general making the codebase more approachable, while still retaining the WS server and support for it in the main backend.

The SIP will shortly be updated to reflect these changes. Furthermore, a weekly meeting has been setup for coordinating the effort to redesign the WS implementation to bring the core features and changes proposed in this SIP, most notably query deduplication, management UI for queued/running queries, and architectural simplification. The meeting will take place on Mondays at noon PST (all interested are welcome to join!)

@harshit2283
Copy link
Contributor

To add to the general consensus, WebSockets are crucial for large-scale deployments, especially when handling a significant number of active users. Long polling is inefficient for this purpose, and horizontal scaling isn't ideal since it requires scaling the supporting infrastructure as well.

I'd be happy to contribute towards making WebSockets mainstream.

@gempir
Copy link

gempir commented Jul 15, 2024

I think the biggest hinderance to the adoption of the new WebSocket server is mostly the missing documentation.

The most I could find this in the README

Enable the `GLOBAL_ASYNC_QUERIES` feature flag:

And overall it feels like a feature that sounds very much in development if there is nothing about it in the official documentation.

That's why at least at my company we did not deploy the websocket server.

@villebro
Copy link
Member Author

villebro commented Aug 2, 2024

We are closing this SIP and will be opening a new one to reflect a new direction that was established during discussions. But in summary:

  • WebSockets will remain as is, polling will be removed
  • A new framework for async tasks will be introduced, and GAQ, Alerts & Reports, SQL Lab async queries and Thumbnails will be moved to that framework so we have a single mechanism for managing async tasks.

@villebro villebro closed this as completed Aug 2, 2024
@villebro villebro changed the title [SIP-141] Simplified Global Async Queries [SIP-141] Global Async Queries 2.0 Aug 2, 2024
@villebro
Copy link
Member Author

villebro commented Aug 3, 2024

Follow-up SIP here: #29839

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design:proposal Design proposals sip Superset Improvement Proposal
Projects
Status: Denied / Closed / Discarded
Development

No branches or pull requests

5 participants