perf(filters): cache distinct column values to avoid re-running virtual dataset query per filter#40839
Conversation
…al dataset query per filter
Code Review Agent Run #ebc25cActionable Suggestions - 0Filtered by Review RulesBito filtered these suggestions based on rules created automatically for your feedback. Manage rules.
Review Details
Bito Usage GuideCommands Type the following command in the pull request comment and save the comment.
Refer to the documentation for additional commands. Configuration This repository uses Documentation & Help |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #40839 +/- ##
==========================================
- Coverage 64.14% 64.11% -0.04%
==========================================
Files 2652 2653 +1
Lines 143482 143754 +272
Branches 33108 33154 +46
==========================================
+ Hits 92037 92163 +126
- Misses 49836 49972 +136
- Partials 1609 1619 +10
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
The flagged issue is correct. The current implementation of the cache key in To resolve this, you should incorporate the effective RLS context into the cache key. You can achieve this by including the result of Here is the minimal fix for "col": column_name,
"limit": row_limit,
"denorm": denormalize_column,
"user": get_user_id(),
"rls": security_manager.get_rls_cache_key(datasource),
"changed_on": str(getattr(datasource, "changed_on", "")),
},This change ensures that the cache key is unique to the RLS context, preventing cross-session data leakage for guest users. superset/datasource/api.py |
…sessions don't share entries
|
Bot review summary (addressed in Thanks to the automated reviewers ( ✅ Resolved
⚪ Deferred (already noted in the PR description)
Other automated checks
I'll resolve the two addressed inline threads to keep the review focus on what's open. |
Code Review Agent Run #0a1989Actionable Suggestions - 0Review Details
Bito Usage GuideCommands Type the following command in the pull request comment and save the comment.
Refer to the documentation for additional commands. Configuration This repository uses Documentation & Help |
There was a problem hiding this comment.
Pull request overview
This PR adds server-side caching to the datasource column-values endpoint (/api/v1/datasource/<type>/<id>/column/<column>/values/) to avoid re-running expensive virtual dataset queries for repeated native-filter loads, and adds an X-Cache-Status response header plus a ?force=true bypass.
Changes:
- Cache
datasource.values_for_column(...)results incache_manager.data_cachekeyed by datasource/column/limit/denormalization/user/RLS/changed_on. - Add
?force=trueto bypass the cache and addX-Cache-Status: HIT|MISSfor debugging. - Add integration tests covering cache hit/bypass/isolation and header behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
superset/datasource/api.py |
Adds caching logic, cache key composition (incl. RLS fingerprint), force-bypass, and cache status header for column values endpoint. |
tests/integration_tests/datasource/api_tests.py |
Adds integration tests validating cache hit/miss behavior, key isolation, and X-Cache-Status header. |
| from superset import db, security_manager | ||
| from superset.connectors.sqla.models import SqlaTable | ||
| from superset.daos.exceptions import DatasourceTypeNotSupportedError | ||
| from superset.extensions import cache_manager | ||
| from superset.utils import json |
| "limit": row_limit, | ||
| "denorm": denormalize_column, | ||
| "user": get_user_id(), | ||
| "rls": security_manager.get_rls_cache_key(datasource), | ||
| "changed_on": str(getattr(datasource, "changed_on", "")), |
|
Thanks for the review @sadpandajoe! @Copilot — addressed both points in
Also removed the now-obsolete |
✅ Deploy Preview for superset-docs-preview ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
Code Review Agent Run #69ba90Actionable Suggestions - 0Additional Suggestions - 1
Filtered by Review RulesBito filtered these suggestions based on rules created automatically for your feedback. Manage rules.
Review Details
Bito Usage GuideCommands Type the following command in the pull request comment and save the comment.
Refer to the documentation for additional commands. Configuration This repository uses Documentation & Help |
SUMMARY
Fixes #39342.
The
/api/v1/datasource/<type>/<id>/column/<column>/values/endpoint had no caching. When a dashboard has multiple native filters backed by the same virtual dataset, each filter fires its own HTTP request to this endpoint, and each request re-executes the wrapping query:If the virtual dataset has joins or aggregations that take a few seconds, every additional filter multiplied the cost — even a simple page reload re-ran every filter query from scratch.
This PR adds a result cache around the call to
datasource.values_for_column(...), mirroring the exact pattern already in use atsuperset/datasource/api.py:417for theget_compatible_metricsendpoint.What this delivers
changed_onin the cache key)?force=truequery param bypasses the cacheX-Cache-Status: HIT/MISSresponse header for operator debuggingFILTER_VALUES_CACHE_WARN_THRESHOLD(defaults to 100k rows) — early signal for cache-memory pressureCache key
Security note: Row-Level Security filters are per-user, so the
userfield in the cache key is required to prevent user A's RLS-restricted values from leaking to user B.Correctness note:
changed_onbusts the cache when someone edits the underlying virtual dataset SQL — no manual invalidation needed.TTL
Uses the same logic as the existing
get_compatible_metricsendpoint at api.py:417:Per-dataset
cache_timeoutoverrides; otherwise the configured default (5 minutes).Performance impact
The performance claim is theoretical — based on cache-hit avoidance of the wrapping query, not benchmarked against a production-scale virtual dataset. The functional correctness is validated by tests; the operational impact will depend on the deployment's filter cardinality and query cost.
What this does NOT deliver (and why)
TESTING INSTRUCTIONS
Six tests cover the cache behavior:
test_get_column_values_cache_hit_skips_queryvalues_for_columncalled exactly oncetest_get_column_values_force_bypasses_cache?force=truere-runs the query even with a hot cache entrytest_get_column_values_cache_isolated_per_columntest_get_column_values_cache_isolated_per_usertest_get_column_values_cache_busts_on_changed_onchanged_oninvalidates the cache keytest_get_column_values_response_advertises_cache_statusX-Cache-Statusheader reports MISS then HITAll 14 pre-existing endpoint tests still pass — no regression.
Manual verification:
/api/v1/datasource/.../values/requests in browser devtools — second-load responses should carryX-Cache-Status: HIT?force=true— response should carryX-Cache-Status: MISSADDITIONAL INFORMATION