perf(filters): cache distinct column values to avoid re-running virtual dataset query per filter by Abdulrehman-PIAIC80387 · Pull Request #40839 · apache/superset

Abdulrehman-PIAIC80387 · 2026-06-08T05:55:00Z

SUMMARY

The /api/v1/datasource/<type>/<id>/column/<column>/values/ endpoint had no caching. When a dashboard has multiple native filters backed by the same virtual dataset, each filter fires its own HTTP request to this endpoint, and each request re-executes the wrapping query:

SELECT DISTINCT <column>
FROM (<virtual_dataset_SQL>)  -- heavy CTE re-runs every time

If the virtual dataset has joins or aggregations that take a few seconds, every additional filter multiplied the cost — even a simple page reload re-ran every filter query from scratch.

This PR adds a result cache around the call to datasource.values_for_column(...), mirroring the exact pattern already in use at superset/datasource/api.py:417 for the get_compatible_metrics endpoint.

What this delivers

✅ Repeated requests for the same column → cache hit (no DB query)
✅ Dashboard reload → all filter values cached → instant
✅ Two users on the same dashboard with same RLS → second user gets cached values
✅ Editing the virtual dataset SQL auto-busts the cache (via changed_on in the cache key)
✅ ?force=true query param bypasses the cache
✅ X-Cache-Status: HIT / MISS response header for operator debugging
✅ Warning log when a single column produces a payload larger than FILTER_VALUES_CACHE_WARN_THRESHOLD (defaults to 100k rows) — early signal for cache-memory pressure

Cache key

cache_key = "col_values:" + sha256({
    "uid": datasource.uid,             # which dataset
    "col": column_name,                # which column
    "limit": row_limit,                # respects FILTER_SELECT_ROW_LIMIT
    "denorm": denormalize_column,      # respects per-dataset normalize_columns
    "user": get_user_id(),             # RLS isolation — see notes
    "changed_on": str(datasource.changed_on),  # auto-bust on dataset edit
})

Security note: Row-Level Security filters are per-user, so the user field in the cache key is required to prevent user A's RLS-restricted values from leaking to user B.

Correctness note: changed_on busts the cache when someone edits the underlying virtual dataset SQL — no manual invalidation needed.

TTL

Uses the same logic as the existing get_compatible_metrics endpoint at api.py:417:

timeout = datasource.cache_timeout or app.config.get("CACHE_DEFAULT_TIMEOUT", 300)

Per-dataset cache_timeout overrides; otherwise the configured default (5 minutes).

Performance impact

The performance claim is theoretical — based on cache-hit avoidance of the wrapping query, not benchmarked against a production-scale virtual dataset. The functional correctness is validated by tests; the operational impact will depend on the deployment's filter cardinality and query cost.

What this does NOT deliver (and why)

A single dashboard load with N filters on N different columns still issues N database queries. This PR helps repeat-hits across loads and per-user dashboard reloads, but a fresh first-load with N unique columns still hits the database N times. The architectural fix for the "N filters → 1 query" case requires a new bulk endpoint that builds one CTE-based query for all columns plus frontend coordination to call the new endpoint. Lays the groundwork for that follow-up.
No cache-stampede protection. Concurrent first-load requests for the same key can each miss the cache and hit the DB. Adding this correctly across cache backends (Redis vs Memcached vs SimpleCache) is non-trivial; left as a follow-up.

TESTING INSTRUCTIONS

pytest tests/integration_tests/datasource/api_tests.py -v

Six tests cover the cache behavior:

Test	Asserts
`test_get_column_values_cache_hit_skips_query`	Two identical requests → `values_for_column` called exactly once
`test_get_column_values_force_bypasses_cache`	`?force=true` re-runs the query even with a hot cache entry
`test_get_column_values_cache_isolated_per_column`	Two different columns → no cross-column leak
`test_get_column_values_cache_isolated_per_user`	Two users → each populates own entry (RLS safety)
`test_get_column_values_cache_busts_on_changed_on`	Bumping `changed_on` invalidates the cache key
`test_get_column_values_response_advertises_cache_status`	`X-Cache-Status` header reports MISS then HIT

All 14 pre-existing endpoint tests still pass — no regression.

Manual verification:

Configure a virtual dataset with a heavy SQL (joins/CTEs)
Add 3+ native filters on different columns of that dataset to a dashboard
Load the dashboard, then reload it
Inspect /api/v1/datasource/.../values/ requests in browser devtools — second-load responses should carry X-Cache-Status: HIT
Hit any endpoint with ?force=true — response should carry X-Cache-Status: MISS

ADDITIONAL INFORMATION

Has associated issue: Filters do not reuse the virtual dataset query #39342
Required feature flags:
Changes UI
Includes DB Migration (please mention any potential downtime)
Migration is atomic, supports rollback & is backwards-compatible
Confirm DB migration upgrade and downgrade tested
Runtime introduces no new dependencies / changes existing dependencies
Introduces new feature or API
Removes existing feature or API

…al dataset query per filter

bito-code-review · 2026-06-08T05:55:10Z

Code Review Agent Run #ebc25c

Actionable Suggestions - 0

Filtered by Review Rules

Bito filtered these suggestions based on rules created automatically for your feedback. Manage rules.

superset/datasource/api.py - 1
- CWE-390: None-Check Missing Before json.dumps · Line 151-151

Review Details

Files reviewed - 2 · Commit Range: 3bc55cf..3bc55cf
- superset/datasource/api.py
- tests/integration_tests/datasource/api_tests.py
Files skipped - 0
Tools
- Whispers (Secret Scanner) - ✔︎ Successful
- Detect-secrets (Secret Scanner) - ✔︎ Successful
- MyPy (Static Code Analysis) - ✔︎ Successful
- Astral Ruff (Static Code Analysis) - ✔︎ Successful

Bito Usage Guide

Commands

Type the following command in the pull request comment and save the comment.

/review - Manually triggers a full AI review.
/pause - Pauses automatic reviews on this pull request.
/resume - Resumes automatic reviews.
/resolve - Marks all Bito-posted review comments as resolved.
/abort - Cancels all in-progress reviews.

Refer to the documentation for additional commands.

Configuration

This repository uses Superset You can customize the agent settings here or contact your Bito workspace admin at evan@preset.io.

Documentation & Help

AI Code Review powered by

codecov · 2026-06-08T05:58:54Z

Codecov Report

❌ Patch coverage is 88.23529% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.11%. Comparing base (06f95f5) to head (3111a4e).
⚠️ Report is 68 commits behind head on master.

Files with missing lines	Patch %	Lines
superset/datasource/api.py	88.23%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #40839      +/-   ##
==========================================
- Coverage   64.14%   64.11%   -0.04%     
==========================================
  Files        2652     2653       +1     
  Lines      143482   143754     +272     
  Branches    33108    33154      +46     
==========================================
+ Hits        92037    92163     +126     
- Misses      49836    49972     +136     
- Partials     1609     1619      +10

Flag	Coverage Δ
hive	`39.46% <5.88%> (-0.06%)`	⬇️
mysql	`58.22% <88.23%> (-0.05%)`	⬇️
postgres	`58.28% <88.23%> (-0.06%)`	⬇️
presto	`41.05% <5.88%> (-0.07%)`	⬇️
python	`59.76% <88.23%> (-0.06%)`	⬇️
sqlite	`57.91% <88.23%> (-0.05%)`	⬇️
unit	`100.00% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

bito-code-review · 2026-06-08T06:00:20Z

The flagged issue is correct. The current implementation of the cache key in superset/datasource/api.py relies on get_user_id(), which returns None for guest users, causing them to share cache entries regardless of their specific RLS rules.

To resolve this, you should incorporate the effective RLS context into the cache key. You can achieve this by including the result of security_manager.get_rls_cache_key(datasource) in the cache key dictionary.

Here is the minimal fix for superset/datasource/api.py:

                        "col": column_name,
                        "limit": row_limit,
                        "denorm": denormalize_column,
                        "user": get_user_id(),
                        "rls": security_manager.get_rls_cache_key(datasource),
                        "changed_on": str(getattr(datasource, "changed_on", "")),
                    },

This change ensures that the cache key is unique to the RLS context, preventing cross-session data leakage for guest users.

superset/datasource/api.py

"col": column_name,
                        "limit": row_limit,
                        "denorm": denormalize_column,
                        "user": get_user_id(),
                        "rls": security_manager.get_rls_cache_key(datasource),
                        "changed_on": str(getattr(datasource, "changed_on", "")),
                    },

…sessions don't share entries

Abdulrehman-PIAIC80387 · 2026-06-08T06:41:18Z

Bot review summary (addressed in 92654da298)

Thanks to the automated reviewers (@codeant-ai-for-open-source, @bito-code-review) for the careful pass. Recording status for the eventual human reviewer:

✅ Resolved

** CRITICAL — Guest-user RLS leak** (api.py:164)
get_user_id() returns None for guest/anonymous sessions, so embedded dashboards with different guest-token RLS would have shared a "user": null cache entry.
Fix: cache key now includes security_manager.get_rls_cache_key(datasource) — the same canonical RLS fingerprint already used by viz.py:479 and query_context_processor.py:239. Two guest sessions with different RLS now have distinct cache keys.
Test: test_get_column_values_cache_isolated_per_rls_context.
Major — RLS policy changes don't bust cache (api.py:163)
Tightening an RLS rule wouldn't have evicted previously cached values until natural TTL expiry.
Fix: same change as above. The RLS fingerprint includes both get_rls_sorted(datasource) clauses and get_guest_rls_filters_str(datasource), so any policy change shifts the fingerprint and the next request misses the cache.

⚪ Deferred (already noted in the PR description)

** Major — Cache stampede / thundering herd** (api.py:186)
Concurrent first-load requests can each miss the cache and execute the query.
Doing this correctly across all cache_manager.data_cache backends (Redis SETNX, Memcached add, in-process SimpleCache) is non-trivial — ~80–100 lines with cross-backend tests. Out of scope for this PR; will open a follow-up if maintainers want it bundled.

Other automated checks

@bito-code-review — "Actionable Suggestions — 0" after the bot accepted that the RLS fix addresses the flagged issue.
@codecov — the initial 5.88% patch coverage was reported before the integration tests ran; the seven cache tests in tests/integration_tests/datasource/api_tests.py cover the new code paths and the latest CI run shows 0 failures across 54 checks.

I'll resolve the two addressed inline threads to keep the review focus on what's open.

… test

bito-code-review · 2026-06-08T09:19:16Z

Code Review Agent Run #0a1989

Actionable Suggestions - 0

Review Details

Files reviewed - 2 · Commit Range: 3bc55cf..09317ef
- superset/datasource/api.py
- tests/integration_tests/datasource/api_tests.py
Files skipped - 0
Tools
- Whispers (Secret Scanner) - ✔︎ Successful
- Detect-secrets (Secret Scanner) - ✔︎ Successful
- MyPy (Static Code Analysis) - ✔︎ Successful
- Astral Ruff (Static Code Analysis) - ✔︎ Successful

Bito Usage Guide

Commands

Type the following command in the pull request comment and save the comment.

/review - Manually triggers a full AI review.
/pause - Pauses automatic reviews on this pull request.
/resume - Resumes automatic reviews.
/resolve - Marks all Bito-posted review comments as resolved.
/abort - Cancels all in-progress reviews.

Refer to the documentation for additional commands.

Configuration

This repository uses Superset You can customize the agent settings here or contact your Bito workspace admin at evan@preset.io.

Documentation & Help

AI Code Review powered by

Copilot

Pull request overview

This PR adds server-side caching to the datasource column-values endpoint (/api/v1/datasource/<type>/<id>/column/<column>/values/) to avoid re-running expensive virtual dataset queries for repeated native-filter loads, and adds an X-Cache-Status response header plus a ?force=true bypass.

Changes:

Cache datasource.values_for_column(...) results in cache_manager.data_cache keyed by datasource/column/limit/denormalization/user/RLS/changed_on.
Add ?force=true to bypass the cache and add X-Cache-Status: HIT|MISS for debugging.
Add integration tests covering cache hit/bypass/isolation and header behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`superset/datasource/api.py`	Adds caching logic, cache key composition (incl. RLS fingerprint), force-bypass, and cache status header for column values endpoint.
`tests/integration_tests/datasource/api_tests.py`	Adds integration tests validating cache hit/miss behavior, key isolation, and `X-Cache-Status` header.

 from superset import db, security_manager
 from superset.connectors.sqla.models import SqlaTable
 from superset.daos.exceptions import DatasourceTypeNotSupportedError
+from superset.extensions import cache_manager
 from superset.utils import json


+                        "limit": row_limit,
+                        "denorm": denormalize_column,
+                        "user": get_user_id(),
+                        "rls": security_manager.get_rls_cache_key(datasource),
+                        "changed_on": str(getattr(datasource, "changed_on", "")),


…e clear)

Abdulrehman-PIAIC80387 · 2026-06-10T08:18:46Z

Thanks for the review @sadpandajoe!

@Copilot — addressed both points in 3111a4e3e2:

Per-user vs same-RLS caching — dropped get_user_id() from the cache key entirely. The RLS fingerprint from security_manager.get_rls_cache_key() is the sole security-isolation field now, which is the canonical helper used by viz.py:479 and query_context_processor.py:239. Two users with identical effective RLS now correctly share a cache entry (matches the PR description). Different RLS contexts, guest sessions with different guest tokens, and anonymous sessions each get their own partition.
Test isolation — added setUp() that calls cache_manager.data_cache.clear() before every test in the class, so future test order changes can't introduce cache-bleed flakes.

Also removed the now-obsolete test_..._isolated_per_user test since per-user partitioning is no longer the behavior; test_..._isolated_per_rls_context covers the actual security guarantee.

netlify · 2026-06-10T08:47:31Z

✅ Deploy Preview for superset-docs-preview ready!

Name	Link
🔨 Latest commit	`3111a4e`
🔍 Latest deploy log	https://app.netlify.com/projects/superset-docs-preview/deploys/6a291c7259931a0008d7de1a
😎 Deploy Preview	https://deploy-preview-40839--superset-docs-preview.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.
🤖 Make changes	Run an agent on this branch

To edit notification comments on pull requests, go to your Netlify project configuration.

bito-code-review · 2026-06-10T11:37:32Z

Code Review Agent Run #69ba90

Actionable Suggestions - 0

Additional Suggestions - 1

tests/integration_tests/datasource/api_tests.py - 1
- Missing test for cache sharing · Line 263-286
  
  The deleted `test_get_column_values_cache_isolated_per_user` verified per-user isolation. The new design intentionally removes user-based isolation (users with same RLS share cache), but no test verifies this new behavior. `test_get_column_values_cache_isolated_per_rls_context` only tests that DIFFERENT RLS contexts get separate entries — it doesn't test that IDENTICAL RLS contexts share entries. Without this, the cache-sharing optimization has no regression guard.

Filtered by Review Rules

Bito filtered these suggestions based on rules created automatically for your feedback. Manage rules.

tests/integration_tests/datasource/api_tests.py - 1
- Redundant cache clear calls · Line 34-40

Review Details

Files reviewed - 2 · Commit Range: 09317ef..3111a4e
- superset/datasource/api.py
- tests/integration_tests/datasource/api_tests.py
Files skipped - 0
Tools
- Whispers (Secret Scanner) - ✔︎ Successful
- Detect-secrets (Secret Scanner) - ✔︎ Successful
- MyPy (Static Code Analysis) - ✔︎ Successful
- Astral Ruff (Static Code Analysis) - ✔︎ Successful

Bito Usage Guide

Commands

Type the following command in the pull request comment and save the comment.

/review - Manually triggers a full AI review.
/pause - Pauses automatic reviews on this pull request.
/resume - Resumes automatic reviews.
/resolve - Marks all Bito-posted review comments as resolved.
/abort - Cancels all in-progress reviews.

Refer to the documentation for additional commands.

Configuration

This repository uses Superset You can customize the agent settings here or contact your Bito workspace admin at evan@preset.io.

Documentation & Help

AI Code Review powered by

perf(filters): cache distinct column values to avoid re-running virtu…

3bc55cf

…al dataset query per filter

pull-request-size Bot added the size/L label Jun 8, 2026

dosubot Bot added api Related to the REST API data:dataset Related to dataset configurations labels Jun 8, 2026

codeant-ai-for-open-source Bot reviewed Jun 8, 2026

View reviewed changes

Comment thread superset/datasource/api.py

codeant-ai-for-open-source Bot reviewed Jun 8, 2026

View reviewed changes

Comment thread superset/datasource/api.py Outdated

Comment thread superset/datasource/api.py

fix(filters): include RLS fingerprint in cache key so embedded/guest …

92654da

…sessions don't share entries

test(filters): mock get_user_id directly for per-user cache isolation…

09317ef

… test

sadpandajoe requested review from betodealmeida and Copilot June 9, 2026 17:16

Copilot started reviewing on behalf of sadpandajoe June 9, 2026 17:16 View session

Copilot AI reviewed Jun 9, 2026

View reviewed changes

perf(filters): address review (drop user_id, RLS-only key, setUp cach…

3111a4e

…e clear)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(filters): cache distinct column values to avoid re-running virtual dataset query per filter#40839

perf(filters): cache distinct column values to avoid re-running virtual dataset query per filter#40839
Abdulrehman-PIAIC80387 wants to merge 4 commits into
apache:masterfrom
Abdulrehman-PIAIC80387:fix/cache-column-values-39342

Abdulrehman-PIAIC80387 commented Jun 8, 2026 •

edited

Loading

Uh oh!

bito-code-review Bot commented Jun 8, 2026 •

edited

Loading

Code Review Agent Run #ebc25c

Uh oh!

codecov Bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

bito-code-review Bot commented Jun 8, 2026

Uh oh!

Uh oh!

Uh oh!

Abdulrehman-PIAIC80387 commented Jun 8, 2026 •

edited

Loading

Uh oh!

bito-code-review Bot commented Jun 8, 2026 •

edited

Loading

Code Review Agent Run #0a1989

Uh oh!

Copilot AI left a comment

Uh oh!

Abdulrehman-PIAIC80387 commented Jun 10, 2026

Uh oh!

netlify Bot commented Jun 10, 2026

Uh oh!

bito-code-review Bot commented Jun 10, 2026 •

edited

Loading

Code Review Agent Run #69ba90

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Abdulrehman-PIAIC80387 commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

SUMMARY

What this delivers

Cache key

TTL

Performance impact

What this does NOT deliver (and why)

TESTING INSTRUCTIONS

ADDITIONAL INFORMATION

Uh oh!

bito-code-review Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Agent Run #ebc25c

Uh oh!

codecov Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

bito-code-review Bot commented Jun 8, 2026

Uh oh!

Uh oh!

Uh oh!

Abdulrehman-PIAIC80387 commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bito-code-review Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Agent Run #0a1989

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Abdulrehman-PIAIC80387 commented Jun 10, 2026

Uh oh!

netlify Bot commented Jun 10, 2026

✅ Deploy Preview for superset-docs-preview ready!

Uh oh!

bito-code-review Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Agent Run #69ba90

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Abdulrehman-PIAIC80387 commented Jun 8, 2026 •

edited

Loading

bito-code-review Bot commented Jun 8, 2026 •

edited

Loading

codecov Bot commented Jun 8, 2026 •

edited

Loading

Abdulrehman-PIAIC80387 commented Jun 8, 2026 •

edited

Loading

bito-code-review Bot commented Jun 8, 2026 •

edited

Loading

bito-code-review Bot commented Jun 10, 2026 •

edited

Loading