Skip to content

perf(filters): cache distinct column values to avoid re-running virtual dataset query per filter#40839

Open
Abdulrehman-PIAIC80387 wants to merge 4 commits into
apache:masterfrom
Abdulrehman-PIAIC80387:fix/cache-column-values-39342
Open

perf(filters): cache distinct column values to avoid re-running virtual dataset query per filter#40839
Abdulrehman-PIAIC80387 wants to merge 4 commits into
apache:masterfrom
Abdulrehman-PIAIC80387:fix/cache-column-values-39342

Conversation

@Abdulrehman-PIAIC80387

@Abdulrehman-PIAIC80387 Abdulrehman-PIAIC80387 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

SUMMARY

Fixes #39342.

The /api/v1/datasource/<type>/<id>/column/<column>/values/ endpoint had no caching. When a dashboard has multiple native filters backed by the same virtual dataset, each filter fires its own HTTP request to this endpoint, and each request re-executes the wrapping query:

SELECT DISTINCT <column>
FROM (<virtual_dataset_SQL>)  -- heavy CTE re-runs every time

If the virtual dataset has joins or aggregations that take a few seconds, every additional filter multiplied the cost — even a simple page reload re-ran every filter query from scratch.

This PR adds a result cache around the call to datasource.values_for_column(...), mirroring the exact pattern already in use at superset/datasource/api.py:417 for the get_compatible_metrics endpoint.

What this delivers

  • ✅ Repeated requests for the same column → cache hit (no DB query)
  • ✅ Dashboard reload → all filter values cached → instant
  • ✅ Two users on the same dashboard with same RLS → second user gets cached values
  • ✅ Editing the virtual dataset SQL auto-busts the cache (via changed_on in the cache key)
  • ?force=true query param bypasses the cache
  • X-Cache-Status: HIT / MISS response header for operator debugging
  • ✅ Warning log when a single column produces a payload larger than FILTER_VALUES_CACHE_WARN_THRESHOLD (defaults to 100k rows) — early signal for cache-memory pressure

Cache key

cache_key = "col_values:" + sha256({
    "uid": datasource.uid,             # which dataset
    "col": column_name,                # which column
    "limit": row_limit,                # respects FILTER_SELECT_ROW_LIMIT
    "denorm": denormalize_column,      # respects per-dataset normalize_columns
    "user": get_user_id(),             # RLS isolation — see notes
    "changed_on": str(datasource.changed_on),  # auto-bust on dataset edit
})

Security note: Row-Level Security filters are per-user, so the user field in the cache key is required to prevent user A's RLS-restricted values from leaking to user B.

Correctness note: changed_on busts the cache when someone edits the underlying virtual dataset SQL — no manual invalidation needed.

TTL

Uses the same logic as the existing get_compatible_metrics endpoint at api.py:417:

timeout = datasource.cache_timeout or app.config.get("CACHE_DEFAULT_TIMEOUT", 300)

Per-dataset cache_timeout overrides; otherwise the configured default (5 minutes).

Performance impact

The performance claim is theoretical — based on cache-hit avoidance of the wrapping query, not benchmarked against a production-scale virtual dataset. The functional correctness is validated by tests; the operational impact will depend on the deployment's filter cardinality and query cost.

What this does NOT deliver (and why)

  • A single dashboard load with N filters on N different columns still issues N database queries. This PR helps repeat-hits across loads and per-user dashboard reloads, but a fresh first-load with N unique columns still hits the database N times. The architectural fix for the "N filters → 1 query" case requires a new bulk endpoint that builds one CTE-based query for all columns plus frontend coordination to call the new endpoint. Lays the groundwork for that follow-up.
  • No cache-stampede protection. Concurrent first-load requests for the same key can each miss the cache and hit the DB. Adding this correctly across cache backends (Redis vs Memcached vs SimpleCache) is non-trivial; left as a follow-up.

TESTING INSTRUCTIONS

pytest tests/integration_tests/datasource/api_tests.py -v

Six tests cover the cache behavior:

Test Asserts
test_get_column_values_cache_hit_skips_query Two identical requests → values_for_column called exactly once
test_get_column_values_force_bypasses_cache ?force=true re-runs the query even with a hot cache entry
test_get_column_values_cache_isolated_per_column Two different columns → no cross-column leak
test_get_column_values_cache_isolated_per_user Two users → each populates own entry (RLS safety)
test_get_column_values_cache_busts_on_changed_on Bumping changed_on invalidates the cache key
test_get_column_values_response_advertises_cache_status X-Cache-Status header reports MISS then HIT

All 14 pre-existing endpoint tests still pass — no regression.

Manual verification:

  1. Configure a virtual dataset with a heavy SQL (joins/CTEs)
  2. Add 3+ native filters on different columns of that dataset to a dashboard
  3. Load the dashboard, then reload it
  4. Inspect /api/v1/datasource/.../values/ requests in browser devtools — second-load responses should carry X-Cache-Status: HIT
  5. Hit any endpoint with ?force=true — response should carry X-Cache-Status: MISS

ADDITIONAL INFORMATION

  • Has associated issue: Filters do not reuse the virtual dataset query #39342
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (please mention any potential downtime)
  • Migration is atomic, supports rollback & is backwards-compatible
  • Confirm DB migration upgrade and downgrade tested
  • Runtime introduces no new dependencies / changes existing dependencies
  • Introduces new feature or API
  • Removes existing feature or API

@dosubot dosubot Bot added api Related to the REST API data:dataset Related to dataset configurations labels Jun 8, 2026
@bito-code-review

bito-code-review Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Code Review Agent Run #ebc25c

Actionable Suggestions - 0
Filtered by Review Rules

Bito filtered these suggestions based on rules created automatically for your feedback. Manage rules.

  • superset/datasource/api.py - 1
    • CWE-390: None-Check Missing Before json.dumps · Line 151-151
Review Details
  • Files reviewed - 2 · Commit Range: 3bc55cf..3bc55cf
    • superset/datasource/api.py
    • tests/integration_tests/datasource/api_tests.py
  • Files skipped - 0
  • Tools
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful
    • MyPy (Static Code Analysis) - ✔︎ Successful
    • Astral Ruff (Static Code Analysis) - ✔︎ Successful

Bito Usage Guide

Commands

Type the following command in the pull request comment and save the comment.

  • /review - Manually triggers a full AI review.

  • /pause - Pauses automatic reviews on this pull request.

  • /resume - Resumes automatic reviews.

  • /resolve - Marks all Bito-posted review comments as resolved.

  • /abort - Cancels all in-progress reviews.

Refer to the documentation for additional commands.

Configuration

This repository uses Superset You can customize the agent settings here or contact your Bito workspace admin at evan@preset.io.

Documentation & Help

AI Code Review powered by Bito Logo

@codecov

codecov Bot commented Jun 8, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 88.23529% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.11%. Comparing base (06f95f5) to head (3111a4e).
⚠️ Report is 68 commits behind head on master.

Files with missing lines Patch % Lines
superset/datasource/api.py 88.23% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #40839      +/-   ##
==========================================
- Coverage   64.14%   64.11%   -0.04%     
==========================================
  Files        2652     2653       +1     
  Lines      143482   143754     +272     
  Branches    33108    33154      +46     
==========================================
+ Hits        92037    92163     +126     
- Misses      49836    49972     +136     
- Partials     1609     1619      +10     
Flag Coverage Δ
hive 39.46% <5.88%> (-0.06%) ⬇️
mysql 58.22% <88.23%> (-0.05%) ⬇️
postgres 58.28% <88.23%> (-0.06%) ⬇️
presto 41.05% <5.88%> (-0.07%) ⬇️
python 59.76% <88.23%> (-0.06%) ⬇️
sqlite 57.91% <88.23%> (-0.05%) ⬇️
unit 100.00% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread superset/datasource/api.py
@bito-code-review

Copy link
Copy Markdown
Contributor

The flagged issue is correct. The current implementation of the cache key in superset/datasource/api.py relies on get_user_id(), which returns None for guest users, causing them to share cache entries regardless of their specific RLS rules.

To resolve this, you should incorporate the effective RLS context into the cache key. You can achieve this by including the result of security_manager.get_rls_cache_key(datasource) in the cache key dictionary.

Here is the minimal fix for superset/datasource/api.py:

                        "col": column_name,
                        "limit": row_limit,
                        "denorm": denormalize_column,
                        "user": get_user_id(),
                        "rls": security_manager.get_rls_cache_key(datasource),
                        "changed_on": str(getattr(datasource, "changed_on", "")),
                    },

This change ensures that the cache key is unique to the RLS context, preventing cross-session data leakage for guest users.

superset/datasource/api.py

"col": column_name,
                        "limit": row_limit,
                        "denorm": denormalize_column,
                        "user": get_user_id(),
                        "rls": security_manager.get_rls_cache_key(datasource),
                        "changed_on": str(getattr(datasource, "changed_on", "")),
                    },

Comment thread superset/datasource/api.py Outdated
Comment thread superset/datasource/api.py
@Abdulrehman-PIAIC80387

Abdulrehman-PIAIC80387 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

Bot review summary (addressed in 92654da298)

Thanks to the automated reviewers (@codeant-ai-for-open-source, @bito-code-review) for the careful pass. Recording status for the eventual human reviewer:

✅ Resolved

  1. ** CRITICAL — Guest-user RLS leak** (api.py:164)
    get_user_id() returns None for guest/anonymous sessions, so embedded dashboards with different guest-token RLS would have shared a "user": null cache entry.
    Fix: cache key now includes security_manager.get_rls_cache_key(datasource) — the same canonical RLS fingerprint already used by viz.py:479 and query_context_processor.py:239. Two guest sessions with different RLS now have distinct cache keys.
    Test: test_get_column_values_cache_isolated_per_rls_context.

  2. Major — RLS policy changes don't bust cache (api.py:163)
    Tightening an RLS rule wouldn't have evicted previously cached values until natural TTL expiry.
    Fix: same change as above. The RLS fingerprint includes both get_rls_sorted(datasource) clauses and get_guest_rls_filters_str(datasource), so any policy change shifts the fingerprint and the next request misses the cache.

⚪ Deferred (already noted in the PR description)

  1. ** Major — Cache stampede / thundering herd** (api.py:186)
    Concurrent first-load requests can each miss the cache and execute the query.
    Doing this correctly across all cache_manager.data_cache backends (Redis SETNX, Memcached add, in-process SimpleCache) is non-trivial — ~80–100 lines with cross-backend tests. Out of scope for this PR; will open a follow-up if maintainers want it bundled.

Other automated checks

  • @bito-code-review — "Actionable Suggestions — 0" after the bot accepted that the RLS fix addresses the flagged issue.
  • @codecov — the initial 5.88% patch coverage was reported before the integration tests ran; the seven cache tests in tests/integration_tests/datasource/api_tests.py cover the new code paths and the latest CI run shows 0 failures across 54 checks.

I'll resolve the two addressed inline threads to keep the review focus on what's open.

@bito-code-review

bito-code-review Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Code Review Agent Run #0a1989

Actionable Suggestions - 0
Review Details
  • Files reviewed - 2 · Commit Range: 3bc55cf..09317ef
    • superset/datasource/api.py
    • tests/integration_tests/datasource/api_tests.py
  • Files skipped - 0
  • Tools
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful
    • MyPy (Static Code Analysis) - ✔︎ Successful
    • Astral Ruff (Static Code Analysis) - ✔︎ Successful

Bito Usage Guide

Commands

Type the following command in the pull request comment and save the comment.

  • /review - Manually triggers a full AI review.

  • /pause - Pauses automatic reviews on this pull request.

  • /resume - Resumes automatic reviews.

  • /resolve - Marks all Bito-posted review comments as resolved.

  • /abort - Cancels all in-progress reviews.

Refer to the documentation for additional commands.

Configuration

This repository uses Superset You can customize the agent settings here or contact your Bito workspace admin at evan@preset.io.

Documentation & Help

AI Code Review powered by Bito Logo

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds server-side caching to the datasource column-values endpoint (/api/v1/datasource/<type>/<id>/column/<column>/values/) to avoid re-running expensive virtual dataset queries for repeated native-filter loads, and adds an X-Cache-Status response header plus a ?force=true bypass.

Changes:

  • Cache datasource.values_for_column(...) results in cache_manager.data_cache keyed by datasource/column/limit/denormalization/user/RLS/changed_on.
  • Add ?force=true to bypass the cache and add X-Cache-Status: HIT|MISS for debugging.
  • Add integration tests covering cache hit/bypass/isolation and header behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
superset/datasource/api.py Adds caching logic, cache key composition (incl. RLS fingerprint), force-bypass, and cache status header for column values endpoint.
tests/integration_tests/datasource/api_tests.py Adds integration tests validating cache hit/miss behavior, key isolation, and X-Cache-Status header.

Comment on lines 24 to 28
from superset import db, security_manager
from superset.connectors.sqla.models import SqlaTable
from superset.daos.exceptions import DatasourceTypeNotSupportedError
from superset.extensions import cache_manager
from superset.utils import json
Comment on lines +159 to +163
"limit": row_limit,
"denorm": denormalize_column,
"user": get_user_id(),
"rls": security_manager.get_rls_cache_key(datasource),
"changed_on": str(getattr(datasource, "changed_on", "")),
@Abdulrehman-PIAIC80387

Copy link
Copy Markdown
Contributor Author

Thanks for the review @sadpandajoe!

@Copilot — addressed both points in 3111a4e3e2:

  1. Per-user vs same-RLS caching — dropped get_user_id() from the cache key entirely. The RLS fingerprint from security_manager.get_rls_cache_key() is the sole security-isolation field now, which is the canonical helper used by viz.py:479 and query_context_processor.py:239. Two users with identical effective RLS now correctly share a cache entry (matches the PR description). Different RLS contexts, guest sessions with different guest tokens, and anonymous sessions each get their own partition.
  2. Test isolation — added setUp() that calls cache_manager.data_cache.clear() before every test in the class, so future test order changes can't introduce cache-bleed flakes.

Also removed the now-obsolete test_..._isolated_per_user test since per-user partitioning is no longer the behavior; test_..._isolated_per_rls_context covers the actual security guarantee.

@netlify

netlify Bot commented Jun 10, 2026

Copy link
Copy Markdown

Deploy Preview for superset-docs-preview ready!

Name Link
🔨 Latest commit 3111a4e
🔍 Latest deploy log https://app.netlify.com/projects/superset-docs-preview/deploys/6a291c7259931a0008d7de1a
😎 Deploy Preview https://deploy-preview-40839--superset-docs-preview.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.
🤖 Make changes Run an agent on this branch

To edit notification comments on pull requests, go to your Netlify project configuration.

@bito-code-review

bito-code-review Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Code Review Agent Run #69ba90

Actionable Suggestions - 0
Additional Suggestions - 1
  • tests/integration_tests/datasource/api_tests.py - 1
    • Missing test for cache sharing · Line 263-286
      The deleted `test_get_column_values_cache_isolated_per_user` verified per-user isolation. The new design intentionally removes user-based isolation (users with same RLS share cache), but no test verifies this new behavior. `test_get_column_values_cache_isolated_per_rls_context` only tests that DIFFERENT RLS contexts get separate entries — it doesn't test that IDENTICAL RLS contexts share entries. Without this, the cache-sharing optimization has no regression guard.
Filtered by Review Rules

Bito filtered these suggestions based on rules created automatically for your feedback. Manage rules.

  • tests/integration_tests/datasource/api_tests.py - 1
Review Details
  • Files reviewed - 2 · Commit Range: 09317ef..3111a4e
    • superset/datasource/api.py
    • tests/integration_tests/datasource/api_tests.py
  • Files skipped - 0
  • Tools
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful
    • MyPy (Static Code Analysis) - ✔︎ Successful
    • Astral Ruff (Static Code Analysis) - ✔︎ Successful

Bito Usage Guide

Commands

Type the following command in the pull request comment and save the comment.

  • /review - Manually triggers a full AI review.

  • /pause - Pauses automatic reviews on this pull request.

  • /resume - Resumes automatic reviews.

  • /resolve - Marks all Bito-posted review comments as resolved.

  • /abort - Cancels all in-progress reviews.

Refer to the documentation for additional commands.

Configuration

This repository uses Superset You can customize the agent settings here or contact your Bito workspace admin at evan@preset.io.

Documentation & Help

AI Code Review powered by Bito Logo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api Related to the REST API data:dataset Related to dataset configurations size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Filters do not reuse the virtual dataset query

2 participants