Skip to content

Make benchmark dataset scrubbing metadata-driven and unify HDF5/MFD load behavior#653

Open
tlwillke wants to merge 2 commits intomainfrom
disable-scrubbing
Open

Make benchmark dataset scrubbing metadata-driven and unify HDF5/MFD load behavior#653
tlwillke wants to merge 2 commits intomainfrom
disable-scrubbing

Conversation

@tlwillke
Copy link
Copy Markdown
Collaborator

@tlwillke tlwillke commented Apr 3, 2026

Summary

This PR restructures benchmark dataset loading so scrubbing behavior is explicit, metadata-controlled, and uniform across HDF5 and MFD loaders. The primary goal is to stop hard-coding legacy load-time scrubbing behavior into loader-specific paths and prepare a safe transition toward prescrubbed datasets whose offline ground truth matches the stored vectors exactly.

Key changes

Scrubbing behavior becomes explicit

  • Added DataSetProperties.LoadBehavior with:
    • LEGACY_SCRUB
    • NO_SCRUB
  • Added load_behavior support to dataset_metadata.yml.
  • Added DataSetUtils.processDataSet(...) as the new metadata-aware entry point for benchmark dataset processing.
  • Preserved the prior load-time scrubbing implementation behind legacyScrubDataSet(...).
  • Kept getScrubbedDataSet(...) temporarily as a deprecated compatibility shim.

Unified metadata-driven loader flow

  • Updated both DataSetLoaderHDF5 and DataSetLoaderMFD to carry full DataSetProperties through the load path instead of collapsing metadata down to only similarity_function.
  • Routed both loaders through DataSetUtils.processDataSet(...) so load behavior is applied in one place.
  • Removed the HDF5-specific filename inference path and now require explicit metadata for curated HDF5 datasets.

Metadata coverage updates

  • Added load_behavior to existing curated dataset entries.
  • Added explicit metadata entries for curated ann-benchmarks HDF5 datasets that previously relied on filename inference.
  • Added metadata entries for remaining MFD datasets in the static registry so known datasets fail clearly if metadata is missing.

Visibility / debugging

  • Added runtime printing of the dataset similarity function in the graph run path so the effective metadata-supplied similarity can be seen during execution.

Behavior changes

What changes now

  • Curated benchmark datasets no longer rely on HDF5 filename suffix inference for similarity.
  • Known datasets are now expected to have an explicit dataset_metadata.yml entry.
  • Benchmark dataset load behavior is controlled centrally through metadata rather than implicitly by format-specific code paths.
  • NO_SCRUB now loads vectors and ground truth exactly as stored.
  • LEGACY_SCRUB preserves the existing load-time behavior:
    • zero-vector removal
    • duplicate base-vector removal
    • query filtering against invalid / overlapping vectors
    • ground-truth remapping
    • conditional normalization

What does not change yet

  • Existing deployed datasets can remain on LEGACY_SCRUB during the transition.
  • The previous scrubbing behavior is still available and intentionally preserved for compatibility while prescrubbed datasets are rolled out.

Why this change

  • Benchmark datasets with precomputed offline ground truth should not be silently mutated by loader-specific cleanup logic unless that behavior is explicitly intended.
  • The old behavior broke the correspondence between stored vectors and offline ground truth for unscrubbed datasets, even though runs still completed.
  • Scrubbing policy is transitional benchmark-loader behavior, not an intrinsic property of the raw dataset format.
  • Unifying HDF5 and MFD under the same metadata architecture makes the code easier to reason about, easier to maintain, and safer to evolve as prescrubbed datasets become available.

Notes / limitations

  • LEGACY_SCRUB remains the default during the transition to avoid breaking currently deployed datasets.
  • getScrubbedDataSet(...) is still present as a deprecated compatibility API and should be removed after downstream callers are fully migrated.
  • Locally added datasets in DataSetLoaderMFD must also be added to dataset_metadata.yml to participate in the new configuration model.
  • This PR does not move scrubbing into the core library yet; it only prepares the benchmark loader path for a clean transition.

…et load behavior to dataset_metadata.yml, routing HDF5 and MFD loaders through processDataSet, preserving legacy scrubbing behind a deprecated path, and requiring explicit similarity/load configuration for curated datasets.
@tlwillke tlwillke requested a review from ashkrisk April 3, 2026 04:30
@tlwillke tlwillke self-assigned this Apr 3, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 3, 2026

Before you submit for review:

  • Does your PR follow guidelines from CONTRIBUTIONS.md?
  • Did you summarize what this PR does clearly and concisely?
  • Did you include performance data for changes which may be performance impacting?
  • Did you include useful docs for any user-facing changes or features?
  • Did you include useful javadocs for developer oriented changes, explaining new concepts or key changes?
  • Did you trigger and review regression testing results against the base branch via Run Bench Main?
  • Did you adhere to the code formatting guidelines (TBD)
  • Did you group your changes for easy review, providing meaningful descriptions for each commit?
  • Did you ensure that all files contain the correct copyright header?

If you did not complete any of these, then please explain below.

@tlwillke tlwillke added the bug Something isn't working label Apr 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant