Make benchmark dataset scrubbing metadata-driven and unify HDF5/MFD load behavior by tlwillke · Pull Request #653 · datastax/jvector

tlwillke · 2026-04-03T04:30:23Z

Summary

This PR restructures benchmark dataset loading so scrubbing behavior is explicit, metadata-controlled, and uniform across HDF5 and MFD loaders. The primary goal is to stop hard-coding legacy load-time scrubbing behavior into loader-specific paths and prepare a safe transition toward prescrubbed datasets whose offline ground truth matches the stored vectors exactly.

Key changes

Scrubbing behavior becomes explicit

Added DataSetProperties.LoadBehavior with:
- LEGACY_SCRUB
- NO_SCRUB
Added load_behavior support to dataset_metadata.yml.
Added DataSetUtils.processDataSet(...) as the new metadata-aware entry point for benchmark dataset processing.
Preserved the prior load-time scrubbing implementation behind legacyScrubDataSet(...).
Kept getScrubbedDataSet(...) temporarily as a deprecated compatibility shim.

Unified metadata-driven loader flow

Updated both DataSetLoaderHDF5 and DataSetLoaderMFD to carry full DataSetProperties through the load path instead of collapsing metadata down to only similarity_function.
Routed both loaders through DataSetUtils.processDataSet(...) so load behavior is applied in one place.
Removed the HDF5-specific filename inference path and now require explicit metadata for curated HDF5 datasets.

Metadata coverage updates

Added load_behavior to existing curated dataset entries.
Added explicit metadata entries for curated ann-benchmarks HDF5 datasets that previously relied on filename inference.
Added metadata entries for remaining MFD datasets in the static registry so known datasets fail clearly if metadata is missing.

Visibility / debugging

Added runtime printing of the dataset similarity function in the graph run path so the effective metadata-supplied similarity can be seen during execution.

Behavior changes

What changes now

Curated benchmark datasets no longer rely on HDF5 filename suffix inference for similarity.
Known datasets are now expected to have an explicit dataset_metadata.yml entry.
Benchmark dataset load behavior is controlled centrally through metadata rather than implicitly by format-specific code paths.
NO_SCRUB now loads vectors and ground truth exactly as stored.
LEGACY_SCRUB preserves the existing load-time behavior:
- zero-vector removal
- duplicate base-vector removal
- query filtering against invalid / overlapping vectors
- ground-truth remapping
- conditional normalization

What does not change yet

Existing deployed datasets can remain on LEGACY_SCRUB during the transition.
The previous scrubbing behavior is still available and intentionally preserved for compatibility while prescrubbed datasets are rolled out.

Why this change

Benchmark datasets with precomputed offline ground truth should not be silently mutated by loader-specific cleanup logic unless that behavior is explicitly intended.
The old behavior broke the correspondence between stored vectors and offline ground truth for unscrubbed datasets, even though runs still completed.
Scrubbing policy is transitional benchmark-loader behavior, not an intrinsic property of the raw dataset format.
Unifying HDF5 and MFD under the same metadata architecture makes the code easier to reason about, easier to maintain, and safer to evolve as prescrubbed datasets become available.

Notes / limitations

LEGACY_SCRUB remains the default during the transition to avoid breaking currently deployed datasets.
getScrubbedDataSet(...) is still present as a deprecated compatibility API and should be removed after downstream callers are fully migrated.
Locally added datasets in DataSetLoaderMFD must also be added to dataset_metadata.yml to participate in the new configuration model.
This PR does not move scrubbing into the core library yet; it only prepares the benchmark loader path for a clean transition.

…et load behavior to dataset_metadata.yml, routing HDF5 and MFD loaders through processDataSet, preserving legacy scrubbing behind a deprecated path, and requiring explicit similarity/load configuration for curated datasets.

github-actions · 2026-04-03T04:30:33Z

Before you submit for review:

Does your PR follow guidelines from CONTRIBUTIONS.md?
Did you summarize what this PR does clearly and concisely?
Did you include performance data for changes which may be performance impacting?
Did you include useful docs for any user-facing changes or features?
Did you include useful javadocs for developer oriented changes, explaining new concepts or key changes?
Did you trigger and review regression testing results against the base branch via Run Bench Main?
Did you adhere to the code formatting guidelines (TBD)
Did you group your changes for easy review, providing meaningful descriptions for each commit?
Did you ensure that all files contain the correct copyright header?

If you did not complete any of these, then please explain below.

tlwillke requested a review from ashkrisk April 3, 2026 04:30

tlwillke self-assigned this Apr 3, 2026

tlwillke requested review from MarkWolters and jshook as code owners April 3, 2026 04:30

tlwillke added the bug Something isn't working label Apr 3, 2026

Disambiguate console print.

7cdaa79

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make benchmark dataset scrubbing metadata-driven and unify HDF5/MFD load behavior#653

Make benchmark dataset scrubbing metadata-driven and unify HDF5/MFD load behavior#653
tlwillke wants to merge 2 commits intomainfrom
disable-scrubbing

tlwillke commented Apr 3, 2026

Uh oh!

github-actions bot commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tlwillke commented Apr 3, 2026

Summary

Key changes

Scrubbing behavior becomes explicit

Unified metadata-driven loader flow

Metadata coverage updates

Visibility / debugging

Behavior changes

What changes now

What does not change yet

Why this change

Notes / limitations

Uh oh!

github-actions bot commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant