Skip to content

improve dataset loading, add vectordata source#643

Open
jshook wants to merge 2 commits intomainfrom
datasets_update
Open

improve dataset loading, add vectordata source#643
jshook wants to merge 2 commits intomainfrom
datasets_update

Conversation

@jshook
Copy link
Copy Markdown
Contributor

@jshook jshook commented Mar 10, 2026

This PR activates the vectodata loader as the third option after the HDF5 and MFD loaders.
If a dataset is not found in either of these, then it will be loaded from vectordata sources so long as the dataset is visible.
Users will want to add a ~/.config/vectordata/catalogs.yaml file to get access to their own or group shared datasets.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 10, 2026

Before you submit for review:

  • Does your PR follow guidelines from CONTRIBUTIONS.md?
  • Did you summarize what this PR does clearly and concisely?
  • Did you include performance data for changes which may be performance impacting?
  • Did you include useful docs for any user-facing changes or features?
  • Did you include useful javadocs for developer oriented changes, explaining new concepts or key changes?
  • Did you trigger and review regression testing results against the base branch via Run Bench Main?
  • Did you adhere to the code formatting guidelines (TBD)
  • Did you group your changes for easy review, providing meaningful descriptions for each commit?
  • Did you ensure that all files contain the correct copyright header?

If you did not complete any of these, then please explain below.

Copy link
Copy Markdown
Contributor

@MarkWolters MarkWolters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a couple nit-picky suggestions but looks good

similarityFunction = VectorSimilarityFunction.EUCLIDEAN;
}
else {
throw new IllegalArgumentException("Unknown similarity function -- expected angular or euclidean for " + filename);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use the terminology we use elsewhere: cosine, l2 (Euclidean is fine), and dot product. And dot product should map to dot product, not cosine.

Also, it's pretty precarious selecting the VSF based on the file name.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, cosine, l2, dot_product, got it. Not sure why we have the wrong matching there.

}

private VectorSimilarityFunction mapDistanceFunction(DistanceFunction df) {
if (df == null) return COSINE;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why wouldn't df == null be an error?

If there is a need for a default, it should be DOT_PRODUCT and is only safe to arbitrarily use in all cases if the vectors are normalized.

Also, shouldn't this return VectorSimilarityFunction.COSINE?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, undefined should be an error.

I think it might actually return VectorSimilarityFunction.COSINE on line 281, but it doesn't look like it from the way it is written. I agree that explicit is better here.

Copy link
Copy Markdown
Collaborator

@tlwillke tlwillke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my comments for requested changes. It's looking good so far, but I cannot evaluate it further until the catalog.yaml comment is addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants