Skip to content

✨ Add schema-based SomaExperimentCurator #2769

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 42 commits into
base: main
Choose a base branch
from

Conversation

Zethson
Copy link
Member

@Zethson Zethson commented May 14, 2025

Fixes #2741
Fixes #2777

  • Fixes some tracking related test warnings
  • Fixes very verbose Pandera import warnings
  • Fixes duplicated API docs
  • Fixes inconsistencies with script naming. Now all scripts are using underscores. We could have also agreed on dashes but 75%+ were already using underscores
  • Refactors the is_x functions of the ScverseDataStructures into one is_scversedatastructure that can be parametrized with the type to check. This harmonizes the code and makes the behavior more predictable. I understand that this can harm readability and it might not be much better that one now has to pass the name of the expected data structure but we have this behavior across other places of the code and it might ultimately help us.
  • Adds lot of tests for checking paths and types of the ScverseDataStructures
  • Adds more fixtures and cleans up tiledbsoma related code
  • Adds support for save_artifact for tiledbsoma for new style Curators
  • Adds a new SomaExperimentCurator with associated tests and guide

Zethson added 14 commits May 14, 2025 15:27
Signed-off-by: Lukas Heumos <[email protected]>
Signed-off-by: Lukas Heumos <[email protected]>
Signed-off-by: Lukas Heumos <[email protected]>
Signed-off-by: Lukas Heumos <[email protected]>
Signed-off-by: Lukas Heumos <[email protected]>
Signed-off-by: Lukas Heumos <[email protected]>
Signed-off-by: Lukas Heumos <[email protected]>
Signed-off-by: Lukas Heumos <[email protected]>
Signed-off-by: Lukas Heumos <[email protected]>
Signed-off-by: Lukas Heumos <[email protected]>
@falexwolf falexwolf changed the title ✨ Add Schema based SomaExperimentCurator ✨ Add schema-based SomaExperimentCurator May 16, 2025
super().__init__(dataset=dataset, schema=schema)
if not data_is_soma_experiment(self._dataset):
raise InvalidArgument("dataset must be SOMAExperiment-like.")
if schema.otype != "tiledbsoma":
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether we should change the otype to a more specific "soma_experiment".

@@ -64,15 +65,13 @@ def identify_zarr_type(
storepath: UPathStr, *, check: bool = True
) -> Literal["anndata", "mudata", "spatialdata", "unknown"]:
"""Identify whether a zarr store is AnnData, SpatialData, or unknown type."""
# we can add these cheap suffix-based-checks later
# also need to check whether the .spatialdata.zarr suffix
# actually becomes a "standard"; currently we don't recognize it
Copy link
Member Author

@Zethson Zethson May 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what is meant here but I've seen both already in the wild and therefore I would recognize them. It allows for cheaper checks in those cases. It also helps us simplify the code and for better test coverage.

return (
identify_zarr_type(
data_path if class_name == "AnnData" else data,
check=True if class_name == "AnnData" else False,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a False for all 3 is also okay but I kept the behavior as it was for now.

Signed-off-by: Lukas Heumos <[email protected]>
Zethson added 2 commits May 20, 2025 09:21
Signed-off-by: Lukas Heumos <[email protected]>
Signed-off-by: Lukas Heumos <[email protected]>
@@ -851,6 +865,92 @@ def __init__(
self._columns_field = self._var_fields


class SomaExperimentCurator(SlotsCurator):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't have standardize?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should come from SlotsCurator.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but tiledbsoma requires certain special logic for that.

# global Experiment obs slot
_ms, modality_slot = None, slot
schema_dataset = (
self._dataset.obs.read()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this reads the whole thing into memory. I would really avoid this. In the old curator it is never read in full.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of such curator is surely to avoid loading even obs in full.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe @falexwolf has different opinion though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand but then it'll require quite a bit more custom code. Usually obs and var don't get that big and it's not the biggest issue to load that into memory, is it?

But before I act, I'm waiting for Alex opinion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1) Tell me if I'm wrong: The way people use SOMA typically is by streaming obs rather than fully downloading it. This is also what we show in our cellxgene guide for Census.

(2) Given Sergei already spent long hours to build the previous SOMA curator so that it works with the appropriate paradigm, I'm leaning towards not introducing regression re paradigm (1), but adopt existing code to the schema-based implementation. My assumption is that it "can't be so hard given it has already been done once".

(3) Quite generally I believe that many access patterns will be streaming-based in the future, and that rather than considering streaming-compat a patch for SOMA, I'd consider viewing the lack of streaming-compat a deficiency of the DataFrameCurator that should be remedied in the upcoming months (not urgent); hopefully using similar or the same abstraction used for SOMA.

(4) Downloading metadata for 100M rows would amount to 100e6 * 50 * 4 / 1e9 = 20 GB with a higher estimate for the number of cols and a lower estimate for the byte size per value. So, it's indeed large.

(5) It might be that for the task of curation we always want the full thing. 🤔 So this might make it special.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pondering on this (5) does not hold: we definitely don't want to load 100M string values in most cases; validation only establishes that the column is of type str, nothing more. While the latter can be verified from the pyarrow.lib.Schema object and takes microseconds once the schema is known, the full load of 100M string values will take seconds if not more.

So, I'd say one should follow the suggestion in (2). Further performance optimizations can be done at a later point. The scope of this PR was to "port the existing implementation to the new-style curators" not "to improve the existing implemenation" (unless that's simple, of course).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pondering this again, if things really do become ugly, one could say that "the curator is only compatible with smaller-scale soma stores". There won't be a frequent need to run it on a gigantic store (Census is one example though). So, it comes down to how hard it is to port what Sergei did without making the code complicated. It might be that new abstractions would need to be introduced, and maybe that's out-of-scope for this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have a look and will outline how much work it would be so that we can make an informed decision.

ms, modality_slot = slot.split(":")
schema_dataset = (
self._dataset.ms[modality_slot.removesuffix(".T")]
.var.read()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, reading the whole thing, would avoid if possible.

},
).save()

curator = ln.curators.SomaExperimentCurator(experiment, soma_schema)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In most cases, this would be called on the URI, not on the Experiment stream object that you're passing. It also seems like an anti-pattern to just open the stream and not close it within a context manager.

So, I suggest to pass the URI aka folder path (be it local or on S3). I also believe that that's consistent with ln.integrations.save_tiledbsoma_experiment(). Tell me if I'm wrong!

Independent, given the name SomaExperimentCurator we should adopt the .from_tiledbsoma() and save_tiledbsoma_experiment() names (backward compat).

Either is's all called Tiledbsoma and we allow for additional logic that specifies that we're only looking for the Experiment slot (that would need to be an argument). Or we call everything SomaExperiment.

If there won't ever be a need to curate non-Experiment SOMA stores, the later is preferrable. Otherwise the former might be preferrable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way naming has to be consistent and the docs should cross-reference these three pieces of API logic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In most cases, this would be called on the URI, not on the Experiment stream object that you're passing. It also seems like an anti-pattern to just open the stream and not close it within a context manager. So, I suggest to pass the URI aka folder path (be it local or on S3). I also believe that that's consistent with ln.integrations.save_tiledbsoma_experiment(). Tell me if I'm wrong!

Generally agreed. Nevertheless, I think we should support both as users might already have an Experiment open. I'll adapt the example though so that this pattern is clearer.

I'll think about the naming...

Zethson added 2 commits May 22, 2025 13:27
Signed-off-by: Lukas Heumos <[email protected]>
Zethson added 2 commits May 28, 2025 16:02
Signed-off-by: Lukas Heumos <[email protected]>
Signed-off-by: Lukas Heumos <[email protected]>
Zethson added 2 commits May 30, 2025 10:53
Signed-off-by: Lukas Heumos <[email protected]>
Signed-off-by: Lukas Heumos <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor SomaExperimentCurator to be schema based Pandera FutureWarnings
3 participants