✨ Add schema-based `SomaExperimentCurator` #2769

Zethson · 2025-05-14T13:29:22Z

Fixes #2741
Fixes #2777

Fixes some tracking related test warnings
Fixes very verbose Pandera import warnings
Fixes duplicated API docs
Fixes inconsistencies with script naming. Now all scripts are using underscores. We could have also agreed on dashes but 75%+ were already using underscores
Refactors the is_x functions of the ScverseDataStructures into one is_scversedatastructure that can be parametrized with the type to check. This harmonizes the code and makes the behavior more predictable. I understand that this can harm readability and it might not be much better that one now has to pass the name of the expected data structure but we have this behavior across other places of the code and it might ultimately help us.
Adds lot of tests for checking paths and types of the ScverseDataStructures
Adds more fixtures and cleans up tiledbsoma related code
Adds support for save_artifact for tiledbsoma for new style Curators
Adds a new SomaExperimentCurator with associated tests and guide

Signed-off-by: Lukas Heumos <[email protected]>

…ure/tiledbsomaexperimentcurator

Signed-off-by: Lukas Heumos <[email protected]>

…ure/tiledbsomaexperimentcurator

Signed-off-by: Lukas Heumos <[email protected]>

Zethson · 2025-05-19T12:00:57Z

lamindb/curators/core.py

+        super().__init__(dataset=dataset, schema=schema)
+        if not data_is_soma_experiment(self._dataset):
+            raise InvalidArgument("dataset must be SOMAExperiment-like.")
+        if schema.otype != "tiledbsoma":


I wonder whether we should change the otype to a more specific "soma_experiment".

Zethson · 2025-05-19T13:47:48Z

lamindb/core/storage/_zarr.py

@@ -64,15 +65,13 @@ def identify_zarr_type(
    storepath: UPathStr, *, check: bool = True
 ) -> Literal["anndata", "mudata", "spatialdata", "unknown"]:
    """Identify whether a zarr store is AnnData, SpatialData, or unknown type."""
-    # we can add these cheap suffix-based-checks later
-    # also need to check whether the .spatialdata.zarr suffix
-    # actually becomes a "standard"; currently we don't recognize it


I see what is meant here but I've seen both already in the wild and therefore I would recognize them. It allows for cheaper checks in those cases. It also helps us simplify the code and for better test coverage.

Zethson · 2025-05-19T14:15:59Z

lamindb/models/artifact.py

+                return (
+                    identify_zarr_type(
+                        data_path if class_name == "AnnData" else data,
+                        check=True if class_name == "AnnData" else False,


I think a False for all 3 is also okay but I kept the behavior as it was for now.

Signed-off-by: Lukas Heumos <[email protected]>

lamindb/models/artifact.py

Koncopd · 2025-05-21T16:15:41Z

lamindb/curators/core.py

@@ -851,6 +865,92 @@ def __init__(
        self._columns_field = self._var_fields


+class SomaExperimentCurator(SlotsCurator):


This doesn't have standardize?

This should come from SlotsCurator.

Yes, but tiledbsoma requires certain special logic for that.

Koncopd · 2025-05-21T16:19:33Z

lamindb/curators/core.py

+                # global Experiment obs slot
+                _ms, modality_slot = None, slot
+                schema_dataset = (
+                    self._dataset.obs.read()


But this reads the whole thing into memory. I would really avoid this. In the old curator it is never read in full.

The point of such curator is surely to avoid loading even obs in full.

Maybe @falexwolf has different opinion though.

I understand but then it'll require quite a bit more custom code. Usually obs and var don't get that big and it's not the biggest issue to load that into memory, is it?

But before I act, I'm waiting for Alex opinion.

(1) Tell me if I'm wrong: The way people use SOMA typically is by streaming obs rather than fully downloading it. This is also what we show in our cellxgene guide for Census.

(2) Given Sergei already spent long hours to build the previous SOMA curator so that it works with the appropriate paradigm, I'm leaning towards not introducing regression re paradigm (1), but adopt existing code to the schema-based implementation. My assumption is that it "can't be so hard given it has already been done once".

(3) Quite generally I believe that many access patterns will be streaming-based in the future, and that rather than considering streaming-compat a patch for SOMA, I'd consider viewing the lack of streaming-compat a deficiency of the DataFrameCurator that should be remedied in the upcoming months (not urgent); hopefully using similar or the same abstraction used for SOMA.

(4) Downloading metadata for 100M rows would amount to 100e6 * 50 * 4 / 1e9 = 20 GB with a higher estimate for the number of cols and a lower estimate for the byte size per value. So, it's indeed large.

(5) It might be that for the task of curation we always want the full thing. 🤔 So this might make it special.

Pondering on this (5) does not hold: we definitely don't want to load 100M string values in most cases; validation only establishes that the column is of type str, nothing more. While the latter can be verified from the pyarrow.lib.Schema object and takes microseconds once the schema is known, the full load of 100M string values will take seconds if not more.

So, I'd say one should follow the suggestion in (2). Further performance optimizations can be done at a later point. The scope of this PR was to "port the existing implementation to the new-style curators" not "to improve the existing implemenation" (unless that's simple, of course).

Pondering this again, if things really do become ugly, one could say that "the curator is only compatible with smaller-scale soma stores". There won't be a frequent need to run it on a gigantic store (Census is one example though). So, it comes down to how hard it is to port what Sergei did without making the code complicated. It might be that new abstractions would need to be introduced, and maybe that's out-of-scope for this PR.

I'll have a look and will outline how much work it would be so that we can make an informed decision.

Koncopd · 2025-05-21T16:21:45Z

lamindb/curators/core.py

+                ms, modality_slot = slot.split(":")
+                schema_dataset = (
+                    self._dataset.ms[modality_slot.removesuffix(".T")]
+                    .var.read()


Same, reading the whole thing, would avoid if possible.

docs/scripts/curate_soma_experiment.py

falexwolf · 2025-05-22T05:31:31Z

docs/scripts/curate_soma_experiment.py

+    },
+).save()
+
+curator = ln.curators.SomaExperimentCurator(experiment, soma_schema)


In most cases, this would be called on the URI, not on the Experiment stream object that you're passing. It also seems like an anti-pattern to just open the stream and not close it within a context manager.

So, I suggest to pass the URI aka folder path (be it local or on S3). I also believe that that's consistent with ln.integrations.save_tiledbsoma_experiment(). Tell me if I'm wrong!

Independent, given the name SomaExperimentCurator we should adopt the .from_tiledbsoma() and save_tiledbsoma_experiment() names (backward compat).

Either is's all called Tiledbsoma and we allow for additional logic that specifies that we're only looking for the Experiment slot (that would need to be an argument). Or we call everything SomaExperiment.

If there won't ever be a need to curate non-Experiment SOMA stores, the later is preferrable. Otherwise the former might be preferrable.

Either way naming has to be consistent and the docs should cross-reference these three pieces of API logic.

In most cases, this would be called on the URI, not on the Experiment stream object that you're passing. It also seems like an anti-pattern to just open the stream and not close it within a context manager. So, I suggest to pass the URI aka folder path (be it local or on S3). I also believe that that's consistent with ln.integrations.save_tiledbsoma_experiment(). Tell me if I'm wrong!

Generally agreed. Nevertheless, I think we should support both as users might already have an Experiment open. I'll adapt the example though so that this pattern is clearer.

I'll think about the naming...

Signed-off-by: Lukas Heumos <[email protected]>

…ure/tiledbsomaexperimentcurator

Signed-off-by: Lukas Heumos <[email protected]>

Zethson added 14 commits May 14, 2025 15:27

✨ Add lots of SomaExperiment tests

978f565

Signed-off-by: Lukas Heumos <[email protected]>

🎨 Fix import

d26a515

Signed-off-by: Lukas Heumos <[email protected]>

🎨 Fix import

aeb8ed0

Signed-off-by: Lukas Heumos <[email protected]>

🎨 Big is_scversedatastructure refactor

65afa20

Signed-off-by: Lukas Heumos <[email protected]>

Merge branch 'main' of https://github.com/laminlabs/lamindb into feat…

7398fde

…ure/tiledbsomaexperimentcurator

🎨 Fix conftest

8d6de03

Signed-off-by: Lukas Heumos <[email protected]>

🎨 Refactor

1d7f01b

Signed-off-by: Lukas Heumos <[email protected]>

🎨 Fix tests

afed0f3

Signed-off-by: Lukas Heumos <[email protected]>

🎨 Soma path

2bee906

Signed-off-by: Lukas Heumos <[email protected]>

🎨 Soma path refactor

d59cb9a

Signed-off-by: Lukas Heumos <[email protected]>

🎨 Scope

60cc4e7

Signed-off-by: Lukas Heumos <[email protected]>

🎨 Scope function

0eb349b

Signed-off-by: Lukas Heumos <[email protected]>

🎨 Scope function

932a838

Signed-off-by: Lukas Heumos <[email protected]>

🎨 Add test_data_is_scversedatastructure

1ccd8c2

Signed-off-by: Lukas Heumos <[email protected]>

falexwolf changed the title ~~✨ Add Schema based SomaExperimentCurator~~ ✨ Add schema-based SomaExperimentCurator May 16, 2025

Zethson added 13 commits May 16, 2025 14:23

🎨 Add SomaExperimentCurator implementation

1b54ca1

Signed-off-by: Lukas Heumos <[email protected]>

Merge branch 'main' into feature/tiledbsomaexperimentcurator

7b82e3c

🎨 Iterate tests

36ac624

Signed-off-by: Lukas Heumos <[email protected]>

🎨 Iterate tests

11be91c

Signed-off-by: Lukas Heumos <[email protected]>

🎨 Fix save artifact

27703c4

Signed-off-by: Lukas Heumos <[email protected]>

Merge branch 'main' of https://github.com/laminlabs/lamindb into feat…

a430f7c

…ure/tiledbsomaexperimentcurator

🎨 Enable MuData execution

6f5571a

Signed-off-by: Lukas Heumos <[email protected]>

🎨 Add SomaExperimentCurator to curate notebook

4a361d0

Signed-off-by: Lukas Heumos <[email protected]>

🎨 Add examples

ba5b899

Signed-off-by: Lukas Heumos <[email protected]>

🎨 Add is_soma_experiment test

f1d6656

Signed-off-by: Lukas Heumos <[email protected]>

🎨 Refactor

ae1f280

Signed-off-by: Lukas Heumos <[email protected]>

🎨 Allow None for is_scversedatastructure

f793abd

Signed-off-by: Lukas Heumos <[email protected]>

🎨 Refactor

cecc15f

Signed-off-by: Lukas Heumos <[email protected]>

Zethson commented May 19, 2025

View reviewed changes

🎨 Fix test

bc95e23

Signed-off-by: Lukas Heumos <[email protected]>

github-actions bot deployed to preview May 19, 2025 15:56 View deployment

🎨 Fix pandera & curator API docs

08f7cf5

Signed-off-by: Lukas Heumos <[email protected]>

github-actions bot deployed to preview May 20, 2025 07:21 View deployment

Zethson added 2 commits May 20, 2025 09:21

🎨 Rename scripts

562b122

Signed-off-by: Lukas Heumos <[email protected]>

🎨 Fix see also syntax

39c6b6d

Signed-off-by: Lukas Heumos <[email protected]>

github-actions bot deployed to preview May 20, 2025 07:33 View deployment

github-actions bot deployed to preview May 20, 2025 07:39 View deployment

Zethson requested review from sunnyosun and Koncopd May 20, 2025 07:42

Koncopd reviewed May 21, 2025

View reviewed changes

lamindb/models/artifact.py Outdated Show resolved Hide resolved

Koncopd reviewed May 21, 2025

View reviewed changes

lamindb/models/artifact.py Show resolved Hide resolved

Koncopd reviewed May 21, 2025

View reviewed changes

falexwolf reviewed May 22, 2025

View reviewed changes

docs/scripts/curate_soma_experiment.py Outdated Show resolved Hide resolved

falexwolf reviewed May 22, 2025

View reviewed changes

Zethson added 2 commits May 22, 2025 13:27

🎨 Polish

ace2630

Signed-off-by: Lukas Heumos <[email protected]>

Merge branch 'main' of https://github.com/laminlabs/lamindb into feat…

6c002b5

…ure/tiledbsomaexperimentcurator

github-actions bot deployed to preview May 22, 2025 11:44 View deployment

Zethson added 2 commits May 28, 2025 16:02

🎨 Fix merge conflicts

bf32124

Signed-off-by: Lukas Heumos <[email protected]>

🎨 Sub

dc2ef93

Signed-off-by: Lukas Heumos <[email protected]>

github-actions bot deployed to preview May 28, 2025 14:25 View deployment

🎨 Rename tiledbsoma datasets

923cd12

Signed-off-by: Lukas Heumos <[email protected]>

github-actions bot deployed to preview May 28, 2025 15:16 View deployment

Zethson mentioned this pull request May 28, 2025

🔈 Fix Pandera warning #2803

Merged

Zethson added 2 commits May 30, 2025 10:53

🎨 Merge conflict

bb1d734

Signed-off-by: Lukas Heumos <[email protected]>

🎨 Sub

6cca9b0

Signed-off-by: Lukas Heumos <[email protected]>

github-actions bot deployed to preview May 30, 2025 09:07 View deployment

Merge branch 'main' into feature/tiledbsomaexperimentcurator

f633ef6

github-actions bot deployed to preview May 30, 2025 15:39 View deployment

		@@ -851,6 +865,92 @@ def __init__(
		self._columns_field = self._var_fields


		class SomaExperimentCurator(SlotsCurator):

✨ Add schema-based SomaExperimentCurator #2769

Are you sure you want to change the base?

✨ Add schema-based SomaExperimentCurator #2769

Conversation

Zethson commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Zethson May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

✨ Add schema-based `SomaExperimentCurator` #2769

✨ Add schema-based `SomaExperimentCurator` #2769

Zethson commented May 14, 2025 •

edited

Loading

Zethson May 19, 2025 •

edited

Loading