-
Notifications
You must be signed in to change notification settings - Fork 15
✨ Add schema-based SomaExperimentCurator
#2769
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Lukas Heumos <[email protected]>
Signed-off-by: Lukas Heumos <[email protected]>
Signed-off-by: Lukas Heumos <[email protected]>
Signed-off-by: Lukas Heumos <[email protected]>
…ure/tiledbsomaexperimentcurator
Signed-off-by: Lukas Heumos <[email protected]>
Signed-off-by: Lukas Heumos <[email protected]>
Signed-off-by: Lukas Heumos <[email protected]>
Signed-off-by: Lukas Heumos <[email protected]>
Signed-off-by: Lukas Heumos <[email protected]>
Signed-off-by: Lukas Heumos <[email protected]>
Signed-off-by: Lukas Heumos <[email protected]>
Signed-off-by: Lukas Heumos <[email protected]>
Signed-off-by: Lukas Heumos <[email protected]>
SomaExperimentCurator
Signed-off-by: Lukas Heumos <[email protected]>
Signed-off-by: Lukas Heumos <[email protected]>
Signed-off-by: Lukas Heumos <[email protected]>
Signed-off-by: Lukas Heumos <[email protected]>
…ure/tiledbsomaexperimentcurator
Signed-off-by: Lukas Heumos <[email protected]>
Signed-off-by: Lukas Heumos <[email protected]>
Signed-off-by: Lukas Heumos <[email protected]>
Signed-off-by: Lukas Heumos <[email protected]>
Signed-off-by: Lukas Heumos <[email protected]>
Signed-off-by: Lukas Heumos <[email protected]>
Signed-off-by: Lukas Heumos <[email protected]>
super().__init__(dataset=dataset, schema=schema) | ||
if not data_is_soma_experiment(self._dataset): | ||
raise InvalidArgument("dataset must be SOMAExperiment-like.") | ||
if schema.otype != "tiledbsoma": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether we should change the otype to a more specific "soma_experiment"
.
@@ -64,15 +65,13 @@ def identify_zarr_type( | |||
storepath: UPathStr, *, check: bool = True | |||
) -> Literal["anndata", "mudata", "spatialdata", "unknown"]: | |||
"""Identify whether a zarr store is AnnData, SpatialData, or unknown type.""" | |||
# we can add these cheap suffix-based-checks later | |||
# also need to check whether the .spatialdata.zarr suffix | |||
# actually becomes a "standard"; currently we don't recognize it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what is meant here but I've seen both already in the wild and therefore I would recognize them. It allows for cheaper checks in those cases. It also helps us simplify the code and for better test coverage.
lamindb/models/artifact.py
Outdated
return ( | ||
identify_zarr_type( | ||
data_path if class_name == "AnnData" else data, | ||
check=True if class_name == "AnnData" else False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a False for all 3 is also okay but I kept the behavior as it was for now.
Signed-off-by: Lukas Heumos <[email protected]>
Signed-off-by: Lukas Heumos <[email protected]>
Signed-off-by: Lukas Heumos <[email protected]>
Signed-off-by: Lukas Heumos <[email protected]>
@@ -851,6 +865,92 @@ def __init__( | |||
self._columns_field = self._var_fields | |||
|
|||
|
|||
class SomaExperimentCurator(SlotsCurator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't have standardize
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should come from SlotsCurator
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but tiledbsoma requires certain special logic for that.
# global Experiment obs slot | ||
_ms, modality_slot = None, slot | ||
schema_dataset = ( | ||
self._dataset.obs.read() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this reads the whole thing into memory. I would really avoid this. In the old curator it is never read in full.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point of such curator is surely to avoid loading even obs
in full.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe @falexwolf has different opinion though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand but then it'll require quite a bit more custom code. Usually obs
and var
don't get that big and it's not the biggest issue to load that into memory, is it?
But before I act, I'm waiting for Alex opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(1) Tell me if I'm wrong: The way people use SOMA typically is by streaming obs rather than fully downloading it. This is also what we show in our cellxgene
guide for Census.
(2) Given Sergei already spent long hours to build the previous SOMA curator so that it works with the appropriate paradigm, I'm leaning towards not introducing regression re paradigm (1), but adopt existing code to the schema-based implementation. My assumption is that it "can't be so hard given it has already been done once".
(3) Quite generally I believe that many access patterns will be streaming-based in the future, and that rather than considering streaming-compat a patch for SOMA, I'd consider viewing the lack of streaming-compat a deficiency of the DataFrameCurator that should be remedied in the upcoming months (not urgent); hopefully using similar or the same abstraction used for SOMA.
(4) Downloading metadata for 100M rows would amount to 100e6 * 50 * 4 / 1e9 = 20 GB
with a higher estimate for the number of cols and a lower estimate for the byte size per value. So, it's indeed large.
(5) It might be that for the task of curation we always want the full thing. 🤔 So this might make it special.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pondering on this (5) does not hold: we definitely don't want to load 100M string values in most cases; validation only establishes that the column is of type str
, nothing more. While the latter can be verified from the pyarrow.lib.Schema
object and takes microseconds once the schema is known, the full load of 100M string values will take seconds if not more.
So, I'd say one should follow the suggestion in (2). Further performance optimizations can be done at a later point. The scope of this PR was to "port the existing implementation to the new-style curators" not "to improve the existing implemenation" (unless that's simple, of course).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pondering this again, if things really do become ugly, one could say that "the curator is only compatible with smaller-scale soma stores". There won't be a frequent need to run it on a gigantic store (Census is one example though). So, it comes down to how hard it is to port what Sergei did without making the code complicated. It might be that new abstractions would need to be introduced, and maybe that's out-of-scope for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have a look and will outline how much work it would be so that we can make an informed decision.
ms, modality_slot = slot.split(":") | ||
schema_dataset = ( | ||
self._dataset.ms[modality_slot.removesuffix(".T")] | ||
.var.read() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, reading the whole thing, would avoid if possible.
}, | ||
).save() | ||
|
||
curator = ln.curators.SomaExperimentCurator(experiment, soma_schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In most cases, this would be called on the URI, not on the Experiment
stream object that you're passing. It also seems like an anti-pattern to just open the stream and not close it within a context manager.
So, I suggest to pass the URI aka folder path (be it local or on S3). I also believe that that's consistent with ln.integrations.save_tiledbsoma_experiment()
. Tell me if I'm wrong!
Independent, given the name SomaExperimentCurator
we should adopt the .from_tiledbsoma()
and save_tiledbsoma_experiment()
names (backward compat).
Either is's all called Tiledbsoma
and we allow for additional logic that specifies that we're only looking for the Experiment
slot (that would need to be an argument). Or we call everything SomaExperiment
.
If there won't ever be a need to curate non-Experiment SOMA stores, the later is preferrable. Otherwise the former might be preferrable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either way naming has to be consistent and the docs should cross-reference these three pieces of API logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In most cases, this would be called on the URI, not on the Experiment stream object that you're passing. It also seems like an anti-pattern to just open the stream and not close it within a context manager. So, I suggest to pass the URI aka folder path (be it local or on S3). I also believe that that's consistent with
ln.integrations.save_tiledbsoma_experiment()
. Tell me if I'm wrong!
Generally agreed. Nevertheless, I think we should support both as users might already have an Experiment open. I'll adapt the example though so that this pattern is clearer.
I'll think about the naming...
Signed-off-by: Lukas Heumos <[email protected]>
…ure/tiledbsomaexperimentcurator
Signed-off-by: Lukas Heumos <[email protected]>
Signed-off-by: Lukas Heumos <[email protected]>
Signed-off-by: Lukas Heumos <[email protected]>
Signed-off-by: Lukas Heumos <[email protected]>
Signed-off-by: Lukas Heumos <[email protected]>
Fixes #2741
Fixes #2777
is_x
functions of the ScverseDataStructures into oneis_scversedatastructure
that can be parametrized with the type to check. This harmonizes the code and makes the behavior more predictable. I understand that this can harm readability and it might not be much better that one now has to pass the name of the expected data structure but we have this behavior across other places of the code and it might ultimately help us.save_artifact
for tiledbsoma for new style CuratorsSomaExperimentCurator
with associated tests and guide