-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ Schema-based curators: AnnDataCurator
#2418
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2418 +/- ##
==========================================
- Coverage 91.71% 91.49% -0.23%
==========================================
Files 62 63 +1
Lines 9138 9697 +559
==========================================
+ Hits 8381 8872 +491
- Misses 757 825 +68 ☔ View full report in Codecov by Sentry. |
@@ -295,4 +339,3 @@ def _get_related_name(self: Schema) -> str: | |||
|
|||
Schema.members = members # type: ignore | |||
Schema._get_related_name = _get_related_name | |||
Schema.feature_sets = Schema._artifacts_m2m # backward compat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should have read Schema.artifacts
instead of Schema.feature_sets
to enable schema.artifacts
pointing to the M2M.
But rather than this hacky solution, I want to properly redo the feature sets backward compatibility
|
||
""" | ||
|
||
class Meta(Record.Meta, TracksRun.Meta, TracksUpdates.Meta): | ||
abstract = False | ||
|
||
_name_field: str = "name" | ||
_aux_fields: dict[str, tuple[str, type]] = {"0": ("coerce_dtype", bool)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Koncopd Here is an example for how to use it.
Is part of a sequence of PRs that refactors the curators:
Curator.validate()
throw an error #2417BaseCurator
asCurator
and introduceCatCurator
#2416organism
handling in curators #2415CatCurator
#2413CellxGene
schema #2412PertCurator
fromwetlab
here and addCellxGene
Curator
test #2408CellxGene
Curator
fromcellxgene-lamin
here #2403DataFrameCurator
#2388This PR introduces the counterpart to the schema-based
DataFrameCurator
introduced in:DataFrameCurator
#2388The
AnnData
schema andCurator
Docs preview.
Compare with `DataFrameCurator`
Docs preview.
Notes
Inferred/annotating vs. validating schema
Originally the plan was to populate
artifact.schema
with the inferred schema that links/annotates by all all features of a dataset in the same wayartifact._schemas_m2m
already does to make the artifact queryable by all features.Because we decided to make
Schema.composite
a self-referential foreign key rather than a ManyToMany, every component can only be used by a single composite schema. This prohibits to useartifact.schema
for an inferred composite schema of which we'd have a high number due to variations in the detailed high-dimensional feature sets. For each combination, we'd need to create a copy of a low-cardinality feature set like theobs
schema in the quickstart example.With this PR we now capture both the information about the curation constraints (the "validating schema") and about the inferred/annotating schemas (what we've always done under the name "feature sets"). There might be a more parsimonious solution if we dropped
_schemas_m2m
and replace it with a good way to create and populateartifact.schema
with an inferred schema. But we postpone this consideration to lamindb v2.This also means that
Schema.validated_by
is not going to be used and should be hidden from the docs:artifact.schema
indicates the validating schema (and not an inferred schema).Materials
Internal Notion page.