Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Schema-based curators: AnnDataCurator #2418

Merged
merged 32 commits into from
Feb 3, 2025
Merged

✨ Schema-based curators: AnnDataCurator #2418

merged 32 commits into from
Feb 3, 2025

Conversation

falexwolf
Copy link
Member

@falexwolf falexwolf commented Feb 2, 2025

Is part of a sequence of PRs that refactors the curators:


This PR introduces the counterpart to the schema-based DataFrameCurator introduced in:

The AnnData schema and Curator

Docs preview.

image
Compare with `DataFrameCurator`

Docs preview.

image

Notes

Inferred/annotating vs. validating schema

Originally the plan was to populate artifact.schema with the inferred schema that links/annotates by all all features of a dataset in the same way artifact._schemas_m2m already does to make the artifact queryable by all features.

Because we decided to make Schema.composite a self-referential foreign key rather than a ManyToMany, every component can only be used by a single composite schema. This prohibits to use artifact.schema for an inferred composite schema of which we'd have a high number due to variations in the detailed high-dimensional feature sets. For each combination, we'd need to create a copy of a low-cardinality feature set like the obs schema in the quickstart example.

With this PR we now capture both the information about the curation constraints (the "validating schema") and about the inferred/annotating schemas (what we've always done under the name "feature sets"). There might be a more parsimonious solution if we dropped _schemas_m2m and replace it with a good way to create and populate artifact.schema with an inferred schema. But we postpone this consideration to lamindb v2.

This also means that Schema.validated_by is not going to be used and should be hidden from the docs: artifact.schema indicates the validating schema (and not an inferred schema).

Materials

Internal Notion page.

Copy link

github-actions bot commented Feb 2, 2025

@falexwolf falexwolf changed the title Anndatacurator ✨ Schema-based curators: AnnDataCurator Feb 3, 2025
Copy link

codecov bot commented Feb 3, 2025

Codecov Report

Attention: Patch coverage is 91.17647% with 21 lines in your changes missing coverage. Please review.

Project coverage is 91.49%. Comparing base (d503387) to head (b3a2f1d).
Report is 37 commits behind head on main.

Files with missing lines Patch % Lines
lamindb/curators/__init__.py 91.30% 8 Missing ⚠️
lamindb/models.py 72.41% 8 Missing ⚠️
lamindb/_schema.py 91.48% 4 Missing ⚠️
lamindb/core/_feature_manager.py 88.88% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2418      +/-   ##
==========================================
- Coverage   91.71%   91.49%   -0.23%     
==========================================
  Files          62       63       +1     
  Lines        9138     9697     +559     
==========================================
+ Hits         8381     8872     +491     
- Misses        757      825      +68     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@github-actions github-actions bot temporarily deployed to pull request February 3, 2025 11:17 Inactive
@github-actions github-actions bot temporarily deployed to pull request February 3, 2025 15:48 Inactive
@falexwolf falexwolf merged commit d38ca15 into main Feb 3, 2025
16 of 18 checks passed
@falexwolf falexwolf deleted the anndatacurator branch February 3, 2025 16:46
@@ -295,4 +339,3 @@ def _get_related_name(self: Schema) -> str:

Schema.members = members # type: ignore
Schema._get_related_name = _get_related_name
Schema.feature_sets = Schema._artifacts_m2m # backward compat
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have read Schema.artifacts instead of Schema.feature_sets to enable schema.artifacts pointing to the M2M.

But rather than this hacky solution, I want to properly redo the feature sets backward compatibility


"""

class Meta(Record.Meta, TracksRun.Meta, TracksUpdates.Meta):
abstract = False

_name_field: str = "name"
_aux_fields: dict[str, tuple[str, type]] = {"0": ("coerce_dtype", bool)}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Koncopd Here is an example for how to use it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant