Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🚚 Move PertCurator from wetlab here and add CellxGene Curator test #2408

Merged
merged 9 commits into from
Feb 1, 2025

Conversation

falexwolf
Copy link
Member

@falexwolf falexwolf commented Jan 30, 2025

Is part of a sequence of PRs that refactors the curators:

Made a "micro-farland2020" dataset to speed up the tests that were running in the wetlab repository: https://lamin.ai/laminlabs/lamindata/transform/JreAW9tkAHrc

Also added a test for the CxG curator based on the notebook in cellxgene-lamin.

@falexwolf falexwolf changed the title 🚚 Move PertCurator from wetlab here 🚚 Move PertCurator from wetlab here Jan 31, 2025
@falexwolf
Copy link
Member Author

@Zethson, do you have an idea for how to speed the cxg and pertcurator tests up more and turn them more into proper unit tests?

(I hope that with the last commit things are passing; they've been passing locally in isolation all along but these notebook-derived tests seem to create entanglement)

Copy link
Member

@Zethson Zethson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Zethson, do you have an idea for how to speed the cxg and pertcurator tests up more and turn them more into proper unit tests?

All of our XCurator tests are currently like pipelines which is easiest to implement and acts like an integration test for the different XCurator functions but it won't ever allow us to parallelize them.

These tests only need super simple datasets with only 1 var index and 2 categoricals each. You already have simple datasets here but maybe we can downsample and reduce them even more.

Honestly, I think we need to sit down and find ways to speed up the fundamental DataFrameCurator. Maybe polars, maybe somehow more bulk transactions, maybe we can somehow cache more? Spending time on that is probably better than optimizing these tests now, no?

I don't have further simple ideas.

{"CRISPR": "genetic", "drug": "compound"}
)

adata.obs["tissue_type"] = "cell culture"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 26 to 41 don't need to be in the test. Just set that as default for the micro dataset.

@falexwolf
Copy link
Member Author

Thanks and understood!

It's a chicken egg problem. I want fast tests to do the refactor / the refactor will start to speed things up.

I will see what can be done more.

@falexwolf
Copy link
Member Author

falexwolf commented Feb 1, 2025

Pasting our current run times; total is ~5 min:

============================= slowest 50 durations =============================
130.46s call     tests/curators/test_cxg_curator.py::test_cxg_curator
37.81s call     tests/curators/test_pert_curator.py::test_pert_curator
34.19s call     tests/curators/test_cat_curators.py::test_spatialdata_curator
6.59s call     tests/curators/test_cat_curators.py::test_df_curator
6.01s call     tests/curators/test_cat_curators.py::test_soma_curator
5.67s call     tests/curators/test_cat_curators.py::test_soma_curator_genes_columns
5.66s call     tests/curators/test_cat_curators.py::test_anndata_curator[donor]
5.59s call     tests/curators/test_cat_curators.py::test_anndata_curator[all]
3.44s call     tests/curators/test_cat_curators.py::test_unvalidated_data_object

Copy link

codecov bot commented Feb 1, 2025

Codecov Report

Attention: Patch coverage is 76.75676% with 43 lines in your changes missing coverage. Please review.

Project coverage is 91.54%. Comparing base (d503387) to head (50552fe).
Report is 30 commits behind head on main.

Files with missing lines Patch % Lines
lamindb/curators/__init__.py 76.75% 43 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2408      +/-   ##
==========================================
- Coverage   91.71%   91.54%   -0.18%     
==========================================
  Files          62       62              
  Lines        9138     9646     +508     
==========================================
+ Hits         8381     8830     +449     
- Misses        757      816      +59     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

github-actions bot commented Feb 1, 2025

@github-actions github-actions bot temporarily deployed to pull request February 1, 2025 02:38 Inactive
@falexwolf falexwolf changed the title 🚚 Move PertCurator from wetlab here 🚚 Move PertCurator from wetlab with tests here and add add CellXGeneCurator test Feb 1, 2025
@falexwolf falexwolf changed the title 🚚 Move PertCurator from wetlab with tests here and add add CellXGeneCurator test 🚚 Move PertCurator from wetlab here and add add CellXGeneCurator test Feb 1, 2025
@falexwolf falexwolf changed the title 🚚 Move PertCurator from wetlab here and add add CellXGeneCurator test 🚚 Move PertCurator from wetlab here and add add CellXGeneCurator test Feb 1, 2025
@falexwolf falexwolf changed the title 🚚 Move PertCurator from wetlab here and add add CellXGeneCurator test 🚚 Move PertCurator from wetlab here and add CellXGeneCurator test Feb 1, 2025
@falexwolf falexwolf changed the title 🚚 Move PertCurator from wetlab here and add CellXGeneCurator test 🚚 Move PertCurator from wetlab here and add CellxGeneCurator test Feb 1, 2025
@falexwolf falexwolf changed the title 🚚 Move PertCurator from wetlab here and add CellxGeneCurator test 🚚 Move PertCurator from wetlab here and add CellxGene Curator test Feb 1, 2025
@falexwolf
Copy link
Member Author

falexwolf commented Feb 1, 2025

The commits before already sped up things to ~3 min total, bringing down the CxG test from 130 sec to 50 sec:

50.40s call     tests/curators/test_cxg_curator.py::test_cxg_curator
29.27s call     tests/curators/test_cat_curators.py::test_spatialdata_curator
28.43s call     tests/curators/test_pert_curator.py::test_pert_curator
5.35s call     tests/curators/test_cat_curators.py::test_soma_curator_genes_columns
4.90s call     tests/curators/test_cat_curators.py::test_soma_curator

@falexwolf
Copy link
Member Author

More to come in the next PR with the first refacor.

So far, everything has been merely copy & pasted.

@falexwolf falexwolf merged commit 47d5e07 into main Feb 1, 2025
15 of 17 checks passed
@falexwolf falexwolf deleted the pertcurator branch February 1, 2025 05:59
@github-actions github-actions bot temporarily deployed to pull request February 1, 2025 06:02 Inactive
@falexwolf
Copy link
Member Author

In the next PR, the CxG test runtime goes further down to 20sec by removing the coupling to laminlabs/cellxgene: #2412 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants