Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] Initial work toward PyTorch data loaders #2823

Draft
wants to merge 75 commits into
base: main
Choose a base branch
from

Conversation

bkmartinjr
Copy link
Member

@bkmartinjr bkmartinjr commented Jul 31, 2024

Notes to reviewers:

  • APIs have changed from the CZI contribution. The demo notebook (tutorial_pytorch.ipynb) has been updated to run with all API modifications.
  • This PR is for an initial drop we plan to use for pre-release feedback. More work is required prior to release
  • This is set up as a separate Python package, with its own config/build/etc.

This PR contains a PyTorch iterable-style DataSet/DataPipe for use with SOMA Experiment. Initial code contributed by the Chan Zuckerberg single-cell team. Modifications to the original code include:

  • stand-alone Python package tiledbsoma_ml
  • significant refactoring to improve performance
  • CI and other config for stand-alone Python package
  • improved context/config handling - pass through all user-specified configuration allowing use with other storage solutions, etc.
  • fix lint and typing issues identified by the CI pipeline in this repo
  • update doctrings and copyright
  • some minor enhancements to unit tests
  • added torch.utils.data.IterableDataset alongside the existing torchdata.datapipes.iter.IterDataPipe, acknowledging that the torchdata.datapipes is deprecated and slated for removal
  • add Python 3.8 support
  • API refinement to better align with ExperimentAxisQuery
  • modifications to support DDP (mult-GPU) training.

API change summary:

  • remove sparse
  • constructor takes ExperimentAxisQuery, rather than than an Experiment
  • reworked I/O and shuffle chunk size param for better API UX

Benchmarking results vs the cellxgene-census loader will be forthcoming shortly.

Copy link

codecov bot commented Jul 31, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.79%. Comparing base (b3feada) to head (5dae82a).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2823      +/-   ##
==========================================
+ Coverage   89.64%   89.79%   +0.14%     
==========================================
  Files          39       39              
  Lines        4096     4096              
==========================================
+ Hits         3672     3678       +6     
+ Misses        424      418       -6     
Flag Coverage Δ
python 89.79% <ø> (+0.14%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
python_api 89.79% <ø> (+0.14%) ⬆️
libtiledbsoma ∅ <ø> (∅)

@johnkerl johnkerl changed the title initial contribution of ExperimentDataPipe/Loader from CZI [python] Initial contribution of ExperimentDataPipe/Loader from CZI Aug 6, 2024
@bkmartinjr bkmartinjr changed the title [python] Initial contribution of ExperimentDataPipe/Loader from CZI [python] Initial work toward PyTorch data loaders Aug 20, 2024
) -> torch.utils.data.DataLoader:
"""Factory method for :class:`torch.utils.data.DataLoader`. This method can be used to safely instantiate a
:class:`torch.utils.data.DataLoader` that works with :class:`tiledbsoma_ml.ExperimentAxisQueryIterableDataset`
or :class:`tiledbsoma_ml.ExperimentAxisQueryIterDataPipe`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
or :class:`tiledbsoma_ml.ExperimentAxisQueryIterDataPipe`.
or :class:`tiledbsoma_ml.ExperimentAxisQueryDataPipe`.

I don't think ExperimentAxisQueryIterDataPipe is a thing but I could be missing something...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used both names, all over the place. The original idea was to reflect the parent/base class, from which the API pattern is borrowed. Given this, I've changed it to IterDataPipe, which is the torchdata name for this style of pipe.

Copy link
Member

@aaronwolen aaronwolen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome. I'm excited to start testing it out. The docs are comprehensive and super clear. My suggested changes are largely minor typos and small improvements for clarity/consistency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants