Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speed up define_clonotypes #368

Closed
grst opened this issue Oct 9, 2022 · 0 comments · Fixed by #470
Closed

speed up define_clonotypes #368

grst opened this issue Oct 9, 2022 · 0 comments · Fixed by #470

Comments

@grst
Copy link
Collaborator

grst commented Oct 9, 2022

Description of feature

The define_clonotypes function scales badly. There are two problems with it

  • it could be faster (while it relies heavily on numpy, there are parts implemented in Python)
  • parallelization doesn't work properly with large data. Due to how multiprocessing is implemented in Python, parallelization involves a lot of copying. If parallelization worked properly, the speed would still be bearable if one throws enough cores at the problem.

Where's the bottleneck of the function?

INPUT:

  • 2 distance matrices, one for unique VJ sequences, one for unique VDJ sequences

OUTPUT:

  • a clonotype id for each cell

CURRENT IMPLEMENTATION:

  1. compute unique receptor configurations (i.e. combining cells with the same sequences into a single entry) (fast)
  2. build a lookup table from which the neighbors of each cell can be retrieved (fast enough)
  3. loop through all unique receptor configurations and find neighbors (SLOW)
  4. build a distance matrix (fast)
  5. graph partition using igraph (fast)

ALTERNATIVE IMPLEMENTATIONS I considered but discarded

  • reindexing sequence distance matrices such that they match the table of unique receptor configurations
  • Then perform matrix operations to combine primary/secondary and TRA/TRB matrices.
  • The problem with this approach is that large dense blocks in the sparse matrices can arise if many unique receptors have the same sequence (e.g. same TRA but different TRB).

Possible solutions

  • fix parallelization (shared memory)
  • reimplement using jax/numba (this may also solve the parallelization and provide GPU support)
  • Combine 2-4 into a single step (maybe possible with sequence embedding -- see Autoencoder-based sequence embedding #369 ). Note that this would be an alternative route and wouldn't replace ir_dist/define_clonotypes completely.
  • Special-casing: In the case of omniscope data (which only has TRB chains), the problem simplifies to reindexing a sparse matrix. If using only one pair of sequences per cell, the problem is likely also simpler.
@grst grst added this to scirpy-dev May 28, 2024
@grst grst moved this to On Hold in scirpy-dev May 28, 2024
@grst grst closed this as completed in #470 Sep 8, 2024
@github-project-automation github-project-automation bot moved this from In progress to Done in scirpy-dev Sep 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

1 participant