speed up define_clonotypes #368

grst · 2022-10-09T14:17:37Z

The define_clonotypes function scales badly. There are two problems with it

it could be faster (while it relies heavily on numpy, there are parts implemented in Python)
parallelization doesn't work properly with large data. Due to how multiprocessing is implemented in Python, parallelization involves a lot of copying. If parallelization worked properly, the speed would still be bearable if one throws enough cores at the problem.

INPUT:

OUTPUT:

CURRENT IMPLEMENTATION:

compute unique receptor configurations (i.e. combining cells with the same sequences into a single entry) (fast)
build a lookup table from which the neighbors of each cell can be retrieved (fast enough)
loop through all unique receptor configurations and find neighbors (SLOW)
build a distance matrix (fast)
graph partition using igraph (fast)

ALTERNATIVE IMPLEMENTATIONS I considered but discarded

reindexing sequence distance matrices such that they match the table of unique receptor configurations
Then perform matrix operations to combine primary/secondary and TRA/TRB matrices.
The problem with this approach is that large dense blocks in the sparse matrices can arise if many unique receptors have the same sequence (e.g. same TRA but different TRB).

fix parallelization (shared memory)
reimplement using jax/numba (this may also solve the parallelization and provide GPU support)
Combine 2-4 into a single step (maybe possible with sequence embedding -- see Autoencoder-based sequence embedding #369 ). Note that this would be an alternative route and wouldn't replace ir_dist/define_clonotypes completely.
Special-casing: In the case of omniscope data (which only has TRB chains), the problem simplifies to reindexing a sparse matrix. If using only one pair of sequences per cell, the problem is likely also simpler.

grst mentioned this issue Oct 9, 2022

Scalability to >1M cells #370

Closed

grst mentioned this issue Jan 9, 2024

Speed up clonotype distance calculation #470

Merged

grst added this to scirpy-dev May 28, 2024

grst moved this to On Hold in scirpy-dev May 28, 2024

grst closed this as completed in #470 Sep 8, 2024

github-project-automation bot moved this from In progress to Done in scirpy-dev Sep 8, 2024

Provide feedback