You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The define_clonotypes function scales badly. There are two problems with it
it could be faster (while it relies heavily on numpy, there are parts implemented in Python)
parallelization doesn't work properly with large data. Due to how multiprocessing is implemented in Python, parallelization involves a lot of copying. If parallelization worked properly, the speed would still be bearable if one throws enough cores at the problem.
Where's the bottleneck of the function?
INPUT:
2 distance matrices, one for unique VJ sequences, one for unique VDJ sequences
OUTPUT:
a clonotype id for each cell
CURRENT IMPLEMENTATION:
compute unique receptor configurations (i.e. combining cells with the same sequences into a single entry) (fast)
build a lookup table from which the neighbors of each cell can be retrieved (fast enough)
loop through all unique receptor configurations and find neighbors (SLOW)
build a distance matrix (fast)
graph partition using igraph (fast)
ALTERNATIVE IMPLEMENTATIONS I considered but discarded
reindexing sequence distance matrices such that they match the table of unique receptor configurations
Then perform matrix operations to combine primary/secondary and TRA/TRB matrices.
The problem with this approach is that large dense blocks in the sparse matrices can arise if many unique receptors have the same sequence (e.g. same TRA but different TRB).
reimplement using jax/numba (this may also solve the parallelization and provide GPU support)
Combine 2-4 into a single step (maybe possible with sequence embedding -- see Autoencoder-based sequence embedding #369 ). Note that this would be an alternative route and wouldn't replace ir_dist/define_clonotypes completely.
Special-casing: In the case of omniscope data (which only has TRB chains), the problem simplifies to reindexing a sparse matrix. If using only one pair of sequences per cell, the problem is likely also simpler.
The text was updated successfully, but these errors were encountered:
Description of feature
The define_clonotypes function scales badly. There are two problems with it
Where's the bottleneck of the function?
INPUT:
OUTPUT:
CURRENT IMPLEMENTATION:
ALTERNATIVE IMPLEMENTATIONS I considered but discarded
Possible solutions
The text was updated successfully, but these errors were encountered: