-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allowing extract_array calls to use pre-indexed grid information? #91
Comments
FWIW the
Anyways last time I checked the bottleneck for |
Oh, this doesn't have anything to do with the HDF5 representations; the discussion in tatami-inc/beachmat#20 only involves in-memory matrices. You can see some timings comparing the subsetting by Matrix's (The precomputed index basically involves running through the set of non-zero values and their row indices, figuring out the start/stop positions for each row chunk in each column, and then using that information to extract the row chunk without having to do a search of some kind, which is what I assume Matrix's CHOLMOD code is doing.) My implementation of this precomputed index gives me a 10-fold speedup to simply iterate over row-wise blocks of the in-memory |
I see, nice! What other sparse in-memory DelayedArray backends could benefit from something like that? This kind of precomputed index seems pretty heavy (almost half the size of Furthermore, it seems that the size of More seriously, if we had a matrix-like container that supports chunking where the chunks are small dgCMatrix objects, you could put Note that one way to construct something analog to BTW I'm still not clear why we need to do block processing at all on in-memory objects. What's the use case? |
Responding in reverse order:
The strongest use case is when the operation discards sparsity. For example, I might be computing ranks, in which case I want to process by blocks so that the peak memory usage is never too high. (Case in point: aertslab/AUCell#16.) Another use case is that I want to split the job across multiple workers. This requires me to split the matrix up somehow, and the block processing machinery is a natural and easy way to do that; just set
Indeed, this is actually what I do in
This makes sense but less appealing if it requires the user to do the transformation explicitly. In the context of the rest of the analysis, many other functions are optimized for
Yes, unfortunately. I suspect this example is slightly pathological in that that 1% density of non-zero values is unusually low (because I couldn't be bothered waiting for That said, I suspect it is possible to halve the size of the index with some extra work:
In practice, it may be more convenient to store both the start and end positions for each viewport, rather than having to jump to the next viewport to retrieve the end positions. This is still okay because the end positions of one viewport is literally the same vector as the start positions of the next viewport, allowing us to exploit R's COW to avoid an actual copy.
Potentially, everything that I work on that uses an in-memory sparse matrix as the seed. The most obvious ones are |
Was fiddling with something else and happened across some timings that might be helpful to distill the problem: library(Matrix)
x <- rsparsematrix(50000, 20000, density=0.1)
system.time(colSums(x))
## user system elapsed
## 0.262 0.001 0.264
y <- DelayedArray(x)
system.time(beachmat::rowBlockApply(y, colSums, grid=TRUE)) # final reduce is negligible
## user system elapsed
## 2.468 0.371 2.849
system.time(colSums(y))
## user system elapsed
## 22.413 2.089 25.034
system.time(blockApply(y, colSums, grid=rowAutoGrid(x))) # for a strict-like-for-like
## user system elapsed
## 35.076 3.796 39.089 The differences from |
The recent conversation in theislab/zellkonverter#34 reminded me of some work I did in tatami-inc/beachmat#20. Briefly, the idea was to speed up row-based block processing of
dgCMatrix
by performing a single pass over the non-zero elements beforehand to identify the start and end of each row block in each column. This avoids the need for costly per-column binary searches when each row block is extracted in the usual way, and gives a ~10-fold speed-up in row-based processing ofdgCMatrix
es.Now I'm wondering whether this approach can be generalized somehow so that other DelayedArray backends can benefit. Perhaps functions like
rowAutoGrid()
can decorate the grid object with extra information that allowsextract_array
to efficiently obtain the necessary bits and pieces, if a suitable object like adgCMatrix
is passed?Happy to give this - or other ideas - a crack with a PR if there is some interest.
The text was updated successfully, but these errors were encountered: