Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reshaping Array #47

Open
muschellij2 opened this issue Jun 27, 2019 · 18 comments
Open

Reshaping Array #47

muschellij2 opened this issue Jun 27, 2019 · 18 comments
Assignees

Comments

@muschellij2
Copy link

First, thanks for the package. It's exactly what we needed. We're working on neuroimaging data: https://github.com/muschellij2/niftiArray. We're trying to do statistics over multiple 3D arrays. I believe this doesn't fit well with DelayedMatrixStats as matrixStats requires 2D matrices.

Similar to #30 and the comment here #30 (comment), I was wondering about "reshaping" an array. Really, I want to be able to vectorize a DelayedArray or convert a 4D DelayedArray to a 2D matrix where the first 3 dimensions are the rows of the matrix and the 4th dimension is the columns. I can realize the array, convert to a matrix, and then create a DelayedArray or DelayedMatrix for the current solution.

Ideally, I'd be able to do this without realizing the array into memory. I'm not sure if HDF5 (the backend we're using) would support this, but we would it useful. Trying to set dim or change the dimensions fails due the (correct) checks on DelayedMatrix. Just wondering if it were possible, if it makes sense with HDF5, and if both are true, then would it be possible to implement? I can send a reprex if necessary, but the setting of dimensions is well documented.

@muschellij2
Copy link
Author

I realized a MWE at Bioconductor/HDF5Array#16, which presents a solution, but not satisfactory one.

@hpages
Copy link
Contributor

hpages commented Jul 2, 2019

Hi,

I've implemented the ReshapedHDF5Array class in HDF5Array 1.13.3 (HDF5Array 1.13.3 belongs to BioC 3.10, which is the current development version of Bioconductor).

The ReshapedHDF5Array class is a DelayedArray subclass for representing an HDF5 dataset with a user-supplied upfront virtual reshaping. See ?ReshapedHDF5Array for more information. Hopefully this will help with your use case.

FWIW I initially tried to implement the reshaping as a delayed operation. This would be a better solution because it would be backend agnostic and the reshaping could then be applied at anytime i.e. even after the DelayedArray object has gone thru some delayed operations like subsetting, cbind(), aperm(), etc... However I quickly realized that this was going to be a lot more complicated than what I anticipated. So I opted for the ReshapedHDF5Array solution. It's simpler to implement but, unfortunately, not as satisfying as the delayed op approach.

H.

@muschellij2
Copy link
Author

Can't get the install. Tried but https://github.com/Bioconductor/HDF5Array/blob/master/DESCRIPTION#L18 has 0.11.4 requirement for DelayedArray, but master branch has it at 0.11.3: https://github.com/Bioconductor/DelayedArray/blob/master/DESCRIPTION#L11. Tried Bioc devel, but still older version.

BiocManager::install("HDF5Array", version = "devel", ask = FALSE)
#> Bioconductor version 3.10 (BiocManager 1.30.4), R 3.6.0 (2019-04-26)
#> Installing package(s) 'HDF5Array'
#> 
#> The downloaded binary packages are in
#>  /var/folders/1s/wrtqcpxn685_zk570bnx9_rr0000gr/T//Rtmp5mQ9Od/downloaded_packages
packageVersion("HDF5Array")
#> [1] '1.13.2'

Created on 2019-07-02 by the reprex package (v0.3.0)

Just getting used to Bioconductor workflow, so I'm likely missing the repo I could find the updated DelayedArray.

@hpages
Copy link
Contributor

hpages commented Jul 2, 2019

Oops, my bad sorry. I meant to have this latest version of HDF5Array (1.13.3) depend on DelayedArray >= 0.11.3, not on DelayedArray >= 0.11.4. Sorry for that.

(FWIW I have a local version of DelayedArray on my laptop that is at version 0.11.4, with some uncommitted changes to it, hence my confusion.)

I just adjusted the requirements in HDF5Array: Bioconductor/HDF5Array@f2f28bc

Please try again.

BTW only a few minutes after pushing the new ReshapedHDF5Array class to HDF5Array, I realized that it wouldn't actually be so hard to implement the reshaping as a delayed operation. I'd like to do it at some point. Then the ReshapedHDF5Array stuff will no longer be needed so maybe I'll remove it.

@muschellij2
Copy link
Author

muschellij2 commented Jul 2, 2019 via email

@muschellij2
Copy link
Author

Also, another check that I found is that to take an array to matrix with 1 column, I needed to add in a 1 to the dimensions of the data (e.g.

dim(from) = c(dim(from), 1)

For example, if dim(from) = c(10, 5, 4) and I want dim(from) = c(200, 1) I would have to make dim(from) = c(10, 5, 4, 1) first.

@hpages
Copy link
Contributor

hpages commented Jul 3, 2019

This 2-step reshaping is a consequence of a restriction on the kind of reshaping supported by the ReshapedHDF5Array() constructor. The restriction is documented in ?ReshapedHDF5Array:

Also please note that arbitrary reshapings are not supported.
Only reshapings that reduce the number of dimensions by collapsing
a group of consecutive dimensions into a single dimension are supported.
For example, reshaping a 10 x 3 x 5 x 1000 array as a 10 x 15 x 1000
array or as a 150 x 1000 matrix is supported.

The important bit here is "Only reshapings that reduce the number of dimensions". So adding dimensions would have to be done as a separate operation (delayed) on the object returned by the ReshapedHDF5Array() constructor.

Note that the 2-step reshaping will go away once reshaping is implemented as a delayed operation.

BTW I'm curious about what benefit you're hoping for by fully unfolding a 3-dimensional dataset into a 1-dimensional DelayedArray object or into a Nx1 DelayedMatrix object. In particular, I can't think of any DelayedMatrixStats operation that would make sense on this Nx1 matrix.

@muschellij2
Copy link
Author

muschellij2 commented Jul 3, 2019 via email

@hpages
Copy link
Contributor

hpages commented Jul 3, 2019

I see. Interesting use case. Note that if the 1000 3D datasets were stored as a single 4D dataset (512x512x512x1000) then things would be much easier. It would just be a matter of creating a ReshapedHDF5Matrix object with something like ReshapedHDF5Array(filepath, name, dim=c(512*512*512, 1000)) and feed that to DelayedMatrixStats::rowMeans2() or DelayedMatrixStats::rowMedians(). This would avoid the overhead of the 1000 reshapipngs + final cbind'ing of 1000 objects that you need to do with separate files. I suspect that this overhead is going to introduce some significant slowdown.

H.

@muschellij2
Copy link
Author

muschellij2 commented Jul 3, 2019 via email

@hpages
Copy link
Contributor

hpages commented Jul 3, 2019

Something that maybe would help is the ability to "stack" the 1000 3D datasets into a 4D dataset. That would need to be a new array operation so we would need to come up with a new verb for it (I don't think there is any equivalent in base R for doing this on ordinary arrays). This "stacking" is something I actually had in mind for a while but we didn't really have a use case for it (until now) so I lacked the incentive for implementing it. I don't anticipate it would be too hard to implement this "stacking" as a delayed operation though. One easy way would be to add the extra dimension to all the objects to stack (with dim(x) <- c(dim(x), 1)) and then to abind() them together along the last dimension.

So for your use case you could stack the 1000 3D datasets into a single 4D dataset (512x512x512x1000), then reshape into a Nx1000 DelayedMatrix (once this reshaping is available as a delayed operation). I believe this could be a little bit more efficient than the ReshapedHDF5Array approach. Not sure, would be interesting to compare.

@hpages
Copy link
Contributor

hpages commented Jul 3, 2019

I'm just seeing your last post now. Yes, nice trick to use a combination of aperm/arbind to achieve binding along an arbitrary dimension. That's what my proposed abind() above would do (except that it would use a native implementation so would avoid the overhead of permuting the dimensions back and forth).

@hpages
Copy link
Contributor

hpages commented Jul 3, 2019

Also, try to use intermediate realizations to circumvent some of the current limitations. For example after you've stacked the 1000 datasets together, you won't be able to reshape the result into an Nx1000 matrix (because the delayed reshaping is not supported yet) but you can realize it to a new HDF5 file. Then you can use ReshapedHDF5Array() on that new HDF5 file to reshape into a matrix. You will also pay much less overhead now when you call DelayedMatrixStats functions on this object because it doesn't carry any delayed op (except for the reshaping which can be seen as a seed built-in delayed op).

@muschellij2
Copy link
Author

muschellij2 commented Jul 3, 2019 via email

@hpages
Copy link
Contributor

hpages commented Jul 3, 2019

Sounds like an interesting project.

Thanks for the offer. I would love to get more involved but I'm overbooked with too many other projects going on at the moment. Also I'll be on vacation from July 10 to July 28 so won't be able to do any significant work on the "delayed reshaping" before August.

One more thing about using an intermediate realization of the big 512x512x512x1000 dataset: in addition to the benefits I mentioned above, this also gives you the opportunity to tweak a few settings like the chunk geometry, level of compression, etc... These physical parameters can have an impact later depending on how you will typically operate on the dataset. For example if the typical access pattern is going to be doing row summarizations on the Nx1000 ReshapedHDF5Matrix object, then choosing chunks that contain full rows might be sensible. This could be achieved with 512x1x1x1000 chunks or any chunk geometry of the form XxYxZx1000. Note that chunks should not be too big (<= 1M array elements) and not too small (>= 10k array elements). Finding the sweet spot might require experimenting a little bit (the exact sweet spot will also depend on the level of compression).

These settings can be controlled thru writeHDF5Array's arguments. See ?writeHDF5Array for the details.

@muschellij2
Copy link
Author

muschellij2 commented Jul 3, 2019 via email

@hpages
Copy link
Contributor

hpages commented Jul 3, 2019

No reason to wait. When reshaping becomes available as a delayed op, all you'll need to do is replace:

A <- ReshapedHDF5Array(filepath, name, dim=c(512*512*512, 1000))

with

A <- HDF5Array(filepath, name)
dim(A) <- c(512*512*512, 1000)

so it'll be a really simple change.

Plus, I'll wait that you get a chance to make these changes before I remove the ReshapedHDF5Array stuff. If I don't manage to get to implement the delayed reshaping before the next Bioconductor release (in fall), then the ReshapedHDF5Array stuff will have to stick around for at least 6 more months (it will have to go thru a 6-month deprecation cycle before I can remove it).

However the lack of delayed reshaping would be a blocking issue for you if for some reason you didn't want to realize the big intermediate 4D dataset.

@hpages hpages self-assigned this Jul 9, 2019
@hpages
Copy link
Contributor

hpages commented May 18, 2021

Another request/use case for delayed reshaping here: https://support.bioconductor.org/p/9136602/#9136860 They're also asking if delayed reshaping could increase the number of dimensions e.g. go from 2 x 12 to 2 x 3 x 4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants