-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reshaping Array #47
Comments
I realized a MWE at Bioconductor/HDF5Array#16, which presents a solution, but not satisfactory one. |
Hi, I've implemented the ReshapedHDF5Array class in HDF5Array 1.13.3 (HDF5Array 1.13.3 belongs to BioC 3.10, which is the current development version of Bioconductor). The ReshapedHDF5Array class is a DelayedArray subclass for representing an HDF5 dataset with a user-supplied upfront virtual reshaping. See FWIW I initially tried to implement the reshaping as a delayed operation. This would be a better solution because it would be backend agnostic and the reshaping could then be applied at anytime i.e. even after the DelayedArray object has gone thru some delayed operations like subsetting, H. |
Can't get the install. Tried but https://github.com/Bioconductor/HDF5Array/blob/master/DESCRIPTION#L18 has 0.11.4 requirement for BiocManager::install("HDF5Array", version = "devel", ask = FALSE)
#> Bioconductor version 3.10 (BiocManager 1.30.4), R 3.6.0 (2019-04-26)
#> Installing package(s) 'HDF5Array'
#>
#> The downloaded binary packages are in
#> /var/folders/1s/wrtqcpxn685_zk570bnx9_rr0000gr/T//Rtmp5mQ9Od/downloaded_packages
packageVersion("HDF5Array")
#> [1] '1.13.2' Created on 2019-07-02 by the reprex package (v0.3.0) Just getting used to Bioconductor workflow, so I'm likely missing the repo I could find the updated DelayedArray. |
Oops, my bad sorry. I meant to have this latest version of HDF5Array (1.13.3) depend on DelayedArray >= 0.11.3, not on DelayedArray >= 0.11.4. Sorry for that. (FWIW I have a local version of DelayedArray on my laptop that is at version 0.11.4, with some uncommitted changes to it, hence my confusion.) I just adjusted the requirements in HDF5Array: Bioconductor/HDF5Array@f2f28bc Please try again. BTW only a few minutes after pushing the new ReshapedHDF5Array class to HDF5Array, I realized that it wouldn't actually be so hard to implement the reshaping as a delayed operation. I'd like to do it at some point. Then the ReshapedHDF5Array stuff will no longer be needed so maybe I'll remove it. |
That's great. I will hold off on adding any capabilities on my end then.
John
…On Tue, Jul 2, 2019 at 6:38 PM Hervé Pagès ***@***.***> wrote:
Oops, my bad sorry. I meant to have this latest version of HDF5Array
(1.13.3) depend on DelayedArray >= 0.11.3, not on DelayedArray >= 0.11.4.
Sorry for that.
(FWIW I have a local version of DelayedArray on my laptop that is at
version 0.11.4, with some uncommitted changes to it, hence my confusion.)
I just adjusted the requirements in HDF5Array: Bioconductor/HDF5Array@
f2f28bc
<Bioconductor/HDF5Array@f2f28bc>
Please try again.
BTW only a few minutes after pushing the new ReshapedHDF5Array class to
HDF5Array, I realized that it wouldn't actually be so hard to implement the
reshaping as a delayed operation. I'd like to do it at some point. Then the
ReshapedHDF5Array stuff will no longer be needed so maybe I'll remove it.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#47?email_source=notifications&email_token=AAIGPLXMK37PIJMPMC6DPYTP5PKERA5CNFSM4H4AJDP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZCXXFI#issuecomment-507870101>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAIGPLXHJW6XT3VLICTJ46DP5PKERANCNFSM4H4AJDPQ>
.
|
Also, another check that I found is that to take an array to matrix with 1 column, I needed to add in a 1 to the dimensions of the data (e.g. dim(from) = c(dim(from), 1) For example, if |
This 2-step reshaping is a consequence of a restriction on the kind of reshaping supported by the
The important bit here is "Only reshapings that reduce the number of dimensions". So adding dimensions would have to be done as a separate operation (delayed) on the object returned by the Note that the 2-step reshaping will go away once reshaping is implemented as a delayed operation. BTW I'm curious about what benefit you're hoping for by fully unfolding a 3-dimensional dataset into a 1-dimensional DelayedArray object or into a Nx1 DelayedMatrix object. In particular, I can't think of any DelayedMatrixStats operation that would make sense on this Nx1 matrix. |
Good point and also good question.
Setup, we have population of 3D arrays of images of the brain (let's say
1000). We want to get an average brain/template. Setup is take 1000
brains in imaging format, put into NiftiArray (aka HDF5Array with
additional info). Now take 1000 NiftiArray objects, reshape to Nx1 image,
cbind together (using acbind), then run DelayedMatrixStats::rowMeans2
or DelayedMatrixStats::rowMedians to get vector, reshape to get array (this
can be done in memory), create population level image, then write out in
imaging format.
Current memory issues is that one image can be 512x512x512, so N
= 134217728, so Nx1000 can be memory prohibitive.
John
…On Wed, Jul 3, 2019 at 2:04 PM Hervé Pagès ***@***.***> wrote:
This 2-step reshaping is a consequence of a restriction on the kind of
reshaping supported by the ReshapedHDF5Array() constructor. The
restriction is documented in ?ReshapedHDF5Array:
Also please note that arbitrary reshapings are not supported.
Only reshapings that reduce the number of dimensions by collapsing
a group of consecutive dimensions into a single dimension are supported.
For example, reshaping a 10 x 3 x 5 x 1000 array as a 10 x 15 x 1000
array or as a 150 x 1000 matrix is supported.
The important bit here is "Only reshapings that *reduce* the number of
dimensions". So adding dimensions would have to be done as a separate
operation (delayed) on the object returned by the ReshapedHDF5Array()
constructor.
Note that the 2-step reshaping will go away once reshaping is implemented
as a delayed operation.
BTW I'm curious about what benefit you're hoping for by fully unfolding a
3-dimensional dataset into a 1-dimensional DelayedArray object or into a
Nx1 DelayedMatrix object. In particular, I can't think of any
DelayedMatrixStats operation that would make sense on this Nx1 matrix.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#47?email_source=notifications&email_token=AAIGPLQ67LB4SM5CUEWAB2LP5TS4XA5CNFSM4H4AJDP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZFH3VQ#issuecomment-508198358>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAIGPLSQH3ZKEY452JACY4LP5TS4XANCNFSM4H4AJDPQ>
.
|
I see. Interesting use case. Note that if the 1000 3D datasets were stored as a single 4D dataset (512x512x512x1000) then things would be much easier. It would just be a matter of creating a ReshapedHDF5Matrix object with something like H. |
I can bind them together (if from is a list of arrays) (where ndims = 3):
from = lapply(from, function(x) {
dim(x) = c(dim(x), 1)
x = DelayedArray::aperm(x, perm = (ndims + 1):1)
x
})
res = do.call(DelayedArray::arbind, args = from)
res = aperm(res, (ndims + 1):1)
which will create a 512x512x512x1000 delayedArray
John
…On Wed, Jul 3, 2019 at 3:26 PM Hervé Pagès ***@***.***> wrote:
I see. Interesting use case. Note that if the 1000 3D datasets were stored
as a single 4D dataset (512x512x512x1000) then things would be much easier.
It would just be a matter of creating a ReshapedHDF5Matrix object with
something like ReshapedHDF5Array(filepath, name, dim=c(512*512*512, 1000))
and feed that to DelayedMatrixStats::rowMedians() or
DelayedMatrixStats::rowMedians(). This would avoid the overhead of the
1000 reshapipngs + final cbind'ing of 1000 objects that you need to do with
separate files. I suspect that this overhead is going to introduce some
significant slowdown.
H.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#47?email_source=notifications&email_token=AAIGPLV3JSOMX65VKM4YSSLP5T4NZA5CNFSM4H4AJDP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZFOPKI#issuecomment-508225449>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAIGPLVFHIIT75ESCRHVJPDP5T4NZANCNFSM4H4AJDPQ>
.
|
Something that maybe would help is the ability to "stack" the 1000 3D datasets into a 4D dataset. That would need to be a new array operation so we would need to come up with a new verb for it (I don't think there is any equivalent in base R for doing this on ordinary arrays). This "stacking" is something I actually had in mind for a while but we didn't really have a use case for it (until now) so I lacked the incentive for implementing it. I don't anticipate it would be too hard to implement this "stacking" as a delayed operation though. One easy way would be to add the extra dimension to all the objects to stack (with So for your use case you could stack the 1000 3D datasets into a single 4D dataset (512x512x512x1000), then reshape into a Nx1000 DelayedMatrix (once this reshaping is available as a delayed operation). I believe this could be a little bit more efficient than the ReshapedHDF5Array approach. Not sure, would be interesting to compare. |
I'm just seeing your last post now. Yes, nice trick to use a combination of |
Also, try to use intermediate realizations to circumvent some of the current limitations. For example after you've stacked the 1000 datasets together, you won't be able to reshape the result into an Nx1000 matrix (because the delayed reshaping is not supported yet) but you can realize it to a new HDF5 file. Then you can use |
I'll take a look at it after the weekend. We don't need it exactly **right
now** but that is the use case we're going to use to convince our imaging R
partners to get on board with the *Array family. Also, the random access
and memory footprint stuff allow us to push into the Shiny domain without
insane lagging for the most part. We're going to write something up in the
coming weeks, and would be more than happy to include you in the effort if
that is of interest.
…On Wed, Jul 3, 2019 at 4:09 PM Hervé Pagès ***@***.***> wrote:
Also, try to use intermediate realizations to circumvent some of the
current limitations. For example after you've stacked the 1000 datasets
together, you won't be able to reshape it into an Nx1000 matrix (because
the delayed reshaping is not supported yet) but you can realize it to a new
HDF5 file. Then you can use ReshapedHDF5Array() on that new HDF5 file to
reshape. You will also pay much less overhead now when you call
DelayedMatrixStats functions on this object because it doesn't carry any
delayed op (except for the reshaping which can be seen as a seed built-in
delayed op).
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#47?email_source=notifications&email_token=AAIGPLRAI2M2ATK4XERYRM3P5UBOTA5CNFSM4H4AJDP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZFRPNY#issuecomment-508237751>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAIGPLRCVW7K6XQLUFQVL5LP5UBOTANCNFSM4H4AJDPQ>
.
|
Sounds like an interesting project. Thanks for the offer. I would love to get more involved but I'm overbooked with too many other projects going on at the moment. Also I'll be on vacation from July 10 to July 28 so won't be able to do any significant work on the "delayed reshaping" before August. One more thing about using an intermediate realization of the big 512x512x512x1000 dataset: in addition to the benefits I mentioned above, this also gives you the opportunity to tweak a few settings like the chunk geometry, level of compression, etc... These physical parameters can have an impact later depending on how you will typically operate on the dataset. For example if the typical access pattern is going to be doing row summarizations on the Nx1000 ReshapedHDF5Matrix object, then choosing chunks that contain full rows might be sensible. This could be achieved with 512x1x1x1000 chunks or any chunk geometry of the form XxYxZx1000. Note that chunks should not be too big (<= 1M array elements) and not too small (>= 10k array elements). Finding the sweet spot might require experimenting a little bit (the exact sweet spot will also depend on the level of compression). These settings can be controlled thru |
Got it. So should I rely on ReshapedHDF5Matrix and ReshapedHDF5Array
staying classes for the coming future or should I wait until the process is
given as a delayed op?
John
…On Wed, Jul 3, 2019 at 4:47 PM Hervé Pagès ***@***.***> wrote:
Sounds like an interesting project.
Thanks for the offer. I would love to get more involved but I'm overbooked
with too many other projects going on at the moment. Also I'll be on
vacation from July 10 to July 28 so won't be able to do any significant
work on the "delayed reshaping" before August.
One more thing about using an intermediate realization of the big
512x512x512x1000 dataset: in addition to the benefits I mentioned above,
this also gives you the opportunity to tweak a few settings like the chunk
geometry, level of compression, etc... These physical parameters can have
an impact later depending on how you will typically operate on the dataset.
For example if the typical access pattern is going to be doing row
summarizations on the Nx1000 ReshapedHDF5Matrix object, then choosing
chunks that contain full rows might be sensible. This could be achieved
with 512x1x1x1000 chunks or any chunk geometry of the form XxYxZx1000. Note
that chunks should not be too big (<= 1M array elements) and not too small
(>= 10k array elements). Finding the sweet spot might require experimenting
a little bit (the exact sweet spot will also depend on the level of
compression).
These settings can be controlled thru writeHDF5Array's arguments. See
?writeHDF5Array for the details.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#47?email_source=notifications&email_token=AAIGPLWGADS7M6MZAAAEFLLP5UF67A5CNFSM4H4AJDP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZFUJZY#issuecomment-508249319>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAIGPLRTBJVUBMMRYP6OLO3P5UF67ANCNFSM4H4AJDPQ>
.
|
No reason to wait. When reshaping becomes available as a delayed op, all you'll need to do is replace:
with
so it'll be a really simple change. Plus, I'll wait that you get a chance to make these changes before I remove the ReshapedHDF5Array stuff. If I don't manage to get to implement the delayed reshaping before the next Bioconductor release (in fall), then the ReshapedHDF5Array stuff will have to stick around for at least 6 more months (it will have to go thru a 6-month deprecation cycle before I can remove it). However the lack of delayed reshaping would be a blocking issue for you if for some reason you didn't want to realize the big intermediate 4D dataset. |
Another request/use case for delayed reshaping here: https://support.bioconductor.org/p/9136602/#9136860 They're also asking if delayed reshaping could increase the number of dimensions e.g. go from 2 x 12 to 2 x 3 x 4. |
First, thanks for the package. It's exactly what we needed. We're working on neuroimaging data: https://github.com/muschellij2/niftiArray. We're trying to do statistics over multiple 3D arrays. I believe this doesn't fit well with DelayedMatrixStats as matrixStats requires 2D matrices.
Similar to #30 and the comment here #30 (comment), I was wondering about "reshaping" an array. Really, I want to be able to vectorize a DelayedArray or convert a 4D DelayedArray to a 2D matrix where the first 3 dimensions are the rows of the matrix and the 4th dimension is the columns. I can realize the array, convert to a matrix, and then create a DelayedArray or DelayedMatrix for the current solution.
Ideally, I'd be able to do this without realizing the array into memory. I'm not sure if HDF5 (the backend we're using) would support this, but we would it useful. Trying to set
dim
or change the dimensions fails due the (correct) checks onDelayedMatrix
. Just wondering if it were possible, if it makes sense with HDF5, and if both are true, then would it be possible to implement? I can send a reprex if necessary, but the setting of dimensions is well documented.The text was updated successfully, but these errors were encountered: