Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

insert method #47

Open
asmeurer opened this issue Apr 23, 2020 · 4 comments
Open

insert method #47

asmeurer opened this issue Apr 23, 2020 · 4 comments

Comments

@asmeurer
Copy link
Collaborator

It could be useful to have an insert method on the dataset objects, since we can do inserts more efficiently than the naive way, by only reading in the chunks that are actually going to change.

@ArvidJB
Copy link
Collaborator

ArvidJB commented Sep 23, 2020

I have been thinking about this a bit.
For inserts that align with the chunk boundaries we can reuse chunks because they are content-addressed. As an example consider chunk sizes of 1000 and a dataset with chunks of
[slice(0,1000), slice(1000,2000), ... slice(5000, 5678)]
If an insert inserts exactly 1000 contiguous elements, i.e.,

ds.resize((len(ds)+1000,))
ds[2234:] = ds[1234:]
ds[1234:2234] = ...

then all the chunks which are just shifted by 1000 can be reused.

But I think we could do the same thing with inserts that do not align to the chunk size if those chunks are stored contiguously in raw_data. How much code is there which relies on reading from raw_data only when chunk aligned?

@asmeurer
Copy link
Collaborator Author

So if I understand correctly, the idea is insert full chunks, which may not be used fully. So the chunks in ds would be [0, 1000], [1000, 1234], [1234, 2234], [2234, 3234], [3234, 4234], [4234, 5234], [5234, 5912] (each represents a half-open slice). Here the chunks [1000, 1234] and [5234, 5912] are not full. Right now, the code makes the assumption that only the last chunk is not full.

I think we might be able to support this. Now that we have ndindex, it should be easier to do some of the slice calculations on offset chunks like this. But it still opens a question. After doing many inserts like this, there will be a lot of empty blocks in the raw data. This is particularly exacerbated for multidimensional datasets. Currently, for a multidimensional dataset any "edge" chunk can be nonfull. So for instance, if your chunk size is (10, 10, 10) and the data shape is (35, 35, 35), there will be 4*4*4 = 64 chunks, and 37 of those will be smaller than the full 10*10*10 = 1000. The fraction goes down as the shape goes up relative to the chunk size, or if the shape is a multiple of the chunk size in a dimension. However, if we start doing this insert trick, there will be will be entire slices of nonfull chunks. If the chunk size is (a, b, c) and you insert along the last axis, there will be a*b new nonfull chunks.

But perhaps with compression enabled, this is not a real issue? I'm unsure how compression works in HDF5, so we would need to test this.

There's also the question of performance. The time to read/write in a dataset will correspond to the number of chunks that the corresponding subslice of the dataset is in. Presently, this is easy to guess because you can just look at the chunk size and the slice. For example, you know with chunk size (10, 10) that a slice like a[0:15, 0:15] will read along 4 chunks. With nonaligned chunks, this could be more than would be naively believed from the slice.

@ArvidJB
Copy link
Collaborator

ArvidJB commented Sep 29, 2020

I think the virtual chunks could still be of full size (except for the last).
Here's an example of what I had in mind. Assume "v0" maps

  1. virtual to raw chunk:
    [slice(0,1000), slice(1000,2000), ... slice(5000, 5678)]
    and the raw dataset is from of size 6000.
  2. Now we insert 123 elements at 1234. I think we can handle that by storing the updated slice(1000,2000) at raw slice(6000,7000). And we could now store a virtual dataset of
    [slice(0,1000), slice(6000,7000), slice(1357,2357), slice(2357,3357), ..., slice(5357,5678)]

This relies on the fact that in this case the raw chunks happen to be contiguous because of the original write.

PS: I was just thinking about one dimension, I think this does not generalize to more dimensions?

@asmeurer
Copy link
Collaborator Author

So with that, things would no longer be chunk aligned in the raw dataset. I think we could do this. But we originally made the data chunks match HDF5 chunks so that we would get the best performance. If we implement something like this, we would need to see how it would affect performance.

@ericdatakelly ericdatakelly added this to the May 2021 milestone Apr 22, 2021
@ericdatakelly ericdatakelly assigned asmeurer and unassigned asmeurer May 3, 2021
@ericdatakelly ericdatakelly modified the milestones: May 2021, June 2021 May 10, 2021
@ericdatakelly ericdatakelly modified the milestones: May 2021, June 2021 Jun 7, 2021
@ericdatakelly ericdatakelly modified the milestones: June 2021, July 2021 Jul 8, 2021
@ericdatakelly ericdatakelly modified the milestones: July 2021, August 2021 Aug 12, 2021
@ericdatakelly ericdatakelly modified the milestones: August 2021, September 2021 Sep 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants