insert method #47

asmeurer · 2020-04-23T16:41:00Z

It could be useful to have an insert method on the dataset objects, since we can do inserts more efficiently than the naive way, by only reading in the chunks that are actually going to change.

ArvidJB · 2020-09-23T18:32:18Z

I have been thinking about this a bit.
For inserts that align with the chunk boundaries we can reuse chunks because they are content-addressed. As an example consider chunk sizes of 1000 and a dataset with chunks of
[slice(0,1000), slice(1000,2000), ... slice(5000, 5678)]
If an insert inserts exactly 1000 contiguous elements, i.e.,

ds.resize((len(ds)+1000,))
ds[2234:] = ds[1234:]
ds[1234:2234] = ...

then all the chunks which are just shifted by 1000 can be reused.

But I think we could do the same thing with inserts that do not align to the chunk size if those chunks are stored contiguously in raw_data. How much code is there which relies on reading from raw_data only when chunk aligned?

asmeurer · 2020-09-28T20:07:49Z

So if I understand correctly, the idea is insert full chunks, which may not be used fully. So the chunks in ds would be [0, 1000], [1000, 1234], [1234, 2234], [2234, 3234], [3234, 4234], [4234, 5234], [5234, 5912] (each represents a half-open slice). Here the chunks [1000, 1234] and [5234, 5912] are not full. Right now, the code makes the assumption that only the last chunk is not full.

I think we might be able to support this. Now that we have ndindex, it should be easier to do some of the slice calculations on offset chunks like this. But it still opens a question. After doing many inserts like this, there will be a lot of empty blocks in the raw data. This is particularly exacerbated for multidimensional datasets. Currently, for a multidimensional dataset any "edge" chunk can be nonfull. So for instance, if your chunk size is (10, 10, 10) and the data shape is (35, 35, 35), there will be 4*4*4 = 64 chunks, and 37 of those will be smaller than the full 10*10*10 = 1000. The fraction goes down as the shape goes up relative to the chunk size, or if the shape is a multiple of the chunk size in a dimension. However, if we start doing this insert trick, there will be will be entire slices of nonfull chunks. If the chunk size is (a, b, c) and you insert along the last axis, there will be a*b new nonfull chunks.

But perhaps with compression enabled, this is not a real issue? I'm unsure how compression works in HDF5, so we would need to test this.

There's also the question of performance. The time to read/write in a dataset will correspond to the number of chunks that the corresponding subslice of the dataset is in. Presently, this is easy to guess because you can just look at the chunk size and the slice. For example, you know with chunk size (10, 10) that a slice like a[0:15, 0:15] will read along 4 chunks. With nonaligned chunks, this could be more than would be naively believed from the slice.

ArvidJB · 2020-09-29T09:17:58Z

I think the virtual chunks could still be of full size (except for the last).
Here's an example of what I had in mind. Assume "v0" maps

virtual to raw chunk:
[slice(0,1000), slice(1000,2000), ... slice(5000, 5678)]
and the raw dataset is from of size 6000.
Now we insert 123 elements at 1234. I think we can handle that by storing the updated slice(1000,2000) at raw slice(6000,7000). And we could now store a virtual dataset of
[slice(0,1000), slice(6000,7000), slice(1357,2357), slice(2357,3357), ..., slice(5357,5678)]

This relies on the fact that in this case the raw chunks happen to be contiguous because of the original write.

PS: I was just thinking about one dimension, I think this does not generalize to more dimensions?

asmeurer · 2020-09-29T19:31:47Z

So with that, things would no longer be chunk aligned in the raw dataset. I think we could do this. But we originally made the data chunks match HDF5 chunks so that we would get the best performance. If we implement something like this, we would need to see how it would affect performance.

scopatz added the low_priority label Jun 10, 2020

scopatz added the Nice-to-have label Jun 17, 2020

asmeurer mentioned this issue Oct 20, 2020

Deleting versions and garbage collecting #143

Closed

asmeurer mentioned this issue Nov 10, 2020

Version replaying #146

Closed

ericdatakelly assigned asmeurer Apr 22, 2021

ericdatakelly added this to the May 2021 milestone Apr 22, 2021

ericdatakelly assigned asmeurer and unassigned asmeurer May 3, 2021

ericdatakelly modified the milestones: May 2021, June 2021 May 10, 2021

ericdatakelly modified the milestones: May 2021, June 2021 Jun 7, 2021

ericdatakelly modified the milestones: June 2021, July 2021 Jul 8, 2021

ericdatakelly modified the milestones: July 2021, August 2021 Aug 12, 2021

ericdatakelly modified the milestones: August 2021, September 2021 Sep 29, 2021

ericdatakelly modified the milestones: September 2021, October 2021 Oct 6, 2021

ericdatakelly modified the milestones: October 2021, November 2021 Dec 2, 2021

ericdatakelly modified the milestones: November 2021, December 2021 Dec 14, 2021

ericdatakelly modified the milestones: December 2021, January 2022 Jan 4, 2022

ericdatakelly modified the milestones: January 2022, February 2022 Feb 14, 2022

magsol modified the milestones: February 2022, March 2022 Mar 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

insert method #47

insert method #47

asmeurer commented Apr 23, 2020

ArvidJB commented Sep 23, 2020

asmeurer commented Sep 28, 2020

ArvidJB commented Sep 29, 2020

asmeurer commented Sep 29, 2020

insert method #47

insert method #47

Comments

asmeurer commented Apr 23, 2020

ArvidJB commented Sep 23, 2020

asmeurer commented Sep 28, 2020

ArvidJB commented Sep 29, 2020

asmeurer commented Sep 29, 2020