Skip to content

Conversation

@uzadude
Copy link
Collaborator

@uzadude uzadude commented Jan 4, 2023

Summary

In #71 we changed the block order. while checking the full iteration on Object Store (S3), we saw it is amazingly slow.
What happened is that it switched to Random IO mode:
.. 15:42:50,084 INFO fs.s3a.S3AInputStream - Switching to Random IO seek policy

Detailed Description

After debugging it looks like it happened because we try to seek back compared to the "natural" readahead buffer.

Luckily, now that we changed the block order we can just read the file sequentially without hopping at all. So just removing the seeking part works!

How was it tested?

regular unit tests

@uzadude uzadude changed the title No need for caching No need for caching in sorted-iterator Jan 4, 2023
@uzadude uzadude requested a review from shay1bz January 4, 2023 14:56
@shay1bz
Copy link
Collaborator

shay1bz commented Jan 4, 2023

What... How is that possible? When you reach a leaf block, you have to go back and seek elsewhere, no?

@uzadude
Copy link
Collaborator Author

uzadude commented Jan 4, 2023

I was also surprised that it just worked, but what happens is just that the parent blocks are saved in memory, and every time you need to read the next block from the file it is always the "correct" block as we saved it pre-order.

@shay1bz
Copy link
Collaborator

shay1bz commented Jan 5, 2023

Wow

@shay1bz
Copy link
Collaborator

shay1bz commented Jan 5, 2023

Are the parents explicitly cached by us? Do we make sure to uncache them? Just making sure there is no memory leak or something

@uzadude
Copy link
Collaborator Author

uzadude commented Jan 5, 2023

good point.
I actually thought about that and forgot to check. but it looks like we're good.
the sorted-iterator only has one pointer to Node next and the Node object only has one pointer to parent. So we will maximum log(n) nodes in memory.

BTW, I also need to verify this "Random IO" in the "search" iterator (get()).

@uzadude
Copy link
Collaborator Author

uzadude commented Jan 12, 2023

ok.. so after looking at get(key) , I saw that we can hop backwards there if we were calling get() multiple times.
So I added assertion to allow only bigger keys, and added functionality to reuse the current state.

@uzadude uzadude force-pushed the cache2 branch 2 times, most recently from 097d0a7 to 207040e Compare January 12, 2023 10:34
@eyala
Copy link
Collaborator

eyala commented May 1, 2023

Is this ready for release? Is it good for all three of our versions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants