Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add option to consolidate a dataset #29

Open
banteg opened this issue Aug 16, 2023 · 1 comment
Open

add option to consolidate a dataset #29

banteg opened this issue Aug 16, 2023 · 1 comment

Comments

@banteg
Copy link
Contributor

banteg commented Aug 16, 2023

a freshly collected contracts dataset out of cryo is 15.53 gb. if you consolidate 17,920 files into 17 files, it would become 7.88 gb, providing 2x savings on storage and 3x improvement in query performance.

i propose to add an option for cryo to incrementally consolidate the collected datasets, merging parts as a soon as they form a larger chunk that won't require any further rewriting.

an example of how it could work with --align option:

  • 0-17,000,000 block range is consolidated into files of 1,000,000 blocks each
  • 17,000,000-17,900,000 range is consolidated into files of 100,000 blocks each
  • 17,900,000-17,920,000 range is consolidated into files of 10,000 blocks each
  • 17,920,000-17,924,000 range is kept as collected with 1,000 blocks in each file
  • if we run again after block 17,930,000, blocks 17,920,000-17,930,000 would be consolidated into a bigger file
  • same would happen block 18,000,000 with chunks for block 17,900,000-18,000,000

adopting this approach would allow a set-and-forget cron job with cryo <dataset> --align --consolidate for a researcher to always come back to a fresh and performant dataset.

@banteg banteg changed the title an option to consolidate a dataset add option to consolidate a dataset Aug 16, 2023
@banteg
Copy link
Contributor Author

banteg commented Aug 24, 2023

implemented the logic i described here
https://github.com/banteg/cryogen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant