-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
benchmarking index in different dbs #8
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #8 +/- ##
=======================================
Coverage 67.15% 67.15%
=======================================
Files 16 16
Lines 813 813
=======================================
Hits 546 546
Misses 267 267 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
@magnusuMET: I think that if we are to use a distributed database, or even sqlite, we have to only fetch the relevant datasets. And we might have to store metadata closer (currently metadata is in memory, but that requires reading files on startup). Benchmarks are kind of flawed because several of them rely on memorymapped and cached variables, so these are best-case scenarios. The difference between seialized and only-read versions of benchmarks are not real for e.g. redis, and is a cache artifact. if one is turned off the other goes up in time. |
I would expect the index for a specific file to be created on-demand. We could run a caching step on startup to ensure the most popular datasets are always available. I am not sure whether this could be a direction to go for? |
Yeah, I guess that could work as well. It would be a critical point to get right for latency and performance depending on use. I was hoping to make the server have as little logic as possible, but I don't think this will make much difference. It still needs to check modified-time against file. One question is how to share list of datasets and locations between instances. |
It takes about 29ms to index a 1.5gb file, but that would depend on how fast the disk is and disk-cache etc. |
It is still required to scan aggregated files, and also generate DAS and DDS. Not sure how long it takes to do those, but they rely on the rust-hdf5 so they are not multi-threaded. So they can't be used regularly. |
I think we need to separate this into two parts, one concerning efficient serving of the data and the other on discovery. When serving a dataset one should try to fetch the index from a db, fallback to disk, check mtime of file against the cached entry, and possibly update the cache for that dataset (expensive, blocking). Here having a distributed in-memory (async?) db might be useful to prevent the blocking chunk index, although the db does not need to be persistent or up to date. Discovery needs to keep an up to date view of the filesystem to add or remove new entries to the entry page. For this we could spawn a background thread which checks every couple of minutes for changes, maybe this should create DAS and DDS for all datasets? |
Yes, I think you are right. I have been thinking about splitting discovery from the data-server into either a separate scraper, or even a different tool that can be used to insert or update new datasets. At the moment there is no discovery, a previous version used inotify but that doesn't work on NFS and is maybe not very reliable on large amounts of files. I think it will make it easier to insert datasets that are stored in object-store and have different ways to be discovered, or maybe even need to be registered. At the risk of making things so complicated we never reach a functional level:
|
Maybe we should do a quick chat about this at some point? It will require some restructuring of dars datasets. |
using sled (my laptop, currently used in dars):
so main part is deserializing, and reading is about 8 us. so we have some margin since majority is currently bounded by deserializing.
one potentially useful optimization is to only deserialize the necessary datasets, but requires datasets to be split into different db entities. but lets start with binary blob and index the lookup-key. Another interesting point is
binary-layout
for lower deserialization costs.