benchmarking index in different dbs #8

gauteh · 2022-05-03T12:22:09Z

using sled (my laptop, currently used in dars):

serialized size: 7359563
serialized size (x_wind_ml): 733527

test serde_db_heed::deserialize_meps_bincode                       ... bench:       7,856 ns/iter (+/- 42)
test serde_db_heed::deserialize_meps_bincode_only_read             ... bench:         105 ns/iter (+/- 0)
test serde_db_redis::deserialize_meps_bincode                      ... bench:  12,481,908 ns/iter (+/- 1,416,965)
test serde_db_redis::deserialize_meps_bincode_only_read            ... bench:   6,876,984 ns/iter (+/- 687,845)
test serde_db_redis::deserialize_meps_bincode_x_wind_ml            ... bench:     253,664 ns/iter (+/- 86,054)
test serde_db_sled::deserialize_meps_bincode_db_sled               ... bench:       7,999 ns/iter (+/- 121)
test serde_db_sled::deserialize_meps_bincode_only_read             ... bench:         197 ns/iter (+/- 1)
test serde_db_sqlite::deserialize_meps_bincode                     ... bench:   5,173,499 ns/iter (+/- 827,225)
test serde_db_sqlite::deserialize_meps_bincode_only_read           ... bench:   3,582,576 ns/iter (+/- 330,532)
test serde_db_sqlite::deserialize_meps_bincode_only_read_x_wind_ml ... bench:     143,666 ns/iter (+/- 43,035)

so main part is deserializing, and reading is about 8 us. so we have some margin since majority is currently bounded by deserializing.

one potentially useful optimization is to only deserialize the necessary datasets, but requires datasets to be split into different db entities. but lets start with binary blob and index the lookup-key. Another interesting point is binary-layout for lower deserialization costs.

codecov-commenter · 2022-05-04T09:16:34Z

Codecov Report

Merging #8 (5af91ff) into master (ac11b05) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master       #8   +/-   ##
=======================================
  Coverage   67.15%   67.15%           
=======================================
  Files          16       16           
  Lines         813      813           
=======================================
  Hits          546      546           
  Misses        267      267

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

gauteh · 2022-05-04T10:29:07Z

@magnusuMET: I think that if we are to use a distributed database, or even sqlite, we have to only fetch the relevant datasets. And we might have to store metadata closer (currently metadata is in memory, but that requires reading files on startup).

Benchmarks are kind of flawed because several of them rely on memorymapped and cached variables, so these are best-case scenarios. The difference between seialized and only-read versions of benchmarks are not real for e.g. redis, and is a cache artifact. if one is turned off the other goes up in time.

magnusuMET · 2022-05-04T10:38:57Z

I would expect the index for a specific file to be created on-demand. We could run a caching step on startup to ensure the most popular datasets are always available. I am not sure whether this could be a direction to go for?

gauteh · 2022-05-04T11:15:54Z

Yeah, I guess that could work as well. It would be a critical point to get right for latency and performance depending on use. I was hoping to make the server have as little logic as possible, but I don't think this will make much difference. It still needs to check modified-time against file. One question is how to share list of datasets and locations between instances.

gauteh · 2022-05-04T11:20:30Z

It takes about 29ms to index a 1.5gb file, but that would depend on how fast the disk is and disk-cache etc.

gauteh · 2022-05-04T11:23:16Z

It is still required to scan aggregated files, and also generate DAS and DDS. Not sure how long it takes to do those, but they rely on the rust-hdf5 so they are not multi-threaded. So they can't be used regularly.

magnusuMET · 2022-05-04T17:25:44Z

I think we need to separate this into two parts, one concerning efficient serving of the data and the other on discovery.

When serving a dataset one should try to fetch the index from a db, fallback to disk, check mtime of file against the cached entry, and possibly update the cache for that dataset (expensive, blocking). Here having a distributed in-memory (async?) db might be useful to prevent the blocking chunk index, although the db does not need to be persistent or up to date.

Discovery needs to keep an up to date view of the filesystem to add or remove new entries to the entry page. For this we could spawn a background thread which checks every couple of minutes for changes, maybe this should create DAS and DDS for all datasets?

gauteh · 2022-05-04T18:33:36Z

Yes, I think you are right.

I have been thinking about splitting discovery from the data-server into either a separate scraper, or even a different tool that can be used to insert or update new datasets. At the moment there is no discovery, a previous version used inotify but that doesn't work on NFS and is maybe not very reliable on large amounts of files. I think it will make it easier to insert datasets that are stored in object-store and have different ways to be discovered, or maybe even need to be registered.

At the risk of making things so complicated we never reach a functional level:

the data-server needs to handle when it discovers an out-dated dataset (mtime, missing dataset) -> signal server? block while waiting for update and return delayed response?
change in dataset might require cache purge on nginx if in front of dars
change in dataset also means that DDS and DAS are out of date, if the dataset change is discovered after DAS, DDS but before DODS the best we can do is return an error. but we don't track that.. maybe inherent limitation of protocol.

gauteh · 2022-05-05T07:56:47Z

Maybe we should do a quick chat about this at some point? It will require some restructuring of dars datasets.

gauteh force-pushed the bench-index-dbs branch from 5257f31 to 304416c Compare May 3, 2022 12:31

gauteh mentioned this pull request May 4, 2022

Data-discovery and index gauteh/dars#19

Open

gauteh changed the title ~~set up for benching index in different dbs~~ benchmarking index in different dbs May 4, 2022

gauteh force-pushed the bench-index-dbs branch from 9723a2a to 03b4e08 Compare May 5, 2022 07:49

gauteh force-pushed the bench-index-dbs branch from 03b4e08 to 81df9af Compare May 5, 2022 08:01

gauteh added 5 commits August 4, 2022 09:59

set up for benching index in different dbs

34556e9

benchmark sqlite

f634f44

benchmark heed

30e7034

benchmark redis

07756b5

bench deser dataset only

5af91ff

gauteh force-pushed the bench-index-dbs branch from 81df9af to 5af91ff Compare August 4, 2022 07:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmarking index in different dbs #8

benchmarking index in different dbs #8

gauteh commented May 3, 2022 •

edited

Loading

codecov-commenter commented May 4, 2022 •

edited

Loading

gauteh commented May 4, 2022 •

edited

Loading

magnusuMET commented May 4, 2022

gauteh commented May 4, 2022

gauteh commented May 4, 2022

gauteh commented May 4, 2022

magnusuMET commented May 4, 2022

gauteh commented May 4, 2022

gauteh commented May 5, 2022

benchmarking index in different dbs #8

Are you sure you want to change the base?

benchmarking index in different dbs #8

Conversation

gauteh commented May 3, 2022 • edited Loading

codecov-commenter commented May 4, 2022 • edited Loading

Codecov Report

gauteh commented May 4, 2022 • edited Loading

magnusuMET commented May 4, 2022

gauteh commented May 4, 2022

gauteh commented May 4, 2022

gauteh commented May 4, 2022

magnusuMET commented May 4, 2022

gauteh commented May 4, 2022

gauteh commented May 5, 2022

gauteh commented May 3, 2022 •

edited

Loading

codecov-commenter commented May 4, 2022 •

edited

Loading

gauteh commented May 4, 2022 •

edited

Loading