-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements to the LRU cache #2500
base: main
Are you sure you want to change the base?
Conversation
map_find_key_values: BTreeMap<Vec<u8>, Vec<(Vec<u8>, Vec<u8>)>>, | ||
map_find_keys: BTreeMap<Vec<u8>, Vec<Vec<u8>>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it's a good idea to cache these, especially as a single entries, since they can contain any number of keys or key-value pairs, so their size is unbounded.
We could not cache the find_
results at all.
Or we could only store the fact that find_keys
or find_key_values
was called with a particular prefix at all, and then cache the actual entries in the other map(s). Whenever an entry belonging to a prefix falls out of the LRU cache or gets deleted, that prefix would have to be cleared here, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The size argument applies just as well to the read_value
LRU cache. Sure, the DynamoDb has some limit like 400K, but with the value splitting that limit is passed over.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a check so that if the cache entry being inserted is larger than the maximal size, then it is refused since that would lead to total clearing of the cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MathieuDutSik @afck Let's keep the discussion going and make sure that we go for the best solution here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my view the current approach is still not ideal, because:
- It potentially duplicates entries, if the same value occurs in both a "find" and a "value" result.
- It creates weird semantics for moving "find" results to the front of a queue based on a "value" cache hit.
I think it would be better to do something like this:
- Whenever we handle a "find" query we put all found entries into the cache as individual "value", and in addition make a note that the cache now contains everything in that range.
- Whenever an entry falls out of the cache we remove (or even better: split or shrink) any such range entry, because now the cache doesn't contain everything under it anymore.
- If we get a "find" query for a range contained in an up-to-date one, we can bump all entries in the smaller range to the front of the queue, i.e. only those that we actually return.
if let Some(key_prefix) = lower_bound { | ||
self.map_find_keys.remove(&key_prefix); | ||
let full_prefix = (key_prefix, CacheEntry::FindKeys); | ||
self.queue.remove(&full_prefix); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have to remove this entry? If we read a value in that range, it doesn't make the result invalid, does it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been changed.
The thing is that it was an update to the LRU cache after the write to the cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this but the test plan should include empirical evidence that this PR makes things faster, not slower. Also how we know this is a low-hanging fruit?
Data will be provided about the PR. I think it is a low-hanging fruit because |
d52190b
to
1ce99b6
Compare
self.remove_from_queue(&full_prefix); | ||
} | ||
btree_map::Entry::Vacant(entry) => { | ||
entry.insert(cache_entry); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm reading this correctly, we are not removing any map_value
entries that are redundant now? So those would be duplicated in memory.
On the other hand, if we do delete them, their lifetime in the cache is now tied to the lifetime of the whole collection.
This is why I'm not convinced by the whole idea of having entire find_key_values
results as a single entry in the LRU cache.
return cache_entry.get_read_value(key_red); | ||
} | ||
None | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this bump the entry back to the front of the queue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and it has been corrected.
if cache_size > self.lru_cache_policy.max_cache_size { | ||
// Inserting that entry would lead to complete clearing of the cache | ||
// which is counter productive | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it common practice to in general not cache objects that are too big? IMHO I think that no cached item should be bigger than probably like 10% (being generous) of the total size of the cache. I don't think only checking for objects that are bigger than the full size of the cache is enough.
If the objects happen to be too big for the size of the cache (even if they're smaller than the max cache size), in the worse case the cache will be useless, as every new insertion will basically evict the entire existing cache.
And do we know how big what we're caching here will generally be? It might be worth having a metric for it, so we can adjust the size of the cache accordingly if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a number of heuristics about that. As I said, there is a huge literature on caching.
I could introduce the 10%, but the problem is that this becomes a parameter to add to the function call.
Adding the metrics has no issue though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why will it become a parameter do add to the function call? Can't you just add like a max_entry_size_pct
or something (with a comment explaining what it is better) to the cache policy?
Also, even if it meant adding a parameter, I think it's better to add a parameter than to have a cache that might not add any value in the worse case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a control value max_entry_size
in the code.
5ad5bc2
to
109e666
Compare
8df2a72
to
09b5155
Compare
b6d9c7b
to
02b0aff
Compare
/// The maximal number of entries in the storage cache. | ||
#[arg(long, default_value = "1000")] | ||
cache_size: usize, | ||
/// The storage cache policy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will generate a CLI w/o any helpful tips about what values are acceptable here. We should:
- Improve the doc
- and/or add a newtype wrapper for the
storage_cache_policy
with useful defaults.
/// If the number of entries in the cache is too large then the underlying maps | ||
/// become the limiting factor | ||
pub const DEFAULT_STORAGE_CACHE_POLICY: StorageCachePolicy = StorageCachePolicy { | ||
max_cache_size: 10000000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not a power of 2? I'd expect this to be something like 5MB for example.
pub const DEFAULT_STORAGE_CACHE_POLICY: StorageCachePolicy = StorageCachePolicy { | ||
max_cache_size: 10000000, | ||
max_entry_size: 1000000, | ||
max_cache_entries: 1000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kindof a redundant with max_cache_size
, isn't it? i.e. both limit the size of a cache but in different ways. I think it's possible to set them in a "incompatible" way. For example, with thisDEFAULT
, we have 1 entry having (approximately) ~10Kb. Which is probably a lot 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need several ways to control the cache. They are not all independent of course.
In the chosen settings, we have at most 1M for an individual entry, at most 10M for the full cache and at most 1000 entries in the cache. Of course, if we have N entries and each entry has at most M bytes then the total size is going to be at most N*M.
This is needed. If we have too many entries in the cache then we have a the risk of a cache containing 1 million entries each of a few bytes. That leads to a BTreeMap with 1 million entries and the runtime can become large. We do not want to be in that situation.
We also do not want to be in a situation where the memory expense of the cache is too high. And limiting the maximum size of a cache entry allows us to avoid one big entry crowding out all others.
pub fn read_storage_cache_policy(storage_cache_policy: Option<String>) -> StorageCachePolicy { | ||
match storage_cache_policy { | ||
None => DEFAULT_STORAGE_CACHE_POLICY, | ||
Some(storage_cache_policy) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not allow for configuring the caching policy by simply expecting --max-cache-size <bytes>
, --max-cache-entries <number>
, etc to the binary? Reading file with these three written down as a JSON seems like an unnecessary complication 🤔
if you decide to promote them to binary arguments, you can even decide to make max_cache_size
and max_cache_entries
mutually exclusive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We generally want to switch to the use of too many command line entries and switch to the use of input in files.
]), | ||
) | ||
.expect("Histogram can be created") | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are so many things here tagged with cfg(with_metrics)
macro. Maybe extract them to a file and put the cfg(with_metrics)
on the whole module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is annoying.
The solution I found was to use the same strategy as in linera-core/src/cloient/mod.rs of putting the contents in a "mod metrics" submodule.
@@ -1,45 +1,305 @@ | |||
// Copyright (c) Zefchain Labs, Inc. | |||
// SPDX-License-Identifier: Apache-2.0 | |||
|
|||
//! Add LRU (least recently used) caching to a given store. | |||
//! Add caching to a given store. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for code organisation – I'd split this code into mod.rs
+ files for non-public elements. It would make reading (and understanding) much easier. For example, there's a lot of non-pub
code here at the top of the file. So when reading it I read a lot of "implementation details" first (like ValueCacheEntry
, CacheEntry
, etc.), w/o even getting to what is the public facing API we're adding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't finished reviewing but a general comment - if I was to write a caching layer around a DB, I'd implement the same exact API for the cache and forward calls to the underlying storage. For example:
trait Storage<K, V> {
fn read(k: K) -> Result<V, Error>;
fn write(k: K, v: V) -> Result<(), Error>;
}
struct RocksDb { ... };
impl Storage for RocksDb { ... }
struct Cache<S: Storage> {
cache: CacheInner,
database: S
}
impl Storage for Cache {
fn read(k: Key) -> Result<V, Error> {
if let Some(cached_v) = self.cache.read(k) {
return Ok(cached_v);
}
let db_v = self.database.read(k);
self.cache.write(k, db_v);
Ok(db_v)
}
}
In this setup cache has only minimally more logic than the actual storage.
But isn't it exactly how things are done? We have the trait There is a little bit of extra logic, I agree. Normally, the other PR #2637 should do the job of removing the extra logic. |
c3b7f8b
to
0b96d06
Compare
indeed |
0601333
to
8272124
Compare
8272124
to
f74f349
Compare
Motivation
The LRU is currently caching only the
read_value
operations. This leaves some optimizations away.Proposal
The following was done:
map
was renamed asmap_value
.find_keys_by_prefix
andfind_key_values_by_prefix
.find_keys_by_prefix
we keep the map as suffix closed and we use the existing entries to get some value. So, if we already computed for prefix[0]
then we can deduce the result for[0,2]
.find_key_values_by_prefix
.max_cache_size
which before was a maximum on the number of entries becomes a maximum on the number of bytes in the cache (and is expanded from 1000 to 10000000).Measurement setup:
test_wasm_end_to_end_amm
onremote-net
.ScyllaDb
which has LRU caching.--release
which takes about 1H to compile.Averaged results:
Test Plan
The CI.
Release Plan
Not relevant.
Links