Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Implement EventCacheStoreLock::lock() with poison error, and ::lock_unchecked #4285

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

Hywan
Copy link
Member

@Hywan Hywan commented Nov 18, 2024

A flask of poison

A flask of poison.


Since #4192, the event cache has a cross-process lock via EventCacheLock. It brings a lock method. Great! However, the caller is not aware if the lock has been acquired from the same holder or from another holder. The holder refers to the process name usually. If the holder of the lock changes, it means the in-memory data might now be invalid! That's absolutely terrifying. Imagine the following scenario:

  • the main app process adds events inside the event cache store
  • the main app process is paused
  • the push notification process is fired, and adds events inside the event cache store
  • the main app process is resumed
  • now what? how this process is able to know its data are invalid and should be reloaded?

Poor main app process 😢. We must help this poor main app process, don't we?

Hence this PR.

It introduces a generation counter. The principle is really simple: every time the lock is acquired from another holder, the generation is incremented by one. The generation does not change if the holder stays the same between lock acquisitions.

So.

  • The first patch do that: a generation value is added inside the stores (e.g. matrix-sdk-sqlite).
  • Next, EventCacheStoreLock::lock is renamed lock_unchecked. Why do we want an unchecked version of the lock? What does it even mean? A lock can be poisoned, but sometimes we just don't care! If we want to add a media in the cache, there is no in-memory invalidation involved. Using lock_unchecked is perfect in this case.
  • After that, we do a bit of Rust magic with BackingStore being automatically implemented for all Arc<T> where T: BackingStore.
  • And now we are ready to implement LockableEventCacheStore::is_poisoned. This “poisoning” API is restricted to the event cache; I didn't want to touch the CrossProcessStoreLock API. Maybe that would be useful for all cross-process store locks later, but for now, only event cache, nah!
  • All pieces are ready. We can implement EventCacheStoreLock::lock().
  • Finally, the test parts.

It's better to review this PR commit-by-commit for your own sanity.


@Hywan Hywan force-pushed the feat-event-cache-lock-poisoned branch 5 times, most recently from cb04c54 to 05f0339 Compare November 18, 2024 16:33
Copy link

codecov bot commented Nov 18, 2024

Codecov Report

Attention: Patch coverage is 98.46154% with 1 line in your changes missing coverage. Please review.

Project coverage is 85.04%. Comparing base (21bb85a) to head (cd4c12e).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...rates/matrix-sdk-base/src/event_cache/store/mod.rs 96.29% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #4285   +/-   ##
=======================================
  Coverage   85.04%   85.04%           
=======================================
  Files         274      274           
  Lines       30043    30080   +37     
=======================================
+ Hits        25550    25583   +33     
- Misses       4493     4497    +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

This patch introduces the _generation_ value for the event cache lock:
every time the lock is acquired by a new holder, the generation is
incremented by 1. This is fundamental to know if the lock has been
poisoned or not.
We are about to introduce a lock which checks if it's been poisoned. To
not break the current API semantics, `lock` is renamed `lock_unchecked`
if poisoning isn't check/doesn't raise an error.
…kingStore`.

This patch implements `BackingStore` for all `Arc<T>` where `T:
BackingStore`.
This patch implements `LockableEventCacheStore:is_poisoned` to know
whether the lock has been poisoned or not.
This patch implements `EventCacheStoreLock::lock`, which is similar
to `lock_unchecked` but it returns a `Result<EventCacheStoreLockGuard,
EventCacheStoreLockPoisonError<…>>` to handle the _poisoned_ case.
This patch adds a small `logged_in_base_client_with_store_config`
function similarly to `logged_in_base_client`, but… with… a store
config!
This patch adds tests for `EventCacheStoreLock::lock` and
`lock_unchecked`.
@Hywan Hywan force-pushed the feat-event-cache-lock-poisoned branch from 05f0339 to 5073cd1 Compare November 18, 2024 19:01
@Hywan Hywan marked this pull request as ready for review November 18, 2024 19:07
@Hywan Hywan requested review from a team as code owners November 18, 2024 19:07
@Hywan Hywan requested review from bnjbvr and andybalaam and removed request for a team November 18, 2024 19:07
@Hywan Hywan force-pushed the feat-event-cache-lock-poisoned branch from 48c54e1 to cd4c12e Compare November 19, 2024 08:33
Copy link
Member

@bnjbvr bnjbvr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making small commits, this made it easier to review.

main comment

I have my doubts about exposing the fact the store has been poisoned to callers of the event cache store. All the data that needs to be reset upon poisoning should be owned by the store itself, so that it can reset internally, and obtaining the lock is transparent to the callers; otherwise, each call site now needs to reflect what they need to do in that case, and that doesn't make for a very ergonomic API at all (and the Result<Result<>> reflects that weirdness).

Does it make sense to you too?

future directions

As discussed in the small chat we had today :) Those things can be implemented later, as follow-ups (even though I'd love to see the first item sooner rather than later :)).

crypto store

Since this is implementing from scratch the same mechanics that existed for the crypto-store, I strongly think we should merge implementations, instead of having two implementations that are close to each other but different. I think we should do it sooner rather than later, otherwise we'd be shipping new tech debt 😟

lock vs lock_unchecked

I like the idea of separating lock() from lock_unchecked(); I think we can go the next mile by splitting the store state into two parts:

  • bits that are OK to touch while holding a lock, independently of other processes: the media cache is a good example of that, likely. This would go into its own data structure, conceptually a SharedEventCacheStore, and it would contain references or own those bits.
  • bits that are not OK to touch while another process holds the lock, and that would need to be reset upon poisoning (aka no aliasing allowed), conceptually a UniqueEventCacheStore. This would contain the room events and such.

Then, to make it so that it's not possible to take the lock with lock_unchecked and touch the bits one is not allowed to, we'd slightly change the signatures of those two functions:

  • lock() -> Result<UniqueEventCacheStore, ...>
  • lock_unchecked() -> Result<SharedEventCacheStore, ...>

This way, we have static safety around those types, and it's not even possible to shoot oneself in the feet by misusing the lock.

@@ -54,6 +54,13 @@ use crate::{
SendOutsideWasm,
};

/// A lock generation is an integer incremented each time it is taken by another
/// holder. This is not used by all cross-process locks.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the meaning of the last sentence here? Consider removing it now, since we're likely to forget about this comment and let it rot even in a future where the crypto store lock would use it?

};

type HolderExpirationGeneration = (String, Instant, LockGeneration);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Care to either comment, or use a small struct with named fields, please?

@@ -506,36 +520,45 @@ pub mod memory_store_helper {
time::{Duration, Instant},
};

use super::{LockGeneration, FIRST_LOCK_GENERATION};

type HolderExpirationGeneration = (String, Instant, LockGeneration);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you deduplicate this?

WHERE holder = ?2
ON CONFLICT (key) DO
UPDATE SET
holder = excluded.holder,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add in a comment where does this excluded entity comes from? More generally, can you add in a comment what this query does, because it's not quite the regular SQL one sees often? :)

struct LockableEventCacheStore(Arc<DynEventCacheStore>);
struct LockableEventCacheStore {
store: Arc<DynEventCacheStore>,
generation_and_is_poisoned: Arc<Mutex<(LockGeneration, bool)>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the LockGeneration be optional before the first time the generation is loaded from the database?

Also: can you use a small struct for holding the LockGeneration and the bool? It will make documenting what the bool is better, and make the code using this lock easier to understand (no need to know what fields 0 and 1 are)


Ok(EventCacheStoreLockGuard { cross_process_lock_guard, store: self.store.deref() })
Ok(if self.lockable_store.is_poisoned() {
Err(EventCacheStoreLockPoisonError(event_cache_store_lock_guard))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

main comment here

Now that I see the full picture, I doubt that this is the best way to go. Why? Because the callers of lock() are now the ones who have to decide what to do based on getting a poisoned lock; and my guess is that they'll all do the same thing, that is, resetting the in-memory content of the event cache store. Since we are in the event cache store, could we instead reset the inner content ourselves, and then return the EventCacheStoreLockGuard? (This would be aligned with the end target of the crypto store cross-process lock, fwiw)

And considering the little code that was in lock_unchecked(), I'm sure we can keep the previous code of lock_unchecked() as it was, and not have lock_unchecked() call into lock(). That would also avoid the EventCacheStoreLockPoisonError type, which is super weird.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The EventCacheStoreLock isn't responsible to invalidate/refresh the in-memory data we have. However, EventCache and other users of this lock are responsible of that. Example: every time EventCache is calling store.lcok(), and is responsible of refreshing its own data. I don't see how EventCacheStoreLock can track all data used everywhere related to the event cache store. It's basically impossible.

Remember that the EventCache uses the EventCacheStoreLock to read load data from the store into memory, or to save data from memory into the store. The EventCacheStoreLock has no idea where the data that it has loaded or saved are used, and by whom.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how EventCacheStoreLock can track all data used everywhere related to the event cache store. It's basically impossible.

If it owns it, as I suggested, it doesn't need to track it.

A thought experiment: if we keep the current API (modulo making the return type of the lock look better, to not have to deal with a result of a result), then the next thing I would implement is a thin wrapper on top of the lock, to invalidate the data manually. So it seems like this is just kicking the can down the road, and pushing the problem elsewhere in our architecture.

Moreover, using a generic parameter for the data stored along the lock and that needs a reset would nicely solve the layer separation. A thin trait InvalidatableData { fn invalidate(); } (lol @ naming) would be sufficient to represent data that needs invalidation, and mocking it in testing situations should be straightforward (use a bool, toggle it when invalidate() is called).

I am really not sure the splitting of responsibility gives us anything valuable, so I still need to be convinced this is the best approach.

Would you have WIP code making use of this, so I can get a better opinion about how it's going to be used?


/// Create a [`BaseClient`] with the given user id, if provided, or an hardcoded
/// one otherwise, and with a store config.
pub(crate) async fn logged_in_base_client_with_store_config(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhh, this is super ugly, and those methods I wanted to see go away with the new MatrixMockServer and MockClientBuilder; any chance you can reuse MockClientBuilder instead?

@andybalaam
Copy link
Member

I will hold off on reviewing until you let me know the ideas have settled a bit or you want my input.

I also would like to share logic where it makes sense with the crypto store.

@Hywan
Copy link
Member Author

Hywan commented Nov 20, 2024

I totally agree we should share the logic with the crypto store. However it implies a non-trivial work to move this generation and lock poison mechanism into the CrossProcessStoreLock API. I prefer to go step-by-step, especially because I'm running out of time to finish the event cache persistent storage. This PR is for the event cache. I agree a refactoring with all cross-process store locks will be necessary once this patch will land and once the design has proven to be mature. Moreover, the work on the crypto store isn't finished and I'm not confident I can rewrite and complete it. I think we should have a team discussion about that: to come with a solid design together. Having a bit of “duplication” isn't dramatically bad here: it's a specific solution, and I agree having a generic solution would be an improvement, but I don't see that as a blocker.

@Hywan
Copy link
Member Author

Hywan commented Nov 20, 2024

After a team discussion, here is the new plan:

  • create new crate: matrix-sdk-cross-process-lock
  • move and rename CrossProcessStoreLock to CrossProcessLock inside matrix-sdk-cross-process-lock
  • improve the CrossProcessLock to include the generation mechanism introduced in #4285
  • add the lock_unchecked and lock+PoisonError logic inside this crate too, including the lock guards etc.
  • create a trait to represent a “poisonable” lock

Once we have that, the following items can be triggered:

  • matrix-sdk-crypto uses this new trait (+ friendly default to use CrossProcessStoreLock maybe?)
  • either:
    • update the state store to be cross-process aware, then event cache store and crypto store use the state store
    • use a new front type, like Stores, that is cross-process aware, and that gives access to all the stores
  • the event cache can implements EventCacheData that contains all in-memory data + owned access to the store (and reload in-memory data in case of poison)
  • the crypto can implements CryptoData that contains all in-memory data + owned access to the store (and reload in-memory data in case of poison) <- is it doable?
  • same dance for the state store?

@Hywan Hywan mentioned this pull request Nov 28, 2024
37 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants