Expand and clarify consitency/durability docs in store.wit #56

dicej · 2024-12-16T19:42:03Z

The existing docs are somewhat vague about how the "read your writes" consistency model works in practice, so I've tried to make them more explicit. Also, they don't mention durability at all, so I've added a section dedicated to that.

Note that I've generally erred on the side of maximum portability across host implementations at the expense of strong guarantees for the guest. Based on previous conversations, my understanding is that we do want to support implementations backed by eventually consistent distributed systems, and that means portable guest code cannot assume a stronger consistency model than what such systems can deliver. Concretely, we must consider the scenario where a host has a pool of connections to multiple replicas in such a system such that a single component instance which opens the same bucket multiple times might get a different replica (each with its own view of the state) each time.

If we feel the guarantees described in these docs are too weak, we can certainly strengthen them at the expense of host implementation flexibility. Alternatively, we could add new APIs for querying and/or controlling the durability and consistency models provided by the implementation -- or even allow the guest to statically declare that it requires some specific consistency model by importing a specific interface corresponding to that model, analogous to what we did with the atomics interface.

Regardless of what set of (non-)guarantees and features we settle on, my main priority is to be as clear as possible about them so that application developers are not caught by surprise.

The existing docs are somewhat vague about how the "read your writes" consistency model works in practice, so I've tried to make them more explicit. Also, they don't mention durability at all, so I've added a section dedicated to that. Note that I've generally erred on the side of maximum portability across host implementations at the expense of strong guarantees for the guest. Based on previous conversations, my understanding is that we _do_ want to support implementations backed by eventually consistent distributed systems, and that means portable guest code cannot assume a stronger consistency model than what such systems can deliver. Concretely, we must consider the scenario where a host has a pool of connections to multiple replicas in such a system such that a single component instance which opens the same bucket multiple times might get a different replica (each with its own view of the state) each time. If we feel the guarantees described in these docs are too weak, we can certainly strengthen them at the expense of host implementation flexibility. Alternatively, we could add new APIs for querying and/or controlling the durability and consistency models provided by the implementation -- or even allow the guest to statically declare that it requires some specific consistency model by importing a specific interface corresponding to that model, analogous to what we did with the `atomics` interface. Regardless of what set of (non-)guarantees and features we settle on, my main priority is to be as clear as possible about them so that application developers are not caught by surprise. Signed-off-by: Joel Dice <[email protected]>

wit/store.wit

Mossaka · 2024-12-16T21:21:31Z

wit/store.wit

+///
+/// ## Durability
+///
+/// This interface does not currently make any hard guarantees about the durability of values


I think it's okay to leave the durability wide open. I am wondering in your case 3 - under async set calls scenario, we want to emphasize that the implementation should still guarantee "Read your write" data consistency.

Now, there is a question of "what happens if an async I/O error occurs right after the set call completes successfully": a weak point of the current specification and I was hoping that we could address this one.

In a strict interpretation of the spec, once set is Ok, the handle SHOULD behave as if the value is now present. A get on the same handle SHOULD return the new value.

If the store experiences a critical I/O failure that causes data corruption or data loss, there are currently no instructions on how the store should respond. Should it return Err(error::other(...)) on subsequent get calls?

I think there are two possible ways to extend the specification to address the above concerns:

Handle defunct after errors

We could define that once a bucket handle experiences a critical I/O error, all further operations on that handle must return an error. That is, if a store fails after set, it would no longer provide a consistent view for subsequent get operations. This does not violate the “read your write” guarantee since the handle is considered defunct.

a Best-effort guarantee tied to success conditions

The specification could define that “read your writes” holds as long as the store does not fail irrecoverably between operations. A get operation should return a Err(error::other("I/O failure")) to reflect the error condition from the store.

@Mossaka Based on the previous discussion above, I think there's performance reasons not to require "read your writes" (even when reads follow writes on the same bucket handle). In particular, if the implementation of write sends the written values out over the network to a primary/writer node, and the implementation of read sends a request over the network to a read replica (distinct from the primary writer node), then you won't have "read your writes" without maintaining extra cached copies or making extra network requests. Thus, I think even when there is not an irrecoverable error, we shouldn't say that "read your writes" holds.

Mossaka · 2024-12-16T21:24:03Z

wit/store.wit

+/// ## Durability
+///
+/// This interface does not currently make any hard guarantees about the durability of values
+/// stored.  A valid implementation might rely on an in-memory hash table, the contents of which are


For in-memory stores, we probably want to emphasize that the data might be lost due to store crashed, and the Best-effort guarantee described in my comment above should apply to our specification - stating that the "read your write" consistency contract should only apply to store operating under normal conditions.

rylev · 2024-12-17T09:39:10Z

wit/store.wit

@@ -7,22 +7,65 @@
 /// ensuring compatibility between different key-value stores. Note: the clients will be expecting
 /// serialization/deserialization overhead to be handled by the key-value store. The value could be
 /// a serialized object from JSON, HTML or vendor-specific data types like AWS S3 objects.
+///
+/// ## Consistency
 /// 
 /// Data consistency in a key value store refers to the guarantee that once a write operation
 /// completes, all subsequent read operations will return the value that was written.


all subsequent read operations will return the value that was written.

It would be nice to understand what "context" (borrowing the terminology below) this is meant for.

One reading of this (which I assume is not meant) is "all subsequent read operations globally (from any client) will return the value that was written". I assume what is actually meant is all reads from the client that performed the write. Perhaps we should move the definitions of client and context from below to the top of this section and then be explicit about how all operations unless otherwise stated are only from the perspective of the current client.

Makes sense; I just pushed an update which simply removes the first paragraph since the second one says the same thing more precisely.

rylev · 2024-12-17T09:42:33Z

wit/store.wit

+/// In other words, the `bucketC` resource may reflect either the most recent write to the `bucketA`
+/// resource, or the one to the `bucketB` resource, or neither, depending on how quickly either of
+/// those writes reached the replica from which the `bucketC` resource is reading.  However,
+/// assuming there are no unrecoverable errors -- such that the state of a replica is irretrievably


I'm confused why we mention "unrecoverable errors". Such errors aren't visible to the guest and thus aren't really of consequence to the guest. I believe the important bit is that the writes one one resource are not guaranteed to be reflected on subsequent reads of a different resource.

As things are written I'm unsure about the following situation. Imagine the guest code:

bucketA = open("foo") bucketB = open("foo") bucketA.set("bar", "a") sleep(1_000_000_years) assert bucketA.get("bar").equals(bucketB.get("bar"))

The client has left sufficient time (1,000,000 years) for replication to happen. However, the backing implementation uses caching such that once set is called, get on that resource will always reflect the call to set. Unfortunately, the underlying write failed and so the cache does not reflect the state of the backing store. This means bucketA and bucketB will never agree on the value of "bar".

Is that spec compliant?

The scenario I had in mind regarding "unrecoverable errors" was where bucketA is connected to replica X and bucketB is connected to replica Y, but replica X is lost (say the rack caught on fire) before it can send bucketA's write to replica Y. Very unlikely of course, and certainly outside the realm of normal operation, but it still prevents us from making any absolute guarantees. In any case, such an error is of consequence to the guest in that bucketA's write never had a chance to be the one the system eventually settles on. And if both replica X and replica Y were in that same unfortunate rack, then it's possible neither write made it to the rest of the system.

BTW, if the discussion of unusual errors is distracting and/or superfluous, I can omit it or move it to a footnote. I mainly just wanted to point out that failures in a distributed system are non-atomic and can affect the behavior of that system even when it's still (partially) available. That's in contrast to a centralized, ACID database where it either fails completely or not at all.

Regarding caching: I expect assert bucketA.get("bar").equals(bucketB.get("bar")) should eventually be true for a long running process; i.e. values shouldn't be cached indefinitely. Not sure exactly where we draw the line on cache invalidation timing, but certainly less than a million years :). And implementations based on systems which support proactive cache eviction (e.g. by pushing notifications to clients) would presumably make use of that.

I don't think this discussion is superfluous. I think it's extremely important. It's the difference between whether host implementors of this interface need to wait for guarantee of replication or not. When we settle on the semantics of writes are not guaranteed to replicate, then that means the guest can never trust a write except by opening a new resource handle and doing a new read, right?

When we settle on the semantics of writes are not guaranteed to replicate, then that means the guest can never trust a write except by opening a new resource handle and doing a new read, right?

Yes, that sounds correct to me. FWIW, I do think supporting two kinds of writes (one that uses write-behind caching to avoid blocking and another that blocks until it has received confirmation from at least one replica) and two kinds of reads (one that uses a cache and one that doesn't) could make sense. Even when using the blocking versions of those operations, though, we still wouldn't be able to make guarantees about if/when the write is visible using a different resource handle (since it might be connected to a different replica).

Some distributed databases use a single-master replication model, which make it easier to provide stronger guarantees -- e.g. as long as you get write confirmation from the master and then, when reading, request that the replica syncs with the master before returning a result, then you'll get very ACID-style semantics. That's what Turso does to implement transactional writes and BEGIN IMMEDIATE transactional reads. The only way to do that with a highly-available, asynchronous, peer-to-peer database is to request write confirmation from all replicas and then, when reading, request that the replica you're talking to sync with all the other replicas before returning a result.

It might help in this discussion to nail down the minimum feature set (related to consistency, durability, or otherwise) a backing key value store must provide to be compatible with wasi-keyvalue, and then determine which systems (e.g. Redis, Cassandra, Memcached, etc.) actually support them. If all the backing stores we want to use support consistency features with tighter guarantees than the ones I've described here, then we can tighten up this language as well.

It was somewhat redundant (and potentially misleading) given that the following paragraph says the same thing less ambiguously and defines exactly in which circumstances the "read your writes" guarantee applies. Signed-off-by: Joel Dice <[email protected]>

Mossaka

LGTM my comments could be seen as follow ups and should not block this PR because I think it brings a lot of value to improve the current documentation on the data consistency part.

rylev · 2025-01-17T14:07:34Z

wit/store.wit

+/// writes." In particular, this means that a `get` call for a given key on a given `bucket`
+/// resource should never return a value that is older than the the last value written to that key
+/// on the same resource, but it MAY get a newer value if one was written around the same
+/// time. These guarantees only apply to reads and writes on the same resource; they do not hold


These guarantees only apply to reads and writes on the same resource;

I think we might be burying the lead a bit. It might be useful to start the consistency section with a quick sentence that says that there are no consistency guarantees across resource handles.

Makes sense; please see my latest push and let me know if it still needs improvement.

rylev · 2025-01-17T14:10:14Z

wit/store.wit

+/// // ...whereas this is NOT guaranteed to succeed immediately (but should eventually):
+/// // assert bucketB.get("bar").equals("a")


It sounds like from what's written above this is not guranteed to ever be true. Since consistency is not guaranteed across resource handles, bucketB.get("bar") may never equal "a" even with unlimited time and no other writes.

Right, hence the "should". I think it's worth mentioning what a conforming implementation should make a best effort to do (i.e. in normal operation, barring exceptional circumstances) as well as what it must do.

If we're using the RFC 2119 meaning of "should" I think we should write it as "SHOULD" (in all caps). A non-RFC definition of "should" here might lead readers to interpret "should" as "will".

rylev · 2025-01-17T14:12:18Z

wit/store.wit

+/// Once a value is `set` for a given key on a given `bucket`, all subsequent `get` requests on that
+/// same bucket will reflect that write or any subsequent writes. `get` requests using a different
+/// bucket may or may not immediately see the new value due to e.g. cache effects and/or replication
+/// lag.


I'd prefer if we were consistent about when we used "resource" vs. "bucket". I think you mean "resource" here, because if there is a second resource handle to the same logical "bucket" then subsequent get requests are not guaranteed to read the write.

That's fair. I'm using bucket here to mean an instance of the bucket resource, but I can change that to "resource handle" if that's clearer.

I just pushed an update to consistently use the term "bucket resource" everywhere, plus a paragraph in the resource bucket docs to clarify that it represents a connection to a key-value store rather than the store itself.

rylev · 2025-01-17T14:15:07Z

wit/store.wit

+/// In other words, the `bucketC` resource may reflect either the most recent write to the `bucketA`
+/// resource, or the one to the `bucketB` resource, or neither, depending on how quickly either of
+/// those writes reached the replica from which the `bucketC` resource is reading.  However,
+/// assuming there are no unrecoverable errors -- such that the state of a replica is irretrievably


I don't think this discussion is superfluous. I think it's extremely important. It's the difference between whether host implementors of this interface need to wait for guarantee of replication or not. When we settle on the semantics of writes are not guaranteed to replicate, then that means the guest can never trust a write except by opening a new resource handle and doing a new read, right?

Signed-off-by: Joel Dice <[email protected]>

rylev · 2025-01-20T10:46:52Z

@dicej this is getting closer. There's a few places where I'd still like to clarify what is a MAY vs. SHOULD vs. MUST, but I think once those are taken care of, I'd be happy with the wording.

For the record, I'm not sure I'm fully on board with the semantics described here (vs. having it so that writes MUST be eventually reflected), but I do think the changes here at least make it much clearer what the semantics as written actually are.

Signed-off-by: Joel Dice <[email protected]>

dicej · 2025-01-21T17:15:54Z

@dicej this is getting closer. There's a few places where I'd still like to clarify what is a MAY vs. SHOULD vs. MUST, but I think once those are taken care of, I'd be happy with the wording.

Thanks; I just pushed an update; please let me know if I missed anything.

For the record, I'm not sure I'm fully on board with the semantics described here (vs. having it so that writes MUST be eventually reflected), but I do think the changes here at least make it much clearer what the semantics as written actually are.

I hear you. To me this boils down to whether we try to support BASE-style DBMSes such as Cassandra and CouchDB in this interface or not. Those systems are designed with a different set of tradeoffs in mind, favoring partition tolerance, low-latency and availability over consistency (i.e. in extreme circumstances they prioritize the former over the latter, and this can lead to lost writes in the case of unrecoverable replica failures). We could either:

Support them by way of the SHOULD terminology I've used here
Choose not to support them
Support them, but in a separate interface (e.g. async-store?)

This PR represents the first option as a conservative default, but we can always change the SHOULDs to MUSTs later if we decide to pursue the second or third options.

Signed-off-by: Joel Dice <[email protected]>

lukewagner · 2025-01-21T23:15:04Z

Great to see the work here fleshing out consistency and durability and adding examples!

One request: perhaps surprisingly, even though Read Your Writes seems like it should be so basic that is comes "for free" (and indeed, on many single-node implementations, it does), in a highly-distributed key-value store, Read Your Writes does add some overhead. This is the case for Fastly's edge key-value store today, but I think the same laws of physics would apply to other low-latency geo-distributed kv stores where writes may take a different physical path than (cached) reads. Thus, if we're already designing wasi:keyvalue/store to be rather-weak and implementable in terms of many diverse kv stores impls anyways, I suggest we don't add the Read Your Writes consistency guarantee.

Also, Read Your Writes is just one of a lattice of rather-weak consistency models, so it'd be a bit arbitrary (at least without more of a broad survey of use cases) to pick "Read Your Writes" and not, say, Causal. Maybe one day we add more wasi:keyvalue/* interfaces that cover all these consistency models (we already have wasi:keyvalue/atomics which hits Sequential and could be expanded with more operations). But starting with wasi:keyvalue/store being just "eventually consistent" seems to make sense as a starting point.

Btw, another consistency(ish) guarantee I think we could include beyond "eventual consistency" is "there are no out-of-thin-air values" (i.e., if a read returns a value v, it's because some other write was made with that value v). No-thin-air-values is famously hard to rule out in weak formal theoretical models, but easy enough to hand-wave at and maybe useful as a complement to "eventual consistency" to illustrate the baseline.

The semantic (non-)guarantees for wasi-keyvalue are still [under discussion](WebAssembly/wasi-keyvalue#56), but meanwhile the behavior of Spin's write-behind cache has caused [some headaches](spinframework#2952), so I'm removing it until we have more clarity on what's allowed and what's disallowed by the proposed standard. The original motivation behind `CachingStoreManager` was to reflect the anticipated behavior of an eventually-consistent, low-latency, cloud-based distributed store and, per [Hyrum's Law](https://www.hyrumslaw.com/) help app developers avoid depending on the behavior of a local, centralized store which would not match that of a distributed store. However, the write-behind caching approach interacts poorly with the lazy connection establishment which some `StoreManager` implementations use, leading writes to apparently succeed even when the connection fails. Subsequent discussion regarding the above issue arrived at a consensus that we should not consider a write to have succeeded until and unless we've successfully connected to and received a write confirmation from at least one replica in a distributed system. I.e. rather than the replication factor (RF) = 0 we've been effectively providing up to this point, we should provide RF=1. The latter still provides low-latency performance when the nearest replica is reasonably close, but improves upon RF=0 in that it shifts responsibility for the write from Spin to the backing store prior to returning "success" to the application. Note that RF=1 (and indeed anything less than RF=ALL) cannot guarantee that the write will be seen immediately (or, in the extreme case of an unrecoverable failure, at all) by readers connected to other replicas. Applications requiring a stronger consistency model should use an ACID-style backing store rather than an eventually consistent one. Signed-off-by: Joel Dice <[email protected]>

The semantic (non-)guarantees for wasi-keyvalue are still [under discussion](WebAssembly/wasi-keyvalue#56), but meanwhile the behavior of Spin's write-behind cache has caused [some headaches](#2952), so I'm removing it until we have more clarity on what's allowed and what's disallowed by the proposed standard. The original motivation behind `CachingStoreManager` was to reflect the anticipated behavior of an eventually-consistent, low-latency, cloud-based distributed store and, per [Hyrum's Law](https://www.hyrumslaw.com/) help app developers avoid depending on the behavior of a local, centralized store which would not match that of a distributed store. However, the write-behind caching approach interacts poorly with the lazy connection establishment which some `StoreManager` implementations use, leading writes to apparently succeed even when the connection fails. Subsequent discussion regarding the above issue arrived at a consensus that we should not consider a write to have succeeded until and unless we've successfully connected to and received a write confirmation from at least one replica in a distributed system. I.e. rather than the replication factor (RF) = 0 we've been effectively providing up to this point, we should provide RF=1. The latter still provides low-latency performance when the nearest replica is reasonably close, but improves upon RF=0 in that it shifts responsibility for the write from Spin to the backing store prior to returning "success" to the application. Note that RF=1 (and indeed anything less than RF=ALL) cannot guarantee that the write will be seen immediately (or, in the extreme case of an unrecoverable failure, at all) by readers connected to other replicas. Applications requiring a stronger consistency model should use an ACID-style backing store rather than an eventually consistent one. Signed-off-by: Joel Dice <[email protected]>

Mossaka · 2025-02-25T05:46:21Z

Great to see the work here fleshing out consistency and durability and adding examples!

One request: perhaps surprisingly, even though Read Your Writes seems like it should be so basic that is comes "for free" (and indeed, on many single-node implementations, it does), in a highly-distributed key-value store, Read Your Writes does add some overhead. This is the case for Fastly's edge key-value store today, but I think the same laws of physics would apply to other low-latency geo-distributed kv stores where writes may take a different physical path than (cached) reads. Thus, if we're already designing wasi:keyvalue/store to be rather-weak and implementable in terms of many diverse kv stores impls anyways, I suggest we don't add the Read Your Writes consistency guarantee.

Also, Read Your Writes is just one of a lattice of rather-weak consistency models, so it'd be a bit arbitrary (at least without more of a broad survey of use cases) to pick "Read Your Writes" and not, say, Causal. Maybe one day we add more wasi:keyvalue/* interfaces that cover all these consistency models (we already have wasi:keyvalue/atomics which hits Sequential and could be expanded with more operations). But starting with wasi:keyvalue/store being just "eventually consistent" seems to make sense as a starting point.

Btw, another consistency(ish) guarantee I think we could include beyond "eventual consistency" is "there are no out-of-thin-air values" (i.e., if a read returns a value v, it's because some other write was made with that value v). No-thin-air-values is famously hard to rule out in weak formal theoretical models, but easy enough to hand-wave at and maybe useful as a complement to "eventual consistency" to illustrate the baseline.

Thanks for sharing this detailed perspective on the consistency models! You made a great point about read-your-write consistency potentially adding overhead in highly distributed systems. I agre that if we are designing wasi:keyvalue/store to be implementable across diverse KV store, starting with true "eventual consistency" as our baseline makes sense.

Would you be open to filing this as a separate issue? I think the current PR has good shape overall, and I'd prefer not to block its progress while we reconsider the consistency model. We could then have a focused discussion about the appropriate baseline guarantees for wasi:keyvalue/store and potentially plan for additional interfaces that address the consistency models in the future.

Mossaka · 2025-02-25T05:53:10Z

wit/store.wit

@@ -67,7 +112,14 @@ interface store {
    /// 6. Memcached calls a collection of key-value pairs a slab
    /// 7. Azure Cosmos DB calls a collection of key-value pairs a container
    ///
-    /// In this interface, we use the term `bucket` to refer to a collection of key-value pairs


I found the wording "connection to a collection of key-value pairs" instead of "a collection of key-value pairs" to be a bit strange - it now implies a networked view instead of a logical container. What does this say to downstream implementation that does not involve networking, e.g. a filesystem implementation?

I used that wording to emphasize the fact that you can have to bucket resource handles pointing to the same key-value connection but connected to different replicas in an eventually consistent distributed system, in which case they'll see that collection from different points of view such that values may arrive in different orders, etc. In other words, I'm trying to emphasize that each handle represents a potentially unique view of the collection which is not necessarily consistent with another view, despite being opened with the same name.

It might help to use two different terms for these concepts, e.g. "bucket" could refer to the collection while "bucket-view" refers to a specific view of the collection, similar the distinction between a value and a pointer to a value in a programing language.

In the interest of minimizing further changes to this PR, though, would it help to change "connection to a collection of key-value pairs" to "view of a collection of key-value pairs" (and likewise replace "connection" with "view" anywhere else it appears)?

Thanks for clarifying. I am okay to merge this PR as is because we can always update the spec if other people find this confusing.

Mossaka · 2025-02-25T05:55:31Z

I am happy to merge this one as is - if there are no other active work you are doing @dicej

dicej · 2025-02-25T20:17:34Z

Would you be open to filing this as a separate issue? I think the current PR has good shape overall, and I'd prefer not to block its progress while we reconsider the consistency model.

I chatted with @lukewagner about this today, and he was concerned about pushing a PR to main that guarantees read-your-writes, only to yank that out soon after. I'll do another pass over this either on Friday or next week and aim to remove the read-your-writes parts, then we can do one final round of review and get this merged.

This removes the "read-your-writes" guarantee since several of the backing stores we wish to support either do not support it or do not support it by default. Signed-off-by: Joel Dice <[email protected]>

dicej · 2025-03-03T18:52:23Z

I just pushed an update that removes the "read-your-writes" guarantee and updates the pseudocode example to match.

Mossaka · 2025-03-07T01:05:13Z

wit/store.wit

+/// those writes reached the replica from which the `bucketC` resource is reading.  However,
+/// assuming there are no irrecoverable errors -- such that the state of a replica is irretrievably
+/// lost before it can be propagated -- one of the values ("a" or "b") MUST eventually be considered
+/// the "latest" and replicated across the system, at which point all three resources will return


one of the vlaues MUST eventually be considered the "latest".

Who decides what is to be considered the "latest"? In a distributed system, nodes might have clock skews that there is no global concept of "latest". My main concen here is related to conflict resolution mechansm and it's not clear to me whether the system or the application is responsible for resolving these conflicts and how to resolve them.

To elaborate more: if the system is using Last-Write-Wins to resolve conflict using a timestamp from the node's clock, due to clock skew, a write operation that actually occurred earlier in real-time might be assigned a later timestamp by a node whose clock is skewed. In this case, two nodes may not have agreed on the same value being the "latest".

I expect that different backing store implementations will potentially use different approaches to resolving conflicts and deciding which write wins (e.g. vector clocks, a centralized serializer, or even (pseudo)random choice), which is why I didn't specify which value ("a" or "b") would be preferred.

We haven't provided any API in this interface to resolve conflicts, so it can only be the responsibility of the system, not the application, AFAICT.

I agree that the term "latest" here may cause confusion. Would it help to change the wording to ...one of the values ("a" or "b") MUST eventually be considered the "winner"... and perhaps mention that the backing store is responsible for making that choice, which could be based on criteria not visible to the application?

...one of the values ("a" or "b") MUST eventually be considered the "winner"...

be considered the "winner" per node, but there won't be a global "winner" for that specific value.

e.g. Node A might determine "a" as the winner while node B might determine "b" as the winner.

I think it's fine to say "one of the values ("a" or "b") MUST eventually be considered the "winner" at each replica ..."

Or perhaps we can say "In the absence of irrecoverable errors and prolonged network partitions, the system will strive to converge towards a consistent state" This emphasizes that after updates cease and replication processes have completed, each replica will reach a stable and internally consistent state.

It is important to recognize that while the keyvalue system aims for replicas to have the same view of the data, realistically, a consistent state in eventual consistency does not guarantee that all replicas will converge to exactly the same value for every key.

It is important to recognize that while the keyvalue system aims for replicas to have the same view of the data, realistically, a consistent state in eventual consistency does not guarantee that all replicas will converge to exactly the same value for every key.

That surprises me; I thought part of the definition of eventual consistency was that, as long a no new updates are made, all replicas will eventually converge on the same value for every key. Can you give an example of an eventually consistent system that does not have that property?

It is a surpising fact that eventual consistency does not guarantee that all replicas will always converge on the same value for every key. It only guarantees that replicas will converge to some consistent state. The key is conflict resolution mechanism.

Since this proposal does not specify a conflict resolution mechanism and leaves it for the provider to decide, I can imagine the following scenario.

let's say we have two replicas A and B and initially there is a network partition
client X writes k=v1 to A, and client Y writes k=v2 to B.

After some time, the network partition recovers, and the system finds a conflict!

If the key-value store uses Last-Write-Win but relies on-node system clock (not a vector clock), conflicting timestamps might cause replicas to inconsistently order updates.

Let's say, replica A believes k=v1 arrives first due to clock skew, and replica B believes k=v2 arrives first. Then because of the Last-Write-Win conflict resolution mechanism, these two replicas will permanently have conflicting values for the same key.

That's why I raised the point about the importance of the conflict resolution mechanism (CRM) in scenarios where concurrent writes exist. If the system is using more advanced CRM, e.g. vector clocks, then it might ensure a single "winner" value.

That certainly is surprising. I wouldn't have guessed that each node is allowed to reach its own conclusion about what the "latest value" is and, in so doing, permanently diverge from its peers. TIL.

I'll update the PR to reflect that; probably next week some time.

the kind of eventual consistent system that guarantees that any two nodes have the same value for the same keys is apprantly called "strong eventual consistency". You will need to use CRDT to achieve that. That's a funny name

Sorry for the delay; I finally got around to updating this. Please let me know what you think.

My mental model of what eventual consistency does and does not guarantee was inaccurate, meaning much of what I wrote earlier was incorrect. I _believe_ this new version gets it right; we'll see. Yay for PR review! Signed-off-by: Joel Dice <[email protected]>

lukewagner · 2025-01-21T20:24:41Z

wit/store.wit

+/// asynchronously on a best-effort basis without blocking `set` calls, in which case an I/O error
+/// could occur after the component instance which originally made the call has exited.
+///
+/// Future versions of the `wasi-keyvalue` package may provide ways to query and control the


Suggested change

/// Future versions of the `wasi-keyvalue` package may provide ways to query and control the

/// Future versions of `wasi:keyvalue` may provide ways to query and control the

lukewagner · 2025-06-20T16:52:48Z

wit/store.wit

+/// earlier time.  Beyond that, there are no guarantees, and thus a portable component must neither
+/// expect nor rely on anything else.
+///
+/// In the future, additional interfaces may be added to `wasi:key-value` with stronger guarantees,


Suggested change

/// In the future, additional interfaces may be added to `wasi:key-value` with stronger guarantees,

/// In the future, additional interfaces may be added to `wasi:keyvalue` with stronger guarantees,

lukewagner · 2025-06-20T16:53:33Z

wit/store.wit

+/// `strict-serializable-store` or just the former, in which case the host would immediately reject
+/// a component which imports the unsupported interface.
+///
+/// Here are a few examples of behavior which an component developer might wish to rely on but which


Suggested change

/// Here are a few examples of behavior which an component developer might wish to rely on but which

/// Here are a few examples of behavior which a component developer might wish to rely on but which

lukewagner · 2025-06-20T16:56:14Z

wit/store.wit

+/// - Convergence: eventual consistency does _NOT_ guarantee that any two replicas will agree on the
+/// value for a given key -- even after all writes have had time to propagate to all replicas.


I might be missing your point here, but I thought that Eventual Consistency did mean that eventually all replicas will converge on the same value... you just don't know how long it'll take.

I thought so too, but see @Mossaka's comment above:

Let's say, replica A believes k=v1 arrives first due to clock skew, and replica B believes k=v2 arrives first. Then because of the Last-Write-Win conflict resolution mechanism, these two replicas will permanently have conflicting values for the same key.

Yeah, I can see how that could arise in a fully-weak consistency model; but in that case we should not say "Eventual Consistency" above. That being said, are we aware of any particular kv-store implementations we'd like to allow that aren't even Eventual Consistency? I had thought EC was sortof the "lower bound" for traditional KV Stores. If we start talking about "caches", then I can see this happening, but I guess that's a question: even if we're not making durability guarantees, do we want implementations that actively evict keys (as opposed to only losing them on crashes)?

@Mossaka seems to be saying that eventual consistency is that weak:

It is important to recognize that while the keyvalue system aims for replicas to have the same view of the data, realistically, a consistent state in eventual consistency does not guarantee that all replicas will converge to exactly the same value for every key.

If someone can point me to an authoritative definition of what "eventual consistency" means and what it does and does not include, I'm happy to use that as a reference and update this document to be consistent with it. So far, it seems that everyone has their own, incompatible idea of what it means.

Maybe there isn't a precise, widely-accepted definition? In that case, I can note that in the docs here, e.g. "Although 'eventual consistency' has no precise, widely-accepted definition, here we define it to mean..." Or just not use the term at all?

I'd be happy to read any alternative definitions, but the wikipedia article does clearly describe convergence over time.

Yeah, that's what I thought it meant and what I want it to mean. @Mossaka can you explain where your assertion that "a consistent state in eventual consistency does not guarantee that all replicas will converge to exactly the same value for every key" came from? It seems to contradict what the Wikipedia article is claiming.

lukewagner · 2025-06-20T20:27:09Z

wit/store.wit

+///
+/// ## Durability
+///
+/// This interface does not currently make any hard guarantees about the durability of values


@Mossaka Based on the previous discussion above, I think there's performance reasons not to require "read your writes" (even when reads follow writes on the same bucket handle). In particular, if the implementation of write sends the written values out over the network to a primary/writer node, and the implementation of read sends a request over the network to a read replica (distinct from the primary writer node), then you won't have "read your writes" without maintaining extra cached copies or making extra network requests. Thus, I think even when there is not an irrecoverable error, we shouldn't say that "read your writes" holds.

Signed-off-by: Joel Dice <[email protected]>

dicej mentioned this pull request Dec 16, 2024

Key-Value Swallows Write Errors When Backing Impl Fails spinframework/spin#2952

Closed

Mossaka reviewed Dec 16, 2024

View reviewed changes

rename consistency heading

aaeec54

rylev reviewed Dec 17, 2024

View reviewed changes

Mossaka approved these changes Jan 8, 2025

View reviewed changes

rylev reviewed Jan 17, 2025

View reviewed changes

expand and clarify consistency wording based on review feedback

b4f3fff

Signed-off-by: Joel Dice <[email protected]>

dicej requested a review from Mossaka January 17, 2025 17:58

update consistency docs to use RFC-2119-style terminology

5b3c65b

Signed-off-by: Joel Dice <[email protected]>

update markdown files

94183c2

Signed-off-by: Joel Dice <[email protected]>

Mossaka approved these changes Jan 21, 2025

View reviewed changes

dicej mentioned this pull request Jan 27, 2025

remove CachingStoreManager from factor-key-value dicej/spin#2

Closed

dicej mentioned this pull request Jan 27, 2025

remove CachingStoreManager from factor-key-value spinframework/spin#2995

Merged

Mossaka approved these changes Feb 25, 2025

View reviewed changes

remove read-your-writes guarantee

cf03f8a

This removes the "read-your-writes" guarantee since several of the backing stores we wish to support either do not support it or do not support it by default. Signed-off-by: Joel Dice <[email protected]>

dicej force-pushed the consistency branch from 2cc7ed0 to cf03f8a Compare March 3, 2025 18:51

Mossaka reviewed Mar 7, 2025

View reviewed changes

dicej force-pushed the consistency branch from 8fda979 to 418d48f Compare June 18, 2025 21:00

lukewagner reviewed Jun 20, 2025

View reviewed changes

address review feedback

a9f34a4

Signed-off-by: Joel Dice <[email protected]>

		/// // ...whereas this is NOT guaranteed to succeed immediately (but should eventually):
		/// // assert bucketB.get("bar").equals("a")

	/// Future versions of the `wasi-keyvalue` package may provide ways to query and control the
	/// Future versions of `wasi:keyvalue` may provide ways to query and control the

	/// In the future, additional interfaces may be added to `wasi:key-value` with stronger guarantees,
	/// In the future, additional interfaces may be added to `wasi:keyvalue` with stronger guarantees,

	/// Here are a few examples of behavior which an component developer might wish to rely on but which
	/// Here are a few examples of behavior which a component developer might wish to rely on but which

		/// - Convergence: eventual consistency does _NOT_ guarantee that any two replicas will agree on the
		/// value for a given key -- even after all writes have had time to propagate to all replicas.

Expand and clarify consitency/durability docs in store.wit #56

Are you sure you want to change the base?

Expand and clarify consitency/durability docs in store.wit #56

Uh oh!

Conversation

dicej commented Dec 16, 2024

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Handle defunct after errors

a Best-effort guarantee tied to success conditions

Uh oh!

lukewagner Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Mossaka left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rylev commented Jan 20, 2025

Uh oh!

dicej commented Jan 21, 2025

Uh oh!

lukewagner commented Jan 21, 2025

Uh oh!

Mossaka commented Feb 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Mossaka commented Feb 25, 2025

Uh oh!

dicej commented Feb 25, 2025

Uh oh!

dicej commented Mar 3, 2025

Uh oh!

Mossaka Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dicej Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

lukewagner Jun 20, 2025 •

edited

Loading

Mossaka Mar 7, 2025 •

edited

Loading

dicej Mar 7, 2025 •

edited

Loading