-
Notifications
You must be signed in to change notification settings - Fork 3
Design and Implementation
The design is a hybrid of a relational database management system and a version control system such as Git. A database is represented as an immutable revision from which new revisions may be derived with data added, subtracted, or replaced. These revisions may be compared with each other and/or combined using a three-way merge algorithm.
The version control model is a natural fit for implementing subscriptions. Revori maintains a mirror of the subscriber's current state, listens for updates to the "head" revision of the database, and calculates the diff between the subscriber's state and the "head", sending the result to the subscriber and updating the mirror. If the subscriber's network connection is slow, the diffs are calculated less frequently, but the subscriber is always sent the latest data available.
Besides diffs, Revori can also calculate three-way merges of database revisions. This allows members of a cluster to temporarily diverge for maximum availability and later converge asynchronously as data propagates across the network. In Riak parlance, R=1, W=1, and DW=0, but we intend to make this tunable in the future so that applications may request stricter consistency guarantees without needing to implement them themselves.
One challenge in designing a multimaster asynchronous distributed system is to preserve causal ordering in a scalable way. For example, if cluster node A applies updates 1, 2, and 3 in that order, and there are multiple paths by which updates might propagate to cluster node B, it is possible that B will receive a revision containing only 1 and 2 after it has already received a newer revision containing all three updates via a different path. In this case, it is important that the older revision is discarded and not allowed to overwrite or conflict with the newer revision. In other words, the causal relationship between those revisions must be preserved as they flow through the cluster.
Currently, Revori guarantees causal ordering by requiring that each node keep track of the revision history of all other nodes in the cluster. This scheme encodes the same information as the vector clock scheme used by e.g. Riak, but is structured differently to make merging easier. Older revisions are discarded once they have been acknowledged by all nodes, so this history cannot grow without bounds. However, it does imply a limit to how large a cluster can grow and how much latency can be tolerated.
If cluster nodes were connected such that there was only one path between any two nodes, we wouldn't need to address this problem at all, and thus each node would need only to keep track of the revision history for nodes it is directly connected to, allowing the cluster to scale arbitrarily. However, this would hurt reliability by eliminating the potential for data to "route around damage" when a given path becomes congested or unreliable, and a disconnected node would be unable to reconnect to the cluster without discarding its state and rejoining anew.
Gérald Oster describes an alternative strategy in his paper "Data Consistency for P2P Collaborative Editing" which looks like a promising way to solve the problem without vector clocks or per-node revision history. Fortunately, new replication schemes can be added to Revori quite easily, and applications need not be modified to accomodate them.