Skip to content
Engel A. Sanchez edited this page Jun 11, 2014 · 10 revisions

Overview

Strongly consistent operations in Riak are implemented using the riak_ensemble library (see its Git Repo and Wiki for more details). These page describes the integration points with Riak KV. The main ones are the riak_kv_ensembles module that manages ensembles in the cluster, and the riak_kv_ensemble_backend module that connects the ensemble peers to KV vnodes and the AAE system.

Managing Ensembles

See riak_kv_ensembles

An ensemble is created for each primary preference list. Unlike eventually consistent Riak, fallback vnodes never participate in consistent operations. For example, if there are buckets with a replication factor (n_val) of 2 and some with 3, two ensembles will be created for each partition of the ring. For partition zero, for example, we will have {0, 2} and {0, 3}.

The riak_kv_ensembles process polls the ring for changes at regular intervals (currently every 10 seconds). This happens only on the claimant node.

Consistent operations

Consistent operations are handled by riak_client, just like regular operations. If the bucket as the {consistent, true} property, it will be routed through the riak_ensemble_client. Notice that legacy buckets (that is, in the default bucket type) are not allowed to have this property. This is done through the bucket property validators in riak_kv_bucket.

Consistent objects use riak_object too, but the vector clock information is fake. It actually stores the epoch and sequence number used by riak_ensemble to version operations encoded as a 64 bit integer as the count for a a fake eseq actor id. Earlier versions would use overwrite semantics for the ensemble put operation if the input object did not have a vector clock and update if it did. Joe decided that it was too common for clients to send an object without a vector clock, potentially causing unsafe overwrites when not intended. Now, most writes are treated as updates, and will fail if the object has been modified since the client read it. Passing the if_none_matched option will cause the write to fail if a value already exists for that key. See riak_client:consistent_put_type/2.

Anti-entropy on consistent data

The purpose of peer syncing is to detect and repair corrupted/missing ensemble backend data.

For riak_kv_ensemble_backend, syncing is implemented using AAE. Anytime a vnode is restarted it is considered untrusted until the vnode performs AAE exchange with a majority of sibling vnodes.

To support AAE syncing, riak_kv_vnode was extended with a new ensemble_repair command and the riak_kv_exchange_fsm was extended to use this command to repair consistent keys. This change is needed because consistent data does not yet support read repair. Likewise, even when consistent data supports read repair, it will only do so when a given ensemble is stable. However, syncing is required to operate even when an ensemble is not stable (eg. peers may be required to sync before they can proceed to elect a leader).

Clone this wiki locally