-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Clarification of storage methods and persistence of log #702
Comments
👋 Thanks for opening this issue! Get help or engage by:
|
Yes :D
There must not be a hole. If there is, it is a severe bug.
Yes when
Yes it blocks the
In short, membership should be part of the state machine and part of the snapshot, just like a normal log entry.
No. In the case where the state-machine is not persistent but the snapshot is persistent, On the other hand, if the application has a persistent state machine where the result of applying a log entry is stored on disk,
Right.
The returned
Ok. Let me re-define the install-snapshot API with a stream: And I'm gonna answer your other question in the next comment:) |
Thanks for the answers. Maybe a bit of background: Our state machine is a "traditional" database engine with a snapshot/checkpoint/savepoint/whatever you want to name it and a write-ahead log. That means, we can replay the log up to the last log entry persisted to recover all the data in the state machine. However, we can't really do it like this at recovery, since that would also recover and apply the log which is not yet committed. Therefore, we "trick" a bit here. During the recovery, we determine the log end (i.e., the last log ID which was safely written to the log, but not necessarily committed yet by the quorum) and the last membership information which was actually committed (typically part of the state machine checkpoint, but it can be also somewhere later in the log). I.e., we can recover the state machine to any state between the membership log index (which is definitely committed and typically equal to the checkpoint log index) and the log end (which is potentially not committed). During the election, if we get elected as a leader, we can ship previously not committed log to other replicas to bring them up-to-date and roll-forward the state machine to the common committed index across quorum of replicas (which is also determined during election and can't be lower than our checkpoint log index). Further log entries are applied as the committed index is moved forward by regular replication. If our replica is a follower, then the leader determines the common portion of the log, which must lie definitely behind the Note that we do not want to persist the committed log ID, since that can be re-determined during election (i.e., it is actually volatile, see Figure 2 in the Raft paper https://raft.github.io/raft.pdf). Some additional questions: Regarding
OK, perfect. BTW, I thought you removed the Regarding If we store the committed membership separately of the database snapshot/checkpoint (this implies that the log is persisted at least until the committed membership position and until the snapshot), then we can roll forward until (committed) But I still have a question regarding The Thanks. |
Yes it is possible. I think the optimal way is to split
In fact, the leader does not have to flush a log to its local store to commit it. A log that is replicated and flushed to a quorum(including the leader or not) can be committed by the leader. The quorum does not have to include the leader.
Yes this is mandatory.
Hmm... All right.
Correct. A log that is replicated to a quorum can not be lost under any circumstance.
I like the second approach too.
You are right the state machine can just be pure in-memory. Thank you for you thoughtful advice:D |
Interesting! My question is: If a server applies non-committed logs to the state machine during recovery, the state machine has to be able to revert to a previous state, once the non-committed but applied logs are truncated by the leader. Does your state machine support reverting to a previous state? And I do not know how your application determines a common committed index. For example, If node-1 has log (1,1),(3,1) node-2 has log (1,1), node-3 has log (2,1);
I did not remove the blank log. I just stopped using a blank log as a heartbeat message. :(
I agree.
You are right. |
Thanks for the answers :-).
Perfect. We just have to think about how to bring it in w/o causing unnecessary compatibility issues. Likely by creating a second trait method, something on the lines of: async fn append_to_log_interleaved(&mut self, entries: &[&Entry<C>], callback: &LogIoComplete) -> Result<(), StorageError<C::NodeId>> {
let res = self.append_to_log(entries);
callback.log_io_completed(entries.last().map(|e| e.log_id), Ok(()));
res
} where the callback can be cheaply cloned and/or it could be provided once by some other mechanism to prevent unnecessary clones. In the method it would just send an info back to RaftCore.
That's correct, but to track that, My idea was somewhat simpler. Let (In fact, we don't block on I/O in our implementation, except for a very few places. We use data flow graphs with flush and barrier disk I/O and/or network I/O dependencies instead to ensure correctness, so we can start an operation at any time and the end result would be as if we would wait for respective I/Os.) Hm, on the other hand, this still may be insufficient - if the commit index is advanced too far and communicated to followers before log is actually persisted, we'll have an issue in certain cases. So probably there is no way around to make it "right" by actually communicating back from the log writer to the Raft core when the persistence has been reached (as described above).
No. The state machine can roll-forward at most to the index which is definitely known to be committed at the time of restart. This is either the snapshot index (i.e., no roll-forward) or the index of the explicitly-stored membership change (or other known commit index persistently stored elsewhere, should we have it). This would be also returned as of now from The definitive answer up to where we can roll forward is only known after the election and re-replication of logs, as necessary. In fact, we simply rely on Potentially, the roll-forward before election can be simply skipped by returning always just the snapshot index and the membership stored in the snapshot. I'm just a bit worried what will happen if the membership changes completely in the meantime. Say,
How does node 1 rejoin the consensus and learn about the new topology? If this works as expected, then we can completely drop roll-forward in case of membership change after snapshot simply to rely on
Correct. This still holds for our implementation - every node knows its log entries and during election the match index is computed, which causes re-replicating some log, potentially, using standard Raft/implementation in
Are you planning to remove it? |
Ah, I was digging in
There is Since this is quite inefficient (the log will be scanned twice on restart - once for membership, which is unlikely to be found, as it is exceptional operation, and once to roll-forward the state machine), I'd suggest to provide additional default-implemented method for this on BTW, the function |
Right. In your case, it is also possible node-1 never sees the newer membership at log-10. In such a case, it has to wait for a new leader, which could be node-4 or node-5, to contact it.
Yes. I trade efficiency for simplicity.
Yes. in openraft a membership config takes effect immediately when it is seen, regardless of whether it is committed or not. This is unlike the raft implementation in etcd, which applies committed membership config. Both of these two solutions have its own issues to address: Openraft must handle membership config fallback on a follower, while etcd raft must handle non-committed membership config during election. |
Something like
Perfect!
Ok I see!
It has to wait for node-4, the leader, to contact it. Node-4 will then replicate logs to node-1, the last membership config log(10) then will be seen by node-1, then node-1 update its effective membership config. See:
Yes.
Hmm... Ok then.
No. A leader must append a blank log as the last step in an election, to assert that its leadership is valid. |
- Answer questions mentioned in databendlabs#702
- Answer questions mentioned in databendlabs#702
That's understandable :-)
What if there is only one committed/applied membership (the regular case)? Then the two are equal. Maybe
Well, something like it, yes. So you'd call it immediately after I'd pass the callback by reference, though, to prevent unnecessary refcounting on the channel. (Also, I'd document and guarantee that the callback passed is always the same one, so one could store an instance in the The callback would be then something like this: trait LogIoComplete: Clone + Send + Sync {
fn log_io_completed(&self, index: u64, result: Result<(), std::io::Error>);
} with an appropriate error type (likely,
OK, I'll drop that, then. Makes things simpler :-) (and with the above optimized membership query it's anyway equivalently fast).
Thanks for the explanation.
Not necessarily. I just have this tick of optimizing stuff, since our application (database) is extremely performance-sensitive. Having an enum with two instead of three different possible values makes Rust optimize out the discriminator, if possible, making it 8B smaller. At 1M requests/second, that's 8MB/second not written :-). That was also the reason for #705 - it saves even more for the regular case. |
Hi @drmingdrmer,
I'm now finally implementing binding of
openraft
to our persistency. The documentation ofStorage
trait is a bit suboptimal, therefore I'd like to clarify it (and it should be then updated).Our persistency is composed of log, which is asynchronously persisted (but a synchronous flush can be done) and a state machine, which is updated in-memory and occasionally snapshotted (depending on various rules), so we can shorten the log. Note that the snapshot is an incremental one (shadow paging), so it's fairly cheap, it's not a file.
Voting
I think
save_vote()
/read_vote()
are pretty much clear - simply save the vote with immediate persistency to the disk and read it from the persistency upon restart as-is. I.e., the persistency of the vote is independent of the state machine or log. Right?Log
get_log_state()
documents that it should return the last log ID, independent of the state machine. I.e., the log is parsed until the end and the last log ID is returned. Fine.delete_conflict_logs_since()
/purge_logs_upto()
are pretty much straightforward. Fine.append_to_log()
is a bit unclear. I take that entries will always be presented in order means that the log index is monotonically increasing and will not be sent into the log twice. Right?There might be potentially holes, but IIRC there are none.
What is unclear to me is when the call needs to return. Is it strictly necessary to persist these log entries durably until the call returns? That would cause blocking of other operations which can progress in the meantime (handling responses, applying to state machine, ...). We need to clarify this (and potentially improve it).
State machine
apply_to_state_machine()
only tells the state machine that a log entry is agreed upon, so the state machine can produce a client reply. This is clear, except for when a membership change is to be applied. I assume, the membership needs to be then applied synchronously to a separate persistent location.last_applied_state()
returns both the log ID of the last snapshot of the state machine and of the membership. Here again, I assume that the persistency of the membership (when applied to the state machine) is to be done separately from snapshot. I.e., if I have the sequence of operations:then I suppose the restart from this would return LogId(2) for the snapshot and LogId(4) + membership stored there for membership. Right?
If the membership change is NOT committed yet (i.e., it was not applied to the state machine), just written to the log, then it should return the membership from LogId(1), right?
I.e., the membership returned is the last one which was actually applied via
apply_to_state_machine()
. Right?Snapshots
So far it seems like snapshot handling is clear enough, though very heavily geared towards having the snapshot in a file. We'll have hard time sending it via an arbitrary seekable interface. I think the interface should be made more strict (though easily mappable to the current trait requirements), but I have no exact proposal yet.
More on log persistence
I was wondering, whether it is possible to at least optimize log persistence to overlap it with other operations/network transfer.
On the leader:
openraft
per se.On the follower:
openraft
.In the worst case we can lose part of the committed log on the crashed leader, so when it restarts, it will have shorter log than expected. However, this will not cause any correctness issues, since any committed entry will be already persisted on one of the followers and one of the followers with the "longest" log will be elected as the new leader, later re-replicating the missing log to the former leader (in the worst case it will need to send the snapshot, if the log is gone, but that's anyway the case for lagging replicas).
Or am I missing something?
A "cleaner" alternative would be to overlap log writing with other operations on the state machine. This could be done:
Storage
interface intoStateMachine
andLogWriter
(since only one mutable reference is possible, thus we need independent objects) and feeding theLogWriter
in a separate task.append_to_log()
only initiate the I/O operation and report back the persisted log index by a callback via the central command queue.The former would keep the async interface, but prevent pipelining of I/O requests. The latter allows easy pipelining, but the implementation is a bit more complex (OTOH, one can build it as a second trait method
start_append_to_log()
, which calls the originalappend_to_log()
and immediately schedules the follow-up operation on the passed back-channel object).I personally prefer the latter, since it is probably cleaner, doesn't require extra task and extra queues, and it will not require any changes to existing implementations on top of
openraft
, unless one wants to optimize the I/O.BTW, something similar could be done for applying to the state machine (i.e., also just start applying to the state machine, with back-channel for completions), but that would likely make state machine updates more complex. Considering that the state machine update is basically an in-memory operation typically not needing yielding the future (at least in our case), I wouldn't complicate it.
Thoughts?
Thanks & regards,
Ivan
The text was updated successfully, but these errors were encountered: