You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In considering alternatives for #190, the following scenario occurred to me which seems like the current GC scheme may also be vulnerable to, so I wanted to open as a separate Issue to improve its visibility.
The problem arises if we cannot assume that when synchronizing between hosts, we cannot assume any particular ordering of changes (which I think is generally true).
Assume hosts A and B with synchronized stores.
GC is initiated on A, but interrupted.
Both hosts are able to fully synchronize, so both hosts now reflect (the same) in-progress GC.
GC is resumed on A, and completes so A now reflects only a single generation (B remains in the state from the prior step). Let us assume in this particular case there was nothing to GC, so all chunks ended up migrated to the new generation on A.
Synchronization between A and B occurs, but does not fully complete. Specifically, let us assume all the name data has synchronized (moved to the newest generation on both A and B), but not all chunk data has synchronized (some chunks still live under older generations on B).
At this point, if synchronization is not completed on B, but a gc is issued on B (assume this occurs after the GC grace time period):
B will see two generations locally:
[0]: The original generation A and B knew initially. This generation has chunks, but no names.
[1]: The new generation from the GC initiated and completed on A, which is only partially synced to B. In our assumed case, it contains all the names, but only some of the chunks.
B will see this as a GC in-progress, examine its oldest local generation ([0]), and see that it contains no names, and wipe it (again, assume we are outside the GC grace time).
This leaves its local store with names that have missing chunks. What the behavior would be on next synchronization I guess would depend on the synchronization mechanism, but even in the happy case that the missing B data were restored from A, it would be the case that for some time B would have a damaged store.
This is a bit contrived and involves a specific sequence of interrupted actions and invoking the gc on multiple hosts at specific times. And perhaps the GC grace time is considered sufficient mitigation ("we will surely fully sync within this window"); but I did want to raise this case as possible, at least under my understanding.
If this scenario is plausible, I believe just prior to wiping a generation (while locked), you would need to double-check and visit all names in younger generations to promote any chunks needed.
The text was updated successfully, but these errors were encountered:
I think you're analysis of this scenario is correct.
The assumption is that the windows before anything is being deleted is large enough to ensure full sync. When this assumption is broken the data might be lost in many scenarios.
E.g.
A & B start synced
B deletes some names, does a GC
time passes, no sync was done on time
B deletes old chunks
A adds new names to the old generation - the only one it knows of, assumes that the existing chunks are there, writes only new ones
sync happens; deletes from B propagate to A; the last name written on A only has the newly written chunks, the one that used to exist, are no more
In considering alternatives for #190, the following scenario occurred to me which seems like the current GC scheme may also be vulnerable to, so I wanted to open as a separate Issue to improve its visibility.
The problem arises if we cannot assume that when synchronizing between hosts, we cannot assume any particular ordering of changes (which I think is generally true).
A
andB
with synchronized stores.A
, but interrupted.A
, and completes soA
now reflects only a single generation (B
remains in the state from the prior step). Let us assume in this particular case there was nothing to GC, so all chunks ended up migrated to the new generation onA
.A
andB
occurs, but does not fully complete. Specifically, let us assume all thename
data has synchronized (moved to the newest generation on bothA
andB
), but not allchunk
data has synchronized (some chunks still live under older generations onB
).At this point, if synchronization is not completed on
B
, but agc
is issued onB
(assume this occurs after the GC grace time period):B
will see two generations locally:[0]
: The original generationA
andB
knew initially. This generation has chunks, but no names.[1]
: The new generation from the GC initiated and completed onA
, which is only partially synced toB
. In our assumed case, it contains all the names, but only some of the chunks.B
will see this as a GC in-progress, examine its oldest local generation ([0]
), and see that it contains no names, and wipe it (again, assume we are outside the GC grace time).This leaves its local store with names that have missing chunks. What the behavior would be on next synchronization I guess would depend on the synchronization mechanism, but even in the happy case that the missing
B
data were restored fromA
, it would be the case that for some timeB
would have a damaged store.This is a bit contrived and involves a specific sequence of interrupted actions and invoking the
gc
on multiple hosts at specific times. And perhaps the GC grace time is considered sufficient mitigation ("we will surely fully sync within this window"); but I did want to raise this case as possible, at least under my understanding.If this scenario is plausible, I believe just prior to wiping a generation (while locked), you would need to double-check and visit all names in younger generations to promote any chunks needed.
The text was updated successfully, but these errors were encountered: