Skip to content

Ops incorrectly deleted during MongoDB node-out #164

Description

@alecgibson

We ran into an issue recently where we had a MongoDB node fail. During this failure, it looks like a ShareDB op was deleted, even though its snapshot was committed.

I can't confirm for sure, but my suspicion is that this chain of events (or something very similar) happened:

  1. writeOp() succeeds
  2. sharedb-mongo attempts to writeSnapshot()
  3. MongoDB commits the snapshot to disk
  4. MongoDB falls over before sending the ack to the client
  5. The client disconnects because of node outage; presumably assumes the write has failed(?)
  6. Attempts to "tidy up" the failed commit op, even though it succeeded
  7. The result is a committed snapshot with a missing op

I'm not entirely sure what my recommendation is. At first, I thought we should just delete the code that tidies these ops, but I do worry that it will result in bloat of the op collection during periods of high concurrency on a document.

We could move to transactions, although I worry about the performance implications (and I can't see much online, apart from guidance to use them sparingly, which wouldn't be the case here...).

We could add an extra DB call before the deletion, which double-checks the op is non-canonical before deleting. Would have to check both o_collection and collection to see if there's another op in the chain that references this op, or if the current snapshot references it. This requires 2 extra fetches, which isn't super nice, but I guess it would only happen in the tidy-up case, and it avoids the general use of transactions.

Other...?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions