Test I/O failures #349

icristescu · 2021-07-19T12:01:00Z

Using for instance a fault injection library https://github.com/CraigFe/ocaml-fiu.

tomjridge · 2021-08-06T12:42:02Z

Out of interest, what are we intending the code to do here?

I guess there are two main faults: wrong data returned; and some unknown read/write failure.

For "wrong data returned" I guess all bets are off (so, use checksums or similar).

For the "unknown failure", I guess you should probably stop, or possibly ignore and continue (probably not so good)?

craigfe · 2021-08-06T14:19:59Z

For "wrong data returned"-type failures, we probably already do about as much as we can: values are stored with their hashes in the pack file, so any invalid data returned by Index (e.g. incorrect file offsets) will generally translate to "unexpected hash" failures in Irmin almost immediately. The more serious class of unsound behaviour is false-positives / false-negatives for Index.mem (and likewise for Index.find), since these can't be double-checked and cause the storage layer to stagger on for some time in a way that obfuscates the real issue (e.g. mirage/irmin#1476).

I think in the case of "unknown failures" we'd ideally want the code to stop, but we can test two properties of the stop:

any failures reported to the user give some explanation of what actually went wrong (i.e. not just an assertion failure or Unix_error (_, _, _));
the store data remains consistent in the presence of spontaneous crashes.

As it stands, (1) is certainly not the case. IIRC, starting a Tezos node with the wrong --data-dir param (e.g. an empty or non-existent directory) quite often leads to a low-level "Insufficiently many bytes"-type exception (e.g. from here) once the index fails to read proper headers from its data files. We've also had a few cases of things like Unix_error (ENOENT, _, _) being raised without actually showing the full context.

Relatedly, we should probably take a more critical look at the various asserts in the codebase and, if they're actually refutable via corruption / user-misuse, change them to say something more constructive or add result to handle it on the Tezos side for more consistent UX. (I'm thinking particularly of Index.v, which feels like it should be surfacing more error cases to the API consumer e.g. when the data files don't exist.)

With regard to (2), we've had bugs due to unexpected exceptions in an asynchronous worker thread not being properly propagated to the parent (including a particularly nasty one due to Out_of_memory). It might be nice to have tests of the form "An unexpected IO failure during is recoverable on restart, and {no data, only recent entries} are lost."

icristescu added the maintenance label Jul 20, 2021

tomjridge mentioned this issue Nov 5, 2021

Index v2 (ocaml-kc / kv-hash) #371

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test I/O failures #349

Test I/O failures #349

icristescu commented Jul 19, 2021

tomjridge commented Aug 6, 2021

craigfe commented Aug 6, 2021

Test I/O failures #349

Test I/O failures #349

Comments

icristescu commented Jul 19, 2021

tomjridge commented Aug 6, 2021

craigfe commented Aug 6, 2021