Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test I/O failures #349

Open
icristescu opened this issue Jul 19, 2021 · 2 comments
Open

Test I/O failures #349

icristescu opened this issue Jul 19, 2021 · 2 comments

Comments

@icristescu
Copy link
Contributor

Using for instance a fault injection library https://github.com/CraigFe/ocaml-fiu.

@tomjridge
Copy link
Contributor

Out of interest, what are we intending the code to do here?

I guess there are two main faults: wrong data returned; and some unknown read/write failure.

For "wrong data returned" I guess all bets are off (so, use checksums or similar).

For the "unknown failure", I guess you should probably stop, or possibly ignore and continue (probably not so good)?

@craigfe
Copy link
Member

craigfe commented Aug 6, 2021

For "wrong data returned"-type failures, we probably already do about as much as we can: values are stored with their hashes in the pack file, so any invalid data returned by Index (e.g. incorrect file offsets) will generally translate to "unexpected hash" failures in Irmin almost immediately. The more serious class of unsound behaviour is false-positives / false-negatives for Index.mem (and likewise for Index.find), since these can't be double-checked and cause the storage layer to stagger on for some time in a way that obfuscates the real issue (e.g. mirage/irmin#1476).

I think in the case of "unknown failures" we'd ideally want the code to stop, but we can test two properties of the stop:

  1. any failures reported to the user give some explanation of what actually went wrong (i.e. not just an assertion failure or Unix_error (_, _, _));
  2. the store data remains consistent in the presence of spontaneous crashes.

As it stands, (1) is certainly not the case. IIRC, starting a Tezos node with the wrong --data-dir param (e.g. an empty or non-existent directory) quite often leads to a low-level "Insufficiently many bytes"-type exception (e.g. from here) once the index fails to read proper headers from its data files. We've also had a few cases of things like Unix_error (ENOENT, _, _) being raised without actually showing the full context.

Relatedly, we should probably take a more critical look at the various asserts in the codebase and, if they're actually refutable via corruption / user-misuse, change them to say something more constructive or add result to handle it on the Tezos side for more consistent UX. (I'm thinking particularly of Index.v, which feels like it should be surfacing more error cases to the API consumer e.g. when the data files don't exist.)

With regard to (2), we've had bugs due to unexpected exceptions in an asynchronous worker thread not being properly propagated to the parent (including a particularly nasty one due to Out_of_memory). It might be nice to have tests of the form "An unexpected IO failure during is recoverable on restart, and {no data, only recent entries} are lost."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants