validator: handle transient errors #5803
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
A 'transient error' is one that will go away in a while, e.g. an fopen failure because of a broken NSF mount. In general, the BOINC back-end code (validation, assimilation) handles transient errors sensibly:
if there's a bad NSF mount, it retries validation for a few hours rather than marking thousands of jobs as failed.
Extend this to script-based validation.
If a script (either init_result or compare_results) exits with 3, treat that as a transient error.
Treat other nonzero exits (or lack of an exit code) as a permanent error.
More generally (for all validators) add a return value VAL_RESULT_TRANSIENT_ERROR for init_result() and compute_results(). This means any transient error.
Previously we checked only for ERR_OPENDIR.
And for compare_results() we treated all nonzero returns as permanent.
Fixes #5799