Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

validator: handle transient errors #5803

Merged
merged 1 commit into from
Sep 10, 2024
Merged

validator: handle transient errors #5803

merged 1 commit into from
Sep 10, 2024

Conversation

davidpanderson
Copy link
Contributor

A 'transient error' is one that will go away in a while, e.g. an fopen failure because of a broken NSF mount. In general, the BOINC back-end code (validation, assimilation) handles transient errors sensibly:
if there's a bad NSF mount, it retries validation for a few hours rather than marking thousands of jobs as failed.

Extend this to script-based validation.
If a script (either init_result or compare_results) exits with 3, treat that as a transient error.
Treat other nonzero exits (or lack of an exit code) as a permanent error.

More generally (for all validators) add a return value VAL_RESULT_TRANSIENT_ERROR for init_result() and compute_results(). This means any transient error.
Previously we checked only for ERR_OPENDIR.
And for compare_results() we treated all nonzero returns as permanent.

Fixes #5799

A 'transient error' is one that will go away in a while,
e.g. an fopen failure because of a broken NSF mount.
In general, the BOINC back-end code (validation, assimilation)
handles transient errors sensibly:
if there's a bad NSF mount, it retries validation for a few hours
rather than marking thousands of jobs as failed.

Extend this to script-based validation.
If a script (either init_result or compare_results) exits with 3,
treat that as a transient error.
Treat other nonzero exits (or lack of an exit code) as a permanent error.

More generally (for all validators) add a return value
VAL_RESULT_TRANSIENT_ERROR for init_result() and compute_results().
This means any transient error.
Previously we checked only for ERR_OPENDIR.
And for compare_results() we treated all nonzero returns as permanent.
Copy link

codecov bot commented Sep 10, 2024

Codecov Report

Attention: Patch coverage is 0% with 68 lines in your changes missing coverage. Please review.

Project coverage is 10.49%. Comparing base (b11cd70) to head (d94d625).
Report is 4 commits behind head on master.

Files with missing lines Patch % Lines
sched/validate_util2.cpp 0.00% 49 Missing ⚠️
sched/script_validator.cpp 0.00% 19 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master    #5803      +/-   ##
============================================
- Coverage     10.50%   10.49%   -0.02%     
  Complexity     1068     1068              
============================================
  Files           280      280              
  Lines         35972    36019      +47     
  Branches       8448     8444       -4     
============================================
  Hits           3780     3780              
- Misses        31798    31845      +47     
  Partials        394      394              
Files with missing lines Coverage Δ
sched/db_purge.cpp 0.00% <ø> (ø)
sched/script_validator.cpp 0.00% <0.00%> (ø)
sched/validate_util2.cpp 0.00% <0.00%> (ø)

@AenBleidd AenBleidd merged commit a2b61ad into master Sep 10, 2024
146 of 147 checks passed
@AenBleidd AenBleidd deleted the dpa_script_val2 branch September 10, 2024 02:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

script_validator does not correctly handle return values
2 participants