-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Have forward_model_ok
not run if storage in invalid (missing responses.json)
#9857
base: main
Are you sure you want to change the base?
Have forward_model_ok
not run if storage in invalid (missing responses.json)
#9857
Conversation
CodSpeed Performance ReportMerging #9857 will not alter performanceComparing Summary
|
@@ -96,36 +96,47 @@ async def forward_model_ok( | |||
realization: int, | |||
iter: int, | |||
ensemble: Ensemble, | |||
forward_model_ok_permanent_error_future: asyncio.Future[str] | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how the Future is usually used is that when you do somewhere:
await forward_model_ok_permanent_error_future
and somewhere else you get exception Ex you can propagate it to the future by
forward_model_ok_permanent_error_future.set_exception(Ex)
which then gets triggered by the line above (with await).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not want to await it, as it might never be set. I only use it so that multiple tasks can set the value, and the others can check if that value has already been set, meaning they should halt. It worked the same if I used myAsyncioFuture.set_result(reason_why_it_failed) or myAsyncioFuture.set_exception(exception_why_it_failed).
I will change it to use the latter one, as I already have the raw exception available.
3ce223e
to
fbb9127
Compare
This commit makes it so that if a realization fails due to `missing responses.json` (can happen if storage is deleted), we will not run `forward_model_ok` for the next realizations, as it is not something we can recover from. We have this logic to avoid spamming `logger.exception(...)` for something we cannot really handle.
fbb9127
to
8479240
Compare
I think it should be fine but it would be great if somebody else can check it too; e.g. @sondreso |
I would like to move the flag up a level and have the jobs not run |
@@ -253,6 +257,10 @@ def parameter_info(self) -> dict[str, Any]: | |||
@property | |||
def response_info(self) -> dict[str, Any]: | |||
info: dict[str, Any] | |||
if not (self.mount_point / self._responses_file).exists(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was just changed to give a FileNotFoundError
because the exception did not show full path. Yes the responses.json
does not exist, but the entire directory might be wrong:
76a2ba9
ensemble, | ||
) | ||
except Exception as err: | ||
if isinstance(err, ErtStorageException): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ErtStorageException does not just signify that the storage mount point does not exist, but many other errors for which forward_model_ok might succeed.
|
||
response_configs = self.experiment.response_configuration | ||
try: | ||
response_configs = self.experiment.response_configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not immediately seem fine to ignore a storage exception here.
I think this a very complicated and risky solution for a minor inconvenience. I think we should not do this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to be a bit careful here, as there are some potential undesired consequences of this solution.
For example, if there is an intermittent disk error (flaky network drive) that causes one of the realizations to fail with an ErtStorageException
, then all future realizations will not load data even if the disk becomes available again. It essentially turns what was an intermittent error into a permanent one.
I agree that the logging is a bit excessive in this scenario, but it does also alert us to something that is an actual error. We could try to collect all errors and log only once at the end, but this could also have unintended consequences. In general, it's difficult to separate this unrecoverable-error from something that we can recover from.
I think the better approach here is to reach out to the users that delete their storage mid run, and try to understand why they are doing it. If we understand that we can make informed improvements and don't solve the wrong problem.
Issue
Resolves #9856
Approach
This commit makes it so that if a realization fails due to
missing responses.json
(can happen if storage is deleted), we will not runforward_model_ok
for the next realizations, as it is not something we can recover from. We have this logic to avoid spamminglogger.exception(...)
for something we cannot really handle.(Screenshot of new behavior in GUI if applicable)
git rebase -i main --exec 'pytest tests/ert/unit_tests tests/everest -n auto --hypothesis-profile=fast -m "not integration_test"'
)When applicable