Have `forward_model_ok` not run if storage in invalid (missing responses.json) #9857

jonathan-eq · 2025-01-23T14:14:06Z

Issue
Resolves #9856

Approach
This commit makes it so that if a realization fails due to missing responses.json (can happen if storage is deleted), we will not run forward_model_ok for the next realizations, as it is not something we can recover from. We have this logic to avoid spamming logger.exception(...) for something we cannot really handle.

(Screenshot of new behavior in GUI if applicable)

PR title captures the intent of the changes, and is fitting for release notes.
Added appropriate release note label
Commit history is consistent and clean, in line with the contribution guidelines.
Make sure unit tests pass locally after every commit (git rebase -i main --exec 'pytest tests/ert/unit_tests tests/everest -n auto --hypothesis-profile=fast -m "not integration_test"')

When applicable

When there are user facing changes: Updated documentation
New behavior or changes to existing untested code: Ensured that unit tests are added (See Ground Rules).
Large PR: Prepare changes in small commits for more convenient review
Bug fix: Add regression test for the bug
Bug fix: Create Backport PR to latest release

codspeed-hq · 2025-01-23T14:25:45Z

CodSpeed Performance Report

Merging #9857 will not alter performance

_{Comparing jonathan-eq:fix-fm-failing-if-storage-is-delted (8479240) with main (02df6ee)}

Summary

✅ 24 untouched benchmarks

xjules · 2025-01-23T14:58:08Z

src/ert/callbacks.py

@@ -96,36 +96,47 @@ async def forward_model_ok(
    realization: int,
    iter: int,
    ensemble: Ensemble,
+    forward_model_ok_permanent_error_future: asyncio.Future[str] | None = None,


how the Future is usually used is that when you do somewhere:

await forward_model_ok_permanent_error_future

and somewhere else you get exception Ex you can propagate it to the future by

forward_model_ok_permanent_error_future.set_exception(Ex)

which then gets triggered by the line above (with await).

I do not want to await it, as it might never be set. I only use it so that multiple tasks can set the value, and the others can check if that value has already been set, meaning they should halt. It worked the same if I used myAsyncioFuture.set_result(reason_why_it_failed) or myAsyncioFuture.set_exception(exception_why_it_failed).
I will change it to use the latter one, as I already have the raw exception available.

This commit makes it so that if a realization fails due to `missing responses.json` (can happen if storage is deleted), we will not run `forward_model_ok` for the next realizations, as it is not something we can recover from. We have this logic to avoid spamming `logger.exception(...)` for something we cannot really handle.

xjules · 2025-01-24T11:30:34Z

I think it should be fine but it would be great if somebody else can check it too; e.g. @sondreso

jonathan-eq · 2025-01-24T11:35:50Z

I would like to move the flag up a level and have the jobs not run forward_model_ok at all, but that would also require moving the ensemble.set_failure( realization, RealizationStorageState.LOAD_FAILURE, final_result.message ). This would however require changes in libres_fascade too 🤔

eivindjahren · 2025-01-24T12:18:43Z

src/ert/storage/local_experiment.py

@@ -253,6 +257,10 @@ def parameter_info(self) -> dict[str, Any]:
    @property
    def response_info(self) -> dict[str, Any]:
        info: dict[str, Any]
+        if not (self.mount_point / self._responses_file).exists():


This was just changed to give a FileNotFoundError because the exception did not show full path. Yes the responses.json does not exist, but the entire directory might be wrong:
76a2ba9

eivindjahren · 2025-01-24T12:19:51Z

src/ert/callbacks.py

+                    ensemble,
+                )
+        except Exception as err:
+            if isinstance(err, ErtStorageException):


ErtStorageException does not just signify that the storage mount point does not exist, but many other errors for which forward_model_ok might succeed.

eivindjahren · 2025-01-24T12:20:59Z

src/ert/storage/local_ensemble.py

-
-        response_configs = self.experiment.response_configuration
+        try:
+            response_configs = self.experiment.response_configuration


It does not immediately seem fine to ignore a storage exception here.

eivindjahren · 2025-01-24T12:21:43Z

I think this a very complicated and risky solution for a minor inconvenience. I think we should not do this.

sondreso

I think we need to be a bit careful here, as there are some potential undesired consequences of this solution.

For example, if there is an intermittent disk error (flaky network drive) that causes one of the realizations to fail with an ErtStorageException, then all future realizations will not load data even if the disk becomes available again. It essentially turns what was an intermittent error into a permanent one.

I agree that the logging is a bit excessive in this scenario, but it does also alert us to something that is an actual error. We could try to collect all errors and log only once at the end, but this could also have unintended consequences. In general, it's difficult to separate this unrecoverable-error from something that we can recover from.

I think the better approach here is to reach out to the users that delete their storage mid run, and try to understand why they are doing it. If we understand that we can make informed improvements and don't solve the wrong problem.

jonathan-eq added the release-notes:bug-fix Automatically categorise as bug fix in release notes label Jan 23, 2025

jonathan-eq self-assigned this Jan 23, 2025

xjules reviewed Jan 23, 2025

View reviewed changes

jonathan-eq force-pushed the fix-fm-failing-if-storage-is-delted branch from 3ce223e to fbb9127 Compare January 24, 2025 07:40

jonathan-eq force-pushed the fix-fm-failing-if-storage-is-delted branch from fbb9127 to 8479240 Compare January 24, 2025 08:27

eivindjahren reviewed Jan 24, 2025

View reviewed changes

sondreso reviewed Jan 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Have `forward_model_ok` not run if storage in invalid (missing responses.json) #9857

Have `forward_model_ok` not run if storage in invalid (missing responses.json) #9857

jonathan-eq commented Jan 23, 2025 •

edited

Loading

codspeed-hq bot commented Jan 23, 2025 •

edited

Loading

xjules Jan 23, 2025

jonathan-eq Jan 24, 2025

xjules commented Jan 24, 2025

jonathan-eq commented Jan 24, 2025

eivindjahren Jan 24, 2025

eivindjahren Jan 24, 2025

eivindjahren Jan 24, 2025

eivindjahren commented Jan 24, 2025

sondreso left a comment

Have forward_model_ok not run if storage in invalid (missing responses.json) #9857

Are you sure you want to change the base?

Have forward_model_ok not run if storage in invalid (missing responses.json) #9857

Conversation

jonathan-eq commented Jan 23, 2025 • edited Loading

When applicable

codspeed-hq bot commented Jan 23, 2025 • edited Loading

CodSpeed Performance Report

Merging #9857 will not alter performance

Summary

xjules Jan 23, 2025

Choose a reason for hiding this comment

jonathan-eq Jan 24, 2025

Choose a reason for hiding this comment

xjules commented Jan 24, 2025

jonathan-eq commented Jan 24, 2025

eivindjahren Jan 24, 2025

Choose a reason for hiding this comment

eivindjahren Jan 24, 2025

Choose a reason for hiding this comment

eivindjahren Jan 24, 2025

Choose a reason for hiding this comment

eivindjahren commented Jan 24, 2025

sondreso left a comment

Choose a reason for hiding this comment

Have `forward_model_ok` not run if storage in invalid (missing responses.json) #9857

Have `forward_model_ok` not run if storage in invalid (missing responses.json) #9857

jonathan-eq commented Jan 23, 2025 •

edited

Loading

codspeed-hq bot commented Jan 23, 2025 •

edited

Loading