Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have forward_model_ok not run if storage in invalid (missing responses.json) #9857

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jonathan-eq
Copy link
Contributor

@jonathan-eq jonathan-eq commented Jan 23, 2025

Issue
Resolves #9856

Approach
This commit makes it so that if a realization fails due to missing responses.json (can happen if storage is deleted), we will not run forward_model_ok for the next realizations, as it is not something we can recover from. We have this logic to avoid spamming logger.exception(...) for something we cannot really handle.

(Screenshot of new behavior in GUI if applicable)

  • PR title captures the intent of the changes, and is fitting for release notes.
  • Added appropriate release note label
  • Commit history is consistent and clean, in line with the contribution guidelines.
  • Make sure unit tests pass locally after every commit (git rebase -i main --exec 'pytest tests/ert/unit_tests tests/everest -n auto --hypothesis-profile=fast -m "not integration_test"')

When applicable

  • When there are user facing changes: Updated documentation
  • New behavior or changes to existing untested code: Ensured that unit tests are added (See Ground Rules).
  • Large PR: Prepare changes in small commits for more convenient review
  • Bug fix: Add regression test for the bug
  • Bug fix: Create Backport PR to latest release

@jonathan-eq jonathan-eq added the release-notes:bug-fix Automatically categorise as bug fix in release notes label Jan 23, 2025
@jonathan-eq jonathan-eq self-assigned this Jan 23, 2025
Copy link

codspeed-hq bot commented Jan 23, 2025

CodSpeed Performance Report

Merging #9857 will not alter performance

Comparing jonathan-eq:fix-fm-failing-if-storage-is-delted (8479240) with main (02df6ee)

Summary

✅ 24 untouched benchmarks

@@ -96,36 +96,47 @@ async def forward_model_ok(
realization: int,
iter: int,
ensemble: Ensemble,
forward_model_ok_permanent_error_future: asyncio.Future[str] | None = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how the Future is usually used is that when you do somewhere:

await forward_model_ok_permanent_error_future

and somewhere else you get exception Ex you can propagate it to the future by

forward_model_ok_permanent_error_future.set_exception(Ex)

which then gets triggered by the line above (with await).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not want to await it, as it might never be set. I only use it so that multiple tasks can set the value, and the others can check if that value has already been set, meaning they should halt. It worked the same if I used myAsyncioFuture.set_result(reason_why_it_failed) or myAsyncioFuture.set_exception(exception_why_it_failed).
I will change it to use the latter one, as I already have the raw exception available.

@jonathan-eq jonathan-eq force-pushed the fix-fm-failing-if-storage-is-delted branch from 3ce223e to fbb9127 Compare January 24, 2025 07:40
This commit makes it so that if a realization fails due to `missing responses.json` (can happen if storage is deleted), we will not run `forward_model_ok` for the next realizations, as it is not something we can recover from.
We have this logic to avoid spamming `logger.exception(...)` for something we cannot really handle.
@jonathan-eq jonathan-eq force-pushed the fix-fm-failing-if-storage-is-delted branch from fbb9127 to 8479240 Compare January 24, 2025 08:27
@xjules
Copy link
Contributor

xjules commented Jan 24, 2025

I think it should be fine but it would be great if somebody else can check it too; e.g. @sondreso

@jonathan-eq
Copy link
Contributor Author

I would like to move the flag up a level and have the jobs not run forward_model_ok at all, but that would also require moving the ensemble.set_failure( realization, RealizationStorageState.LOAD_FAILURE, final_result.message ). This would however require changes in libres_fascade too 🤔

@@ -253,6 +257,10 @@ def parameter_info(self) -> dict[str, Any]:
@property
def response_info(self) -> dict[str, Any]:
info: dict[str, Any]
if not (self.mount_point / self._responses_file).exists():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just changed to give a FileNotFoundError because the exception did not show full path. Yes the responses.json does not exist, but the entire directory might be wrong:
76a2ba9

ensemble,
)
except Exception as err:
if isinstance(err, ErtStorageException):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ErtStorageException does not just signify that the storage mount point does not exist, but many other errors for which forward_model_ok might succeed.


response_configs = self.experiment.response_configuration
try:
response_configs = self.experiment.response_configuration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not immediately seem fine to ignore a storage exception here.

@eivindjahren
Copy link
Contributor

I think this a very complicated and risky solution for a minor inconvenience. I think we should not do this.

Copy link
Collaborator

@sondreso sondreso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to be a bit careful here, as there are some potential undesired consequences of this solution.

For example, if there is an intermittent disk error (flaky network drive) that causes one of the realizations to fail with an ErtStorageException, then all future realizations will not load data even if the disk becomes available again. It essentially turns what was an intermittent error into a permanent one.

I agree that the logging is a bit excessive in this scenario, but it does also alert us to something that is an actual error. We could try to collect all errors and log only once at the end, but this could also have unintended consequences. In general, it's difficult to separate this unrecoverable-error from something that we can recover from.

I think the better approach here is to reach out to the users that delete their storage mid run, and try to understand why they are doing it. If we understand that we can make informed improvements and don't solve the wrong problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes:bug-fix Automatically categorise as bug fix in release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ERT spams exceptions if storage is deleted mid run (missing responses.json)
4 participants