Skip to content

Commit

Permalink
Added Task restart patterns to user guide
Browse files Browse the repository at this point in the history
  • Loading branch information
ianmkenney committed Oct 21, 2024
1 parent 977c896 commit 2d2d8f6
Showing 1 changed file with 42 additions and 0 deletions.
42 changes: 42 additions & 0 deletions docs/user_guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -510,6 +510,48 @@ If you’re feeling confident, you could set all errored :py:class:`~alchemiscal
<ScopedKey('Task-157232d7ff794a0985ebce5055e0f336-my_org-my_campaign-my_project')>,
...]

***************************************************
Re-running Errored Tasks with Task Restart Patterns
***************************************************

Re-running errored :py:class`~alchemiscale.storage.models.Task`\s manually for known failure modes (such as those described in the previous section) quickly becomes tedious, especially for large networks.
Alternatively, you can add `regular expression <https://en.wikipedia.org/wiki/Regular_expression>`_ strings as Task restart patterns to an :external+gufe:py:class`~gufe.network.AlchemicalNetwork`.
These patterns _enforce_ the `AlchemicalNetwork` and there is no limit to the number of patterns that can enforce an `AlchemicalNetwork`.
As a result, `Task`\s actioned on that `AlchemicalNetwork` now support automatic restarts if the `Task` fails during any part of its execution, provided that an enforcing pattern matches a traceback returned by any of the `Task`\'s returned `ProtocolUnitFailure`\s.
The number of restarts is controlled by the ``num_allowed_restarts`` parameter of the `AlchemiscaleClient.add_task_restart_patterns` method.
If a `Task` is restarted more than ``num_allowed_restarts`` times, the `Task` is canceled and left with an ``error`` status.
As an example, if you wanted to rerun any `Task` that failed with a ``RuntimeError`` _or_ a ``MemoryError`` and attempt it at least 5 times, you could add the following patterns:::

>>> asc.add_task_restart_patterns(network_scoped_key, [r"RuntimeError: .+", r"MemoryError: Unable to allocate \d+ GiB"], 5)

Providing too general a pattern, such as the example above, you may consume compute resources on failures that are unavoidable.
On the other hand, an overly strict pattern (such as specifying explicit Gufe keys) will likely do nothing.
Therefore, it is best to find a balance in your patterns that matches your use-case.

Restart patterns _enforcing_ an `AlchemicalNetwork` can be retrieved with::

>>> asc.get_task_restart_patterns(network_scoped_key)
{"RuntimeError: .+": 5, "MemoryError: Unable to allocate \d+ GiB": 5}

The number of allowed restarts can be modified::

>>> asc.set_task_restart_patterns_allowed_restarts(network_scoped_key, ["RuntimeError: .+"], 3)
>>> asc.set_task_restart_patterns_allowed_restarts(network_scoped_key, ["MemoryError: Unable to allocate \d+ GiB"], 2)
>>> asc.get_task_restart_patterns(network_scoped_key)
{"RuntimeError: .+": 3, "MemoryError: Unable to allocate \d+ GiB": 2}

Patterns can be removed by specifying the patterns in a list::

>>> asc.remove_task_restart_patterns(network_scoped_key, ["MemoryError: Unable to allocate \d+ GiB"])
>>> asc.get_task_restart_patterns(network_scoped_key)
{"RuntimeError: .+": 3}

Or by clearing all enforcing patterns::

>>> asc.clear_task_restart_patterns(network_scoped_key)
>>> asc.get_task_restart_patterns(network_scoped_key)
{}


***********************************
Marking Tasks as deleted or invalid
Expand Down

0 comments on commit 2d2d8f6

Please sign in to comment.