Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(robot-server): HTTP API for "Ignore error and skip to next step" #16564

Merged
merged 9 commits into from
Oct 24, 2024
9 changes: 8 additions & 1 deletion api-client/src/runs/types.ts
Original file line number Diff line number Diff line change
Expand Up @@ -125,12 +125,15 @@ export const RUN_ACTION_TYPE_PAUSE: 'pause' = 'pause'
export const RUN_ACTION_TYPE_STOP: 'stop' = 'stop'
export const RUN_ACTION_TYPE_RESUME_FROM_RECOVERY: 'resume-from-recovery' =
'resume-from-recovery'
export const RUN_ACTION_TYPE_RESUME_FROM_RECOVERY_ASSUMING_FALSE_POSITIVE: 'resume-from-recovery-assuming-false-positive' =
'resume-from-recovery-assuming-false-positive'

export type RunActionType =
| typeof RUN_ACTION_TYPE_PLAY
| typeof RUN_ACTION_TYPE_PAUSE
| typeof RUN_ACTION_TYPE_STOP
| typeof RUN_ACTION_TYPE_RESUME_FROM_RECOVERY
| typeof RUN_ACTION_TYPE_RESUME_FROM_RECOVERY_ASSUMING_FALSE_POSITIVE

export interface RunAction {
id: string
Expand Down Expand Up @@ -172,7 +175,11 @@ export type RunError = RunCommandError
* Error Policy
*/

export type IfMatchType = 'ignoreAndContinue' | 'failRun' | 'waitForRecovery'
export type IfMatchType =
| 'assumeFalsePositiveAndContinue'
| 'ignoreAndContinue'
| 'failRun'
| 'waitForRecovery'

export interface ErrorRecoveryPolicy {
policyRules: Array<{
Expand Down
54 changes: 42 additions & 12 deletions robot-server/robot_server/runs/action_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,20 +7,53 @@


class RunActionType(str, Enum):
"""The type of the run control action.

* `"play"`: Start or resume a run.
* `"pause"`: Pause a run.
* `"stop"`: Stop (cancel) a run.
* `"resume-from-recovery"`: Resume normal protocol execution after a command failed,
the run was placed in `awaiting-recovery` mode, and manual recovery steps
were taken.
"""The type of the run control action, which determines behavior.

* `"play"`: Start the run, or resume it after it's been paused.

* `"pause"`: Pause the run.

* `"stop"`: Stop (cancel) the run.

* `"resume-from-recovery"`: Resume normal protocol execution after the run was in
error recovery mode. Continue from however the last command left the robot.

* `"resume-from-recovery-assuming-false-positive"`: Resume normal protocol execution
after the run was in error recovery mode. Act as if the underlying error was a
false positive.

To see the difference between `"resume-from-recovery"` and
`"resume-from-recovery-assuming-false-positive"`, suppose we've just entered error
recovery mode after a `commandType: "pickUpTip"` command failed with an
`errorType: "tipPhysicallyMissing"` error. That normally leaves the robot thinking
it has no tip attached. If you use `"resume-from-recovery"`, the robot will run
the next protocol command from that state, acting as if there's no tip attached.
(This may cause another error, if the next command needs a tip.)
Whereas if you use `"resume-from-recovery-assuming-false-positive"`,
the robot will try to nullify the error, thereby acting as if it *does* have a tip
attached.

Generally:

* If you've tried to recover from the error by sending your own `intent: "fixit"`
commands to `POST /runs/{id}/commands`, use `"resume-from-recovery"`. It's your
responsibility to ensure your `POST`ed commands leave the robot in a good-enough
state to continue with the protocol.

* Otherwise, use `"resume-from-recovery-assuming-false-positive"`.

Do not combine `intent: "fixit"` commands with
`"resume-from-recovery-assuming-false-positive"`—the robot's built-in
false-positive recovery may compete with your own.
Comment on lines +37 to +47
Copy link
Contributor Author

@SyntaxColoring SyntaxColoring Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This warning against combining intent: "fixit" commands with "resume-from-recovery-assuming-false-positive" is because I've thought of an unfortunate danger. This is a contrived example, but bear with me and suppose:

  1. A moveLabware command fails to move something into slot A1.
  2. We enter error recovery mode.
  3. A client runs a fixit command that places some other labware into slot A1, for some reason.
  4. Then the client uses "resume-from-recovery-assuming-false-positive". Given how feat(api): Allow treating errors as false-positives (ignore them and continue with the run) #16556 works, this applies the "missing part" of the state update from step 1, to move the original labware into slot A1. But this conflicts with the fixit command from step 3.

This is worse than the normal brand of weird error recovery interactions, because it bypasses our usual validation layer and possibly messes up internal state. (Because the conflict happens deep, at the level of the Protocol Engine StateStores.) A spectrum of possible effects, from luckiest to unluckiest:

  • It has no adverse effect.
  • It raises some obscure internal AssertionError.
  • It somehow messes up internal state and does who-knows-what to subsequent commands.
  • It makes the run hang.

This gets a lot easier if we can forbid clients from combining fixit commands with resume-from-recovery-assuming-false-positive run actions. As in, forbid with an error, not just forbid with documentation like I've done here. But I'm not sure we can do that—don't we home unconditionally at the beginning of error recovery?

Copy link
Contributor

@mjhuff mjhuff Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we home unconditionally at the beginning of error recovery?

Yes, we do. I can see a future world in which we do other clean-up fixit commands on the client side before resume-from-recovery-assuming-false-positive, too.

In my personal opinion, it is likely better not to be overly prescriptive with what combinations of recovery commands/actions a client can take. We currently operate under that philosophy for the rest of Error Recovery, and while documenting this behavior is useful, this feels like a spot in which we can just extend that same philosophy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree with @mjhuff here. I think this is addressable with a "Please do not do this" type warning, pretty strong language about not executing fixit commands that actually do things and warnings that the buyer must beware if they're going to try - and I think that for the audience of this interface, i.e. us and people who are trying to do weird things with the API, that's a reasonable ask.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if people try it and they run into really weird errors, then our goal should be to make the errors less weird.

Copy link
Contributor Author

@SyntaxColoring SyntaxColoring Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, okay.

In my personal opinion, it is likely better not to be overly prescriptive with what combinations of recovery commands/actions a client can take. We currently operate under that philosophy for the rest of Error Recovery...

Right, but we've been able to operate under that philosophy for the rest of Error Recovery because we've constructed boundaries that can keep the robot decently-behaved even when weird stuff happens. e.g. the run controls will always work, and robot state won't tear, and user-visible error messages will have been written to be read by users. And what I'm talking about here is that my implementation of EXEC-676, so far, subverts those very boundaries.

I think that for the audience of this interface, i.e. us and people who are trying to do weird things with the API, that's a reasonable ask.

It's reasonable for us now, but having seen stuff like RQA-2934 and EXEC-760, doesn't this seem like a nuance that will get lost as the frontend code continues to evolve, especially under the hands of different people?


To be clear, I’m totally down to ship this in v8.2. I just want to make sure we're on the same page about accepting the mess I’ve made of this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't think it's that much of a mess. Error recovery, broadly, is a thing that is very hard to reason through. It makes sense to me that are a whole ton of very important caveats about the way that you're allowed to handle problems like this. The choice that we're making is

  • the backend allows a very general interface that may allow you to do dangerous or incorrect things
  • structure is handled and provided by the frontend

I think that this is a good and acceptable tradeoff to allow new and surprising uses of the mechanism. If there's a thing to be improved, in my opinion it is how we signal errors that come from corruption of internal state and trying to get the actions that the robot does in that case to better logically flow from actions. We're okay with having surprising things happen sometimes to people writing error recovery flows; the goal is that after they think about it for a second, they say "okay I guess that makes sense".

"""

PLAY = "play"
PAUSE = "pause"
STOP = "stop"
RESUME_FROM_RECOVERY = "resume-from-recovery"
RESUME_FROM_RECOVERY_ASSUMING_FALSE_POSITIVE = (
"resume-from-recovery-assuming-false-positive"
)
Comment on lines +54 to +56
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming: Anything more concise than resume-from-recovery-assuming-false-positive?

I do like that the current name specifically claims that this is for false-positives, instead of saying something vaguer like "resume and fix up state." I think it's a helpfully clarifying constraint for how backend engineers should implement commands and their errors going forward.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry! im confused with the name and not really sure about its intent :-(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the big RunActionType docstring help? Otherwise let's talk about it in-person.



class RunActionCreate(BaseModel):
Expand All @@ -41,7 +74,4 @@ class RunAction(ResourceModel):

id: str = Field(..., description="A unique identifier to reference the command.")
createdAt: datetime = Field(..., description="When the command was created.")
actionType: RunActionType = Field(
...,
description="Specific type of action, which determines behavior.",
)
actionType: RunActionType
4 changes: 4 additions & 0 deletions robot-server/robot_server/runs/error_recovery_mapping.py
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,10 @@ def _map_error_recovery_type(reaction_if_match: ReactionIfMatch) -> ErrorRecover
match reaction_if_match:
case ReactionIfMatch.IGNORE_AND_CONTINUE:
return ErrorRecoveryType.IGNORE_AND_CONTINUE
case ReactionIfMatch.ASSUME_FALSE_POSITIVE_AND_CONTINUE:
# todo(mm, 2024-10-23): Connect to work in
# https://github.com/Opentrons/opentrons/pull/16556.
return ErrorRecoveryType.IGNORE_AND_CONTINUE
case ReactionIfMatch.FAIL_RUN:
return ErrorRecoveryType.FAIL_RUN
case ReactionIfMatch.WAIT_FOR_RECOVERY:
Expand Down
36 changes: 28 additions & 8 deletions robot-server/robot_server/runs/error_recovery_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,17 +24,40 @@


class ReactionIfMatch(Enum):
"""How to handle a given error.
"""How to handle a matching error.

* `"ignoreAndContinue"`: Ignore this error and continue with the next command.
* `"failRun"`: Fail the run.
* `"waitForRecovery"`: Enter interactive error recovery mode.

* `"waitForRecovery"`: Enter interactive error recovery mode. You can then
perform error recovery with `POST /runs/{id}/commands` and exit error
recovery mode with `POST /runs/{id}/actions`.

* `"assumeFalsePositiveAndContinue"`: Continue the run without interruption, acting
as if the error was a false positive.

This is equivalent to doing `"waitForRecovery"`
and then sending `actionType: "resume-from-recovery-assuming-false-positive"`
to `POST /runs/{id}/actions`, except this requires no ongoing intervention from
the client.

* `"ignoreAndContinue"`: Continue the run without interruption, accepting whatever
state the error left the robot in.

This is equivalent to doing `"waitForRecovery"`
and then sending `actionType: "resume-from-recovery"` to `POST /runs/{id}/actions`,
except this requires no ongoing intervention from the client.

This is probably not useful very often because it's likely to cause downstream
errors—imagine trying an `aspirate` command after a failed `pickUpTip` command.
This is provided for symmetry.
"""

IGNORE_AND_CONTINUE = "ignoreAndContinue"
FAIL_RUN = "failRun"
WAIT_FOR_RECOVERY = "waitForRecovery"
ASSUME_FALSE_POSITIVE_AND_CONTINUE = "assumeFalsePositiveAndContinue"
# todo(mm, 2024-10-22): "ignoreAndContinue" may be a misnomer now: is
# "assumeFalsePositiveAndContinue" not also a way to "ignore"? Consider renaming.
IGNORE_AND_CONTINUE = "ignoreAndContinue"


class ErrorMatcher(BaseModel):
Expand Down Expand Up @@ -69,10 +92,7 @@ class ErrorRecoveryRule(BaseModel):
...,
description="The criteria that must be met for this rule to be applied.",
)
ifMatch: ReactionIfMatch = Field(
...,
description="How to handle errors matched by this rule.",
)
ifMatch: ReactionIfMatch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you remove the Field constructor?

Copy link
Contributor Author

@SyntaxColoring SyntaxColoring Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Redoc, the description here was covering up the more detailed description from the ReactionIfMatch docstring. I'm not sure if that's a Redoc problem or FastAPI problem or Pydantic problem, but this was a quick fix.



class ErrorRecoveryPolicy(BaseModel):
Expand Down
1 change: 0 additions & 1 deletion robot-server/robot_server/runs/router/base_router.py
Original file line number Diff line number Diff line change
Expand Up @@ -442,7 +442,6 @@ async def update_run(
See `PATCH /errorRecovery/settings`.
"""
),
status_code=status.HTTP_201_CREATED,
responses={
status.HTTP_200_OK: {"model": SimpleEmptyBody},
status.HTTP_409_CONFLICT: {"model": ErrorBody[RunStopped]},
Expand Down
12 changes: 12 additions & 0 deletions robot-server/robot_server/runs/run_controller.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
import logging
from datetime import datetime
from typing import Optional
from typing_extensions import assert_never
from opentrons.protocol_engine import ProtocolEngineError
from opentrons_shared_data.errors.exceptions import RoboticsInteractionError

Expand Down Expand Up @@ -96,6 +97,17 @@ def create_action(
elif action_type == RunActionType.RESUME_FROM_RECOVERY:
self._run_orchestrator_store.resume_from_recovery()

elif (
action_type
== RunActionType.RESUME_FROM_RECOVERY_ASSUMING_FALSE_POSITIVE
):
# todo(mm, 2024-10-23): Connect to work in
# https://github.com/Opentrons/opentrons/pull/16556.
self._run_orchestrator_store.resume_from_recovery()

else:
assert_never(action_type)

except ProtocolEngineError as e:
raise RunActionNotAllowedError(message=e.message, wrapping=[e]) from e

Expand Down
Loading