-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(robot-server): HTTP API for "Ignore error and skip to next step" #16564
Changes from all commits
36e98e2
7541b70
611d1d2
89ee3dd
3b8827b
fad998f
5aa91eb
964d2c0
c36ceef
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,20 +7,53 @@ | |
|
||
|
||
class RunActionType(str, Enum): | ||
"""The type of the run control action. | ||
|
||
* `"play"`: Start or resume a run. | ||
* `"pause"`: Pause a run. | ||
* `"stop"`: Stop (cancel) a run. | ||
* `"resume-from-recovery"`: Resume normal protocol execution after a command failed, | ||
the run was placed in `awaiting-recovery` mode, and manual recovery steps | ||
were taken. | ||
"""The type of the run control action, which determines behavior. | ||
|
||
* `"play"`: Start the run, or resume it after it's been paused. | ||
|
||
* `"pause"`: Pause the run. | ||
|
||
* `"stop"`: Stop (cancel) the run. | ||
|
||
* `"resume-from-recovery"`: Resume normal protocol execution after the run was in | ||
error recovery mode. Continue from however the last command left the robot. | ||
|
||
* `"resume-from-recovery-assuming-false-positive"`: Resume normal protocol execution | ||
after the run was in error recovery mode. Act as if the underlying error was a | ||
false positive. | ||
|
||
To see the difference between `"resume-from-recovery"` and | ||
`"resume-from-recovery-assuming-false-positive"`, suppose we've just entered error | ||
recovery mode after a `commandType: "pickUpTip"` command failed with an | ||
`errorType: "tipPhysicallyMissing"` error. That normally leaves the robot thinking | ||
it has no tip attached. If you use `"resume-from-recovery"`, the robot will run | ||
the next protocol command from that state, acting as if there's no tip attached. | ||
(This may cause another error, if the next command needs a tip.) | ||
Whereas if you use `"resume-from-recovery-assuming-false-positive"`, | ||
the robot will try to nullify the error, thereby acting as if it *does* have a tip | ||
attached. | ||
|
||
Generally: | ||
|
||
* If you've tried to recover from the error by sending your own `intent: "fixit"` | ||
commands to `POST /runs/{id}/commands`, use `"resume-from-recovery"`. It's your | ||
responsibility to ensure your `POST`ed commands leave the robot in a good-enough | ||
state to continue with the protocol. | ||
|
||
* Otherwise, use `"resume-from-recovery-assuming-false-positive"`. | ||
|
||
Do not combine `intent: "fixit"` commands with | ||
`"resume-from-recovery-assuming-false-positive"`—the robot's built-in | ||
false-positive recovery may compete with your own. | ||
""" | ||
|
||
PLAY = "play" | ||
PAUSE = "pause" | ||
STOP = "stop" | ||
RESUME_FROM_RECOVERY = "resume-from-recovery" | ||
RESUME_FROM_RECOVERY_ASSUMING_FALSE_POSITIVE = ( | ||
"resume-from-recovery-assuming-false-positive" | ||
) | ||
Comment on lines
+54
to
+56
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Naming: Anything more concise than I do like that the current name specifically claims that this is for false-positives, instead of saying something vaguer like "resume and fix up state." I think it's a helpfully clarifying constraint for how backend engineers should implement commands and their errors going forward. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sorry! im confused with the name and not really sure about its intent :-( There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does the big |
||
|
||
|
||
class RunActionCreate(BaseModel): | ||
|
@@ -41,7 +74,4 @@ class RunAction(ResourceModel): | |
|
||
id: str = Field(..., description="A unique identifier to reference the command.") | ||
createdAt: datetime = Field(..., description="When the command was created.") | ||
actionType: RunActionType = Field( | ||
..., | ||
description="Specific type of action, which determines behavior.", | ||
) | ||
actionType: RunActionType |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -24,17 +24,40 @@ | |
|
||
|
||
class ReactionIfMatch(Enum): | ||
"""How to handle a given error. | ||
"""How to handle a matching error. | ||
|
||
* `"ignoreAndContinue"`: Ignore this error and continue with the next command. | ||
* `"failRun"`: Fail the run. | ||
* `"waitForRecovery"`: Enter interactive error recovery mode. | ||
|
||
* `"waitForRecovery"`: Enter interactive error recovery mode. You can then | ||
perform error recovery with `POST /runs/{id}/commands` and exit error | ||
recovery mode with `POST /runs/{id}/actions`. | ||
|
||
* `"assumeFalsePositiveAndContinue"`: Continue the run without interruption, acting | ||
as if the error was a false positive. | ||
|
||
This is equivalent to doing `"waitForRecovery"` | ||
and then sending `actionType: "resume-from-recovery-assuming-false-positive"` | ||
to `POST /runs/{id}/actions`, except this requires no ongoing intervention from | ||
the client. | ||
|
||
* `"ignoreAndContinue"`: Continue the run without interruption, accepting whatever | ||
state the error left the robot in. | ||
|
||
This is equivalent to doing `"waitForRecovery"` | ||
and then sending `actionType: "resume-from-recovery"` to `POST /runs/{id}/actions`, | ||
except this requires no ongoing intervention from the client. | ||
|
||
This is probably not useful very often because it's likely to cause downstream | ||
errors—imagine trying an `aspirate` command after a failed `pickUpTip` command. | ||
This is provided for symmetry. | ||
""" | ||
|
||
IGNORE_AND_CONTINUE = "ignoreAndContinue" | ||
FAIL_RUN = "failRun" | ||
WAIT_FOR_RECOVERY = "waitForRecovery" | ||
ASSUME_FALSE_POSITIVE_AND_CONTINUE = "assumeFalsePositiveAndContinue" | ||
# todo(mm, 2024-10-22): "ignoreAndContinue" may be a misnomer now: is | ||
# "assumeFalsePositiveAndContinue" not also a way to "ignore"? Consider renaming. | ||
IGNORE_AND_CONTINUE = "ignoreAndContinue" | ||
|
||
|
||
class ErrorMatcher(BaseModel): | ||
|
@@ -69,10 +92,7 @@ class ErrorRecoveryRule(BaseModel): | |
..., | ||
description="The criteria that must be met for this rule to be applied.", | ||
) | ||
ifMatch: ReactionIfMatch = Field( | ||
..., | ||
description="How to handle errors matched by this rule.", | ||
) | ||
ifMatch: ReactionIfMatch | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why did you remove the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In Redoc, the |
||
|
||
|
||
class ErrorRecoveryPolicy(BaseModel): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This warning against combining
intent: "fixit"
commands with"resume-from-recovery-assuming-false-positive"
is because I've thought of an unfortunate danger. This is a contrived example, but bear with me and suppose:moveLabware
command fails to move something into slot A1.fixit
command that places some other labware into slot A1, for some reason."resume-from-recovery-assuming-false-positive"
. Given how feat(api): Allow treating errors as false-positives (ignore them and continue with the run) #16556 works, this applies the "missing part" of the state update from step 1, to move the original labware into slot A1. But this conflicts with thefixit
command from step 3.This is worse than the normal brand of weird error recovery interactions, because it bypasses our usual validation layer and possibly messes up internal state. (Because the conflict happens deep, at the level of the Protocol Engine
StateStore
s.) A spectrum of possible effects, from luckiest to unluckiest:AssertionError
.This gets a lot easier if we can forbid clients from combining
fixit
commands withresume-from-recovery-assuming-false-positive
run actions. As in, forbid with an error, not just forbid with documentation like I've done here. But I'm not sure we can do that—don't we home unconditionally at the beginning of error recovery?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we do. I can see a future world in which we do other clean-up fixit commands on the client side before
resume-from-recovery-assuming-false-positive
, too.In my personal opinion, it is likely better not to be overly prescriptive with what combinations of recovery commands/actions a client can take. We currently operate under that philosophy for the rest of Error Recovery, and while documenting this behavior is useful, this feels like a spot in which we can just extend that same philosophy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree with @mjhuff here. I think this is addressable with a "Please do not do this" type warning, pretty strong language about not executing fixit commands that actually do things and warnings that the buyer must beware if they're going to try - and I think that for the audience of this interface, i.e. us and people who are trying to do weird things with the API, that's a reasonable ask.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And if people try it and they run into really weird errors, then our goal should be to make the errors less weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, okay.
Right, but we've been able to operate under that philosophy for the rest of Error Recovery because we've constructed boundaries that can keep the robot decently-behaved even when weird stuff happens. e.g. the run controls will always work, and robot state won't tear, and user-visible error messages will have been written to be read by users. And what I'm talking about here is that my implementation of EXEC-676, so far, subverts those very boundaries.
It's reasonable for us now, but having seen stuff like RQA-2934 and EXEC-760, doesn't this seem like a nuance that will get lost as the frontend code continues to evolve, especially under the hands of different people?
To be clear, I’m totally down to ship this in v8.2. I just want to make sure we're on the same page about accepting the mess I’ve made of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't think it's that much of a mess. Error recovery, broadly, is a thing that is very hard to reason through. It makes sense to me that are a whole ton of very important caveats about the way that you're allowed to handle problems like this. The choice that we're making is
I think that this is a good and acceptable tradeoff to allow new and surprising uses of the mechanism. If there's a thing to be improved, in my opinion it is how we signal errors that come from corruption of internal state and trying to get the actions that the robot does in that case to better logically flow from actions. We're okay with having surprising things happen sometimes to people writing error recovery flows; the goal is that after they think about it for a second, they say "okay I guess that makes sense".