Trampolining Part 2: Avoid infinite loops for Pinned workflows#9374
Trampolining Part 2: Avoid infinite loops for Pinned workflows#9374
Conversation
| var targetDeploymentVersionChanged bool | ||
| if m.ms.config.EnableSendTargetVersionChanged(m.ms.namespaceEntry.Name().String()) && | ||
| m.ms.GetEffectiveVersioningBehavior() != enumspb.VERSIONING_BEHAVIOR_UNSPECIFIED && | ||
| versioningInfo.GetVersioningOverride() == nil && |
There was a problem hiding this comment.
adding a guard here so that workflows that have the pinned override do not bounce on to the new version!
There was a problem hiding this comment.
I like this restriction. It's really hard to reason about the signal in case of override. also, user has no way to do with it because the override (presumably) will override the AU initial behavior. Moreover, as you mentioned it doesn't make sense to send signal when "the operator explicitly controls the workflow's versioning".
However, this intensifies the need for one-time move to let users who do not specify override at workflow start but just happened to move the workflow to a patched build can still benefit from the signal.
So basically there is two worlds (we never talked about this clearly but I just realized seeing these two world separately helps clarify things immensely):
-
World 1: RoutingConfig-based. User relies on routing config and does not deal with override. I case of patches they use one-time move.
-
World 2: Manual pinning. User wants full control on per-execution version routing. They do not use WD routing config, but pass pinned versioning override for each new workflow execution. The cost is that they lose the following benefits of the server-managed routing:
- Separation of concerns: Starter/client is now involved in versioning
- AutoUpgrade behavior
- Built-in Ramp
- Target version changed signal (built-in trampolining signal)
So basically pinned override will be only used in world 2, except for very rare edge cases when user has set the wrong behavior in SDK and wants to override it temporarily in world 1.
This also explains why Pinned behavior override has the special inheritance logic: in world 2 it makes sense for CaN and Child to inherit the override.
I think we should clarify our docs about these two ways of using Worker Versioning so users are not confused. And emphasize on routing config way as the recommended way. (I think without us mentioning it many users would think of manual pinning as the first thing, their mind don't separate versioning concerns from Starter automatically at the first glance.)
For both of these worlds to work fully we need to fill the following gaps:
- One-time move without override (world 1)
- passing (and removing) versioning override for CaN and Child (world 2)
| modernc.org/memory v1.11.0 // indirect | ||
| ) | ||
|
|
||
| replace go.temporal.io/api => ../api-go |
There was a problem hiding this comment.
reminder: change this later on.
| @@ -52,6 +53,7 @@ type WorkflowTaskInfo struct { | |||
| HistorySizeBytes int64 | |||
|
|
|||
| TargetWorkerDeploymentVersionChanged bool | |||
| TargetWorkerDeploymentVersionOnStart *deploymentpb.WorkerDeploymentVersion | |||
There was a problem hiding this comment.
we need this to be present on this structure due to the following reason:
WorkflowTaskInfo is the intermediary struct that bundles all WFT-related data. It's populated by getWorkflowTaskInfo() and then consumed by callers that retroactively write WFT started events to history.
I initially did not have this addition to be present, but then quickly realized that for those speculative tasks that are being converted to normal workflow tasks (so that the corresponding events may be persisted), we need to have this information handy so that the right information is present in those events.
| // Compute whether the routing target has changed since this run started, used to set the | ||
| // target_worker_deployment_version_changed flag that tells pinned workflows to CAN onto the | ||
| // latest current version. | ||
| // - Skip when targetDeploymentVersion is nil: matching only sends it when the poller's |
There was a problem hiding this comment.
this is soon to be altered/fixed imo
|
In theory, this PR could have been quite small if I had not worried/thought about the replication side of things. I think it's important to note that if we don't persist this information in history (on the event side of things), then it becomes difficult for us to rebuild the mutable state when looking at just the events. This could also cause a workflow, on a MS that has been rebuilt, to undergo an additional CAN. one could say that this is an overkill for something which is anyways the user's fault, but my knowledge on replication and how consistent things have to be kept is quite minimal, so I decided to take a stab at this. |
| } | ||
| initialTarget := versioningInfo.GetTargetWorkerDeploymentVersionOnStart() | ||
| targetDeploymentVersionChanged = | ||
| initialTarget.GetBuildId() != targetDeploymentVersion.GetBuildId() || |
There was a problem hiding this comment.
should still make sure that the target is actually different from currentDeploymentVersion.
| versioningInfo = &workflowpb.WorkflowExecutionVersioningInfo{} | ||
| m.ms.GetExecutionInfo().VersioningInfo = versioningInfo | ||
| } | ||
| versioningInfo.TargetWorkerDeploymentVersionOnStart = &deploymentpb.WorkerDeploymentVersion{ |
There was a problem hiding this comment.
It's hard to reason about race conditions involving target version change between the time previous run asked for CaN and next run starting its first wft.
Shouldn't we pass this all the way from the previous run and just set it in the WorkflowExecutionStartedEvent?
(Also, it's weird to have this field in every wft started event while the value only change once).
What changed?
Why?
How did you test it?
Potential risks
"build-v2" → spurious targetDeploymentVersionChanged=true. This is a one-time false-positive regression for those Pinned workflows.