Skip to content

Comments

Trampolining Part 2: Avoid infinite loops for Pinned workflows#9374

Draft
Shivs11 wants to merge 3 commits intomainfrom
trampolining-pt2
Draft

Trampolining Part 2: Avoid infinite loops for Pinned workflows#9374
Shivs11 wants to merge 3 commits intomainfrom
trampolining-pt2

Conversation

@Shivs11
Copy link
Member

@Shivs11 Shivs11 commented Feb 20, 2026

What changed?

  • Consists of the change to prevent any Pinned workflows, that may have forgotten to have the initial CAN Behaviour as AU, from CAN'ing infinitely.

Why?

  • Worker-Versioning correctness.

How did you test it?

  • built
  • run locally and tested manually
  • covered by existing tests
  • added new unit test(s)
  • added new functional test(s)

Potential risks

  • Pre-existing workflows (started before this fix) have nil TargetVersionOnStart. On the first WFT after deployment, "" !=
    "build-v2" → spurious targetDeploymentVersionChanged=true. This is a one-time false-positive regression for those Pinned workflows.

var targetDeploymentVersionChanged bool
if m.ms.config.EnableSendTargetVersionChanged(m.ms.namespaceEntry.Name().String()) &&
m.ms.GetEffectiveVersioningBehavior() != enumspb.VERSIONING_BEHAVIOR_UNSPECIFIED &&
versioningInfo.GetVersioningOverride() == nil &&
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding a guard here so that workflows that have the pinned override do not bounce on to the new version!

Copy link
Contributor

@ShahabT ShahabT Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this restriction. It's really hard to reason about the signal in case of override. also, user has no way to do with it because the override (presumably) will override the AU initial behavior. Moreover, as you mentioned it doesn't make sense to send signal when "the operator explicitly controls the workflow's versioning".

However, this intensifies the need for one-time move to let users who do not specify override at workflow start but just happened to move the workflow to a patched build can still benefit from the signal.

So basically there is two worlds (we never talked about this clearly but I just realized seeing these two world separately helps clarify things immensely):

  • World 1: RoutingConfig-based. User relies on routing config and does not deal with override. I case of patches they use one-time move.

  • World 2: Manual pinning. User wants full control on per-execution version routing. They do not use WD routing config, but pass pinned versioning override for each new workflow execution. The cost is that they lose the following benefits of the server-managed routing:

    • Separation of concerns: Starter/client is now involved in versioning
    • AutoUpgrade behavior
    • Built-in Ramp
    • Target version changed signal (built-in trampolining signal)

So basically pinned override will be only used in world 2, except for very rare edge cases when user has set the wrong behavior in SDK and wants to override it temporarily in world 1.

This also explains why Pinned behavior override has the special inheritance logic: in world 2 it makes sense for CaN and Child to inherit the override.

I think we should clarify our docs about these two ways of using Worker Versioning so users are not confused. And emphasize on routing config way as the recommended way. (I think without us mentioning it many users would think of manual pinning as the first thing, their mind don't separate versioning concerns from Starter automatically at the first glance.)

For both of these worlds to work fully we need to fill the following gaps:

  1. One-time move without override (world 1)
  2. passing (and removing) versioning override for CaN and Child (world 2)

modernc.org/memory v1.11.0 // indirect
)

replace go.temporal.io/api => ../api-go
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reminder: change this later on.

@@ -52,6 +53,7 @@ type WorkflowTaskInfo struct {
HistorySizeBytes int64

TargetWorkerDeploymentVersionChanged bool
TargetWorkerDeploymentVersionOnStart *deploymentpb.WorkerDeploymentVersion
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need this to be present on this structure due to the following reason:

WorkflowTaskInfo is the intermediary struct that bundles all WFT-related data. It's populated by getWorkflowTaskInfo() and then consumed by callers that retroactively write WFT started events to history.

I initially did not have this addition to be present, but then quickly realized that for those speculative tasks that are being converted to normal workflow tasks (so that the corresponding events may be persisted), we need to have this information handy so that the right information is present in those events.

// Compute whether the routing target has changed since this run started, used to set the
// target_worker_deployment_version_changed flag that tells pinned workflows to CAN onto the
// latest current version.
// - Skip when targetDeploymentVersion is nil: matching only sends it when the poller's
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is soon to be altered/fixed imo

@Shivs11
Copy link
Member Author

Shivs11 commented Feb 20, 2026

In theory, this PR could have been quite small if I had not worried/thought about the replication side of things. I think it's important to note that if we don't persist this information in history (on the event side of things), then it becomes difficult for us to rebuild the mutable state when looking at just the events. This could also cause a workflow, on a MS that has been rebuilt, to undergo an additional CAN.

one could say that this is an overkill for something which is anyways the user's fault, but my knowledge on replication and how consistent things have to be kept is quite minimal, so I decided to take a stab at this.

}
initialTarget := versioningInfo.GetTargetWorkerDeploymentVersionOnStart()
targetDeploymentVersionChanged =
initialTarget.GetBuildId() != targetDeploymentVersion.GetBuildId() ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should still make sure that the target is actually different from currentDeploymentVersion.

versioningInfo = &workflowpb.WorkflowExecutionVersioningInfo{}
m.ms.GetExecutionInfo().VersioningInfo = versioningInfo
}
versioningInfo.TargetWorkerDeploymentVersionOnStart = &deploymentpb.WorkerDeploymentVersion{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to reason about race conditions involving target version change between the time previous run asked for CaN and next run starting its first wft.

Shouldn't we pass this all the way from the previous run and just set it in the WorkflowExecutionStartedEvent?

(Also, it's weird to have this field in every wft started event while the value only change once).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants