Skip to content

Commit 6f842cc

Browse files
committed
SEP74 Automatic Workload Migration
Signed-off-by: Daniel Grimm <[email protected]>
1 parent eb6389f commit 6f842cc

File tree

1 file changed

+225
-0
lines changed

1 file changed

+225
-0
lines changed
Lines changed: 225 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,225 @@
1+
| Status | Authors | Created |
2+
|---|---|---|
3+
| WIP | @dgn | 2025-05-22 |
4+
5+
# Automatic Workload Migration
6+
7+
Tracked in [#74](https://github.com/istio-ecosystem/sail-operator/issues/74).
8+
9+
## Overview
10+
11+
When using Istio's revision-based update strategy, workloads must be migrated from old control plane revisions to new ones after an upgrade. Currently, this migration process is manual, requiring users to update namespace and pod labels and restart deployments to pick up new sidecar proxy versions. This creates operational overhead and potential for errors, especially in environments with many namespaces and workloads.
12+
13+
This enhancement introduces automatic workload migration functionality to the Sail Operator, enabling seamless migration of workloads from old IstioRevisions to the current active revision when `workloadMigration.enabled` is set in the Istio resource's update strategy.
14+
15+
## Goals
16+
17+
* Automatically migrate workloads from old IstioRevisions to the active revision when enabled
18+
* Provide configurable migration behavior including batch sizes, delays, and timeouts
19+
* Zero-downtime migration with proper health checking between batches
20+
* Handle both namespace-level and deployment-level revision targeting
21+
* Support both default revision (`istio-injection=enabled`) and named revisions (`istio.io/rev=<revision>`)
22+
23+
## Non-goals
24+
25+
* Support for migration of workloads in external clusters managed by remote control planes
26+
* Migration of StatefulSets, DaemonSets, or other workload types (focus on Deployments initially)
27+
* Complex scheduling or dependency-aware migration ordering
28+
* Ambient Mesh support - ztunnel does not yet support revisions
29+
30+
## Design
31+
32+
### User Stories
33+
34+
1. **As a platform engineer**, I want workloads to automatically migrate to new Istio versions during control plane upgrades, so that I don't have to manually update namespace labels and restart deployments.
35+
36+
2. **As a cluster operator**, I want to configure the migration behavior (batch sizes, delays) to control the impact on my workloads during upgrades.
37+
38+
3. **As an application owner**, I want my applications to maintain availability during Istio upgrades without manual intervention.
39+
40+
4. **As a platform team**, I want to use stable revision tags while still having workloads automatically migrate to new control plane versions.
41+
42+
### API Changes
43+
44+
#### IstioUpdateStrategy Enhancement
45+
46+
The existing `IstioUpdateStrategy` type is extended with a new `WorkloadMigration` field:
47+
48+
```go
49+
type WorkloadMigrationConfig struct {
50+
// Defines whether the workloads should be moved from one control plane instance to another automatically
51+
// +kubebuilder:default=false
52+
Enabled *bool `json:"enabled,omitempty"`
53+
54+
// Maximum number of deployments to restart concurrently during migration.
55+
// Defaults to 1.
56+
// +kubebuilder:default=1
57+
// +kubebuilder:validation:Minimum=1
58+
BatchSize *int32 `json:"batchSize,omitempty"`
59+
60+
// Time to wait between deployment restart batches.
61+
// Defaults to 30s.
62+
// +kubebuilder:default="30s"
63+
DelayBetweenBatches *metav1.Duration `json:"delayBetweenBatches,omitempty"`
64+
65+
// Maximum time to wait for a deployment to become ready after restart.
66+
// Defaults to 5m.
67+
// +kubebuilder:default="5m"
68+
ReadinessTimeout *metav1.Duration `json:"readinessTimeout,omitempty"`
69+
}
70+
```
71+
72+
#### RBAC Permissions
73+
74+
The operator already has the required permission to update `Deployment` resources.
75+
76+
### Architecture
77+
78+
#### Migration Flow
79+
80+
1. **Trigger**: Migration is triggered when:
81+
- `workloadMigration.enabled` is set to `true`
82+
- An Istio resource's version is updated
83+
84+
2. **Discovery**: The operator discovers workloads using old revisions by:
85+
- Listing all namespaces and checking their `istio.io/rev` or `istio-injection` labels
86+
- Listing all deployments and checking their pod template labels
87+
- Listing all pods and checking their pod annotations to detect injected revision
88+
- Comparing current annotations against the active revision name
89+
90+
3. **Migration**: Workloads are migrated in two phases:
91+
- **Phase 1**: Namespace label updates (no restarts required yet)
92+
- **Phase 2**: Deployment restarts in configurable batches
93+
94+
4. **Validation**: Each batch waits for readiness before proceeding to the next batch
95+
96+
#### WorkloadManager & InUse Detection
97+
98+
The introduction of a `WorkloadManager` is proposed that will handle workload-specific tasks such as label updates and InUse detection. Currently, InUse detection code is spread over the `Istio`/`IstioRevision`/`IstioRevisionTag` controllers. Implementation of this SEP should include a refactoring that moves the code into a common package.
99+
100+
#### Deployment Restart Mechanism
101+
102+
Deployments are restarted using Kubernetes' standard rolling update mechanism:
103+
104+
1. **Label Update**: Pod template labels are updated to reference the new revision, if required - the only exception being when IstioRevisionTags are used
105+
2. **Restart Annotation**: A `kubectl.kubernetes.io/restartedAt` annotation is added to trigger pod replacement
106+
3. **Health Check**: The operator waits for `deployment.Status.ReadyReplicas == deployment.Spec.Replicas`
107+
108+
### Performance Impact
109+
110+
* **Discovery Overhead**: The operator lists all namespaces and deployments once per migration
111+
* **Batch Processing**: Migration impact is controlled through configurable batch sizes and delays
112+
* **Memory Usage**: Minimal additional memory for tracking migration state
113+
* **Network Traffic**: Standard Kubernetes API calls for object updates
114+
115+
### Backward Compatibility
116+
117+
* **Opt-in Feature**: Migration only occurs when `workloadMigration.enabled: true` is explicitly set
118+
* **Default Behavior**: Existing behavior is unchanged for users who don't enable the feature
119+
* **API Compatibility**: All new fields are optional with sensible defaults
120+
* **GitOps Support**: GitOps Deployments are supported when using `IstioRevisionTag` or tool-specific instructions to ignore revision label updates
121+
122+
### Kubernetes vs OpenShift vs Other Distributions
123+
124+
No distribution-specific dependencies.
125+
126+
## Alternatives Considered
127+
128+
TBD
129+
130+
## Implementation Plan
131+
132+
### Phase 1: Core Implementation
133+
- [ ] Extend IstioUpdateStrategy API with WorkloadMigrationConfig
134+
- [ ] Implement core migration logic in Istio controller
135+
- [ ] Refactor InUse detection into WorkloadManager
136+
- [ ] Implement namespace label migration
137+
- [ ] Implement deployment restart with batching
138+
139+
### Phase 2: Testing
140+
- [ ] Unit tests for all migration functions
141+
- [ ] Integration tests for migration scenarios
142+
- [ ] E2E tests for end-to-end migration workflows
143+
144+
### Phase 3: Documentation and Validation
145+
- [ ] User documentation updates
146+
- [ ] Example configurations
147+
148+
### Phase 4: Future Enhancements (Optional)
149+
- [ ] Support for StatefulSets and DaemonSets
150+
- [ ] Migration rollback capabilities
151+
- [ ] Advanced scheduling options (maintenance windows)
152+
- [ ] Integration with external monitoring systems
153+
154+
## Test Plan
155+
156+
### Unit Tests
157+
158+
### Integration Tests
159+
160+
### E2E Tests
161+
162+
## Example Configuration
163+
164+
### Basic Automatic Migration
165+
```yaml
166+
apiVersion: sailoperator.io/v1
167+
kind: Istio
168+
metadata:
169+
name: default
170+
spec:
171+
version: v1.26.0
172+
updateStrategy:
173+
type: RevisionBased
174+
workloadMigration:
175+
enabled: true
176+
```
177+
178+
### Advanced Migration Configuration
179+
```yaml
180+
apiVersion: sailoperator.io/v1
181+
kind: Istio
182+
metadata:
183+
name: default
184+
spec:
185+
version: v1.26.0
186+
updateStrategy:
187+
type: RevisionBased
188+
workloadMigration:
189+
enabled: true
190+
batchSize: 5
191+
delayBetweenBatches: 60s
192+
readinessTimeout: 10m
193+
```
194+
195+
### Usage with Revision Tags
196+
```yaml
197+
# First, create a revision tag
198+
apiVersion: sailoperator.io/v1
199+
kind: IstioRevisionTag
200+
metadata:
201+
name: stable
202+
spec:
203+
targetRef:
204+
kind: Istio
205+
name: default
206+
207+
---
208+
# Workloads can use stable revision tag
209+
apiVersion: v1
210+
kind: Namespace
211+
metadata:
212+
name: production
213+
labels:
214+
istio.io/rev: stable
215+
```
216+
217+
When the Istio version is updated, workloads using the `stable` tag will automatically be migrated to the new revision.
218+
219+
## Security Considerations
220+
221+
There's a risk that we break security features configured by the user because they don't work properly in the new version, e.g. deprecated features or custom changes made using `EnvoyFilter`. We can't really mitigate this, so workload migration will always be disabled by default and should be used with caution.
222+
223+
## Change History
224+
225+
* 2025-05-22: Initial SEP created

0 commit comments

Comments
 (0)