Skip to content

Commit e561842

Browse files
jeremyederAmbient Code Botclaude
authored
fix(operator): increase resource limits to prevent OOMKill at scale (#1299)
## Summary The agentic-operator is OOMKilled on vteam-uat (128 restarts, CrashLoopBackOff for 2+ days on node `ip-10-0-15-94.ec2.internal`). The cluster has grown beyond what the original resource limits can handle: | Resource | Count | |----------|-------| | AgenticSessions | 4,319 | | ProjectSettings | 674 | | Namespaces | 763 | | Pods (total) | 477 | The controller-runtime in-memory cache stores all watched resources. With the old 512Mi limit, the operator OOMs within ~60 seconds of startup every time. **Changes:** - Memory: 128Mi/512Mi → 512Mi/4Gi (request/limit) - CPU: 50m/200m → 100m/2 cores (request/limit) - Add `GOMEMLIMIT=3500MiB` so Go's GC gets aggressive before hitting the container ceiling A follow-up issue has been filed for code-level scalability improvements (cache scoping, unbounded maps, connection pooling, etc.). ## Test plan - [ ] Deploy to vteam-uat, confirm operator pod starts and stays Running - [ ] `oc top pod -n ambient-code -l app=agentic-operator` — check steady-state memory usage - [ ] Wait 10+ minutes, confirm zero restarts - [ ] Verify sessions can be created/reconciled normally 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Optimized operator resource allocation, including increased CPU and memory configurations, and added memory management settings for improved operational performance and stability. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: Ambient Code Bot <bot@ambient-code.local> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 8a2310a commit e561842

1 file changed

Lines changed: 8 additions & 4 deletions

File tree

components/manifests/base/core/operator-deployment.yaml

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -142,6 +142,10 @@ spec:
142142
value: "production"
143143
- name: VERSION
144144
value: "latest" # Override with actual version in production
145+
# Go memory limit — tells GC to be aggressive before hitting container limit.
146+
# Set to ~87.5% of memory limit (4Gi) per Go best practices.
147+
- name: GOMEMLIMIT
148+
value: "3500MiB"
145149
volumeMounts:
146150
# Model manifest (mounted ConfigMap — kubelet auto-syncs changes)
147151
- name: model-manifest
@@ -153,11 +157,11 @@ spec:
153157
readOnly: true
154158
resources:
155159
requests:
156-
cpu: 50m
157-
memory: 128Mi
158-
limits:
159-
cpu: 200m
160+
cpu: 100m
160161
memory: 512Mi
162+
limits:
163+
cpu: "2"
164+
memory: 4Gi
161165
livenessProbe:
162166
httpGet:
163167
path: /healthz

0 commit comments

Comments
 (0)