Context
While investigating the operator OOMKill on vteam-uat (see linked PR), I audited the operator codebase for memory and scalability issues. The immediate fix is bumping resource limits, but there are several code-level improvements that would reduce the operator's memory footprint and improve its behavior at scale.
Current cluster scale (vteam-uat): 4,319 AgenticSessions, 674 ProjectSettings, 763 namespaces, 477 pods (80 runner pods). Raw JSON for all sessions: ~25MB, estimated ~75MB+ in Go heap as unstructured maps.
Items are sorted by expected impact, with real metrics from the vteam-uat investigation.
1. Scope controller-runtime cache to managed namespaces only
Impact: HIGH — largest single memory reduction
Current behavior: The controller-runtime cache watches ALL AgenticSessions and ALL Pods across all 763 namespaces. The pod watch caches every pod in the cluster (477 pods), even though only 80 are runner pods. The cache predicates only filter which events trigger reconciliation — they do NOT filter what's stored in the cache.
Measured impact: The full AgenticSession cache is ~25MB raw JSON (~75MB+ in Go heap). All 477 pods are cached even though 397 are irrelevant (non-runner system pods).
Fix: Use cache.Options.ByObject with label selectors or namespace selectors to restrict what the controller-runtime cache stores:
- Add a label selector on Pods so only
-runner pods are cached
- Consider namespace-scoped caching for managed namespaces only (requires
cache.Options.DefaultNamespaces or per-object overrides)
Expected improvement: ~50% reduction in cache memory by excluding non-runner pods and unmanaged namespace resources.
Files: components/operator/internal/controller/agenticsession_controller.go, components/operator/main.go (manager Options)
2. Unbounded conditions array growth on AgenticSession status
Impact: HIGH — linear memory growth per session over time
Current behavior: setCondition() in helpers.go (line 220-258) appends new conditions to the status without any size limit or deduplication. Each reconciliation can add multiple conditions (pod scheduled, pod created, runner started, etc.). Over the lifetime of a session, the conditions array grows unbounded.
Measured impact: With 4,319 sessions being reconciled, each carrying potentially hundreds of condition entries, this bloats both the CR storage in etcd and the in-memory cache. A single session with many reconciliation cycles could have a conditions array in the tens of KB.
Fix: Implement condition pruning — keep only the most recent N conditions per type (e.g., last 20), or prune conditions older than a threshold. Standard Kubernetes condition semantics keep one entry per type and update lastTransitionTime.
Expected improvement: Reduces both etcd storage and in-memory cache size per session. At scale, this could reclaim 10-30% of session object memory.
Files: components/operator/internal/handlers/helpers.go (lines 220-258)
3. HTTP client created per API call in reconciliation loops
Impact: MEDIUM — connection pool waste and TIME_WAIT accumulation
Current behavior: In sessions.go (lines ~1618, 1650, 1679, 1750), a new http.Client is created for each API call inside loops (e.g., iterating over repos to add/remove). Each client creates its own connection pool and TCP connections.
Measured impact: During reconciliation of sessions with multiple repos, this creates many short-lived HTTP clients. TCP connections enter TIME_WAIT state and accumulate, consuming file descriptors and memory. With 10 concurrent reconciliations processing repo operations simultaneously, this creates bursts of unnecessary connections.
Fix: Create a single package-level http.Client with appropriate timeouts and connection pooling. Reuse across all API calls.
Expected improvement: Reduces transient memory spikes during reconciliation, eliminates TIME_WAIT socket accumulation. Modest steady-state improvement but significant during burst operations.
Files: components/operator/internal/handlers/sessions.go
4. Unbounded project timeout cache
Impact: MEDIUM — slow linear growth with namespace count
Current behavior: psTimeoutCache in inactivity.go (lines 57-102) stores timeout entries per namespace in a map with no eviction policy. Every unique namespace accessed adds a permanent entry. The cache has no maximum size.
Measured impact: With 763 namespaces, each entry is small (namespace string + timeout value + timestamp), so total size is modest (~100KB). But there's no cleanup — entries for deleted namespaces are never removed.
Fix: Add TTL-based eviction (remove entries unused for >1 hour) or switch to an LRU cache with a size cap (e.g., 200 entries). The existing 60-second TTL on individual entries helps, but entries themselves are never removed from the map.
Expected improvement: Prevents unbounded map growth. Small absolute impact at current scale but prevents accumulation over months of operation.
Files: components/operator/internal/handlers/inactivity.go (lines 57-102)
5. Watch loops without cancellation context
Impact: MEDIUM — resource leak on shutdown/restart
Current behavior: WatchNamespaces() and WatchProjectSettings() in main.go (lines 160-161) are started as goroutines with context.TODO(). They use infinite for loops and cannot be gracefully shut down. When the operator is terminated, these goroutines and their associated Kubernetes API watch connections are not cleanly closed.
Measured impact: On each of the 128 restarts, old watch connections may not be properly closed before the process is killed. The 2-second sleep between reconnects means the operator could have stale connections during the brief window before OOMKill.
Fix: Accept a cancellable context from main() (derived from ctrl.SetupSignalHandler()), pass it to watch functions, and handle context cancellation for clean shutdown. Better yet, migrate these to controller-runtime controllers (the code comment on line 158-159 already notes this as future work).
Expected improvement: Clean shutdown, no resource leaks across restarts. Eliminates a class of subtle bugs under frequent restart scenarios.
Files: components/operator/main.go (lines 160-161), components/operator/internal/handlers/namespaces.go, components/operator/internal/handlers/projectsettings.go
6. Unbounded response body reads
Impact: LOW — only triggers on error paths
Current behavior: In sessions.go (line 1786), io.ReadAll(resp.Body) reads the entire HTTP response body into memory without any size limit. If a runner returns a large error response, the entire body is loaded.
Fix: Use io.LimitedReader with a cap (e.g., 64KB) to prevent unbounded reads.
Files: components/operator/internal/handlers/sessions.go
Related
- PR for immediate resource limit fix: (linked)
🤖 Generated with Claude Code
Context
While investigating the operator OOMKill on vteam-uat (see linked PR), I audited the operator codebase for memory and scalability issues. The immediate fix is bumping resource limits, but there are several code-level improvements that would reduce the operator's memory footprint and improve its behavior at scale.
Current cluster scale (vteam-uat): 4,319 AgenticSessions, 674 ProjectSettings, 763 namespaces, 477 pods (80 runner pods). Raw JSON for all sessions: ~25MB, estimated ~75MB+ in Go heap as unstructured maps.
Items are sorted by expected impact, with real metrics from the vteam-uat investigation.
1. Scope controller-runtime cache to managed namespaces only
Impact: HIGH — largest single memory reduction
Current behavior: The controller-runtime cache watches ALL AgenticSessions and ALL Pods across all 763 namespaces. The pod watch caches every pod in the cluster (477 pods), even though only 80 are runner pods. The cache predicates only filter which events trigger reconciliation — they do NOT filter what's stored in the cache.
Measured impact: The full AgenticSession cache is ~25MB raw JSON (~75MB+ in Go heap). All 477 pods are cached even though 397 are irrelevant (non-runner system pods).
Fix: Use
cache.Options.ByObjectwith label selectors or namespace selectors to restrict what the controller-runtime cache stores:-runnerpods are cachedcache.Options.DefaultNamespacesor per-object overrides)Expected improvement: ~50% reduction in cache memory by excluding non-runner pods and unmanaged namespace resources.
Files:
components/operator/internal/controller/agenticsession_controller.go,components/operator/main.go(manager Options)2. Unbounded conditions array growth on AgenticSession status
Impact: HIGH — linear memory growth per session over time
Current behavior:
setCondition()inhelpers.go(line 220-258) appends new conditions to the status without any size limit or deduplication. Each reconciliation can add multiple conditions (pod scheduled, pod created, runner started, etc.). Over the lifetime of a session, the conditions array grows unbounded.Measured impact: With 4,319 sessions being reconciled, each carrying potentially hundreds of condition entries, this bloats both the CR storage in etcd and the in-memory cache. A single session with many reconciliation cycles could have a conditions array in the tens of KB.
Fix: Implement condition pruning — keep only the most recent N conditions per type (e.g., last 20), or prune conditions older than a threshold. Standard Kubernetes condition semantics keep one entry per type and update
lastTransitionTime.Expected improvement: Reduces both etcd storage and in-memory cache size per session. At scale, this could reclaim 10-30% of session object memory.
Files:
components/operator/internal/handlers/helpers.go(lines 220-258)3. HTTP client created per API call in reconciliation loops
Impact: MEDIUM — connection pool waste and TIME_WAIT accumulation
Current behavior: In
sessions.go(lines ~1618, 1650, 1679, 1750), a newhttp.Clientis created for each API call inside loops (e.g., iterating over repos to add/remove). Each client creates its own connection pool and TCP connections.Measured impact: During reconciliation of sessions with multiple repos, this creates many short-lived HTTP clients. TCP connections enter TIME_WAIT state and accumulate, consuming file descriptors and memory. With 10 concurrent reconciliations processing repo operations simultaneously, this creates bursts of unnecessary connections.
Fix: Create a single package-level
http.Clientwith appropriate timeouts and connection pooling. Reuse across all API calls.Expected improvement: Reduces transient memory spikes during reconciliation, eliminates TIME_WAIT socket accumulation. Modest steady-state improvement but significant during burst operations.
Files:
components/operator/internal/handlers/sessions.go4. Unbounded project timeout cache
Impact: MEDIUM — slow linear growth with namespace count
Current behavior:
psTimeoutCacheininactivity.go(lines 57-102) stores timeout entries per namespace in a map with no eviction policy. Every unique namespace accessed adds a permanent entry. The cache has no maximum size.Measured impact: With 763 namespaces, each entry is small (namespace string + timeout value + timestamp), so total size is modest (~100KB). But there's no cleanup — entries for deleted namespaces are never removed.
Fix: Add TTL-based eviction (remove entries unused for >1 hour) or switch to an LRU cache with a size cap (e.g., 200 entries). The existing 60-second TTL on individual entries helps, but entries themselves are never removed from the map.
Expected improvement: Prevents unbounded map growth. Small absolute impact at current scale but prevents accumulation over months of operation.
Files:
components/operator/internal/handlers/inactivity.go(lines 57-102)5. Watch loops without cancellation context
Impact: MEDIUM — resource leak on shutdown/restart
Current behavior:
WatchNamespaces()andWatchProjectSettings()inmain.go(lines 160-161) are started as goroutines withcontext.TODO(). They use infiniteforloops and cannot be gracefully shut down. When the operator is terminated, these goroutines and their associated Kubernetes API watch connections are not cleanly closed.Measured impact: On each of the 128 restarts, old watch connections may not be properly closed before the process is killed. The 2-second sleep between reconnects means the operator could have stale connections during the brief window before OOMKill.
Fix: Accept a cancellable context from
main()(derived fromctrl.SetupSignalHandler()), pass it to watch functions, and handle context cancellation for clean shutdown. Better yet, migrate these to controller-runtime controllers (the code comment on line 158-159 already notes this as future work).Expected improvement: Clean shutdown, no resource leaks across restarts. Eliminates a class of subtle bugs under frequent restart scenarios.
Files:
components/operator/main.go(lines 160-161),components/operator/internal/handlers/namespaces.go,components/operator/internal/handlers/projectsettings.go6. Unbounded response body reads
Impact: LOW — only triggers on error paths
Current behavior: In
sessions.go(line 1786),io.ReadAll(resp.Body)reads the entire HTTP response body into memory without any size limit. If a runner returns a large error response, the entire body is loaded.Fix: Use
io.LimitedReaderwith a cap (e.g., 64KB) to prevent unbounded reads.Files:
components/operator/internal/handlers/sessions.goRelated
🤖 Generated with Claude Code