Operator scalability: code-level improvements to reduce memory footprint

## Context

While investigating the operator OOMKill on vteam-uat (see linked PR), I audited the operator codebase for memory and scalability issues. The immediate fix is bumping resource limits, but there are several code-level improvements that would reduce the operator's memory footprint and improve its behavior at scale.

**Current cluster scale (vteam-uat):** 4,319 AgenticSessions, 674 ProjectSettings, 763 namespaces, 477 pods (80 runner pods). Raw JSON for all sessions: ~25MB, estimated ~75MB+ in Go heap as unstructured maps.

Items are sorted by expected impact, with real metrics from the vteam-uat investigation.

---

## 1. Scope controller-runtime cache to managed namespaces only

**Impact: HIGH — largest single memory reduction**

**Current behavior:** The controller-runtime cache watches ALL AgenticSessions and ALL Pods across all 763 namespaces. The pod watch caches every pod in the cluster (477 pods), even though only 80 are runner pods. The cache predicates only filter which events trigger reconciliation — they do NOT filter what's stored in the cache.

**Measured impact:** The full AgenticSession cache is ~25MB raw JSON (~75MB+ in Go heap). All 477 pods are cached even though 397 are irrelevant (non-runner system pods).

**Fix:** Use `cache.Options.ByObject` with label selectors or namespace selectors to restrict what the controller-runtime cache stores:
- Add a label selector on Pods so only `-runner` pods are cached
- Consider namespace-scoped caching for managed namespaces only (requires `cache.Options.DefaultNamespaces` or per-object overrides)

**Expected improvement:** ~50% reduction in cache memory by excluding non-runner pods and unmanaged namespace resources.

**Files:** `components/operator/internal/controller/agenticsession_controller.go`, `components/operator/main.go` (manager Options)

---

## 2. Unbounded conditions array growth on AgenticSession status

**Impact: HIGH — linear memory growth per session over time**

**Current behavior:** `setCondition()` in `helpers.go` (line 220-258) appends new conditions to the status without any size limit or deduplication. Each reconciliation can add multiple conditions (pod scheduled, pod created, runner started, etc.). Over the lifetime of a session, the conditions array grows unbounded.

**Measured impact:** With 4,319 sessions being reconciled, each carrying potentially hundreds of condition entries, this bloats both the CR storage in etcd and the in-memory cache. A single session with many reconciliation cycles could have a conditions array in the tens of KB.

**Fix:** Implement condition pruning — keep only the most recent N conditions per type (e.g., last 20), or prune conditions older than a threshold. Standard Kubernetes condition semantics keep one entry per type and update `lastTransitionTime`.

**Expected improvement:** Reduces both etcd storage and in-memory cache size per session. At scale, this could reclaim 10-30% of session object memory.

**Files:** `components/operator/internal/handlers/helpers.go` (lines 220-258)

---

## 3. HTTP client created per API call in reconciliation loops

**Impact: MEDIUM — connection pool waste and TIME_WAIT accumulation**

**Current behavior:** In `sessions.go` (lines ~1618, 1650, 1679, 1750), a new `http.Client` is created for each API call inside loops (e.g., iterating over repos to add/remove). Each client creates its own connection pool and TCP connections.

**Measured impact:** During reconciliation of sessions with multiple repos, this creates many short-lived HTTP clients. TCP connections enter TIME_WAIT state and accumulate, consuming file descriptors and memory. With 10 concurrent reconciliations processing repo operations simultaneously, this creates bursts of unnecessary connections.

**Fix:** Create a single package-level `http.Client` with appropriate timeouts and connection pooling. Reuse across all API calls.

**Expected improvement:** Reduces transient memory spikes during reconciliation, eliminates TIME_WAIT socket accumulation. Modest steady-state improvement but significant during burst operations.

**Files:** `components/operator/internal/handlers/sessions.go`

---

## 4. Unbounded project timeout cache

**Impact: MEDIUM — slow linear growth with namespace count**

**Current behavior:** `psTimeoutCache` in `inactivity.go` (lines 57-102) stores timeout entries per namespace in a map with no eviction policy. Every unique namespace accessed adds a permanent entry. The cache has no maximum size.

**Measured impact:** With 763 namespaces, each entry is small (namespace string + timeout value + timestamp), so total size is modest (~100KB). But there's no cleanup — entries for deleted namespaces are never removed.

**Fix:** Add TTL-based eviction (remove entries unused for >1 hour) or switch to an LRU cache with a size cap (e.g., 200 entries). The existing 60-second TTL on individual entries helps, but entries themselves are never removed from the map.

**Expected improvement:** Prevents unbounded map growth. Small absolute impact at current scale but prevents accumulation over months of operation.

**Files:** `components/operator/internal/handlers/inactivity.go` (lines 57-102)

---

## 5. Watch loops without cancellation context

**Impact: MEDIUM — resource leak on shutdown/restart**

**Current behavior:** `WatchNamespaces()` and `WatchProjectSettings()` in `main.go` (lines 160-161) are started as goroutines with `context.TODO()`. They use infinite `for` loops and cannot be gracefully shut down. When the operator is terminated, these goroutines and their associated Kubernetes API watch connections are not cleanly closed.

**Measured impact:** On each of the 128 restarts, old watch connections may not be properly closed before the process is killed. The 2-second sleep between reconnects means the operator could have stale connections during the brief window before OOMKill.

**Fix:** Accept a cancellable context from `main()` (derived from `ctrl.SetupSignalHandler()`), pass it to watch functions, and handle context cancellation for clean shutdown. Better yet, migrate these to controller-runtime controllers (the code comment on line 158-159 already notes this as future work).

**Expected improvement:** Clean shutdown, no resource leaks across restarts. Eliminates a class of subtle bugs under frequent restart scenarios.

**Files:** `components/operator/main.go` (lines 160-161), `components/operator/internal/handlers/namespaces.go`, `components/operator/internal/handlers/projectsettings.go`

---

## 6. Unbounded response body reads

**Impact: LOW — only triggers on error paths**

**Current behavior:** In `sessions.go` (line 1786), `io.ReadAll(resp.Body)` reads the entire HTTP response body into memory without any size limit. If a runner returns a large error response, the entire body is loaded.

**Fix:** Use `io.LimitedReader` with a cap (e.g., 64KB) to prevent unbounded reads.

**Files:** `components/operator/internal/handlers/sessions.go`

---

## Related

- PR for immediate resource limit fix: (linked)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operator scalability: code-level improvements to reduce memory footprint #1300

Context

1. Scope controller-runtime cache to managed namespaces only

2. Unbounded conditions array growth on AgenticSession status

3. HTTP client created per API call in reconciliation loops

4. Unbounded project timeout cache

5. Watch loops without cancellation context

6. Unbounded response body reads

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Operator scalability: code-level improvements to reduce memory footprint #1300

Description

Context

1. Scope controller-runtime cache to managed namespaces only

2. Unbounded conditions array growth on AgenticSession status

3. HTTP client created per API call in reconciliation loops

4. Unbounded project timeout cache

5. Watch loops without cancellation context

6. Unbounded response body reads

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions