Skip to content

feat: operator OOM reproducer for scalability triage (#1300)#1308

Closed
jeremyeder wants to merge 7 commits intoambient-code:mainfrom
jeremyeder:feat/operator-oom-reproducer
Closed

feat: operator OOM reproducer for scalability triage (#1300)#1308
jeremyeder wants to merge 7 commits intoambient-code:mainfrom
jeremyeder:feat/operator-oom-reproducer

Conversation

@jeremyeder
Copy link
Copy Markdown
Contributor

@jeremyeder jeremyeder commented Apr 14, 2026

Summary

Reproducer script and pprof support for issue #1300 (operator scalability improvements). Deliberately undersizes the operator's memory limit on a local kind cluster and creates bulk resources until OOMKill, proving the scalability issues at smaller scale.

Changes:

  • components/operator/main.go — add optional pprof server on :6060 (enabled via ENABLE_PPROF=true). Zero production impact — opt-in only.
  • scripts/scalability/reproduce-oom.sh — reproducer script that:
    • Patches operator to 128Mi (configurable) memory limit
    • Creates namespaces with ambient-code.io/managed=true (triggers ProjectSettings + RBAC cascade)
    • Creates AgenticSession CRs to fill the controller-runtime cache
    • Monitors memory usage and detects OOMKill
    • Optionally captures pprof heap profiles between batches

Usage:

# After make kind-up LOCAL_IMAGES=true CONTAINER_ENGINE=docker
./scripts/scalability/reproduce-oom.sh \
    --mem-limit 128Mi \
    --namespaces 300 \
    --sessions-per-ns 5 \
    --batch-size 50 \
    --pprof \
    --cleanup

Heap profile analysis:

go tool pprof pprof-dumps/*/heap-batch-*.pb.gz

Test plan

  • make kind-up LOCAL_IMAGES=true CONTAINER_ENGINE=docker
  • Run reproducer with default settings — verify OOMKill detected
  • Run with --pprof — verify heap profiles captured
  • Run with --cleanup — verify namespaces deleted and operator restored
  • Verify pprof profiles show controller-runtime cache as top allocator

Closes #1300 (reproducer only — code fixes are separate PRs)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Optional runtime profiling HTTP endpoint, enabled via environment toggle.
  • Chores

    • Added a scalability test script to reproduce OOM conditions with configurable limits, batching, optional heap profiling, and cleanup.
    • Ignore generated scalability artifacts and profiling dumps in version control.
  • Performance & Reliability

    • Controller caching narrowed to relevant pods and prunes large object payloads.
    • Reused HTTP client and limited error-body reads to reduce overhead.
  • Bug Fixes

    • Bounded inactivity-timeout cache with eviction to prevent unbounded growth.

Ambient Code Bot and others added 2 commits April 14, 2026 09:33
Enable pprof endpoints on :6060 via ENABLE_PPROF=true env var,
for capturing heap profiles during OOM investigation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bash script that pushes the agentic-operator to OOMKill on a local kind
cluster by deliberately undersizing its memory limit and creating bulk
namespaces with managed labels (triggering ProjectSettings + RBAC
cascade) and AgenticSession CRs to fill the controller-runtime cache.

Supports pprof heap capture between batches and automatic cleanup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 14, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: a7c08a87-a555-49ae-a076-7cc87e5057c4

📥 Commits

Reviewing files that changed from the base of the PR and between 2e4f632 and 196226f.

📒 Files selected for processing (2)
  • components/operator/main.go
  • scripts/scalability/reproduce-oom.sh
🚧 Files skipped from review as they are similar to previous changes (1)
  • scripts/scalability/reproduce-oom.sh

📝 Walkthrough

Walkthrough

Scopes controller-runtime cache to runner Pods (app=ambient-runner) and adds an AgenticSession cache Transform that prunes terminal sessions’ heavy fields (replaces spec with {} and reduces status to {"phase": ...}). Adds an optional pprof HTTP server on :6060. Introduces a shared HTTP client and bounded error-body reads, caps/evicts inactivity-timeout cache entries, adds a scalability repro script, and updates .gitignore. No public APIs changed.

Changes

Cohort / File(s) Summary
Operator startup
components/operator/main.go
Configure ctrl.Options.Cache.ByObject to restrict cached corev1.Pod objects to label app=ambient-runner; add an AgenticSession cache Transform that detects terminal status.phase (Completed, Failed, Stopped) and prunes cached objects by replacing spec with an empty map and shrinking status to only {"phase": <phase>}; conditionally start an HTTP pprof server on :6060 when ENABLE_PPROF=="true", logging startup/errors.
Session handlers
components/operator/internal/handlers/sessions.go
Introduce package-level runnerHTTPClient (pooled http.Client with timeouts/transport tuning); add maxErrorBodyBytes and use io.LimitReader for non-200 responses; update repo/workflow call sites to reuse shared client.
Inactivity timeout cache
components/operator/internal/handlers/inactivity.go
Add maxTimeoutCacheEntries and update cache insertion to evict expired entries first (by TTL), then drop ~oldest half of remaining entries when capacity exceeded before inserting new namespace entry.
Scalability testing script
scripts/scalability/reproduce-oom.sh
New CLI script to patch operator resources/env (mem limit, ENABLE_PPROF/GOMEMLIMIT), create batches of namespaced AgenticSession CRs, optionally port-forward to collect heap profiles, monitor operator memory/restarts, and optionally restore/cleanup.
VCS Ignore
.gitignore
Added entries to ignore scalability artifacts: pprof-dumps/ and scalability-runs/.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Script as reproduce-oom.sh
participant KubeAPI as Kubernetes API / kubectl
participant Operator as agentic-operator Pod
participant PortFwd as kubectl port-forward
Script->>KubeAPI: verify cluster & deployment; patch operator (mem limit, ENABLE_PPROF)
KubeAPI-->>Script: rollout status
Script->>KubeAPI: create test namespaces & apply AgenticSession CRs (batched)
KubeAPI-->>Operator: API events (new CRs)
Operator-->>Operator: reconcile sessions (cache & processing; memory grows)
alt pprof enabled
Script->>PortFwd: start port-forward to Operator /debug/pprof
PortFwd-->>Script: local port available
Script->>PortFwd: fetch /debug/pprof/heap snapshots
PortFwd-->>Script: heap profiles saved to pprof-dumps/
end
Script->>KubeAPI: poll Operator pod status & memory usage
KubeAPI-->>Script: reports restart-count, OOMKilled or metrics
Script->>Script: decide to continue next batch or stop/cleanup


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 warning)

Check name Status Explanation Resolution
Security And Secret Handling ❌ Error Exposed pprof server on :6060 without authentication allows cluster-wide access to sensitive profiling endpoints (/debug/pprof/*) when ENABLE_PPROF=true. Bind pprof to 127.0.0.1:6060 instead of :6060, implement authentication/authorization middleware, and document that ENABLE_PPROF should only be used in isolated test/development environments.
Docstring Coverage ⚠️ Warning Docstring coverage is 46.15% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (6 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed Title follows Conventional Commits format (feat: scope) and clearly describes the main addition: a new OOM reproducer script for scalability testing.
Linked Issues check ✅ Passed PR implements 4 of 6 objectives from #1300: controller-runtime cache scoping with label/namespace selectors and terminal-phase pruning [1], shared HTTP client reuse [3], cache eviction cap [4], and response body limiting [6].
Out of Scope Changes check ✅ Passed All changes directly support issue #1300: pprof instrumentation enables heap profiling, reproduce-oom.sh exercises memory reduction features, inactivity.go/sessions.go/main.go implement the targeted optimizations.
Performance And Algorithmic Complexity ✅ Passed PR implements performance improvements from issue #1300: HTTP client reuse eliminates connection pool waste, LimitReader caps error body reads at 64KB, pod cache scoped to runner labels, and cache eviction logic uses sequential O(n) passes on capped map (maxTimeoutCacheEntries=500). No algorithmic regressions detected.
Kubernetes Resource Safety ✅ Passed PR contains code optimizations and test changes without modifying Kubernetes manifests, RBAC, or resource definitions. Existing deployment has proper resource limits and RBAC is appropriately scoped.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
✨ Simplify code
  • Create PR with simplified code

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/operator/main.go`:
- Around line 160-167: The pprof server is currently bound to all interfaces in
the anonymous goroutine (when ENABLE_PPROF is true) via
http.ListenAndServe(":6060", nil); change the listen address to the loopback
interface so it only accepts local connections (use "127.0.0.1:6060" instead of
":6060") in that goroutine and keep the same logger.Error handling for
http.ListenAndServe failing.

In `@scripts/scalability/reproduce-oom.sh`:
- Around line 129-153: The cleanup currently only restores when ORIG_* limit
vars exist and always deletes ENABLE_PPROF/GOMEMLIMIT; modify the script to
snapshot presence and values of operator resource limits and env vars (keep
ORIG_MEM_LIMIT, ORIG_CPU_LIMIT, ORIG_MEM_REQUEST, ORIG_CPU_REQUEST and add
ORIG_ENABLE_PPROF and ORIG_GOMEMLIMIT at start), and change the restore logic so
it always runs: if an ORIG_* limit value is non-empty, reapply it with kubectl
set resources for deployment "$OPERATOR_DEPLOY" in "$OPERATOR_NS", and if the
original value was empty remove that specific limit/request explicitly; likewise
for ENABLE_PPROF and GOMEMLIMIT, if ORIG_ENABLE_PPROF/ORIG_GOMEMLIMIT are set
restore them to their original values, otherwise remove the env var (don't
unconditionally delete). Ensure the cleanup code references the same ORIG_* vars
you snapshot so the deployment returns exactly to its pre-run state.
- Around line 245-247: The script only updates limits which can be rejected if
requests.memory > MEM_LIMIT; modify the kubectl call that updates the deployment
(the line invoking kubectl set resources deployment/"$OPERATOR_DEPLOY" -n
"$OPERATOR_NS" --limits="memory=${MEM_LIMIT}") to update requests as well so
requests.memory <= MEM_LIMIT in the same patch (e.g., include a
--requests="memory=${MEM_LIMIT}" or compute a safe request value and pass it
alongside --limits) to avoid invalid updates.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: eb8402e7-cc17-4193-9f3d-adb4447b49e9

📥 Commits

Reviewing files that changed from the base of the PR and between 041558d and d49c5aa.

📒 Files selected for processing (2)
  • components/operator/main.go
  • scripts/scalability/reproduce-oom.sh

Comment on lines +160 to +167
// Optional pprof server for memory profiling (enable via ENABLE_PPROF=true)
if os.Getenv("ENABLE_PPROF") == "true" {
go func() {
logger.Info("pprof server listening on :6060")
if err := http.ListenAndServe(":6060", nil); err != nil {
logger.Error(err, "pprof server failed")
}
}()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Bind pprof to loopback instead of all pod interfaces.

With ENABLE_PPROF=true, this listens on :6060 without auth, so any in-cluster caller that can reach the pod can pull heap/profile data. kubectl port-forward still works if you bind to 127.0.0.1:6060.

Minimal fix
-			logger.Info("pprof server listening on :6060")
-			if err := http.ListenAndServe(":6060", nil); err != nil {
+			logger.Info("pprof server listening on 127.0.0.1:6060")
+			if err := http.ListenAndServe("127.0.0.1:6060", nil); err != nil {
 				logger.Error(err, "pprof server failed")
 			}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Optional pprof server for memory profiling (enable via ENABLE_PPROF=true)
if os.Getenv("ENABLE_PPROF") == "true" {
go func() {
logger.Info("pprof server listening on :6060")
if err := http.ListenAndServe(":6060", nil); err != nil {
logger.Error(err, "pprof server failed")
}
}()
// Optional pprof server for memory profiling (enable via ENABLE_PPROF=true)
if os.Getenv("ENABLE_PPROF") == "true" {
go func() {
logger.Info("pprof server listening on 127.0.0.1:6060")
if err := http.ListenAndServe("127.0.0.1:6060", nil); err != nil {
logger.Error(err, "pprof server failed")
}
}()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/operator/main.go` around lines 160 - 167, The pprof server is
currently bound to all interfaces in the anonymous goroutine (when ENABLE_PPROF
is true) via http.ListenAndServe(":6060", nil); change the listen address to the
loopback interface so it only accepts local connections (use "127.0.0.1:6060"
instead of ":6060") in that goroutine and keep the same logger.Error handling
for http.ListenAndServe failing.

Comment on lines +129 to +153
echo "Restoring original operator resource limits..."
if [[ -n "${ORIG_MEM_LIMIT:-}" || -n "${ORIG_CPU_LIMIT:-}" ]]; then
local limits_arg=""
if [[ -n "${ORIG_MEM_LIMIT:-}" ]]; then
limits_arg="memory=${ORIG_MEM_LIMIT}"
fi
if [[ -n "${ORIG_CPU_LIMIT:-}" ]]; then
[[ -n "$limits_arg" ]] && limits_arg="${limits_arg},"
limits_arg="${limits_arg}cpu=${ORIG_CPU_LIMIT}"
fi
local requests_arg=""
if [[ -n "${ORIG_MEM_REQUEST:-}" ]]; then
requests_arg="memory=${ORIG_MEM_REQUEST}"
fi
if [[ -n "${ORIG_CPU_REQUEST:-}" ]]; then
[[ -n "$requests_arg" ]] && requests_arg="${requests_arg},"
requests_arg="${requests_arg}cpu=${ORIG_CPU_REQUEST}"
fi
kubectl set resources deployment/"$OPERATOR_DEPLOY" -n "$OPERATOR_NS" \
--limits="$limits_arg" --requests="$requests_arg" >/dev/null 2>&1 || true
fi

echo "Removing ENABLE_PPROF and GOMEMLIMIT env vars..."
kubectl set env deployment/"$OPERATOR_DEPLOY" -n "$OPERATOR_NS" \
ENABLE_PPROF- GOMEMLIMIT- >/dev/null 2>&1 || true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

--cleanup does not actually restore the original deployment state.

Two cases break restoration here:

  1. if the operator originally had no limits, the temporary memory limit added by the script is never cleared because restore only runs when an original limit existed;
  2. ENABLE_PPROF and GOMEMLIMIT are never snapshotted, so cleanup always deletes them even when the deployment had real pre-existing values.

That leaves the operator mutated after the repro.

Also applies to: 226-239

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/scalability/reproduce-oom.sh` around lines 129 - 153, The cleanup
currently only restores when ORIG_* limit vars exist and always deletes
ENABLE_PPROF/GOMEMLIMIT; modify the script to snapshot presence and values of
operator resource limits and env vars (keep ORIG_MEM_LIMIT, ORIG_CPU_LIMIT,
ORIG_MEM_REQUEST, ORIG_CPU_REQUEST and add ORIG_ENABLE_PPROF and ORIG_GOMEMLIMIT
at start), and change the restore logic so it always runs: if an ORIG_* limit
value is non-empty, reapply it with kubectl set resources for deployment
"$OPERATOR_DEPLOY" in "$OPERATOR_NS", and if the original value was empty remove
that specific limit/request explicitly; likewise for ENABLE_PPROF and
GOMEMLIMIT, if ORIG_ENABLE_PPROF/ORIG_GOMEMLIMIT are set restore them to their
original values, otherwise remove the env var (don't unconditionally delete).
Ensure the cleanup code references the same ORIG_* vars you snapshot so the
deployment returns exactly to its pre-run state.

Comment on lines +245 to +247
echo " Setting memory limit to $MEM_LIMIT..."
kubectl set resources deployment/"$OPERATOR_DEPLOY" -n "$OPERATOR_NS" \
--limits="memory=${MEM_LIMIT}" >/dev/null 2>&1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Lowering only the limit can make the deployment update invalid.

If the operator already has a memory request above $MEM_LIMIT, this patch is rejected (requests.memory cannot exceed limits.memory) and the script exits before the repro starts. Set a memory request <= $MEM_LIMIT in the same update.

One safe way to patch it
 echo "  Setting memory limit to $MEM_LIMIT..."
-kubectl set resources deployment/"$OPERATOR_DEPLOY" -n "$OPERATOR_NS" \
-  --limits="memory=${MEM_LIMIT}" >/dev/null 2>&1
+PATCH_REQUESTS="memory=${MEM_LIMIT}"
+if [[ -n "${ORIG_CPU_REQUEST:-}" ]]; then
+  PATCH_REQUESTS="${PATCH_REQUESTS},cpu=${ORIG_CPU_REQUEST}"
+fi
+kubectl set resources deployment/"$OPERATOR_DEPLOY" -n "$OPERATOR_NS" \
+  --limits="memory=${MEM_LIMIT}" \
+  --requests="$PATCH_REQUESTS" >/dev/null 2>&1
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
echo " Setting memory limit to $MEM_LIMIT..."
kubectl set resources deployment/"$OPERATOR_DEPLOY" -n "$OPERATOR_NS" \
--limits="memory=${MEM_LIMIT}" >/dev/null 2>&1
echo " Setting memory limit to $MEM_LIMIT..."
PATCH_REQUESTS="memory=${MEM_LIMIT}"
if [[ -n "${ORIG_CPU_REQUEST:-}" ]]; then
PATCH_REQUESTS="${PATCH_REQUESTS},cpu=${ORIG_CPU_REQUEST}"
fi
kubectl set resources deployment/"$OPERATOR_DEPLOY" -n "$OPERATOR_NS" \
--limits="memory=${MEM_LIMIT}" \
--requests="$PATCH_REQUESTS" >/dev/null 2>&1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/scalability/reproduce-oom.sh` around lines 245 - 247, The script only
updates limits which can be rejected if requests.memory > MEM_LIMIT; modify the
kubectl call that updates the deployment (the line invoking kubectl set
resources deployment/"$OPERATOR_DEPLOY" -n "$OPERATOR_NS"
--limits="memory=${MEM_LIMIT}") to update requests as well so requests.memory <=
MEM_LIMIT in the same patch (e.g., include a --requests="memory=${MEM_LIMIT}" or
compute a safe request value and pass it alongside --limits) to avoid invalid
updates.

Ambient Code Bot and others added 5 commits April 14, 2026 10:20
- Use pprof heap stats for memory monitoring (kubectl top needs
  metrics-server which kind doesn't have by default)
- Keep a persistent port-forward for the pprof endpoint instead of
  start/stop per capture
- Bump defaults: 1000 namespaces, 10 sessions/ns (10,000 total)
  since 2,500 sessions only used ~36MB at 128Mi limit
- Tee all output to scalability-runs/ log file
- Add pprof-dumps/ and scalability-runs/ to .gitignore

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
awk field for HeapInuse value is after '= ' delimiter, not field $3.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…1300)

Three targeted fixes based on pprof heap analysis from the OOM reproducer:

1. Scope controller-runtime Pod cache to app=ambient-runner label.
   Previously cached ALL pods cluster-wide. On vteam-uat, 397 of 477
   pods were non-runner system pods wasting cache memory.

2. Reuse a single shared HTTP client for runner API calls instead of
   creating a new http.Client per request inside loops. Eliminates
   connection pool waste and TIME_WAIT socket accumulation.

3. Bound io.ReadAll to 64KB for error response bodies. Previously
   unbounded — accounted for 20% of heap in pprof profiles.

4. Cap the project timeout cache (psTimeoutCache) at 500 entries with
   TTL-based eviction. Previously grew unbounded with namespace count.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add a cache TransformFunc that strips spec and heavy status fields
from AgenticSession objects in terminal phases (Completed, Failed,
Stopped). These sessions dominate the cache at scale — on vteam-uat,
most of 4,319 sessions are terminal. Stripping reduces per-object
memory from ~3.5KB to ~500 bytes while preserving metadata for
restart detection (desired-phase annotation).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Set session status to Completed after creation so the cache
TransformFunc actually exercises the terminal session stripping.
Without this, all test sessions stayed in Pending forever and
the transform never kicked in.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jeremyeder
Copy link
Copy Markdown
Contributor Author

The reproducer confirmed that these fixes don't help enough to matter. So I'm going to close this for now. The real fix will be establishing quotas at some level within the cluster. the operator currently scales with the # of agenticsession CRs. so its memory grows in that way. There are many solutions - this patch could be one of them. Deferring for now until we hit an actual need. 512MB is a joke - we should have never hit this.

@jeremyeder jeremyeder closed this Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Operator scalability: code-level improvements to reduce memory footprint

1 participant