feat: operator OOM reproducer for scalability triage (#1300)#1308
feat: operator OOM reproducer for scalability triage (#1300)#1308jeremyeder wants to merge 7 commits intoambient-code:mainfrom
Conversation
Enable pprof endpoints on :6060 via ENABLE_PPROF=true env var, for capturing heap profiles during OOM investigation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bash script that pushes the agentic-operator to OOMKill on a local kind cluster by deliberately undersizing its memory limit and creating bulk namespaces with managed labels (triggering ProjectSettings + RBAC cascade) and AgenticSession CRs to fill the controller-runtime cache. Supports pprof heap capture between batches and automatic cleanup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughScopes controller-runtime cache to runner Pods ( Changes
Sequence Diagram(s)mermaid Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 error, 1 warning)
✅ Passed checks (6 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
✨ Simplify code
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/operator/main.go`:
- Around line 160-167: The pprof server is currently bound to all interfaces in
the anonymous goroutine (when ENABLE_PPROF is true) via
http.ListenAndServe(":6060", nil); change the listen address to the loopback
interface so it only accepts local connections (use "127.0.0.1:6060" instead of
":6060") in that goroutine and keep the same logger.Error handling for
http.ListenAndServe failing.
In `@scripts/scalability/reproduce-oom.sh`:
- Around line 129-153: The cleanup currently only restores when ORIG_* limit
vars exist and always deletes ENABLE_PPROF/GOMEMLIMIT; modify the script to
snapshot presence and values of operator resource limits and env vars (keep
ORIG_MEM_LIMIT, ORIG_CPU_LIMIT, ORIG_MEM_REQUEST, ORIG_CPU_REQUEST and add
ORIG_ENABLE_PPROF and ORIG_GOMEMLIMIT at start), and change the restore logic so
it always runs: if an ORIG_* limit value is non-empty, reapply it with kubectl
set resources for deployment "$OPERATOR_DEPLOY" in "$OPERATOR_NS", and if the
original value was empty remove that specific limit/request explicitly; likewise
for ENABLE_PPROF and GOMEMLIMIT, if ORIG_ENABLE_PPROF/ORIG_GOMEMLIMIT are set
restore them to their original values, otherwise remove the env var (don't
unconditionally delete). Ensure the cleanup code references the same ORIG_* vars
you snapshot so the deployment returns exactly to its pre-run state.
- Around line 245-247: The script only updates limits which can be rejected if
requests.memory > MEM_LIMIT; modify the kubectl call that updates the deployment
(the line invoking kubectl set resources deployment/"$OPERATOR_DEPLOY" -n
"$OPERATOR_NS" --limits="memory=${MEM_LIMIT}") to update requests as well so
requests.memory <= MEM_LIMIT in the same patch (e.g., include a
--requests="memory=${MEM_LIMIT}" or compute a safe request value and pass it
alongside --limits) to avoid invalid updates.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: eb8402e7-cc17-4193-9f3d-adb4447b49e9
📒 Files selected for processing (2)
components/operator/main.goscripts/scalability/reproduce-oom.sh
| // Optional pprof server for memory profiling (enable via ENABLE_PPROF=true) | ||
| if os.Getenv("ENABLE_PPROF") == "true" { | ||
| go func() { | ||
| logger.Info("pprof server listening on :6060") | ||
| if err := http.ListenAndServe(":6060", nil); err != nil { | ||
| logger.Error(err, "pprof server failed") | ||
| } | ||
| }() |
There was a problem hiding this comment.
Bind pprof to loopback instead of all pod interfaces.
With ENABLE_PPROF=true, this listens on :6060 without auth, so any in-cluster caller that can reach the pod can pull heap/profile data. kubectl port-forward still works if you bind to 127.0.0.1:6060.
Minimal fix
- logger.Info("pprof server listening on :6060")
- if err := http.ListenAndServe(":6060", nil); err != nil {
+ logger.Info("pprof server listening on 127.0.0.1:6060")
+ if err := http.ListenAndServe("127.0.0.1:6060", nil); err != nil {
logger.Error(err, "pprof server failed")
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Optional pprof server for memory profiling (enable via ENABLE_PPROF=true) | |
| if os.Getenv("ENABLE_PPROF") == "true" { | |
| go func() { | |
| logger.Info("pprof server listening on :6060") | |
| if err := http.ListenAndServe(":6060", nil); err != nil { | |
| logger.Error(err, "pprof server failed") | |
| } | |
| }() | |
| // Optional pprof server for memory profiling (enable via ENABLE_PPROF=true) | |
| if os.Getenv("ENABLE_PPROF") == "true" { | |
| go func() { | |
| logger.Info("pprof server listening on 127.0.0.1:6060") | |
| if err := http.ListenAndServe("127.0.0.1:6060", nil); err != nil { | |
| logger.Error(err, "pprof server failed") | |
| } | |
| }() |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/operator/main.go` around lines 160 - 167, The pprof server is
currently bound to all interfaces in the anonymous goroutine (when ENABLE_PPROF
is true) via http.ListenAndServe(":6060", nil); change the listen address to the
loopback interface so it only accepts local connections (use "127.0.0.1:6060"
instead of ":6060") in that goroutine and keep the same logger.Error handling
for http.ListenAndServe failing.
| echo "Restoring original operator resource limits..." | ||
| if [[ -n "${ORIG_MEM_LIMIT:-}" || -n "${ORIG_CPU_LIMIT:-}" ]]; then | ||
| local limits_arg="" | ||
| if [[ -n "${ORIG_MEM_LIMIT:-}" ]]; then | ||
| limits_arg="memory=${ORIG_MEM_LIMIT}" | ||
| fi | ||
| if [[ -n "${ORIG_CPU_LIMIT:-}" ]]; then | ||
| [[ -n "$limits_arg" ]] && limits_arg="${limits_arg}," | ||
| limits_arg="${limits_arg}cpu=${ORIG_CPU_LIMIT}" | ||
| fi | ||
| local requests_arg="" | ||
| if [[ -n "${ORIG_MEM_REQUEST:-}" ]]; then | ||
| requests_arg="memory=${ORIG_MEM_REQUEST}" | ||
| fi | ||
| if [[ -n "${ORIG_CPU_REQUEST:-}" ]]; then | ||
| [[ -n "$requests_arg" ]] && requests_arg="${requests_arg}," | ||
| requests_arg="${requests_arg}cpu=${ORIG_CPU_REQUEST}" | ||
| fi | ||
| kubectl set resources deployment/"$OPERATOR_DEPLOY" -n "$OPERATOR_NS" \ | ||
| --limits="$limits_arg" --requests="$requests_arg" >/dev/null 2>&1 || true | ||
| fi | ||
|
|
||
| echo "Removing ENABLE_PPROF and GOMEMLIMIT env vars..." | ||
| kubectl set env deployment/"$OPERATOR_DEPLOY" -n "$OPERATOR_NS" \ | ||
| ENABLE_PPROF- GOMEMLIMIT- >/dev/null 2>&1 || true |
There was a problem hiding this comment.
--cleanup does not actually restore the original deployment state.
Two cases break restoration here:
- if the operator originally had no limits, the temporary memory limit added by the script is never cleared because restore only runs when an original limit existed;
ENABLE_PPROFandGOMEMLIMITare never snapshotted, so cleanup always deletes them even when the deployment had real pre-existing values.
That leaves the operator mutated after the repro.
Also applies to: 226-239
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/scalability/reproduce-oom.sh` around lines 129 - 153, The cleanup
currently only restores when ORIG_* limit vars exist and always deletes
ENABLE_PPROF/GOMEMLIMIT; modify the script to snapshot presence and values of
operator resource limits and env vars (keep ORIG_MEM_LIMIT, ORIG_CPU_LIMIT,
ORIG_MEM_REQUEST, ORIG_CPU_REQUEST and add ORIG_ENABLE_PPROF and ORIG_GOMEMLIMIT
at start), and change the restore logic so it always runs: if an ORIG_* limit
value is non-empty, reapply it with kubectl set resources for deployment
"$OPERATOR_DEPLOY" in "$OPERATOR_NS", and if the original value was empty remove
that specific limit/request explicitly; likewise for ENABLE_PPROF and
GOMEMLIMIT, if ORIG_ENABLE_PPROF/ORIG_GOMEMLIMIT are set restore them to their
original values, otherwise remove the env var (don't unconditionally delete).
Ensure the cleanup code references the same ORIG_* vars you snapshot so the
deployment returns exactly to its pre-run state.
| echo " Setting memory limit to $MEM_LIMIT..." | ||
| kubectl set resources deployment/"$OPERATOR_DEPLOY" -n "$OPERATOR_NS" \ | ||
| --limits="memory=${MEM_LIMIT}" >/dev/null 2>&1 |
There was a problem hiding this comment.
Lowering only the limit can make the deployment update invalid.
If the operator already has a memory request above $MEM_LIMIT, this patch is rejected (requests.memory cannot exceed limits.memory) and the script exits before the repro starts. Set a memory request <= $MEM_LIMIT in the same update.
One safe way to patch it
echo " Setting memory limit to $MEM_LIMIT..."
-kubectl set resources deployment/"$OPERATOR_DEPLOY" -n "$OPERATOR_NS" \
- --limits="memory=${MEM_LIMIT}" >/dev/null 2>&1
+PATCH_REQUESTS="memory=${MEM_LIMIT}"
+if [[ -n "${ORIG_CPU_REQUEST:-}" ]]; then
+ PATCH_REQUESTS="${PATCH_REQUESTS},cpu=${ORIG_CPU_REQUEST}"
+fi
+kubectl set resources deployment/"$OPERATOR_DEPLOY" -n "$OPERATOR_NS" \
+ --limits="memory=${MEM_LIMIT}" \
+ --requests="$PATCH_REQUESTS" >/dev/null 2>&1📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| echo " Setting memory limit to $MEM_LIMIT..." | |
| kubectl set resources deployment/"$OPERATOR_DEPLOY" -n "$OPERATOR_NS" \ | |
| --limits="memory=${MEM_LIMIT}" >/dev/null 2>&1 | |
| echo " Setting memory limit to $MEM_LIMIT..." | |
| PATCH_REQUESTS="memory=${MEM_LIMIT}" | |
| if [[ -n "${ORIG_CPU_REQUEST:-}" ]]; then | |
| PATCH_REQUESTS="${PATCH_REQUESTS},cpu=${ORIG_CPU_REQUEST}" | |
| fi | |
| kubectl set resources deployment/"$OPERATOR_DEPLOY" -n "$OPERATOR_NS" \ | |
| --limits="memory=${MEM_LIMIT}" \ | |
| --requests="$PATCH_REQUESTS" >/dev/null 2>&1 |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/scalability/reproduce-oom.sh` around lines 245 - 247, The script only
updates limits which can be rejected if requests.memory > MEM_LIMIT; modify the
kubectl call that updates the deployment (the line invoking kubectl set
resources deployment/"$OPERATOR_DEPLOY" -n "$OPERATOR_NS"
--limits="memory=${MEM_LIMIT}") to update requests as well so requests.memory <=
MEM_LIMIT in the same patch (e.g., include a --requests="memory=${MEM_LIMIT}" or
compute a safe request value and pass it alongside --limits) to avoid invalid
updates.
- Use pprof heap stats for memory monitoring (kubectl top needs metrics-server which kind doesn't have by default) - Keep a persistent port-forward for the pprof endpoint instead of start/stop per capture - Bump defaults: 1000 namespaces, 10 sessions/ns (10,000 total) since 2,500 sessions only used ~36MB at 128Mi limit - Tee all output to scalability-runs/ log file - Add pprof-dumps/ and scalability-runs/ to .gitignore Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
awk field for HeapInuse value is after '= ' delimiter, not field $3. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…1300) Three targeted fixes based on pprof heap analysis from the OOM reproducer: 1. Scope controller-runtime Pod cache to app=ambient-runner label. Previously cached ALL pods cluster-wide. On vteam-uat, 397 of 477 pods were non-runner system pods wasting cache memory. 2. Reuse a single shared HTTP client for runner API calls instead of creating a new http.Client per request inside loops. Eliminates connection pool waste and TIME_WAIT socket accumulation. 3. Bound io.ReadAll to 64KB for error response bodies. Previously unbounded — accounted for 20% of heap in pprof profiles. 4. Cap the project timeout cache (psTimeoutCache) at 500 entries with TTL-based eviction. Previously grew unbounded with namespace count. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add a cache TransformFunc that strips spec and heavy status fields from AgenticSession objects in terminal phases (Completed, Failed, Stopped). These sessions dominate the cache at scale — on vteam-uat, most of 4,319 sessions are terminal. Stripping reduces per-object memory from ~3.5KB to ~500 bytes while preserving metadata for restart detection (desired-phase annotation). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Set session status to Completed after creation so the cache TransformFunc actually exercises the terminal session stripping. Without this, all test sessions stayed in Pending forever and the transform never kicked in. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
The reproducer confirmed that these fixes don't help enough to matter. So I'm going to close this for now. The real fix will be establishing quotas at some level within the cluster. the operator currently scales with the # of agenticsession CRs. so its memory grows in that way. There are many solutions - this patch could be one of them. Deferring for now until we hit an actual need. 512MB is a joke - we should have never hit this. |
Summary
Reproducer script and pprof support for issue #1300 (operator scalability improvements). Deliberately undersizes the operator's memory limit on a local kind cluster and creates bulk resources until OOMKill, proving the scalability issues at smaller scale.
Changes:
components/operator/main.go— add optional pprof server on:6060(enabled viaENABLE_PPROF=true). Zero production impact — opt-in only.scripts/scalability/reproduce-oom.sh— reproducer script that:ambient-code.io/managed=true(triggers ProjectSettings + RBAC cascade)Usage:
# After make kind-up LOCAL_IMAGES=true CONTAINER_ENGINE=docker ./scripts/scalability/reproduce-oom.sh \ --mem-limit 128Mi \ --namespaces 300 \ --sessions-per-ns 5 \ --batch-size 50 \ --pprof \ --cleanupHeap profile analysis:
Test plan
make kind-up LOCAL_IMAGES=true CONTAINER_ENGINE=docker--pprof— verify heap profiles captured--cleanup— verify namespaces deleted and operator restoredCloses #1300 (reproducer only — code fixes are separate PRs)
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Chores
Performance & Reliability
Bug Fixes