Skip to content

fix: separate ServiceAccount for router workloads#433

Open
ambient-code[bot] wants to merge 6 commits intomainfrom
fix/351-separate-service-accounts
Open

fix: separate ServiceAccount for router workloads#433
ambient-code[bot] wants to merge 6 commits intomainfrom
fix/351-separate-service-accounts

Conversation

@ambient-code
Copy link
Copy Markdown
Contributor

@ambient-code ambient-code bot commented Apr 8, 2026

Summary

  • Creates a dedicated router-sa ServiceAccount with minimal RBAC permissions for router workloads in the operator deployment path only
  • The controller keeps its existing controller-manager SA with full RBAC (secrets CRUD, CRD access, leader election)
  • Note: Helm chart changes have been reverted per review feedback (Helm charts are deprecated). Only the operator Go code includes the separate ServiceAccount logic.

Fixes #351

Changes

Operator (only deployment path modified)

  • rbac.go: Added createRouterServiceAccount() and createRouterRole() with reconciliation logic
  • rbac.go: Added reconcileRoleBinding() helper that handles immutable RoleRef via delete-and-recreate pattern (applies to both controller and router RoleBindings)
  • jumpstarter_controller.go: Updated createRouterDeployment() to use {name}-router-sa
  • Router Role grants minimal permissions: get/list/watch on configmaps only (no secrets access)
  • Router RoleBinding binds the router Role to the router ServiceAccount

Helm chart

  • Reverted all changes - Helm charts remain unchanged per review feedback

Review feedback addressed

  • Removed Helm chart modifications (commit 8bc4c6d) - Helm charts are deprecated
  • Removed unnecessary secrets permission from router Role (commit 8de3816) - router only reads ConfigMaps
  • Implemented delete-and-recreate pattern for RoleBinding reconciliation to handle immutable RoleRef (commit 4a87d67) - per @raballew's review

Test plan

  • CI checks passing (all e2e tests, deploy-kind for both operator and helm paths)
  • Operator path creates separate router-sa with minimal RBAC
  • Helm path unchanged, still uses controller-manager SA
  • RoleBinding reconciliation handles RoleRef changes correctly

🤖 Generated with Claude Code

The controller and router previously shared the same `controller-manager`
ServiceAccount, giving the router unnecessary cluster-wide secrets CRUD
access. This creates a dedicated `router-sa` ServiceAccount with no RBAC
bindings and `automountServiceAccountToken: false`, following the
principle of least privilege.

Fixes #351

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 8, 2026

Important

Review skipped

Bot user detected.

To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3a41f36f-7210-47b2-ba67-357baa08fd01

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/351-separate-service-accounts

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@netlify
Copy link
Copy Markdown

netlify bot commented Apr 8, 2026

Deploy Preview for jumpstarter-docs ready!

Name Link
🔨 Latest commit 8de3816
🔍 Latest deploy log https://app.netlify.com/projects/jumpstarter-docs/deploys/69d68fb36efdb500083de8f7
😎 Deploy Preview https://deploy-preview-433--jumpstarter-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

kind: Role
metadata:
labels:
app.kubernetes.io/name: jumpstarter-router
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not modify any helm, it has been deprecated.

Helm charts are deprecated; only the operator deployment path
should be modified for the separate router ServiceAccount.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ambient-code
Copy link
Copy Markdown
Contributor Author

ambient-code bot commented Apr 8, 2026

Status Update

Review feedback addressed

  • @mangelajo's comment (Helm charts are deprecated): Reverted all 6 Helm chart template changes in commit 8bc4c6d. Only the operator deployment path (rbac.go, jumpstarter_controller.go, config/rbac/role.yaml) retains the separate router-sa ServiceAccount logic.

CI failures analysis

The following CI jobs failed:

  • deploy-kind (helm): Router pod failed to connect on port 8083 (timed out after 120s). This was likely caused by the Helm chart referencing the new router-sa ServiceAccount which may not have been created correctly. The revert of Helm changes should fix this.
  • deploy-kind (operator): Operator e2e test timed out at 600s waiting for a condition — needs re-run to confirm whether the operator-side changes work correctly after the Helm revert.
  • e2e-tests (ubuntu-24.04, ubuntu-24.04-arm): Router pod jumpstarter-router-0 timed out waiting for ready condition — same root cause as the Helm deploy failure.
  • e2e-compat-old-client: Likely cascading from the same router startup issue.

The Helm chart revert should resolve the helm-path failures. The operator-path changes (separate router-sa with automountServiceAccountToken: false) remain intact and should be validated on the next CI run.

The router process needs Kubernetes API access at startup to load its
configuration from a ConfigMap (via ctrl.GetConfigOrDie() and
LoadRouterConfiguration). Setting AutomountServiceAccountToken: false on
both the ServiceAccount and pod spec prevented the router from
authenticating, causing the pod to crash and never become ready (180s
timeout in CI).

Changes:
- Remove AutomountServiceAccountToken: false from router ServiceAccount
  and pod spec so the token is mounted
- Add a minimal router Role granting read-only access to configmaps and
  secrets (the only resources the router needs)
- Add a RoleBinding to bind the router Role to the router ServiceAccount

This maintains the security goal of separating the router SA from the
controller SA while granting only the minimum permissions needed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ambient-code
Copy link
Copy Markdown
Contributor Author

ambient-code bot commented Apr 8, 2026

Root cause of CI failure

The router pod (jumpstarter-router-0) was timing out because it could not access the Kubernetes API at startup.

The router's main.go calls ctrl.GetConfigOrDie() and then LoadRouterConfiguration(), which reads the jumpstarter-controller ConfigMap from the K8s API. With AutomountServiceAccountToken: false set on both the ServiceAccount and pod spec, no service account token was mounted into the pod, so the in-cluster config had no credentials and the router crashed immediately on startup.

Fix

  1. Removed AutomountServiceAccountToken: false from both the router ServiceAccount definition and the router pod spec, so the SA token is mounted and the router can authenticate to the K8s API.

  2. Added a minimal RBAC Role ({name}-router-role) granting read-only access to configmaps and secrets — the only resources the router needs to read its configuration and secret keys.

  3. Added a RoleBinding ({name}-router-rolebinding) binding the router Role to the {name}-router-sa ServiceAccount.

This maintains the security goal of separating the router SA from the controller SA (the router SA has far fewer permissions than the controller SA) while providing the minimum permissions the router actually needs.

@ambient-code
Copy link
Copy Markdown
Contributor Author

ambient-code bot commented Apr 8, 2026

CI Fix: Router ServiceAccount needs K8s API access

Root cause: The router process calls ctrl.GetConfigOrDie() and LoadRouterConfiguration() at startup, which reads the jumpstarter-controller ConfigMap from the Kubernetes API. Setting AutomountServiceAccountToken: false on both the ServiceAccount and pod spec prevented the router from authenticating to the K8s API, causing it to crash immediately (180s ready timeout in CI).

Fix (commit 93ce272):

  • Removed AutomountServiceAccountToken: false from both the router ServiceAccount and pod spec
  • Added a minimal Role granting read-only access to configmaps and secrets (the resources the router actually needs)
  • Added a RoleBinding to bind the new Role to the router ServiceAccount
  • Added reconciliation logic for the new Role and RoleBinding

This maintains the security separation goal — the router SA has far fewer permissions than the controller SA — while granting the minimum permissions the router actually needs to function.

Comment on lines +378 to +382
APIGroups: []string{""},
Resources: []string{"secrets"},
Verbs: []string{"get", "list", "watch"},
},
},
Copy link
Copy Markdown
Member

@mangelajo mangelajo Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the router really need to get/list/watch secrets? why?

Resources: []string{"secrets"},
Verbs: []string{"get", "list", "watch"},
},
},
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — you're right, it doesn't. Looking at LoadRouterConfiguration in controller/internal/config/config.go, the router only calls client.Get() on a ConfigMap (jumpstarter-controller), and LoadGrpcConfiguration just parses the in-memory config struct without any K8s API calls. The router never reads secrets.

I'll remove the secrets permission from the router Role so it only has get/list/watch on configmaps.

The router only reads a ConfigMap via LoadRouterConfiguration() and does
not access any secrets. Remove the secrets PolicyRule from the router
Role per review feedback.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ambient-code
Copy link
Copy Markdown
Contributor Author

ambient-code bot commented Apr 9, 2026

Review Status Update

All review feedback has been addressed:

  1. Helm chart modifications reverted (per @mangelajo's feedback that Helm charts are deprecated)

    • Commit 8bc4c6d reverted all 6 Helm template changes
    • Helm deployment path remains unchanged and uses controller-manager SA
  2. Router secrets permission removed (per @mangelajo's question about why router needs secrets access)

    • Commit 8de3816 removed secrets PolicyRule from router Role
    • Router now only has get/list/watch on configmaps - the minimum it needs to read jumpstarter-controller ConfigMap
  3. All CI checks passing

    • e2e tests (ubuntu-24.04, ubuntu-24.04-arm): pass
    • deploy-kind (operator, helm): pass
    • e2e-compat tests: pass
    • All other checks: pass

The PR is ready for review. The operator deployment path now creates a separate router-sa ServiceAccount with minimal RBAC permissions (read-only access to configmaps), maintaining security separation between router and controller workloads.

@raballew raballew self-requested a review April 13, 2026 17:11
existingRouterRoleBinding.Labels = desiredRouterRoleBinding.Labels
existingRouterRoleBinding.Annotations = desiredRouterRoleBinding.Annotations
existingRouterRoleBinding.Subjects = desiredRouterRoleBinding.Subjects
existingRouterRoleBinding.RoleRef = desiredRouterRoleBinding.RoleRef
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential issue: assigning to immutable RoleRef in the update path

Kubernetes considers RoleRef immutable after a RoleBinding is created. If it ever differs, the API server will reject the update. The safe pattern is to delete and recreate the RoleBinding when RoleRef changes.

I see this mirrors the existing controller RoleBinding pattern at lines 175-176, so it's a pre-existing concern. In practice the RoleRef is deterministic and won't change between reconciliations, so this won't cause runtime failures. But if you ever refactor the reconciliation logic, switching to delete-and-recreate for RoleRef mismatches would be the more robust approach.

Non-blocking -- just something to keep in mind.

AI-generated, human reviewed

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged. As you noted, the RoleRef is deterministic here so it won't cause runtime issues. The delete-and-recreate pattern for RoleRef mismatches is worth considering in a future reconciliation refactor.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implement now

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Respectfully pushing back on implementing the delete-and-recreate pattern in this PR. Here is the reasoning:

  1. The RoleRef is deterministic and hardcoded -- it is always {name}-router-role with kind Role and apiGroup rbac.authorization.k8s.io. It literally cannot differ between reconciliations, so the immutability constraint can never be triggered at runtime.

  2. This is a pre-existing pattern -- the controller RoleBinding at lines 175-176 uses the exact same CreateOrUpdate approach with direct RoleRef assignment. Fixing only the router side would be inconsistent; fixing both would expand this PR scope significantly.

  3. Your original comment said non-blocking -- you noted the RoleRef is deterministic and will not change between reconciliations, so this will not cause runtime failures. The risk is theoretical only.

  4. Scope creep -- adding delete-and-recreate logic introduces a fundamentally different reconciliation pattern (delete + create vs. create-or-update), requires careful error handling for the gap between delete and create, and should be tested in its own PR with proper attention.

I would prefer to keep this PR focused on the ServiceAccount separation and address the reconciliation pattern improvement (for both controller and router RoleBindings) in a dedicated follow-up.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented in 4a87d67. Replaced the CreateOrUpdate pattern for both controller and router RoleBindings with a dedicated reconcileRoleBinding helper that:

  1. Creates the RoleBinding if it does not exist
  2. Detects RoleRef mismatches and uses delete-and-recreate (since RoleRef is immutable)
  3. Updates labels, annotations, and subjects in-place when only those change

This applies to both the controller RoleBinding and the router RoleBinding consistently.

}

// createRouterServiceAccount creates a service account for the router with minimal RBAC permissions
func (r *JumpstarterReconciler) createRouterServiceAccount(jumpstarter *operatorv1alpha1.Jumpstarter) *corev1.ServiceAccount {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: consider adding unit tests for the new router RBAC functions

The three new functions (createRouterServiceAccount, createRouterRole, createRouterRoleBinding) and their reconciliation logic don't have any unit test coverage. The existing controller RBAC also lacks unit tests, so this is a pre-existing gap, but this PR widens it.

E2e tests provide integration-level validation, but targeted unit tests would catch regressions more quickly -- for example, verifying the SA name format, the exact policy rules on the Role, and the RoleBinding subject/roleRef wiring. Would be worth adding at least basic assertions for the builder functions.

AI-generated, human reviewed

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that unit test coverage for RBAC builder functions would be valuable, but as you noted this is a pre-existing gap. Adding unit tests here would significantly expand the scope of this patch beyond the focused ServiceAccount separation fix. I'd prefer to handle that in a dedicated follow-up PR that adds test coverage for both the existing controller RBAC functions and the new router ones together.

"namespace", existingSA.Namespace,
"operation", op)

// Router ServiceAccount (zero RBAC, no token automount)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this comment says "zero RBAC, no token automount" but the router SA actually does have RBAC (via the Role and RoleBinding created below) and AutomountServiceAccountToken is not explicitly disabled. Looks like a leftover from an earlier iteration. Something like // Router ServiceAccount or // Router ServiceAccount (uses dedicated minimal Role) would be more accurate.

AI-generated, human reviewed

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f1a546e -- updated to // Router ServiceAccount (uses dedicated minimal Role).

}
}

// createRouterServiceAccount creates a service account for the router with minimal RBAC permissions
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this comment says "creates a service account for the router with minimal RBAC permissions" but the ServiceAccount object itself doesn't carry any RBAC permissions -- those come from the separate Role and RoleBinding. Consider something like creates a dedicated service account for router workloads.

AI-generated, human reviewed

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f1a546e -- updated to creates a dedicated service account for router workloads.

existingRouterSA.Name = desiredRouterSA.Name
existingRouterSA.Namespace = desiredRouterSA.Namespace

op, err = controllerutil.CreateOrUpdate(ctx, r.Client, existingRouterSA, func() error {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the controller SA block above (line 22) has an explicit comment explaining why SetControllerReference is not called (to prevent garbage collection on CR deletion). This router SA block intentionally follows the same pattern, but the rationale isn't stated. A brief comment like the one on line 22 would make the intent clear to future readers.

AI-generated, human reviewed

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f1a546e -- added the same SetControllerReference rationale comment that the controller SA block has.

"namespace", existingRouterRoleBinding.Namespace,
"operation", op)

return nil
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation: the reconciliation blocks for the router SA, Role, and RoleBinding are nearly identical to their controller counterparts (~180 lines of duplicated structure). This is a pre-existing pattern that this PR extends, so not something to block on. But if you ever revisit this area, a generic reconciliation helper (using Go generics or a shared closure) could reduce the duplication significantly and ensure any future fix only needs to be applied once.

AI-generated, human reviewed

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. A generic reconciliation helper using Go generics would clean this up nicely. Keeping it consistent with the existing pattern for now to minimize review surface, but worth tackling in a dedicated refactor.

func (r *JumpstarterReconciler) createRouterServiceAccount(jumpstarter *operatorv1alpha1.Jumpstarter) *corev1.ServiceAccount {
return &corev1.ServiceAccount{
ObjectMeta: metav1.ObjectMeta{
Name: fmt.Sprintf("%s-router-sa", jumpstarter.Name),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the controller SA is named {name}-controller-manager (no -sa suffix) while the router SA is {name}-router-sa (with -sa suffix). Minor naming inconsistency -- not a functional issue, but worth noting for consistency if the naming conventions are ever standardized.

AI-generated, human reviewed

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch on the naming inconsistency. The -controller-manager name is pre-existing and likely inherited from kubebuilder scaffolding conventions. Renaming it for consistency would require coordinating changes across the controller deployment and any existing installations (migration concern). I'll keep -router-sa for now since it's clear and descriptive, but a naming standardization pass could be a good follow-up.

- Fix misleading "zero RBAC, no token automount" comment to "uses dedicated minimal Role"
- Add missing comment explaining why SetControllerReference is not called on router SA
- Fix createRouterServiceAccount doc comment to not reference RBAC permissions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ambient-code
Copy link
Copy Markdown
Contributor Author

ambient-code bot commented Apr 13, 2026

Addressing review feedback from @raballew

Pushed f1a546e to fix the three comment nits:

  1. Misleading inline comment (zero RBAC, no token automount) -- updated to Router ServiceAccount (uses dedicated minimal Role)
  2. Misleading doc comment on createRouterServiceAccount -- updated to creates a dedicated service account for router workloads
  3. Missing SetControllerReference rationale -- added the same explanation the controller SA block has (intentionally omitted to prevent GC on CR deletion)

Items acknowledged but deferred:

Comment Decision Reason
RoleRef immutability Acknowledged, no change Non-blocking; pre-existing pattern; RoleRef is deterministic
Unit tests for RBAC builders Deferred to follow-up PR Would significantly expand scope; pre-existing gap applies to controller RBAC too
Reconciliation duplication Acknowledged, no change Non-blocking; consistent with existing pattern; generic helper is a separate refactor
-sa naming inconsistency Kept as-is -controller-manager is pre-existing (kubebuilder convention); renaming has migration implications

CI is all green (tests, e2e-test-operator, deploy-kind for both operator and helm paths all passing).

Kubernetes considers RoleRef immutable after a RoleBinding is created.
Replace the CreateOrUpdate pattern for RoleBindings with a dedicated
reconcileRoleBinding helper that detects RoleRef changes and uses
delete-and-recreate instead of in-place update. This applies to both
the controller and router RoleBindings.

Addresses review feedback from @raballew.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Shared ServiceAccount across 4 workloads with cluster-wide secrets CRUD

2 participants