Skip to content

Conversation

kpouget
Copy link
Contributor

@kpouget kpouget commented Sep 29, 2025

Improves the reliability of the system bring-up scripts

Builds on top of #1167

Summary by CodeRabbit

  • New Features

    • Added startup guards that wait for API services, apiservice availability and node readiness with progress, timing and faster retries.
    • Conditional TAP handling and safe disable flow plus a VSock connectivity test.
    • New flows to fetch/apply pull secrets, SSH key and user passwords (including AWS-backed retrieval).
  • Refactor

    • Unified wait/retry behavior, clearer per‑attempt messaging, stricter shell safety, safer apply/patch flows, and consistent kubeconfig handling.
  • Chores

    • Removed the filesystem-grow service and updated unit ordering and logging.

Copy link

coderabbitai bot commented Sep 29, 2025

Walkthrough

Set KUBECONFIG inline in many systemd units, add node- and APIServices-readiness units and scripts, replace EnvironmentFile usage, introduce TAP detection/conditional-disable flow and AWS secret fetcher, rewrite common wait helper and many scripts for stricter shell behavior and argument-driven inputs, and remove growfs unit/script.

Changes

Cohort / File(s) Summary
Unit env & startup edits
systemd/crc-cluster-status.service, systemd/crc-dnsmasq.service, systemd/crc-pullsecret.service, systemd/crc-routes-controller.service, systemd/crc-wait-apiserver-up.service, systemd/ocp-cluster-ca.service, systemd/ocp-clusterid.service, systemd/ocp-custom-domain.service, systemd/ocp-mco-sshkey.service, systemd/ocp-userpasswords.service
Removed EnvironmentFile=-/etc/sysconfig/crc-env and inlined Environment=KUBECONFIG=/opt/kubeconfig; adjusted unit ordering, ExecStart/ExecStartPre, StartLimit/RestartSec and WantedBy where applicable.
New readiness units & scripts
systemd/crc-wait-node-ready.service, systemd/crc-wait-node-ready.sh, systemd/ocp-wait-apiservices-available.service, systemd/ocp-wait-apiservices-available.sh
Added oneshot services and scripts to poll for node readiness and APIServices availability, with retry/timing and dependency wiring.
TAP detection & disable flow
systemd/crc-needs-tap.sh, systemd/crc-conditionally-disable-tap.sh, systemd/crc-disable-tap.sh, systemd/crc-check-tap.service, (systemd/crc-no-tap.sh removed), createdisk.sh
Introduced TAP-need detection, conditional-disable wrapper and new check service; removed old no-tap script; added ExecCondition wiring to [email protected].
Common wait helper rewrite
systemd/crc-systemd-common.sh
Renamed wait_for_resourcewait_for_resource_or_die, added input validation, timing, xtrace handling, per-try/status messaging and explicit exit on timeout; switched to a for-loop retry model.
CA, kubeconfig & vsock test
systemd/ocp-cluster-ca.sh, systemd/crc-test-vsock.py, systemd/crc-self-sufficient-env.sh
Reworked CA/kubeconfig flows (cleanup trap, external-IP-file support), switched to new wait helper, added vsock connectivity test used for self-sufficient detection.
Pull-secret & AWS fetcher
systemd/crc-pullsecret.sh, systemd/crc-aws-fetch-secrets.sh, systemd/crc-pullsecret.service, tools.sh
Pullsecret script now accepts a secrets-file arg and patches via stdin; added an AWS SSM fetcher (Podman) to populate secrets; tightened htpasswd helper.
Routes controller & dnsmasq template
systemd/crc-routes-controller.sh, systemd/dnsmasq.sh.template, systemd/crc-routes-controller.service
Hardened scripts (strict shell), removed prior gating (always deploy routes controller), switched to wait_for_resource_or_die; dnsmasq template adds APPS_DOMAIN pre-check.
Cluster status & apiserver wait scripts
systemd/crc-cluster-status.sh, systemd/crc-wait-apiserver-up.sh
Added configurable login retries/delay, SECONDS timing and readiness marker, progress messaging; switched to new wait helper and removed exporting KUBECONFIG.
User creds, SSH key, cluster ID & custom domain
systemd/ocp-userpasswords.sh, systemd/ocp-userpasswords.service, systemd/ocp-mco-sshkey.sh, systemd/ocp-mco-sshkey.service, systemd/ocp-clusterid.sh, systemd/ocp-clusterid.service, systemd/ocp-custom-domain.sh, systemd/ocp-custom-domain.service
Converted scripts to argument-driven inputs and validations, use jq+stdin for patches, added wait_for_resource_or_die waits, temp-file cleanup and explicit success messages; services updated to pass file args and ordering.
Removed growfs & docs update
systemd/ocp-growfs.service, systemd/ocp-growfs.sh, docs/self-sufficient-bundle.md
Deleted the ocp-growfs unit and script and removed the corresponding docs table row.
Targets, drop-ins & logging
systemd/crc-custom.target, systemd/kubelet.service.d/wants-crc-custom.conf, systemd/ovs-configuration.service.d/mute-console.conf
Updated crc-custom.target Description/Requires/After to depend on new wait services; added kubelet drop-in Wants/Before and OVS drop-in to redirect stdout/stderr to a log.
Disk/unit packaging changes
createdisk-library.sh, createdisk.sh
copy_systemd_units now copies drop-in directories and Python scripts recursively; createdisk adds ExecCondition to [email protected].
AWS & secret tooling
systemd/crc-aws-fetch-secrets.sh
New script to fetch three AWS SSM parameters inside a Podman AWS-CLI container with retries, validation and atomic writes to secret paths.
Misc scripts hardened
systemd/*.sh (many)
Broad hardening: strict shell options, argument validation, trap-based cleanup, jq-driven patches from stdin, and switching to wait_for_resource_or_die.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor systemd
  participant APIServer as crc-wait-apiserver-up.service
  participant NodeReady as crc-wait-node-ready.service
  participant APISvcs as ocp-wait-apiservices-available.service
  participant CRCUnits as crc-custom.target children

  systemd->>APIServer: Start (Env KUBECONFIG=/opt/kubeconfig)
  APIServer-->>systemd: apiserver ready
  systemd->>NodeReady: Start (After=APIServer)
  NodeReady-->>systemd: node ready
  systemd->>APISvcs: Start (Wants/After=NodeReady)
  APISvcs-->>systemd: apiservices available
  systemd->>CRCUnits: Start downstream units
Loading
sequenceDiagram
  autonumber
  actor systemd
  participant CheckTap as crc-check-tap.service
  participant Cond as crc-conditionally-disable-tap.sh
  participant Needs as crc-needs-tap.sh
  participant Disable as crc-disable-tap.sh
  participant GVNet as [email protected]

  systemd->>CheckTap: ExecStart
  CheckTap->>Cond: run wrapper
  Cond->>Needs: check if TAP needed
  alt TAP required
    Needs-->>Cond: success → do nothing
  else TAP not required
    Needs-->>Cond: non-zero → exec disable script
    Cond->>Disable: exec /usr/local/bin/crc-disable-tap.sh
    Disable-->>systemd: removed NM conn & stopped GVNet
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Possibly related PRs

Suggested labels

needs-ok-to-test

Suggested reviewers

  • cfergeau
  • gbraad
  • praveenkumar
  • anjannath

Poem

I hop through logs and timed delays,
I check if TAP should wake or laze.
I fetch secrets, patch with jq, and wait for nodes to say "hooray"—
Timers tick, units start, all done;
A little rabbit cheers: clean code, well spun. 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 8.33% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title concisely and accurately summarizes the intent of the changeset to enhance the reliability of system bring-up scripts. It directly reflects the modifications across service units and associated helper scripts, such as introducing retries, strict error handling, and startup ordering. The phrasing is clear, specific, and avoids unnecessary detail, making it easy for teammates to understand the PR’s core purpose at a glance.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

openshift-ci bot commented Sep 29, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign praveenkumar for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kpouget kpouget mentioned this pull request Sep 29, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 18

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
systemd/ocp-cluster-ca.sh (1)

79-82: World-writable kubeconfig is a security bug.

chmod 0666 exposes cluster credentials to all users. Lock this down and restore SELinux context if applicable.

-rm -rf /opt/kubeconfig
-cp /opt/crc/kubeconfig /opt/kubeconfig
-chmod 0666 /opt/kubeconfig
+rm -f /opt/kubeconfig
+cp /opt/crc/kubeconfig /opt/kubeconfig
+chown root:root /opt/kubeconfig
+chmod 0600 /opt/kubeconfig
+command -v restorecon >/dev/null 2>&1 && restorecon -v /opt/kubeconfig
🧹 Nitpick comments (16)
systemd/crc-self-sufficient-env.sh (1)

10-18: Optional: broaden accepted truthy values and unify wording

If these flags may be set to "true"/"yes", consider case-insensitive match; also keep “self‑sufficient” hyphenation consistent.

-if [ "${CRC_SELF_SUFFICIENT:-}" = "1" ]; then
-    echo "Running with a self sufficient bundle"
+shopt -s nocasematch
+if [[ "${CRC_SELF_SUFFICIENT:-}" =~ ^(1|true|yes)$ ]]; then
+    echo "Running with a self-sufficient bundle"
     exit 0
 fi

-if [ "${CRC_CLOUD:-}" = "1" ]; then
+if [[ "${CRC_CLOUD:-}" =~ ^(1|true|yes)$ ]]; then
     echo "Running in a cloud environment"
     exit 0
 fi
+shopt -u nocasematch

Also applies to: 20-22

systemd/ocp-userpasswords.sh (2)

18-19: Avoid TTY allocation in non-interactive systemd context

-t is unnecessary and may cause issues. Consider also pinning the image or using a local tool to avoid network pulls.

-    podman run --rm -ti docker.io/xmartlabs/htpasswd "$1" "$2"
+    podman run --rm -i docker.io/xmartlabs/htpasswd "$1" "$2"

Optional: prefer htpasswd -Bb (bcrypt) explicitly if supported by the image.


12-19: Optional: avoid Docker Hub dependency for reliability

To reduce external pull failures/rate limits, consider using a local htpasswd (httpd-tools) or a pinned/quay-hosted image. Also ensure bcrypt is used (-Bb).

Also applies to: 33-45

systemd/ocp-growfs.sh (2)

3-6: Harden against multiple/absent XFS devices under strict mode

With errexit/nounset, ambiguous root_partition can break the script. Select the root device explicitly and make grow ops idempotent.

-root_partition=$(/usr/sbin/blkid -t TYPE=xfs -o device)
-/usr/bin/growpart "${root_partition%?}" "${root_partition#/dev/???}"
+root_partition=$(/usr/sbin/findmnt -n -o SOURCE /sysroot || true)
+if [ -z "${root_partition:-}" ]; then
+  echo "Failed to determine root device for /sysroot" >&2
+  exit 1
+fi
+/usr/bin/growpart "${root_partition%?}" "${root_partition#/dev/???}" || true

Optionally gate xfs_growfs with xfs_info /sysroot checks if needed.

Also applies to: 9-14


18-21: Optional: add a final sync

A quick sync after filesystem growth can help on some environments before subsequent units run.

 echo "All done"
+sync
 exit 0
systemd/ocp-clusterid.sh (1)

13-19: LGTM; consider skipping if clusterID already set

Current flow looks good. Optionally no-op when clusterID exists to avoid unnecessary patch.

-wait_for_resource_or_die clusterversion
-
-oc patch clusterversion version -p '{"spec":{"clusterID":"'${uuid}'"}}' --type merge
+wait_for_resource_or_die "clusterversion/version"
+if ! oc get clusterversion version -o jsonpath='{.spec.clusterID}' | grep -q '.'; then
+  oc patch clusterversion version -p '{"spec":{"clusterID":"'${uuid}'"}}' --type merge
+fi
systemd/ocp-mco-sshkey.sh (1)

19-20: Wait on the specific MachineConfig name for determinism.

Use the helper to wait for the concrete resource rather than the type:

-wait_for_resource_or_die machineconfig
+wait_for_resource_or_die machineconfig/99-master-ssh 20 5
systemd/crc-wait-apiserver-up.sh (1)

11-14: LGTM; consistent with the new wait helper.

Optional: consider adding a small --request-timeout in wait_for_resource_or_die (systemd/crc-systemd-common.sh) to avoid long hangs if the API is half‑up.

systemd/ocp-custom-domain.sh (1)

54-60: Harden JSON patching for apiserver and registry route (quote-safe).

Avoid inline JSON concatenation with unquoted $EIP. Use patch files or jq for robust quoting.

-oc patch apiserver cluster --type=merge \
-   -p '{"spec":{"servingCerts": {"namedCertificates":[{"names":["api.'$EIP'.nip.io"],"servingCertificate": {"name": "nip-secret"}}]}}}'
+cat > /tmp/apiserver-patch.json <<EOF
+{"spec":{"servingCerts":{"namedCertificates":[
+  {"names":["api.${EIP}.nip.io"],"servingCertificate":{"name":"nip-secret"}}
+]}}}
+EOF
+oc patch apiserver cluster --type=merge --patch-file=/tmp/apiserver-patch.json
-oc patch route default-route -n openshift-image-registry --type=merge \
-   -p '{"spec": {"host": "default-route-openshift-image-registry.'$EIP'.nip.io"}}'
+cat > /tmp/registry-route-patch.json <<EOF
+{"spec":{"host":"default-route-openshift-image-registry.${EIP}.nip.io"}}
+EOF
+oc patch route default-route -n openshift-image-registry --type=merge --patch-file=/tmp/registry-route-patch.json
systemd/crc-pullsecret.sh (3)

14-17: Remove duplicate xtrace disable and restore tracing afterwards.

Keep one set +x before secret handling and restore to aid debugging.

-set +x # disable the logging to avoid leaking the pull secrets
-
-set +x # disable the logging to avoid leaking the pull secrets
+set +x # disable logging to avoid leaking the pull secrets
+# ... secret handling ...
+# restore tracing if desired
+# set -x

19-25: Tolerate missing/invalid existing pull-secret without aborting the script.

With set -o pipefail, a base64 decode error aborts the script. Swallow decode errors and rely on the subsequent jq check.

-existingPs=$(oc get secret pull-secret -n openshift-config \
-                   -o jsonpath="{['data']['\.dockerconfigjson']}" \
-                    | base64 -d)
+existingPs=$(oc get secret pull-secret -n openshift-config \
+                   -o jsonpath="{['data']['\.dockerconfigjson']}" \
+                   | base64 -d || true)

33-36: Build the patch JSON with jq to avoid quoting pitfalls.

Inline JSON with embedded $pullSecretB64 is brittle. Use jq to construct the payload.

-pullSecretB64=$(cat /opt/crc/pull-secret | base64 -w0)
-oc patch secret pull-secret -n openshift-config --type merge \
-   -p '{"data":{".dockerconfigjson": "'${pullSecretB64}'"}}'
+pullSecretB64=$(base64 -w0 < /opt/crc/pull-secret)
+PATCH=$(jq -n --arg v "$pullSecretB64" '{data:{".dockerconfigjson":$v}}')
+oc patch secret pull-secret -n openshift-config --type=merge -p "$PATCH"
systemd/ocp-cluster-ca.sh (4)

13-14: nounset + sourcing env can be brittle; guard it.

Avoid failures if the env file references unset vars by temporarily disabling nounset.

-source /etc/sysconfig/crc-env || echo "WARNING: crc-env not found"
+set +u
+source /etc/sysconfig/crc-env || echo "WARNING: crc-env not found"
+set -u

18-18: Fragile readiness probe: ‘configmap’ is namespaced.

This lists the default namespace and can spuriously fail. Probe a cluster-scoped resource instead (e.g., namespaces) to reduce flakiness.

-wait_for_resource_or_die configmap
+wait_for_resource_or_die namespaces

61-70: Clarify KUBECONFIG handoff and avoid rm -rf on files.

You export KUBECONFIG twice; use a distinct var for the build artifact and avoid recursive rm on a regular file.

-export KUBECONFIG=/opt/crc/kubeconfig
-rm -rf "${KUBECONFIG}"
+NEW_KUBECONFIG=/opt/crc/kubeconfig
+export KUBECONFIG="${NEW_KUBECONFIG}"
+rm -f "${KUBECONFIG}"
 
 oc config set-credentials system:admin --client-certificate=${CLIENT_CA_FILE_PATH} --client-key=${CLIENT_CA_KEY_FILE_PATH} \
    --embed-certs

71-71: Stronger readiness check for new client CA acceptance.

Listing clusteroperators confirms API availability, not that the new client cert is accepted. Consider additionally waiting until auth works with the new cert.

Example:

wait_for_resource_or_die clusteroperators 90 2
# Ensure the client cert is actually honored
for i in {1..60}; do
  if oc whoami 2>/dev/null | grep -q '^system:admin$'; then break; fi
  sleep 2
done
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 57169c8 and 379d0b4.

📒 Files selected for processing (30)
  • snc.sh (1 hunks)
  • systemd/crc-cluster-status.service (1 hunks)
  • systemd/crc-cluster-status.sh (1 hunks)
  • systemd/crc-dnsmasq.service (1 hunks)
  • systemd/crc-needs-tap.sh (1 hunks)
  • systemd/crc-no-tap.service (1 hunks)
  • systemd/crc-no-tap.sh (1 hunks)
  • systemd/crc-pullsecret.service (1 hunks)
  • systemd/crc-pullsecret.sh (1 hunks)
  • systemd/crc-routes-controller.service (1 hunks)
  • systemd/crc-routes-controller.sh (1 hunks)
  • systemd/crc-self-sufficient-env.sh (1 hunks)
  • systemd/crc-systemd-common.sh (1 hunks)
  • systemd/crc-user-mode-networking.sh (1 hunks)
  • systemd/crc-wait-apiserver-up.service (0 hunks)
  • systemd/crc-wait-apiserver-up.sh (1 hunks)
  • systemd/dnsmasq.sh.template (1 hunks)
  • systemd/ocp-cluster-ca.service (1 hunks)
  • systemd/ocp-cluster-ca.sh (3 hunks)
  • systemd/ocp-clusterid.service (1 hunks)
  • systemd/ocp-clusterid.sh (1 hunks)
  • systemd/ocp-custom-domain.service (1 hunks)
  • systemd/ocp-custom-domain.sh (3 hunks)
  • systemd/ocp-growfs.service (0 hunks)
  • systemd/ocp-growfs.sh (2 hunks)
  • systemd/ocp-mco-sshkey.service (1 hunks)
  • systemd/ocp-mco-sshkey.sh (1 hunks)
  • systemd/ocp-userpasswords.service (1 hunks)
  • systemd/ocp-userpasswords.sh (1 hunks)
  • tools.sh (1 hunks)
💤 Files with no reviewable changes (2)
  • systemd/crc-wait-apiserver-up.service
  • systemd/ocp-growfs.service
🧰 Additional context used
🧬 Code graph analysis (9)
systemd/crc-routes-controller.sh (1)
systemd/crc-systemd-common.sh (1)
  • wait_for_resource_or_die (3-15)
systemd/ocp-mco-sshkey.sh (1)
systemd/crc-systemd-common.sh (1)
  • wait_for_resource_or_die (3-15)
systemd/ocp-cluster-ca.sh (1)
systemd/crc-systemd-common.sh (1)
  • wait_for_resource_or_die (3-15)
systemd/crc-wait-apiserver-up.sh (1)
systemd/crc-systemd-common.sh (1)
  • wait_for_resource_or_die (3-15)
systemd/crc-pullsecret.sh (1)
systemd/crc-systemd-common.sh (1)
  • wait_for_resource_or_die (3-15)
systemd/ocp-userpasswords.sh (1)
systemd/crc-systemd-common.sh (1)
  • wait_for_resource_or_die (3-15)
systemd/ocp-clusterid.sh (1)
systemd/crc-systemd-common.sh (1)
  • wait_for_resource_or_die (3-15)
snc.sh (1)
tools.sh (1)
  • retry (110-128)
systemd/ocp-custom-domain.sh (1)
systemd/crc-systemd-common.sh (1)
  • wait_for_resource_or_die (3-15)
🪛 Shellcheck (0.11.0)
tools.sh

[warning] 225-225: Declare and assign separately to avoid masking return values.

(SC2155)

systemd/crc-needs-tap.sh

[error] 26-26: Can only exit with status 0-255. Other data should be written to stdout/stderr.

(SC2242)

systemd/ocp-userpasswords.sh

[error] 14-14: Couldn't parse this redirection. Fix to allow more checks.

(SC1073)


[error] 14-14: Fix any mentioned problems and try again.

(SC1072)

🔇 Additional comments (9)
systemd/crc-dnsmasq.service (1)

12-12: Hardening the ovs prerequisite looks good.

Adding the ExecStartPre explicit start makes this unit fail fast if ovs-configuration.service can’t come up, tightening the dependency chain without waiting for ExecStart to stumble over missing state. Nicely targeted reliability win.

snc.sh (1)

284-290: Restore failure handling for kubeadmin login.

Inside the subshell we now disable set -e, so if retry ${OC} login … exhausts its attempts it returns non-zero but the code keeps going, deletes the password file, redacts logs, and the subshell exits 0 because sed succeeds. We silently proceed without a login and make later steps fail in confusing ways. Please bail out immediately when the login ultimately fails.

-    retry ${OC} login -u kubeadmin -p "$kubeadmin_pass" --insecure-skip-tls-verify=true api.${SNC_PRODUCT_NAME}.${BASE_DOMAIN}:6443
-    rm -f ${INSTALL_DIR}/auth/kubeadmin-password
+    if ! retry ${OC} login -u kubeadmin -p "$kubeadmin_pass" --insecure-skip-tls-verify=true api.${SNC_PRODUCT_NAME}.${BASE_DOMAIN}:6443; then
+        echo "Failed to authenticate as kubeadmin after retries" >&2
+        exit 1
+    fi
+    rm -f ${INSTALL_DIR}/auth/kubeadmin-password

Likely an incorrect or invalid review comment.

systemd/ocp-custom-domain.service (1)

11-15: LGTM on restart backoff; check env dependency

Reducing RestartSec to 10 for a oneshot is reasonable given StartLimit*. With EnvironmentFile removed, please confirm ocp-custom-domain.sh doesn’t rely on vars formerly provided there.

systemd/crc-pullsecret.service (1)

11-11: Quicker retry window looks good

Dropping RestartSec to 10 s should help the oneshot retry faster on transient failures without affecting the existing start-limit settings.

systemd/crc-systemd-common.sh (1)

3-15: Helper rename aligns with behavior

Renaming to wait_for_resource_or_die matches the existing fatal-on-timeout semantics, so the intent is clearer and the implementation stays unchanged.

systemd/crc-routes-controller.sh (1)

3-20: Hardened controller bring-up

Strict shell flags plus the shared wait helper make the flow more predictable, and the explicit ROUTE_CONTROLLER path keeps the apply logic tidy.

systemd/crc-no-tap.service (1)

2-11: Service gating reads well

The new description and ExecCondition=!/usr/local/bin/crc-needs-tap.sh make the unit self-contained, and ordering before [email protected] ensures we clean up ahead of that instance.

systemd/crc-pullsecret.sh (1)

12-13: LGTM on adopting wait_for_resource_or_die.

systemd/crc-needs-tap.sh (1)

11-17: No swap needed: ExecCondition already inverts exit codes
The ! prefix on ExecCondition correctly inverts the script’s exit status, so NEED_TAP=0/DONT_NEED_TAP=1 operate as intended. No changes required in crc-needs-tap.sh.

Likely an incorrect or invalid review comment.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
systemd/ocp-cluster-ca.sh (1)

79-82: Do not world‑writable kubeconfig containing embedded private key.

0666 exposes creds; set 0600 (or 0640 with controlled group).

 rm -rf /opt/kubeconfig
 cp /opt/crc/kubeconfig /opt/kubeconfig
-chmod 0666 /opt/kubeconfig
+chmod 0600 /opt/kubeconfig
🧹 Nitpick comments (15)
systemd/dnsmasq.sh.template (2)

12-12: Guard empty hostIp and quote nmcli argument.

If hostname --all-ip-addresses returns nothing, $hostIp is empty; the config and nmcli call become invalid. Add a check and quote the nmcli value.

 hostIp=$(hostname --all-ip-addresses | awk '{print $1}')
+[[ -n "$hostIp" ]] || { echo "ERROR: could not determine host IP"; exit 1; }
-/bin/nmcli conn modify --temporary ovs-if-br-ex ipv4.dns $hostIp,1.1.1.1
+/bin/nmcli conn modify --temporary ovs-if-br-ex ipv4.dns "$hostIp,1.1.1.1"

3-6: errtrace without an ERR trap gives little value—consider adding a simple trap.

This improves diagnosability under -e/-u/-o pipefail.

trap 'rc=$?; echo "ERROR: ${BASH_SOURCE[0]}:${BASH_LINENO[0]} command \"${BASH_COMMAND}\" failed with exit $rc" >&2' ERR
systemd/crc-wait-apiserver-up.sh (4)

3-6: Good shell strictness; add an ERR trap to get actionable logs on failure.

errtrace only helps if an ERR trap exists. Add a tiny trap so systemd journals capture the failing command/line.

 set -o pipefail
 set -o errexit
 set -o nounset
 set -o errtrace
+trap 'echo "[crc-wait-apiserver-up] error at line ${LINENO}: ${BASH_COMMAND}" >&2' ERR

13-13: Make retry cadence configurable via env for different environments.

Hard‑coding 4,60 limits reuse; allow overrides while keeping current defaults.

-wait_for_resource_or_die node 4 60
+wait_for_resource_or_die node "${CRC_WAIT_MAX_RETRY:-4}" "${CRC_WAIT_WAIT_SEC:-60}"

15-17: Status messages LGTM. Consider “API server” casing if you care about wording consistency.


19-19: exit 0 is redundant.

Bash exits 0 after the last command succeeds; safe to drop for brevity.

-exit 0
systemd/crc-cluster-status.sh (1)

27-44: Subshell guards secrets; nit: message typo

The (set +x; set +e; oc login …) pattern correctly prevents password trace. Consider s/successed/succeeded/ for logs.

systemd/ocp-userpasswords.sh (2)

31-32: Be explicit about the resource being waited on
wait_for_resource_or_die secret is vague; pass a concrete, namespaced resource.

-wait_for_resource_or_die secret
+# Wait for the target namespace before creating the secret
+wait_for_resource_or_die "ns/openshift-config"

Please confirm the helper supports this form; alternatively extend it to accept a namespace flag.


36-37: Minor: avoid useless cat and quote
Keep consistent quoting; avoid subshell for simple file read.

-dev_pass=$(gen_htpasswd developer "$(cat /tmp/crc/pass_developer)")
-adm_pass=$(gen_htpasswd kubeadmin "$(cat /tmp/crc/pass_kubeadmin)")
+dev_pass="$(gen_htpasswd developer "$(< /tmp/crc/pass_developer)")"
+adm_pass="$(gen_htpasswd kubeadmin "$(< /tmp/crc/pass_kubeadmin)")"
systemd/ocp-growfs.sh (1)

18-20: LGTM; exit 0 is fine (optional to drop).

errexit already ensures non‑zero exits on failure; explicit exit 0 is harmless. You can keep the “All done” message for clarity.

If desired, also confirm leaving /sysroot rw is intentional (Line 16 shows the ro remount commented out).

systemd/crc-routes-controller.sh (2)

15-16: Make waits robust (slow clusters) by extending retries.

Defaults cap at ~100s; CRC bring‑up can exceed this. Pass larger retry/intervals.

-wait_for_resource_or_die pods
-wait_for_resource_or_die deployments
+wait_for_resource_or_die pods 180 5
+wait_for_resource_or_die deployments 180 5

9-9: Assert manifest presence before apply.

Fail early with a clearer error if the YAML is missing.

-ROUTE_CONTROLLER=/opt/crc/routes-controller.yaml
+ROUTE_CONTROLLER=/opt/crc/routes-controller.yaml
+[ -s "$ROUTE_CONTROLLER" ] || { echo "routes-controller manifest not found: $ROUTE_CONTROLLER"; exit 1; }

Also applies to: 18-18

systemd/ocp-cluster-ca.sh (1)

61-70: Reduce kubeconfig side effects; consider subshell/local var.

Flipping KUBECONFIG mid-script is easy to misuse. Limit scope.

-export KUBECONFIG=/opt/crc/kubeconfig
-rm -rf "${KUBECONFIG}"
+(
+  export KUBECONFIG=/opt/crc/kubeconfig
+  rm -f "$KUBECONFIG"
   oc config set-credentials system:admin --client-certificate=${CLIENT_CA_FILE_PATH} --client-key=${CLIENT_CA_KEY_FILE_PATH} \
    --embed-certs
   oc config set-context system:admin --cluster="${cluster_name}" --namespace=default --user=system:admin
   oc config set-cluster "${cluster_name}" --server="${apiserver_url}" --insecure-skip-tls-verify=true
   oc config use-context system:admin
- 
-wait_for_resource_or_die clusteroperators 90 2
+)
+
+# Optionally wait for readiness, not mere existence:
+# oc wait --for=condition=Available --all clusteroperator --timeout=15m
systemd/crc-needs-tap.sh (2)

3-7: Avoid unconditional xtrace in systemd units; gate behind DEBUG.

set -x spams journals and can leak data. Make it opt‑in.

 set -o pipefail
 set -o errexit
 set -o nounset
 set -o errtrace
-set -x
+: "${DEBUG:=0}"
+if [[ "$DEBUG" == "1" ]]; then
+  set -x
+fi

11-12: Mark exit codes as readonly to prevent accidental mutation.

Small hardening.

-NEED_TAP=0
-DONT_NEED_TAP=1
+readonly NEED_TAP=0
+readonly DONT_NEED_TAP=1
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 379d0b4 and 90db003.

📒 Files selected for processing (29)
  • systemd/crc-cluster-status.service (1 hunks)
  • systemd/crc-cluster-status.sh (1 hunks)
  • systemd/crc-dnsmasq.service (1 hunks)
  • systemd/crc-needs-tap.sh (1 hunks)
  • systemd/crc-no-tap.service (1 hunks)
  • systemd/crc-no-tap.sh (1 hunks)
  • systemd/crc-pullsecret.service (1 hunks)
  • systemd/crc-pullsecret.sh (1 hunks)
  • systemd/crc-routes-controller.service (1 hunks)
  • systemd/crc-routes-controller.sh (1 hunks)
  • systemd/crc-self-sufficient-env.sh (1 hunks)
  • systemd/crc-systemd-common.sh (1 hunks)
  • systemd/crc-user-mode-networking.sh (1 hunks)
  • systemd/crc-wait-apiserver-up.service (0 hunks)
  • systemd/crc-wait-apiserver-up.sh (1 hunks)
  • systemd/dnsmasq.sh.template (1 hunks)
  • systemd/ocp-cluster-ca.service (1 hunks)
  • systemd/ocp-cluster-ca.sh (3 hunks)
  • systemd/ocp-clusterid.service (1 hunks)
  • systemd/ocp-clusterid.sh (1 hunks)
  • systemd/ocp-custom-domain.service (1 hunks)
  • systemd/ocp-custom-domain.sh (3 hunks)
  • systemd/ocp-growfs.service (0 hunks)
  • systemd/ocp-growfs.sh (2 hunks)
  • systemd/ocp-mco-sshkey.service (1 hunks)
  • systemd/ocp-mco-sshkey.sh (1 hunks)
  • systemd/ocp-userpasswords.service (1 hunks)
  • systemd/ocp-userpasswords.sh (1 hunks)
  • tools.sh (1 hunks)
💤 Files with no reviewable changes (2)
  • systemd/crc-wait-apiserver-up.service
  • systemd/ocp-growfs.service
🚧 Files skipped from review as they are similar to previous changes (11)
  • systemd/crc-self-sufficient-env.sh
  • systemd/ocp-custom-domain.sh
  • systemd/crc-pullsecret.service
  • systemd/crc-user-mode-networking.sh
  • systemd/crc-dnsmasq.service
  • systemd/crc-routes-controller.service
  • systemd/crc-pullsecret.sh
  • systemd/ocp-userpasswords.service
  • systemd/ocp-clusterid.sh
  • systemd/crc-no-tap.service
  • systemd/crc-systemd-common.sh
🧰 Additional context used
🧬 Code graph analysis (5)
systemd/crc-wait-apiserver-up.sh (1)
systemd/crc-systemd-common.sh (1)
  • wait_for_resource_or_die (3-15)
systemd/crc-routes-controller.sh (1)
systemd/crc-systemd-common.sh (1)
  • wait_for_resource_or_die (3-15)
systemd/ocp-userpasswords.sh (1)
systemd/crc-systemd-common.sh (1)
  • wait_for_resource_or_die (3-15)
systemd/ocp-mco-sshkey.sh (1)
systemd/crc-systemd-common.sh (1)
  • wait_for_resource_or_die (3-15)
systemd/ocp-cluster-ca.sh (1)
systemd/crc-systemd-common.sh (1)
  • wait_for_resource_or_die (3-15)
🪛 Shellcheck (0.11.0)
systemd/ocp-userpasswords.sh

[error] 14-14: Couldn't parse this redirection. Fix to allow more checks.

(SC1073)


[error] 14-14: Fix any mentioned problems and try again.

(SC1072)

systemd/crc-needs-tap.sh

[error] 26-26: Can only exit with status 0-255. Other data should be written to stdout/stderr.

(SC2242)

tools.sh

[warning] 225-225: Declare and assign separately to avoid masking return values.

(SC2155)

🔇 Additional comments (18)
systemd/dnsmasq.sh.template (1)

3-6: Unbound APPS_DOMAIN under strict mode makes script fail when crc-env is absent.

With set -o nounset, using $APPS_DOMAIN later (line 20) exits the script if /etc/sysconfig/crc-env is missing or does not define it. Make the env file required or validate/prime APPS_DOMAIN before use.

Apply this diff to guard sourcing and assert the variable is set:

-source /etc/sysconfig/crc-env || echo "WARNING: crc-env not found"
+APPS_DOMAIN="${APPS_DOMAIN-}"
+if [[ -r /etc/sysconfig/crc-env ]]; then
+  # shellcheck disable=SC1091
+  source /etc/sysconfig/crc-env
+else
+  echo "WARNING: /etc/sysconfig/crc-env not found; relying on environment"
+fi
+: "${APPS_DOMAIN:?APPS_DOMAIN must be set (export it or provide it via /etc/sysconfig/crc-env)}"

And fail-fast at the point of use in the config:

-address=/${APPS_DOMAIN}/$hostIp
+address=/${APPS_DOMAIN:?APPS_DOMAIN must be set}/$hostIp

Also applies to: 9-9, 20-20

systemd/crc-wait-apiserver-up.sh (1)

11-11: Log message LGTM.

systemd/crc-cluster-status.sh (1)

3-6: Good: strict mode enabled early

pipefail/errexit/nounset/errtrace before logic is a solid hardening step.

systemd/ocp-userpasswords.sh (4)

43-45: Make sed portable and update idempotently
Use POSIX space class; prefer oc apply over replace for create/update.

-   --from-file=htpasswd=<((echo "$dev_pass"; echo "$adm_pass") | sed '/^\s*$/d') \
+   --from-file=htpasswd=<((echo "$dev_pass"; echo "$adm_pass") | sed '/^[[:space:]]*$/d') \
    -n openshift-config \
-   --dry-run=client -oyaml | oc replace -f-
+   --dry-run=client -oyaml | oc apply -f-

13-16: Fix argument validation and stderr redirection
Condition is inverted and redirection is malformed; this can fail at runtime.

-    if [ -z "${1:-}" ] && [ ! -z "${2:-}" ]; then
-        echo "gen_htpasswd needs to arguments: username password" 1>&
+    if [ -z "${1:-}" ] || [ -z "${2:-}" ]; then
+        echo "gen_htpasswd needs two arguments: username password" >&2
         return 1
     fi

Based on static analysis hints.


26-29: Correct the error message (kubeadmin, not developer)
Message mismatches the file being checked.

-    echo "developer password does not exist"
+    echo "kubeadmin password does not exist"

33-35: Do not enable tracing while handling secrets
This turns tracing on instead of off and risks leaking hashes/content later in the pipeline.

-echo "generating the kubeadmin and developer passwords ..."
-set -x # disable the logging to avoid leaking the passwords
+echo "generating the kubeadmin and developer passwords ..."
+set +x # disable tracing to avoid leaking the passwords

Optionally re‑enable with set -x after secret is created.

systemd/crc-cluster-status.service (1)

13-13: Revert RestartSec to preserve retry window (or retune StartLimit).

With StartLimitIntervalSec=450 and StartLimitBurst=10, RestartSec=10 will exhaust retries in ~100s and hit start‑limit lockout; previously at 40s it allowed ~400s. Restore 40s or increase StartLimit* accordingly to keep resiliency.

-RestartSec=10
+RestartSec=40
systemd/ocp-mco-sshkey.service (1)

10-10: Keep restart pacing aligned with StartLimit.

10s cadence trips the burst in ~100s. Revert to 40s (or proportionally raise StartLimitIntervalSec/Burst).

-RestartSec=10
+RestartSec=40
systemd/ocp-clusterid.service (1)

10-10: Restart pacing regression—restore 40s (or widen StartLimit).

10s restarts consume the burst in ~100s vs. ~400s before. Revert or retune StartLimit*.

-RestartSec=10
+RestartSec=40
systemd/ocp-cluster-ca.sh (3)

22-25: Idempotency: drop brittle early-exit; use apply for create-or-update.

If CM exists but apiserver isn’t patched yet, the early exit skips patch forever. Use apply and guard on actual apiserver state.

-if oc get configmap client-ca-custom -n openshift-config; then
-    echo "API Server Client CA already rotated..."
-    exit 0
-fi
+current_client_ca="$(oc get apiserver cluster -o jsonpath='{.spec.clientCA.name}' || true)"
+if [[ "${current_client_ca}" == "client-ca-custom" ]]; then
+    echo "API Server Client CA already rotated..."
+    exit 0
+fi
@@
-oc create configmap client-ca-custom \
-   -n openshift-config \
-   --from-file=ca-bundle.crt=${CA_FILE_PATH}
+oc create configmap client-ca-custom \
+   -n openshift-config \
+   --from-file=ca-bundle.crt=${CA_FILE_PATH} \
+   --dry-run=client -o yaml | oc apply -f -
@@
-oc patch apiserver cluster --type=merge \
+oc patch apiserver cluster --type=merge \
    -p '{"spec": {"clientCA": {"name": "client-ca-custom"}}}'

Also applies to: 47-49, 51-52


7-11: Add restrictive umask before creating keys/kubeconfig.

Prevents group/world-readable secrets.

 set -o nounset
 set -o errtrace
+umask 077
 set -x

71-76: First-run failure: replace errors if the CM is absent; use apply.

Keep pipeline idempotent.

 oc create configmap admin-kubeconfig-client-ca \
    -n openshift-config \
    --from-file=ca-bundle.crt=${CA_FILE_PATH} \
-   --dry-run=client -o yaml | oc replace -f-
+   --dry-run=client -o yaml | oc apply -f-
systemd/crc-no-tap.sh (1)

11-12: Guard disable when the instance isn’t installed (avoid errexit abort).

With errexit, this fails when [email protected] is absent. Gate the call or swallow the “not found” exit.

 echo "Disabling the tap0 network configuration ..."
 
 rm -f /etc/NetworkManager/system-connections/tap0.nmconnection
-systemctl disable --now [email protected]
+if systemctl list-unit-files '[email protected]' >/dev/null 2>&1; then
+    systemctl disable --now [email protected] || true
+fi
systemd/ocp-mco-sshkey.sh (1)

22-26: Build JSON safely with jq; fix oc patch type flag.

Current string concat breaks with spaces or special chars and is fragile. Also use --type=merge.

-echo "Updating the public key resource for machine config operator"
-pub_key=$(cat "${PUB_KEY_PATH}" | tr -d '\n\r')
-
-MC_PATCH='{"spec": {"config": {"passwd": {"users": [{"name": "core", "sshAuthorizedKeys": ["'${pub_key}'"]}]}}}}'
-
-if ! oc patch machineconfig 99-master-ssh -p "$MC_PATCH" --type merge;
+echo "Updating the public key resource for machine config operator"
+pub_key=$(tr -d '\n\r' < "${PUB_KEY_PATH}")
+MC_PATCH="$(jq -n --arg key "$pub_key" \
+  '{spec:{config:{passwd:{users:[{name:"core",sshAuthorizedKeys:[$key]}]}}}}')"
+
+if ! oc patch machineconfig 99-master-ssh --type=merge -p "$MC_PATCH";
 then
systemd/ocp-cluster-ca.service (1)

11-11: Avoid exhausting StartLimit too quickly; space retries out or raise limits.

At 10s RestartSec with StartLimitBurst=10, the unit locks out after ~100s. Restore 40s (or adjust StartLimit*).

 Restart=on-failure
-RestartSec=10
+RestartSec=40

Alternative: increase StartLimitBurst and/or StartLimitIntervalSec if you prefer faster retries.

systemd/crc-needs-tap.sh (2)

14-17: Apple virtualization handling is inverted — should not keep tap0.

Per prior feedback, Apple virtualization must not use tap0. Flip the exit code here to signal “don’t keep tap”. The message should match.

-if systemd-detect-virt | grep -q '^apple$' ; then
-    echo "Running with Apple virtualization. Need tap0."
-    exit $NEED_TAP
+if systemd-detect-virt | grep -q '^apple$' ; then
+    echo "Running with Apple virtualization. Don't keep tap0"
+    exit "$DONT_NEED_TAP"
 fi

Based on past review comments.


26-26: Bug: exiting with literal name instead of numeric code.

Expand the variable; otherwise it’s treated as a word and returns 255.

-exit DONT_NEED_TAP
+exit "$DONT_NEED_TAP"

@kpouget kpouget marked this pull request as draft September 29, 2025 15:18
@kpouget kpouget force-pushed the improve branch 10 times, most recently from c92b3c2 to 9bafac1 Compare October 1, 2025 08:42
@kpouget
Copy link
Contributor Author

kpouget commented Oct 1, 2025

this version works 🎉 9bafac1

@kpouget kpouget force-pushed the improve branch 2 times, most recently from bc4f6d3 to b2b148e Compare October 1, 2025 16:39
@kpouget
Copy link
Contributor Author

kpouget commented Oct 1, 2025

Marking this PR as Ready for review, it's discuss how to merge it.
I still need to test it on libvirt and vfkit, as well as AWS with a CI-built AMI image

@kpouget kpouget marked this pull request as ready for review October 1, 2025 16:40
@openshift-ci openshift-ci bot requested a review from cfergeau October 1, 2025 16:40
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
systemd/ocp-cluster-ca.sh (1)

90-94: World-readable kubeconfig (0666) is a credential leak; lock it down.

Kubeconfig embeds client cert/key; 0666 exposes cluster admin creds. Use 0600.

-# copy the new kubeconfig to /opt/kubeconfig
-rm -rf /opt/kubeconfig
-cp /opt/crc/kubeconfig /opt/kubeconfig
-chmod 0666 /opt/kubeconfig
+# copy the new kubeconfig to /opt/kubeconfig with strict perms
+rm -f /opt/kubeconfig
+cp /opt/crc/kubeconfig /opt/kubeconfig
+chmod 0600 /opt/kubeconfig
♻️ Duplicate comments (5)
tools.sh (1)

222-228: Harden secret handling: add umask 077, fix SC2155, and pass -- to htpasswd

Restrict file perms, avoid declare+assign subshell masking (SC2155), and guard option parsing.

-   (
-       set +x # use a subshell to avoid leaking the password
-
-       local random_password=$(cat "$auth_file_dir/auth/kubeadmin-password")
-       "${HTPASSWD}" -c -B -i "${pass_file}" developer <<< "developer" # use -c to create the file
-       "${HTPASSWD}" -B -i "${pass_file}" kubeadmin <<< "${random_password}" # append to the existing password file
-   )
+   (
+       set +x  # avoid leaking the password
+       umask 077
+       local random_password
+       random_password="$(<"${auth_file_dir}/auth/kubeadmin-password")"
+       "${HTPASSWD}" -c -B -i -- "${pass_file}" developer <<<"developer"
+       "${HTPASSWD}" -B -i -- "${pass_file}" kubeadmin <<<"${random_password}"
+   )
systemd/crc-needs-tap.sh (1)

38-41: Critical: Apple virtualization should exit with DONT_NEED_TAP, not NEED_TAP.

According to the past review discussion, Apple virtualization (vfkit) does not require tap0. The current regex matches apple and exits with NEED_TAP, which is incorrect.

Based on the discussion and past review recommendations, the logic should be:

  • libvirt (kvm/qemu): NEED_TAP
  • Hyper-V (microsoft/hyperv): NEED_TAP
  • Apple virtualization: DONT_NEED_TAP

Apply this diff to fix the virtualization detection:

-if systemd-detect-virt | grep -E -q '^(kvm|apple|microsoft)$' ; then
-    echo "Running with '$(systemd-detect-virt)' virtualization. Need tap0."
+# Detect virtualization type safely under set -e
+virt="$(systemd-detect-virt || true)"
+case "${virt}" in
+  kvm|qemu|microsoft|hyperv)
+    echo "Running with '${virt}' virtualization. Need tap0."
     exit $NEED_TAP
-fi
+    ;;
+  apple)
+    echo "Running with Apple virtualization. Don't need tap0."
+    exit $DONT_NEED_TAP
+    ;;
+esac
systemd/ocp-cluster-ca.sh (2)

7-11: Set restrictive umask before generating keys/kubeconfigs.

Prevents group/world read on newly created files.

 set -o errtrace
 set -x
+umask 077

22-27: Drop early exit or guard by apiserver state; make flow idempotent.

Exiting when CM exists can skip the apiserver patch forever after a partial run. Use apply (already used) and only exit if apiserver.spec.clientCA.name is set.

-if oc get configmap client-ca-custom -n openshift-config 2>/dev/null; then
-    echo "API Server Client CA already rotated..."
-    exit 0
-fi
-echo "API Server Client CA not rotated. Doing it now ..."
+current_client_ca="$(oc get apiserver cluster -o jsonpath='{.spec.clientCA.name}' 2>/dev/null || true)"
+if [[ "${current_client_ca}" == "client-ca-custom" ]]; then
+  echo "API Server Client CA already rotated..."
+  exit 0
+fi
+echo "API Server Client CA not rotated. Doing it now ..."
@@
-oc create configmap client-ca-custom \
+oc create configmap client-ca-custom \
    -n openshift-config \
    --from-file=ca-bundle.crt=${CA_FILE_PATH} \
    --dry-run=client -o yaml \
     | oc apply -f -
@@
-jq -n '
+jq -n '
 {
   "spec": {
     "clientCA": {
       "name": "client-ca-custom"
     }
   }
 }' | oc patch apiserver cluster --type=merge --patch-file=/dev/stdin

Also applies to: 48-53, 55-63

systemd/ocp-userpasswords.sh (1)

12-19: Fix error message typo.

“needs to arguments” → “needs two arguments”.

-        echo "gen_htpasswd needs to arguments: username password" >&2
+        echo "gen_htpasswd needs two arguments: username password" >&2
🧹 Nitpick comments (11)
systemd/crc-disable-tap.sh (2)

7-7: Consider removing set -x for production.

set -x traces every command to stderr, which can clutter logs in production environments. Consider removing it or gating it behind a debug flag (e.g., ${DEBUG:-false} && set -x).


14-14: Remove redundant exit 0.

With set -o errexit, the script will exit 0 automatically if it reaches the end without error. The explicit exit 0 is redundant.

-exit 0
systemd/crc-self-sufficient-env.sh (1)

10-17: Document CRC_CLOUD removal if obsolete.

The guard logic now only checks CRC_SELF_SUFFICIENT and no longer checks CRC_CLOUD. If CRC_CLOUD is obsolete and being replaced by CRC_SELF_SUFFICIENT, consider adding a comment documenting this transition for future maintainers.

Based on past review comments.

systemd/ocp-clusterid.sh (1)

20-20: Remove redundant explicit exit.

With set -o errexit enabled, the script will automatically exit with status 0 upon successful completion. The explicit exit 0 is redundant.

 echo "All done"
-
-exit 0
systemd/crc-needs-tap.sh (1)

11-28: Consider adding error handling for sudo failure.

The sudo cat /sys/class/dmi/id/product_name call on line 21 could fail if sudo isn't available or permissions are denied. Consider wrapping it in a conditional or using || true to prevent script termination under set -e.

-    elif [[ $(sudo cat /sys/class/dmi/id/product_name) == *"Amazon EC2"* ]]; then
+    elif [[ $(sudo cat /sys/class/dmi/id/product_name 2>/dev/null || echo "") == *"Amazon EC2"* ]]; then
systemd/crc-systemd-common.sh (1)

33-33: Redundant return statement.

Line 33's return 0 is unreachable because line 19 already returns on success, and line 25 exits on failure. Consider removing it for clarity.

     done
 
-    return 0
 }
systemd/crc-user-mode-networking.sh (1)

13-30: Harden mode parsing; avoid arithmetic errors and clarify ExecCondition contract.

Current integer tests can emit “integer expression expected” if CRC_NETWORK_MODE_USER is non‑numeric. A case switch is simpler and safer; also send the error to stderr.

-USER_MODE=0
-SYSTEM_MODE=1
+USER_MODE=0
+SYSTEM_MODE=1
@@
-if [[ -z "${CRC_NETWORK_MODE_USER:-}" ]]; then
-    echo "CRC_NETWORK_MODE_USER not set. Assuming user networking."
-    exit $USER_MODE
-fi
-
-if [[ "$CRC_NETWORK_MODE_USER" -eq 0 ]]; then
-    echo "network-mode 'system' detected"
-    exit $SYSTEM_MODE
-fi
-
-if [[ "$CRC_NETWORK_MODE_USER" -eq 1 ]]; then
-    echo "network-mode 'user' detected"
-    exit $USER_MODE
-fi
-
-echo "ERROR: unknown network mode: CRC_NETWORK_MODE_USER=$CRC_NETWORK_MODE_USER (expected 0 or 1). Assuming user networking"
-
-exit $USER_MODE
+case "${CRC_NETWORK_MODE_USER:-1}" in
+  0)
+    echo "network-mode 'system' detected"
+    exit ${SYSTEM_MODE}
+    ;;
+  1)
+    echo "network-mode 'user' detected"
+    exit ${USER_MODE}
+    ;;
+  *)
+    echo "ERROR: unknown network mode: CRC_NETWORK_MODE_USER=${CRC_NETWORK_MODE_USER} (expected 0 or 1). Assuming user networking" >&2
+    exit ${USER_MODE}
+    ;;
+esac

Please confirm how units consume this (e.g., ExecCondition should succeed only in user‑mode).

systemd/crc-pullsecret.sh (1)

17-26: Disable xtrace while touching decoded/secret material (defense‑in‑depth).

Trace won’t print data, but safest to mute it around secret inspection and patch.

-# check if the .auths field is there
-if oc get secret pull-secret \
+set +x
+# check if the .auths field is there
+if oc get secret pull-secret \
@@
-# Create the JSON patch in memory and pipe it to the oc command
+# Create the JSON patch in memory and pipe it to the oc command
 base64 -w0 < /opt/crc/pull-secret | \
   jq -R '{"data": {".dockerconfigjson": .}}' | \
   oc patch secret pull-secret \
      -n openshift-config \
      --type merge \
      --patch-file=/dev/stdin
+set -x

Also applies to: 35-41

systemd/ocp-custom-domain.sh (1)

3-7: Protect key material and clean up temp files.

Add restrictive umask and trap to remove /tmp/nip.* keys after use.

 set -o errtrace
 set -x
+umask 077
+trap 'rm -f /tmp/nip.key /tmp/nip.crt' EXIT
@@
 openssl req -newkey rsa:2048 -new \
         -nodes -x509 -days 3650 \
         -keyout /tmp/nip.key -out /tmp/nip.crt \

Also applies to: 22-26, 28-32

systemd/ocp-cluster-ca.sh (1)

38-46: Clean up sensitive temp files.

Add a trap to remove generated CA/key/CSR files.

+# cleanup sensitive files on exit
+trap 'rm -f "${CA_FILE_PATH}" "${CA_KEY_FILE_PATH}" "${CLIENT_CA_FILE_PATH}" \
+            "${CLIENT_CA_KEY_FILE_PATH}" "${CLIENT_CSR_FILE_PATH}"' EXIT
@@
 export KUBECONFIG=/opt/crc/kubeconfig
 rm -rf "${KUBECONFIG}"

Also applies to: 71-78

systemd/ocp-growfs.service (1)

4-4: Ensure producer–consumer ordering. After=crc-env-file-exists.path only waits for the file to appear, not to be fully populated. If another unit or script creates /etc/sysconfig/crc-env, you must also order on that producer (e.g. After=<writer-service>.service) or have it write atomically (e.g. via a rename) to prevent this service from reading a partial file.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 90db003 and b2b148e.

📒 Files selected for processing (33)
  • createdisk.sh (2 hunks)
  • systemd/crc-check-tap.service (1 hunks)
  • systemd/crc-cluster-status.service (1 hunks)
  • systemd/crc-cluster-status.sh (1 hunks)
  • systemd/crc-disable-tap.sh (1 hunks)
  • systemd/crc-dnsmasq.service (1 hunks)
  • systemd/crc-env-file-exists.path (1 hunks)
  • systemd/crc-needs-tap.sh (1 hunks)
  • systemd/crc-no-tap.service (0 hunks)
  • systemd/crc-no-tap.sh (0 hunks)
  • systemd/crc-pullsecret.service (1 hunks)
  • systemd/crc-pullsecret.sh (1 hunks)
  • systemd/crc-routes-controller.service (2 hunks)
  • systemd/crc-routes-controller.sh (1 hunks)
  • systemd/crc-self-sufficient-env.sh (1 hunks)
  • systemd/crc-systemd-common.sh (1 hunks)
  • systemd/crc-user-mode-networking.sh (1 hunks)
  • systemd/crc-wait-apiserver-up.service (1 hunks)
  • systemd/crc-wait-apiserver-up.sh (1 hunks)
  • systemd/dnsmasq.sh.template (1 hunks)
  • systemd/ocp-cluster-ca.service (0 hunks)
  • systemd/ocp-cluster-ca.sh (3 hunks)
  • systemd/ocp-clusterid.service (1 hunks)
  • systemd/ocp-clusterid.sh (1 hunks)
  • systemd/ocp-custom-domain.service (1 hunks)
  • systemd/ocp-custom-domain.sh (2 hunks)
  • systemd/ocp-growfs.service (1 hunks)
  • systemd/ocp-growfs.sh (2 hunks)
  • systemd/ocp-mco-sshkey.service (1 hunks)
  • systemd/ocp-mco-sshkey.sh (1 hunks)
  • systemd/ocp-userpasswords.service (1 hunks)
  • systemd/ocp-userpasswords.sh (1 hunks)
  • tools.sh (1 hunks)
💤 Files with no reviewable changes (3)
  • systemd/crc-no-tap.service
  • systemd/ocp-cluster-ca.service
  • systemd/crc-no-tap.sh
🚧 Files skipped from review as they are similar to previous changes (9)
  • systemd/crc-wait-apiserver-up.service
  • systemd/crc-cluster-status.service
  • systemd/crc-cluster-status.sh
  • systemd/dnsmasq.sh.template
  • systemd/ocp-growfs.sh
  • systemd/ocp-mco-sshkey.service
  • systemd/crc-wait-apiserver-up.sh
  • systemd/crc-routes-controller.sh
  • systemd/crc-dnsmasq.service
🧰 Additional context used
🧬 Code graph analysis (6)
systemd/ocp-mco-sshkey.sh (1)
systemd/crc-systemd-common.sh (1)
  • wait_for_resource_or_die (3-34)
systemd/ocp-cluster-ca.sh (1)
systemd/crc-systemd-common.sh (1)
  • wait_for_resource_or_die (3-34)
systemd/ocp-clusterid.sh (1)
systemd/crc-systemd-common.sh (1)
  • wait_for_resource_or_die (3-34)
systemd/crc-pullsecret.sh (1)
systemd/crc-systemd-common.sh (1)
  • wait_for_resource_or_die (3-34)
systemd/ocp-userpasswords.sh (1)
systemd/crc-systemd-common.sh (1)
  • wait_for_resource_or_die (3-34)
systemd/ocp-custom-domain.sh (1)
systemd/crc-systemd-common.sh (1)
  • wait_for_resource_or_die (3-34)
🪛 Shellcheck (0.11.0)
tools.sh

[warning] 225-225: Declare and assign separately to avoid masking return values.

(SC2155)

🔇 Additional comments (34)
systemd/crc-disable-tap.sh (4)

3-7: LGTM: Strict shell options improve reliability.

The combination of pipefail, errexit, nounset, and errtrace aligns with the PR's goal of improving script reliability.


12-12: Error suppression is acceptable but less precise than unit existence check.

The || true approach addresses the past review concern about errexit aborting when the unit is missing. However, it suppresses all errors, not just "unit does not exist." For more precise error handling, consider checking unit existence first as suggested in the past review.

Based on past review comments.


3-7: LGTM! Strict bash safety options are appropriate.

The comprehensive error handling (pipefail, errexit, nounset, errtrace) combined with execution tracing (set -x) ensures script failures are caught early and debuggable.


11-12: LGTM! Cleanup is properly idempotent.

The rm -f silently handles missing files, and || true on the systemctl command ensures the script succeeds even when the unit doesn't exist—addressing the concern from the past review without added complexity.

systemd/crc-env-file-exists.path (2)

1-8: LGTM: Clean path unit for environment file gating.

This path unit provides a clean dependency trigger for services that require /etc/sysconfig/crc-env to exist before starting. The implementation is straightforward and aligns with the PR's goal of improving startup reliability by removing direct EnvironmentFile references from multiple services.


1-9: LGTM! Clean path unit for startup coordination.

The path unit correctly gates dependent services on the existence of /etc/sysconfig/crc-env, providing a cleaner dependency mechanism than scattered EnvironmentFile directives.

systemd/crc-self-sufficient-env.sh (7)

3-6: LGTM: Strict shell options improve reliability.

The combination of pipefail, errexit, nounset, and errtrace aligns with the PR's goal of improving script reliability.


8-8: Good: Graceful fallback when environment file is missing.

Using || echo "WARNING: crc-env not found" allows the script to continue even when the environment file is absent, which is appropriate given the subsequent guard logic.


10-13: LGTM: Safe parameter expansion prevents unset variable errors.

Using ${CRC_SELF_SUFFICIENT:-} with the :- operator safely handles the case where the variable is unset, preventing nounset from aborting the script.


15-15: Typo fixed.

The typo "environement" → "environment" has been corrected.

Based on past review comments.


3-6: LGTM! Consistent strict bash safety.

The error handling options match the pattern established across other scripts in this PR.


15-15: Typo already fixed.

The past review comment noted "environement" → "environment"; this has been corrected.


8-8: Ignore source fallback warning; fallback is effective under set -o errexit. Bash does not exit on a failing command that’s part of an || chain, so source /etc/sysconfig/crc-env || echo… will catch any non-zero exit (missing file or errors inside) and run the warning. No change needed.

Likely an incorrect or invalid review comment.

systemd/ocp-growfs.service (1)

4-4: LGTM: Consistent dependency pattern.

Adding After=crc-env-file-exists.path aligns with the PR's broader pattern of gating service startup on environment file presence. Since crc-self-sufficient-env.sh (called in ExecCondition) sources /etc/sysconfig/crc-env directly, removing the external EnvironmentFile reference is safe.

systemd/ocp-custom-domain.service (4)

3-3: LGTM: Consistent dependency pattern.

Adding After=crc-env-file-exists.path aligns with the PR's broader pattern of gating service startup on environment file presence. Since crc-self-sufficient-env.sh (called in ExecCondition) sources /etc/sysconfig/crc-env directly, removing the external EnvironmentFile reference is safe.


12-12: RestartSec value is appropriate.

The RestartSec=40 value provides reasonable spacing between restart attempts and avoids the premature start-limit lockout concern raised in past reviews about a 10s value. Combined with StartLimitIntervalSec=450 and StartLimitBurst=10, the unit has adequate protection against rapid retry exhaustion.

Based on past review comments.


3-3: LGTM! Consistent dependency pattern.

Adding After=crc-env-file-exists.path aligns this service with the new environment file gating strategy used across other units in this PR.


12-12: RestartSec already at recommended value.

The past review comment raised concerns about a 10s restart interval; the current code shows 40s, which is appropriate given the start limits (10 retries over 450s).

systemd/ocp-clusterid.sh (3)

3-6: LGTM! Strict mode improves safety.

Enabling pipefail, errexit, nounset, and errtrace ensures the script fails fast on errors, unset variables, and pipeline failures, which aligns with the PR's goal to improve reliability.


11-11: LGTM! Fail-fast behavior improves reliability.

Replacing wait_for_resource with wait_for_resource_or_die ensures the script terminates immediately if the clusterversion resource is not available within the retry limit, rather than silently continuing.


15-16: LGTM! Patch generation via jq is more robust.

Generating the patch JSON with jq and piping to --patch-file=/dev/stdin is cleaner and safer than inline JSON strings, reducing the risk of quoting or escaping errors.

systemd/crc-routes-controller.service (2)

3-3: Consistent use of crc-env-file-exists.path.

This aligns with the pattern in systemd/ocp-clusterid.service. Note the inconsistency with systemd/crc-pullsecret.service which uses cloud-init-crc-env.path (see comment on systemd/ocp-clusterid.service line 3).


12-12: No issues found in networking mode gating logic
The script systemd/crc-user-mode-networking.sh exists and correctly returns exit code 0 for user-mode networking (default/empty, CRC_NETWORK_MODE_USER=1, or unknown) and exit code 1 only for system-mode (CRC_NETWORK_MODE_USER=0), matching the intended service gating.

systemd/ocp-userpasswords.service (1)

4-4: LGTM! Consistent path unit gating.

Adding After=crc-env-file-exists.path is consistent with the pattern in other services. Note the inconsistency with systemd/crc-pullsecret.service which uses cloud-init-crc-env.path (see comment on systemd/crc-pullsecret.service line 3).

systemd/ocp-mco-sshkey.sh (4)

3-7: LGTM! Strict shell options improve reliability.

The combination of pipefail, errexit, nounset, and errtrace ensures early detection of errors, which is essential for system bring-up scripts.


14-17: LGTM! Appropriate pre-condition check.

Early exit when the public key file is missing prevents downstream failures.


19-38: LGTM! Safe JSON construction with jq resolves previous fragility.

The wait_for_resource_or_die pre-check and jq-based JSON builder address the previous review concern about brittle string concatenation. Using --arg ensures the public key is properly escaped, and piping to --patch-file=/dev/stdin is clean.


40-42: LGTM! Explicit success feedback.

Clear success messaging and exit code improve observability.

createdisk.sh (1)

112-119: LGTM! Conditional service startup improves reliability.

Adding After=crc-check-tap.service ensures proper ordering, and using ExecCondition to gate startup based on /usr/local/bin/crc-needs-tap.sh is cleaner than polling in ExecStartPre. This aligns with the PR's goal of improving bring-up script reliability.

systemd/crc-needs-tap.sh (2)

3-7: LGTM! Strict shell options improve reliability.

The combination of strict options ensures early error detection, consistent with other scripts in this PR.


43-50: LGTM! Self-sufficient bundle check and safe default.

The self-sufficient bundle check provides appropriate override behavior, and defaulting to DONT_NEED_TAP is the safe choice when environment is unknown.

systemd/crc-check-tap.service (1)

1-16: LGTM! Service structure is correct.

The unit ordering, ExecCondition negation logic, and dependency on crc-needs-tap.sh are appropriate. However, ensure the critical issue in systemd/crc-needs-tap.sh (Apple virtualization returning wrong exit code) is fixed, as this unit depends on that script's correctness.

systemd/crc-systemd-common.sh (2)

3-12: LGTM! Explicit naming and input validation improve reliability.

Renaming to wait_for_resource_or_die clearly signals the fail-fast behavior, and the input validation with usage message prevents misuse.


14-31: LGTM! Improved retry loop with progress feedback.

The for loop with explicit retry counter provides better observability than the previous until loop. Progress messages and error output to stderr are appropriate.

@kpouget kpouget changed the title WIP: Improve the reliability of the system bring-up scripts Improve the reliability of the system bring-up scripts Oct 2, 2025
(will be removed in a follow up commit)
* Make more resilient by failing on any error
* Improve the logging and argument validation
* Use `jq` to to guarantee that the patch file is valid JSON (will be
updated in a follow up commit to avoid passing the pub key in the CLI)
* Better use of script constants
* Better validation of the arguments
* Better logging

File will be further updated to prevent leaking passwords in the
journal logs.
No need to sleep `5s` here, the SystemD dependencies should enforce
the correct ordering.
Define the `KUBECONFIG` in the Systemd unit file, so that the CRC
scripts don't have to care about it.

```
Environment=KUBECONFIG=/opt/kubeconfig
```

Gives a better separation of concerns.
Introduce the `ocp-apiservices-available.sh` script, which waits for
the `apiservices` to be all available.

The APIServices are made up of two groups:
- the K8s APIs, which are always available (pods, secrets, configmaps,...)
- the OCP APIs, which need OCP Operators and Pods to be ready (routes, projects, ...)

This script waits for the second group to finish its initialization.
The `crc-wait-apiserver-up` wait for the K8s APIServer to be up and
running.

This commit makes the patch try 60 times with 5s delay, instead of 5
times with 60s delay. The script becomes more reactive to the
APIServer activation.
This script add a synchronization point on the `ready` status of the
CRC node.

Before the node is ready, services can interact with the K8s
APIServer, but user (and OCP) services won't start their deployment
before the CRC node is ready.

This synchronization point avoid that other services (like the
`ocp-wait-apiservices`) wait in vain while their target didn't start
their own deployment.
This script has been broken for a while, will be handled by cloud-init.
…om AWS-CLI

This script offloads `mapt` and other AWS deployers from the task of
fetching the secrets from AWS IMDS service.

This script should be include in the `cloud-init` user-data
configuration file, with this kind of invocation:
```
/usr/local/bin/crc-aws-fetch-secrets.sh \
  "{{ .SSMPullSecretName }}" \
  "{{ .SSMKubeAdminPasswordName }}" \
  "{{ .SSMDeveloperPasswordName }}"
```

where the parameters specify the location of the three secrets in the
IMDS store.
This commit moves the definition of the secret file locations from the
scripts to the SystemD unit.

This way, SystemD can enforce that the files exist before launching
the relevant services.
A review of the systemd journal logs of the different services
highlighted that the SystemD journal captures information about the
Podman containers via a Podman-internal logging mechanism.

This commit disables the logging mechanism to the containers handling
secrets.
To ease the quick glance review of the CRC boot timing, this scripts
adds a simple timing measurement, based on Bash's `SECONDS` special
variable (automatically tracking time past after its `SECONDS=0`
reset).

For a stronger time tracking, refer to the journal timestamps of the
services.
…onfig-ovewrite directories

SystemD allows overwriting the definition of services by writing new
properties in the `unit-name.service.d/override.conf` files.

This commit allows the CRC image creation script to properly upload
these files and directories to the VM image.
This override prevents the `ovs-configuration.service` from logging
its xtrace execution into the journal and the console. This service is
very verbose, and makes the console impossible to follow in real-time.

Instead, its output is logged in a `/var/log` file.
This commit reduces the Restart duration of the service.
The SystemD dependencies should already avoid any failure of the
script.
Add a dependency on the `cloud-final.service`, to be sure that the
pull-secrets have been pulled when the service starts.
Add a dependency on the `cloud-final.service` to be sure that the CRC
passwords have been fetched before starting.
Reformulate the dependencies of the `crc-custom.target` to avoid
startup deadlocks.

Load `crc-custom.target` as a dependency of the `kubelet.service`.
Ensure that the pub key has been fetched before starting the service.
Better use of `jq` to ensure that the public key isn't exposed in the
journal logs.

Exposing a public key isn't a security leak, but better avoid
disclosing it as a good practice.
`AssertPathExists` is checked before the condition is tested. Use a
`ExecStartPre` directive instead.
```
ExecCondition=/usr/local/bin/crc-self-sufficient-env.sh
```
…test

Not working before the network is established
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (5)
systemd/ocp-custom-domain.sh (1)

64-117: Wait for the concrete objects before patching.

wait_for_resource_or_die only checks that oc get <resource> exits 0. Commands like oc get ingresses.config.openshift.io or oc get route.route.openshift.io succeed even while the singleton (cluster) or default-route object is still absent, so we can race straight into the subsequent oc patch … and abort under set -e. Gate on the actual targets before patching.

-wait_for_resource_or_die ingresses.config.openshift.io
+wait_for_resource_or_die ingresses.config.openshift.io/cluster-wait_for_resource_or_die apiserver.config.openshift.io
+wait_for_resource_or_die apiserver.config.openshift.io/cluster-wait_for_resource_or_die route.route.openshift.io
+wait_for_resource_or_die "route/default-route -n openshift-image-registry"
systemd/ocp-cluster-ca.sh (1)

21-24: Don't exit just because the ConfigMap exists.

If a previous run created client-ca-custom but failed before the oc patch call, this guard will exit every retry, leaving the API server stuck on the old client CA forever. Query the apiserver (.spec.clientCA.name) before deciding to skip the rotation, or drop the exit and let the flow proceed so the patch is reapplied when needed.

-if oc get configmap client-ca-custom -n openshift-config 2>/dev/null; then
-    echo "API Server Client CA already rotated..."
-    exit 0
-fi
+current_client_ca="$(oc get apiserver cluster -o jsonpath='{.spec.clientCA.name}' 2>/dev/null || true)"
+if [[ "${current_client_ca}" == "client-ca-custom" ]]; then
+    echo "API Server Client CA already rotated..."
+    exit 0
+fi
systemd/crc-test-vsock.py (1)

30-34: Fix ioctl argument type and add portable fallback for the constant.

Line 32 passes a string ' ' to fcntl.ioctl, which will raise TypeError in Python 3 (requires bytes). Additionally, socket.IOCTL_VM_SOCKETS_GET_LOCAL_CID may not be available on all Python builds, causing AttributeError.

Apply this diff to fix both issues:

 print(f"Looking up the CID in {VSOCK_DEV}...")
+IOCTL_VM_SOCKETS_GET_LOCAL_CID = getattr(socket, "IOCTL_VM_SOCKETS_GET_LOCAL_CID", 0x7B9)
-with open(VSOCK_DEV, 'rb') as f:
-    r = fcntl.ioctl(f, socket.IOCTL_VM_SOCKETS_GET_LOCAL_CID, '    ')
+with open(VSOCK_DEV, "rb", buffering=0) as f:
+    r = fcntl.ioctl(f, IOCTL_VM_SOCKETS_GET_LOCAL_CID, bytes(4))
     cid = struct.unpack('I', r)[0]
-print(f'Our vsock CID is {cid}.')
+print(f"Our vsock CID is {cid}.")
tools.sh (1)

224-228: Tighten htpasswd creation: enforce restrictive umask

The bcrypt hashes we drop into pass_file can still leak if the file inherits a permissive default umask (0644 on most hosts). That leak was flagged earlier, but the current revision still creates the file without tightening permissions, so other users on the host can read and reuse the credential store. Please restore the hardening by setting a restrictive umask before the two htpasswd calls, and (bonus) move the local out of the subshell and pass -- to guard against paths that start with -.

 function generate_htpasswd_file {
    local auth_file_dir=$1
    local pass_file=$2
+   local random_password
    (
        set +x # use a subshell to avoid leaking the password
-       local random_password
-       random_password=$(cat "$auth_file_dir/auth/kubeadmin-password")
-       "${HTPASSWD}" -c -B -i "${pass_file}" developer <<< "developer" # use -c to create the file
-       "${HTPASSWD}" -B -i "${pass_file}" kubeadmin <<< "${random_password}" # append to the existing password file
+       umask 077
+       random_password="$(<"${auth_file_dir}/auth/kubeadmin-password")"
+       "${HTPASSWD}" -c -B -i -- "${pass_file}" developer <<<"developer" # use -c to create the file
+       "${HTPASSWD}" -B -i -- "${pass_file}" kubeadmin <<<"${random_password}" # append to the existing password file
    )
 }
systemd/ocp-userpasswords.sh (1)

27-53: Stop leaking passwords through process arguments.

htpasswd -nb "$1" "$2" exposes the cleartext password in the container’s argv, so anyone able to run ps while the unit executes can read both credentials. Pipe the password via stdin and use htpasswd -ni (or equivalent) so it never shows up in the process list, and keep tracing disabled only around the secret-handling block.

Here's a concrete fix:

-    podman run --log-driver=none --rm "$CRC_HTPASSWD_IMAGE" htpasswd -nb "$1" "$2"
+    printf '%s' "$2" | podman run --log-driver=none --rm \
+        "$CRC_HTPASSWD_IMAGE" htpasswd -ni "$1"

Then re-enable set -x after both gen_htpasswd calls, once secrets are out of scope.
[suggest_s essential? no -> major only? Already given fix; no extra tag]

🧹 Nitpick comments (3)
systemd/ocp-wait-apiservices-available.sh (1)

43-54: Display all counted unavailable services for complete diagnostics.

The counting logic (lines 22-27) correctly treats services without Available == "True" as unavailable (including Unknown and missing conditions), but the diagnostic output (lines 43-51) only displays services with Available == "False". This inconsistency means operators see a count of N unavailable services but fewer than N services listed, which can be confusing during troubleshooting.

Update the jq query to display all services counted as unavailable:

     if ! oc get apiservices -o json | jq -r '
       .items[] |
       . as $item |
-      (
-        $item.status.conditions[]? |
-        select(.type == "Available" and .status == "False")
-      ) |
-      "  - \($item.metadata.name): \(.reason) - \(.message)"
+      select(
+        (($item.status.conditions // [])
+          | any(.type=="Available" and .status=="True")) | not
+      ) |
+      (($item.status.conditions // [])
+        | map(select(.type == "Available"))[0]
+        | "  - \($item.metadata.name): \(.status // "Unknown") - \(.reason // "N/A") - \(.message // "No message")"
+      )
     '
systemd/crc-needs-tap.sh (1)

13-13: Minor: Unused variable definition.

EXIT_ERROR is defined but never used in any exit path. All exits use either EXIT_NEED_TAP or EXIT_DONT_NEED_TAP.

Consider removing this unused variable:

-EXIT_ERROR=1
systemd/crc-pullsecret.sh (1)

50-63: Optional: Consider re-enabling xtrace after secret generation.

Xtrace is disabled at line 50 to protect passwords but never re-enabled. This is safe but means the final oc create secret operations (lines 57-61) won't have verbose logging.

If verbose logging is desired for the secret creation phase, consider:

 echo "Generating the kubeadmin and developer passwords ..."
 set +x # disable the logging to avoid leaking the passwords
 
 dev_pass=$(gen_htpasswd developer "$(cat "$CRC_PASS_DEVELOPER_PATH")")
 adm_pass=$(gen_htpasswd kubeadmin "$(cat "$CRC_PASS_KUBEADMIN_PATH")")
+set -x # re-enable tracing after passwords are generated
 
 echo "creating the password secret ..."
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f671d47 and 664f723.

📒 Files selected for processing (44)
  • createdisk-library.sh (2 hunks)
  • createdisk.sh (2 hunks)
  • docs/self-sufficient-bundle.md (0 hunks)
  • systemd/crc-aws-fetch-secrets.sh (1 hunks)
  • systemd/crc-check-tap.service (1 hunks)
  • systemd/crc-cluster-status.service (1 hunks)
  • systemd/crc-cluster-status.sh (2 hunks)
  • systemd/crc-conditionally-disable-tap.sh (1 hunks)
  • systemd/crc-custom.target (1 hunks)
  • systemd/crc-disable-tap.sh (1 hunks)
  • systemd/crc-dnsmasq.service (0 hunks)
  • systemd/crc-needs-tap.sh (1 hunks)
  • systemd/crc-no-tap.service (0 hunks)
  • systemd/crc-no-tap.sh (0 hunks)
  • systemd/crc-pullsecret.service (1 hunks)
  • systemd/crc-pullsecret.sh (1 hunks)
  • systemd/crc-routes-controller.service (1 hunks)
  • systemd/crc-routes-controller.sh (1 hunks)
  • systemd/crc-self-sufficient-env.sh (1 hunks)
  • systemd/crc-systemd-common.sh (1 hunks)
  • systemd/crc-test-vsock.py (1 hunks)
  • systemd/crc-user-mode-networking.sh (1 hunks)
  • systemd/crc-wait-apiserver-up.service (1 hunks)
  • systemd/crc-wait-apiserver-up.sh (1 hunks)
  • systemd/crc-wait-node-ready.service (1 hunks)
  • systemd/crc-wait-node-ready.sh (1 hunks)
  • systemd/dnsmasq.sh.template (1 hunks)
  • systemd/kubelet.service.d/wants-crc-custom.conf (1 hunks)
  • systemd/ocp-cluster-ca.service (2 hunks)
  • systemd/ocp-cluster-ca.sh (2 hunks)
  • systemd/ocp-clusterid.service (1 hunks)
  • systemd/ocp-clusterid.sh (1 hunks)
  • systemd/ocp-custom-domain.service (2 hunks)
  • systemd/ocp-custom-domain.sh (1 hunks)
  • systemd/ocp-growfs.service (0 hunks)
  • systemd/ocp-growfs.sh (0 hunks)
  • systemd/ocp-mco-sshkey.service (1 hunks)
  • systemd/ocp-mco-sshkey.sh (1 hunks)
  • systemd/ocp-userpasswords.service (2 hunks)
  • systemd/ocp-userpasswords.sh (1 hunks)
  • systemd/ocp-wait-apiservices-available.service (1 hunks)
  • systemd/ocp-wait-apiservices-available.sh (1 hunks)
  • systemd/ovs-configuration.service.d/mute-console.conf (1 hunks)
  • tools.sh (1 hunks)
💤 Files with no reviewable changes (6)
  • systemd/crc-no-tap.service
  • systemd/crc-dnsmasq.service
  • systemd/ocp-growfs.sh
  • docs/self-sufficient-bundle.md
  • systemd/crc-no-tap.sh
  • systemd/ocp-growfs.service
✅ Files skipped from review due to trivial changes (1)
  • systemd/kubelet.service.d/wants-crc-custom.conf
🚧 Files skipped from review as they are similar to previous changes (14)
  • systemd/crc-aws-fetch-secrets.sh
  • systemd/ocp-cluster-ca.service
  • systemd/crc-wait-node-ready.service
  • systemd/dnsmasq.sh.template
  • systemd/crc-cluster-status.service
  • systemd/ocp-userpasswords.service
  • createdisk-library.sh
  • systemd/ocp-custom-domain.service
  • systemd/ocp-wait-apiservices-available.service
  • systemd/crc-routes-controller.service
  • systemd/crc-routes-controller.sh
  • createdisk.sh
  • systemd/crc-custom.target
  • systemd/ocp-clusterid.sh
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-10-02T12:58:51.079Z
Learnt from: kpouget
PR: crc-org/snc#1168
File: systemd/crc-pullsecret.sh:11-11
Timestamp: 2025-10-02T12:58:51.079Z
Learning: In systemd/crc-pullsecret.sh, the wait_for_resource_or_die secret call is intentionally checking that the Kubernetes secret API is operational (part of the OCP/K8s boot process), not waiting for a specific secret resource to exist. The script relies on subsequent oc commands and systemd retry logic to handle cases where specific resources aren't available yet.

Applied to files:

  • systemd/crc-pullsecret.service
  • systemd/crc-wait-apiserver-up.sh
  • systemd/ocp-userpasswords.sh
  • systemd/ocp-custom-domain.sh
  • systemd/crc-systemd-common.sh
  • systemd/crc-pullsecret.sh
  • systemd/ocp-cluster-ca.sh
📚 Learning: 2025-10-07T13:05:57.801Z
Learnt from: kpouget
PR: crc-org/snc#1168
File: systemd/crc-systemd-common.sh:5-7
Timestamp: 2025-10-07T13:05:57.801Z
Learning: In systemd/crc-systemd-common.sh and similar system scripts, extensive input validation (like checking if parameters are numeric or within specific ranges) is not required for internal functions. The codebase prefers simpler, less defensive coding for these utilities.

Applied to files:

  • systemd/crc-systemd-common.sh
📚 Learning: 2025-10-02T13:13:28.229Z
Learnt from: kpouget
PR: crc-org/snc#1168
File: systemd/dnsmasq.sh.template:3-12
Timestamp: 2025-10-02T13:13:28.229Z
Learning: The file `/etc/sysconfig/crc-env` in the crc-org/snc repository contains configuration variables and file paths (e.g., `CRC_PASS_KUBEADMIN_PATH`, `CRC_PUB_KEY_PATH`, `CRC_NETWORK_MODE_USER`, `APPS_DOMAIN`), not the actual secrets. The secrets themselves are stored in separate files referenced by these paths. Sourcing `crc-env` with xtrace enabled does not leak sensitive information.

Applied to files:

  • systemd/ocp-mco-sshkey.service
  • systemd/crc-needs-tap.sh
  • systemd/crc-self-sufficient-env.sh
📚 Learning: 2025-10-07T07:03:06.884Z
Learnt from: kpouget
PR: crc-org/snc#1168
File: systemd/crc-self-sufficient-env.sh:10-17
Timestamp: 2025-10-07T07:03:06.884Z
Learning: In the crc-org/snc repository, the CRC_CLOUD environment variable has been removed from the codebase. The self-sufficient environment check in systemd/crc-self-sufficient-env.sh now only validates CRC_SELF_SUFFICIENT=1, not CRC_CLOUD.

Applied to files:

  • systemd/crc-self-sufficient-env.sh
🧬 Code graph analysis (6)
systemd/crc-wait-apiserver-up.sh (1)
systemd/crc-systemd-common.sh (1)
  • wait_for_resource_or_die (3-52)
systemd/ocp-userpasswords.sh (1)
systemd/crc-systemd-common.sh (1)
  • wait_for_resource_or_die (3-52)
systemd/ocp-custom-domain.sh (2)
systemd/crc-systemd-common.sh (1)
  • wait_for_resource_or_die (3-52)
systemd/ocp-cluster-ca.sh (1)
  • cleanup (38-42)
systemd/crc-pullsecret.sh (1)
systemd/crc-systemd-common.sh (1)
  • wait_for_resource_or_die (3-52)
systemd/ocp-cluster-ca.sh (2)
systemd/crc-systemd-common.sh (1)
  • wait_for_resource_or_die (3-52)
systemd/ocp-custom-domain.sh (1)
  • cleanup (42-45)
systemd/ocp-mco-sshkey.sh (1)
systemd/crc-systemd-common.sh (1)
  • wait_for_resource_or_die (3-52)
🪛 Ruff (0.14.0)
systemd/crc-test-vsock.py

1-1: Shebang is present but file is not executable

(EXE001)

🪛 Shellcheck (0.11.0)
systemd/crc-needs-tap.sh

[warning] 13-13: EXIT_ERROR appears unused. Verify use (or export if used externally).

(SC2034)

🔇 Additional comments (19)
systemd/ovs-configuration.service.d/mute-console.conf (1)

1-3: LGTM! Previous logging issue has been addressed.

The use of append: correctly preserves logs across service restarts, addressing the truncation concern from the previous review.

systemd/crc-pullsecret.service (2)

6-6: Increased retry attempts improve reliability.

The combination of StartLimitBurst=40 and RestartSec=10 maintains the same total retry window (~400 seconds within the 450-second interval) but provides 4× more retry attempts compared to the previous configuration. This improves the service's ability to recover from transient failures during cluster bring-up.

Also applies to: 12-12


13-16: No issues found. The script correctly accepts the path as a positional argument (PULL_SECRETS_FILE="${1:-}"), does not depend on /etc/sysconfig/crc-env, and has no CRC_* environment variable dependencies. The systemd service change is compatible with the current script implementation.

systemd/crc-disable-tap.sh (1)

11-12: Properly handles missing units and files.

The use of rm -f safely removes the NM connection file even if absent, and || true on the systemctl command ensures the script succeeds even when the unit doesn't exist. This correctly addresses the past review concern about handling environments where [email protected] isn't present.

systemd/crc-self-sufficient-env.sh (1)

11-30: Inverted exit logic is correct for ExecCondition semantics.

The script intentionally exits 0 when vsock doesn't work (exit 1 from Python) and exits 1 when vsock works (exit 0 from Python). This inverted logic is correct for use as ExecCondition in systemd units: the condition succeeds (allowing the unit to run) when the system is self-sufficient (vsock unavailable), and fails (skipping the unit) when vsock is available (not self-sufficient).

The return code handling is implemented correctly:

  • Timeout (124): propagates error
  • vsock not working (1): exits 0 to proceed with self-sufficient setup
  • vsock working (0): exits 1 to skip self-sufficient setup
  • Other errors: propagated
systemd/crc-conditionally-disable-tap.sh (1)

10-17: Wrapper logic is clean and efficient.

The conditional logic correctly delegates to either an early exit or the disable script based on TAP necessity. Using exec to replace the current process with crc-disable-tap.sh is efficient and ensures proper signal handling and exit status propagation.

systemd/ocp-wait-apiservices-available.sh (1)

14-65: Robust error handling and retry logic.

The script correctly handles transient oc failures, uses a sentinel value (999) to keep retrying when the API is unreachable, and employs proper defaults for the unavailable count. The timeout and retry parameters provide reasonable balance between responsiveness and reliability.

systemd/crc-wait-apiserver-up.service (1)

10-10: crc-wait-apiserver-up.service: EnvironmentFile removal is safe; crc-wait-apiserver-up.sh only depends on KUBECONFIG (and built-in SECONDS).

systemd/ocp-mco-sshkey.service (1)

12-15: Scripts have been properly refactored and no longer depend on environment variables.

Verification confirms:

  • ocp-mco-sshkey.sh accepts the public key path as $1 and does not source crc-env.
  • crc-self-sufficient-env.sh contains no references to CRC_SELF_SUFFICIENT, CRC_CLOUD, or crc-env.

The removal of EnvironmentFile=/etc/sysconfig/crc-env is correct. The scripts have been properly updated to not depend on environment variables from this file.

systemd/ocp-clusterid.service (1)

11-11: KUBECONFIG is sufficient for the scripts; no additional environment variables are required.

Verification shows that:

  • ocp-clusterid.sh sources crc-systemd-common.sh and uses only oc and uuidgen commands
  • crc-systemd-common.sh uses only oc get to check cluster resources
  • crc-self-sufficient-env.sh runs only a Python test for vsock connectivity
  • None of these scripts reference CRC_* variables or other environment variables

Setting Environment=KUBECONFIG=/opt/kubeconfig is sufficient for the oc commands to function correctly. The removal of EnvironmentFile is appropriate.

systemd/crc-wait-apiserver-up.sh (1)

3-20: LGTM! Strict shell options and improved observability.

The addition of strict shell options (pipefail, errexit, nounset, errtrace) improves reliability, and the timing metrics with progress messages provide good visibility into the boot process.

systemd/crc-wait-node-ready.sh (1)

1-40: LGTM! Node readiness polling with proper error handling.

The script correctly handles set -o errexit by using || true on line 16, ensuring the loop continues when oc get fails during node initialization. The fallback value on line 17 provides clear status reporting.

systemd/crc-systemd-common.sh (1)

3-52: LGTM! Well-structured wait helper with appropriate validation.

The function renaming to wait_for_resource_or_die clearly communicates the exit-on-failure behavior. The xtrace handling, timing metrics, and input validation are all appropriate. Based on learnings, the validation level is correct for internal functions in this codebase.

systemd/crc-needs-tap.sh (2)

7-9: Confirm: set -x with crc-env sourcing is safe.

The script enables xtrace while sourcing /etc/sysconfig/crc-env. Based on learnings, this file contains configuration variables and file paths (not actual secrets), so xtrace does not leak sensitive information.


15-45: LGTM! Clear TAP detection logic.

The detection flow is well-structured:

  1. Self-sufficient bundles don't need TAP (exits early)
  2. System-mode networking doesn't need TAP (exits early)
  3. Virtualization detection determines TAP necessity (apple/none: no, others: yes)
systemd/crc-check-tap.service (1)

1-14: LGTM! Proper systemd unit ordering and activation.

The unit correctly orders TAP checking before NetworkManager and gv-user-network services, and the WantedBy directives ensure proper activation.

systemd/crc-pullsecret.sh (2)

11-54: LGTM! Secure argument-driven secret handling.

The script correctly:

  • Accepts the pull-secrets file path as an argument
  • Uses wait_for_resource_or_die to verify API readiness (per learnings)
  • Validates input file existence and content
  • Streams secrets securely via stdin to avoid command-line exposure

11-65: LGTM! Secure argument-driven password handling.

The script correctly:

  • Accepts password file paths as arguments
  • Uses --log-driver=none to prevent password leakage in podman logs
  • Disables xtrace (line 50) before generating passwords to avoid shell trace exposure
  • Validates inputs before processing
systemd/crc-user-mode-networking.sh (1)

1-65: LGTM! Robust user-mode networking detection.

The script properly:

  • Handles target argument to invert exit codes based on search criteria
  • Validates CRC_NETWORK_MODE_USER is set and numeric (0 or 1)
  • Handles self-sufficient bundles correctly
  • Provides clear error messages

The unreachable error handler (lines 60-65) is defensive but harmless.

@praveenkumar
Copy link
Member

/retest

@kpouget
Copy link
Contributor Author

kpouget commented Oct 17, 2025

install process got interrupted :/

+ sudo virt-install --name crc --vcpus 6 --memory 14336 --arch=x86_64 --disk path=/var/lib/libvirt/crc/crc.qcow2,size=31 --network network=crc,mac=52:54:00:ee:42:e1 --os-variant rhel9-unknown --nographics --cdrom /var/lib/libvirt/crc/rhcos-live.iso --events on_reboot=restart --autoconsole none --boot uefi --wait

Starting install...

Allocating 'crc.qcow2'                                      |  31 GB  00:00     

Creating domain...                                          |         00:00     
{"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:173","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.ExecuteProcess","level":"error","msg":"Entrypoint received interrupt: terminated","severity":"error","time":"2025-10-16T22:39:32Z"}
Connection to 34.45.13.221 closed by remote host.

Recommendation: To check for possible causes of SSH connectivity issues and get
recommendations, rerun the ssh command with the --troubleshoot option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants