Skip to content

Commit 1d9d3e6

Browse files
authored
feat(gpu): Add robust proxy support for driver installation (#1361)
This PR introduces comprehensive HTTP/S proxy support for the GPU driver installation script, enabling its use in environments with restricted internet egress, such as those using Secure Web Proxy. The `set_proxy` function, controlled by the `http-proxy` and new `http-proxy-pem-uri` metadata attributes, now configures APT, GPG, Java, pip, and Conda to route traffic through the specified proxy. If a PEM certificate URI is provided, the certificate is installed into the OS, Conda, and Java trust stores. The script now correctly handles the proxy scheme (HTTP vs HTTPS) based on the presence of the `http-proxy-pem-uri` metadata. This change was validated in a development environment where all internet access was routed through an explicit proxy. Additional changes: - `README.md` updated to document the new `http-proxy-pem-uri` metadata option and clarify `http-proxy` usage. - GCS caching for the NVIDIA driver is checked earlier to avoid unnecessary HEAD requests to the NVIDIA CDN. - `configure_dkms_certs` is now more idempotent. - Spark RAPIDS versions and repository URL aligned with `spark-rapids/spark-rapids.sh` as part of a move towards a unified GPU/RAPIDS installation script. - Switched to using `/sys/bus/pci/devices/*/uevent` for GPU detection to remove dependency on pciutils - Moved `set_proxy` call earlier in `prepare_to_install`. - Refactored `no_proxy` and `nvcc_gencode` list generation. fix(ci): Add retry logic to kubectl logs in presubmit - Wrapped `kubectl logs` command in `run-presubmit-on-k8s.sh` with a retry loop to handle transient "No agent available" errors from GKE.
1 parent 17b1f6e commit 1d9d3e6

File tree

4 files changed

+288
-43
lines changed

4 files changed

+288
-43
lines changed

cloudbuild/run-presubmit-on-k8s.sh

Lines changed: 32 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -42,19 +42,46 @@ EOF
4242
kubectl apply -f $POD_CONFIG
4343

4444
# Delete POD on exit and describe it before deletion if exit was unsuccessful
45-
trap '[[ $? != 0 ]] && kubectl describe "pod/${POD_NAME}"; kubectl delete pods "${POD_NAME}"' EXIT
45+
trap 'exit_code=$?
46+
if [[ ${exit_code} != 0 ]]; then
47+
echo "Presubmit failed for ${POD_NAME}. Describing pod..."
48+
kubectl describe "pod/${POD_NAME}" || echo "Failed to describe pod."
49+
50+
PROJECT_ID=$(gcloud config get-value project 2>/dev/null || echo "unknown-project")
51+
BUCKET="dataproc-init-actions-test-${PROJECT_ID}"
52+
LOG_GCS_PATH="gs://${BUCKET}/${BUILD_ID}/logs/${POD_NAME}.log"
53+
54+
echo "Attempting to upload logs to ${LOG_GCS_PATH}"
55+
if kubectl logs "${POD_NAME}" | gsutil cp - "${LOG_GCS_PATH}"; then
56+
echo "Logs for failed pod ${POD_NAME} uploaded to: ${LOG_GCS_PATH}"
57+
else
58+
echo "Log upload to ${LOG_GCS_PATH} failed."
59+
fi
60+
fi
61+
echo "Deleting pod ${POD_NAME}..."
62+
kubectl delete pods "${POD_NAME}" --ignore-not-found=true
63+
exit ${exit_code}' EXIT
4664

4765
kubectl wait --for=condition=Ready "pod/${POD_NAME}" --timeout=15m
4866

67+
# To mitigate problems with early test failure, retry kubectl logs
68+
sleep 10s
4969
while ! kubectl describe "pod/${POD_NAME}" | grep -q Terminated; do
50-
kubectl logs -f "${POD_NAME}" --since-time="${LOGS_SINCE_TIME}" --timestamps=true
70+
# Try to stream logs, but primary log capture is now in the trap
71+
kubectl logs -f "${POD_NAME}" --since-time="${LOGS_SINCE_TIME}" --timestamps=true || true
5172
LOGS_SINCE_TIME=$(date --iso-8601=seconds)
73+
sleep 2 # Short sleep to avoid busy waiting if logs -f exits
5274
done
5375

54-
EXIT_CODE=$(kubectl get pod "${POD_NAME}" \
55-
-o go-template="{{range .status.containerStatuses}}{{.state.terminated.exitCode}}{{end}}")
76+
# Final check on the pod exit code
77+
EXIT_CODE=$(kubectl get pod "${POD_NAME}" -o go-template="{{range .status.containerStatuses}}{{.state.terminated.exitCode}}{{end}}" || echo "1")
5678

5779
if [[ ${EXIT_CODE} != 0 ]]; then
58-
echo "Presubmit failed!"
80+
echo "Presubmit final state for ${POD_NAME} indicates failure (Exit Code: ${EXIT_CODE})."
81+
# The trap will handle the log upload and cleanup
5982
exit 1
6083
fi
84+
85+
echo "Presubmit for ${POD_NAME} successful."
86+
# Explicitly exit 0 to clear the trap's exit code
87+
exit 0

gpu/README.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -225,6 +225,18 @@ sometimes found in the "building from source" sections.
225225
modulus md5sum of the files referenced by both the private and
226226
public secret names.
227227

228+
- `http-proxy: <HOST>:<PORT>` - Optional. The address of an HTTP
229+
proxy to use for internet egress. The script will configure `apt`,
230+
`curl`, `gsutil`, `pip`, `java`, and `gpg` to use this proxy.
231+
232+
- `http-proxy-pem-uri: <GS_PATH>` - Optional. A `gs://` path to the
233+
PEM-encoded certificate file used by the proxy specified in
234+
`http-proxy`. This is needed if the proxy uses TLS and its
235+
certificate is not already trusted by the cluster's default trust
236+
store (e.g., if it's a self-signed certificate or signed by an
237+
internal CA). The script will install this certificate into the
238+
system and Java trust stores.
239+
228240
#### Loading built kernel module
229241

230242
For platforms which do not have pre-built binary kernel drivers, the

0 commit comments

Comments
 (0)