Skip to content

Conversation

@pravarpathania
Copy link

@pravarpathania pravarpathania commented Oct 4, 2025

Problem

sky show-gpus --infra k8s fails with:
AssertionError: Keys of counts ([]), capacity ([]), and available (['L4']) must be the same.

This occurs during Kubernetes cluster scaling (e.g., when a GKE GPU node pool is resizing).
Accelerators are discovered via node labels before their status.allocatable field reports capacity or availability, leading to mismatched accelerator dictionary keys.


Change

  • sky/catalog/kubernetes_catalog.py: Added key alignment logic to ensure counts, capacity, and available dictionaries always share the same keys.
  • sky/core.py: Improved assertion message to include contextual information for easier debugging.
  • tests/unit_tests/kubernetes/test_kubernetes_utils.py: Added comprehensive test cases covering:
    • Nodes scaling with 0-capacity accelerators
    • Mixed ready + scaling nodes
    • Edge scenarios with transient GPU states

Why

  • Prevents sky show-gpus --infra k8s from crashing during cluster scaling operations.
  • Handles transient Kubernetes states gracefully.
  • Improves robustness and error transparency for GPU discovery.
  • Platform-agnostic — while first reported on GKE, this fix applies to any Kubernetes environment (EKS, AKS, on-prem, etc.).

Tested

  • Unit tests: pytest tests/unit_tests/kubernetes/test_kubernetes_utils.py
  • All catalog-related tests: pytest tests/unit_tests -k catalog
  • Pre-commit formatting and linting (pre-commit run --files <changed-files> or ./format.sh <changed-files>)
  • Edge cases simulated (scaling-only, mixed-node clusters, no accelerators)

Files Modified

  • sky/catalog/kubernetes_catalog.py
  • sky/core.py
  • tests/unit_tests/kubernetes/test_kubernetes_utils.py

Fixes:


Notes

  • This branch rebases on the latest upstream/master to ensure a clean, up-to-date diff.

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh

  • Any manual or new tests for this PR (please specify below)

  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)

  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)

  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[UX] sky show-gpus --infra k8s fails during the kubernetes cluster GPU node pool is scaling up

1 participant