FIX (kubernetes): align accelerator dict keys during scaling (#7100) #7482
+159
−4
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
sky show-gpus --infra k8sfails with:AssertionError: Keys of counts ([]), capacity ([]), and available (['L4']) must be the same.
This occurs during Kubernetes cluster scaling (e.g., when a GKE GPU node pool is resizing).
Accelerators are discovered via node labels before their
status.allocatablefield reports capacity or availability, leading to mismatched accelerator dictionary keys.Change
counts,capacity, andavailabledictionaries always share the same keys.Why
sky show-gpus --infra k8sfrom crashing during cluster scaling operations.Tested
pytest tests/unit_tests/kubernetes/test_kubernetes_utils.pypytest tests/unit_tests -k catalogpre-commit run --files <changed-files>or./format.sh <changed-files>)Files Modified
sky/catalog/kubernetes_catalog.pysky/core.pytests/unit_tests/kubernetes/test_kubernetes_utils.pyFixes:
sky show-gpus --infra k8sfails during the kubernetes cluster GPU node pool is scaling up #7100Notes
This branch rebases on the latest
upstream/masterto ensure a clean, up-to-date diff.Code formatting: install pre-commit (auto-check on commit) or
bash format.shAny manual or new tests for this PR (please specify below)
All smoke tests:
/smoke-test(CI) orpytest tests/test_smoke.py(local)Relevant individual tests:
/smoke-test -k test_name(CI) orpytest tests/test_smoke.py::test_name(local)Backward compatibility:
/quicktest-core(CI) orpytest tests/smoke_tests/test_backward_compat.py(local)