{AKS} az aks create/update: add --enable/--disable-control-plane-metrics#9931
{AKS} az aks create/update: add --enable/--disable-control-plane-metrics#9931bragi92 wants to merge 11 commits into
Conversation
Surface the first-class API property azureMonitorProfile.metrics.controlPlane.enabled (API version 2026-02-02-preview, already in the vendored SDK) so users can opt clusters in/out of Azure Monitor managed Prometheus control plane metrics (kube-apiserver, etcd, etc.) without the AFEC-gated preview. - Add --enable-control-plane-metrics on `az aks create` and `az aks update`, plus --disable-control-plane-metrics on `az aks update`. - Validate that --enable-control-plane-metrics requires Azure Monitor metrics to be enabled (either already on the cluster or via --enable-azure-monitor-metrics in the same command), and that enable and disable cannot be combined. - Wire the flags into the create (set_up_azure_monitor_profile) and update (update_azure_monitor_profile) decorator paths. - Decorator unit tests + live-only command tests (positive & negative). - Add HISTORY.rst Pending entry. Mirrors the upstream proposal at Azure#9855. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…n_put on create On greenfield `az aks create --enable-azure-monitor-metrics --enable-control-plane-metrics`, setting `azureMonitorProfile.metrics.controlPlane.enabled=true` on the initial cluster PUT causes the AKS RP to schedule the CCP collector pod before the Data Collection Rule Association (DCRA) has been created. The pod then CrashLoopBackOffs until the postprocessing step finishes creating the AMW/DCE/DCR/DCRA and the RP reconciles. Fix: on the create flow, leave `metrics.controlPlane` unset on the initial PUT. After postprocessing creates the DCRA, the existing fire-and-forget addon_put PUT now also flips `metrics.controlPlane.enabled=true`, so the CCP pod is only scheduled once its DCRA exists. Changes: * `_setup_azure_monitor_metrics` no longer mutates `metrics.control_plane` on the create path; it still calls `get_enable_control_plane_metrics()` so the mutually-exclusive flag validation fires early. * New `_addon_put_with_control_plane` helper in `azuremonitormetrics/azuremonitorprofile.py` that mirrors core `addon_put` and additionally sets `metrics.controlPlane.enabled=true`. * `link_azure_monitor_profile_artifacts` dispatches to the new helper when `create_flow=True` and `enable_control_plane_metrics=True`. * Update path is unchanged (single-PUT update on an existing cluster that already has its DCRA does not race). * Added unit tests covering create-path deferral and the `--enable-control-plane-metrics` without `--enable-azure-monitor-metrics` validation error. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
| rule | cmd_name | rule_message | suggest_message |
|---|---|---|---|
| aks create | cmd aks create added parameter enable_control_plane_metrics |
||
| aks update | cmd aks update added parameter disable_control_plane_metrics |
||
| aks update | cmd aks update added parameter enable_control_plane_metrics |
|
Hi @bragi92, |
|
AKS |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds first-class support for enabling/disabling Azure Monitor managed Prometheus control plane metrics for AKS via azureMonitorProfile.metrics.controlPlane.enabled, including validation and post-create behavior to avoid CCP/DCRA race conditions.
Changes:
- Introduces
--enable-control-plane-metrics(create/update) and--disable-control-plane-metrics(update) flags with validation rules and help text. - Defers setting
metrics.controlPlane.enabled=trueduring create to a postprocessing PUT after DCRA creation. - Adds unit and live-only scenario tests covering positive and negative flows for create/update.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/aks-preview/azext_aks_preview/managed_cluster_decorator.py | Adds flag getters/validation and update-time payload changes for control plane metrics; defers create-time enablement. |
| src/aks-preview/azext_aks_preview/azuremonitormetrics/azuremonitorprofile.py | Adds a postprocessing PUT variant that flips controlPlane.enabled after DCRA creation. |
| src/aks-preview/azext_aks_preview/_params.py | Wires new CLI flags for create/update with help text and aliases. |
| src/aks-preview/azext_aks_preview/custom.py | Plumbs new parameters into command entrypoints. |
| src/aks-preview/azext_aks_preview/_help.py | Documents new flags in command help. |
| src/aks-preview/azext_aks_preview/tests/latest/test_managed_cluster_decorator.py | Adds unit tests for deferral/validation and update toggling behavior. |
| src/aks-preview/azext_aks_preview/tests/latest/test_aks_commands.py | Adds live-only tests for create/update/negative control plane metrics scenarios. |
| src/aks-preview/HISTORY.rst | Notes new flags and the shift from AFEC-gated preview to first-class API. |
- Wait on the LRO in _addon_put_with_control_plane via poller.result(). This is the only place controlPlane.enabled is set during the greenfield create flow, so the CP flip must be durably persisted before the create command returns. Without the wait, callers and tests that read the cluster immediately could observe the pre-flip state. (The sibling addon_put intentionally remains fire-and-forget because metrics.enabled was already persisted on the initial cluster PUT.) - Replace raise UnknownError(e) with raise UnknownError(str(e)) from e so the message is readable and the original traceback is preserved. - Coerce _get_enable_control_plane_metrics / _get_disable_control_plane_metrics return values to bool() to match the declared -> bool return type when the parameter dict omits the key. - Make the live test_aks_create_with_control_plane_metrics assertion robust: the controlPlane.enabled check is moved out of the immediate create response into an explicit aks show after aks wait, since the flip is intentionally deferred to post-DCRA postprocessing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The wheel under test is the aks-preview extension PR (Azure#9931). The GA in-box CLI PR (#33537) is a parallel change in az aks, not a mirror. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
/azp run |
|
Azure Pipelines successfully started running 2 pipeline(s). |
|
see comment Azure/azure-cli#33537 (review) |
Per FumingZhang review feedback on Azure/azure-cli#33537: calling get_enable_control_plane_metrics() purely to trigger validation and discarding the return value is a confusing pattern. Extract the validation block into a new private _validate_control_plane_metrics_params method, expose a public validate_control_plane_metrics_params, and have the getters delegate to it when enable_validation=True (preserves existing API). The two _setup_azure_monitor_profile call sites now call the validator directly instead of discarding a getter result. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Mirrors the GA azure-cli cleanup. Other validators in the file are a single public def validate_xxx(self) -> None — no private companion. Collapse the extra _validate_control_plane_metrics_params indirection so the new validator matches the file's convention. Tests + behavior unchanged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Mirrors the GA azure-cli cleanup. The aka.ms/aks/controlplane-metrics shortlink does not resolve. Drop the trailing reference from the help strings (create + update enable, update disable, plus _help.py YAML for both). Vendored SDK docstrings are auto-generated upstream and untouched. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Mirrors the GA azure-cli cleanup. Replace 'kube-apiserver, etcd, etc' with the actual default Prometheus scrape job names: controlplane-apiserver and controlplane-etcd. These are the targets users see in AMW and what the AKS docs reference. The 'etc' was also misleading since scheduler / controller-manager / NAP targets are opt-in via MinimalIngestionProfile and are not flipped on by this flag. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
/azp run |
|
Commenter does not have sufficient privileges for PR 9931 in repo Azure/azure-cli-extensions |
|
/azp run |
|
Azure Pipelines successfully started running 2 pipeline(s). |
|
Please fix failed test cases @bragi92
|
…sertion The validator raises RequiredArgumentMissingError, not ArgumentUsageError. These are sibling classes under UserFault (not parent/child), so assertRaises(ArgumentUsageError) did not catch the error and the test failed in CI.
|
Addressed failure @FumingZhang Can you help with re-running the pipeline with /azp run Also it looks like -> this might be a flaky test and unrelated to my change. I've fixed the other test. |
|
/azp run |
|
Azure Pipelines successfully started running 2 pipeline(s). |
|
Thank you Fuming, I'll wait for your PR to merge and then take in your change into my fork. |
The test_aks_check_network integration test was failing across all Python jobs with a CannotOverwriteExistingCassetteException because the recorded VMSS request no longer matched. Pull the corrected recording from upstream/main (Azure#9940) so the build passes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Bump aks-preview VERSION to 21.0.0b6 and move the --enable/--disable-control-plane-metrics changelog entry out of the Pending section into its own 21.0.0b6 release so it ships with this PR. Reconcile HISTORY.rst with main, which already released 21.0.0b5 (prepared-image-specification + the --node-image-only fix), so the changelog is consistent and the version is a clean increment over main. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Thanks @FumingZhang . I've merged in #9940 into my fork. Can you help with re-running the pipeline and the merge? |
Resolve release conflicts in aks-preview HISTORY.rst and setup.py: keep version 21.0.0b6 for the control-plane-metrics change, on top of upstream's released 21.0.0b5 (prepared-image-specification + --node-image-only fix). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
/azp run |
|
Azure Pipelines successfully started running 2 pipeline(s). |
Surface
azureMonitorProfile.metrics.controlPlane.enabledso users can opt clusters in/out of Azure Monitor managed Prometheus control-plane metrics (controlplane-apiserver, controlplane-etcd) via the first-class API property — replaces the AFEC-gated preview. This is theaks-previewmirror of the in-box CLI change in Azure/azure-cli#33537.New flags:
az aks create:--enable-control-plane-metrics(--enable-cp-metrics)az aks update:--enable-control-plane-metrics(--enable-cp-metrics)az aks update:--disable-control-plane-metrics(--disable-cp-metrics)Enable requires Azure Monitor metrics to already be on or to be enabled in the same command via
--enable-azure-monitor-metrics. Enable + disable in the same command, or enable-CP +--disable-azure-monitor-metrics, are rejected client-side withMutuallyExclusiveArgumentError.Greenfield race fix:
On
aks create,metrics.controlPlane.enabled=trueis intentionally NOT set on the initial cluster PUT. Otherwise the RP would schedule the control-plane-metrics collection (CCP) pod before the DCRA is created in postprocessing (link_azure_monitor_profile_artifacts), causing the CCP pod to crash-loop with "DCRA not found" until reconciliation. The flip is deferred to the existing post-DCRAaddon_putPUT, so the CCP pod is scheduled only after its DCRA exists. The update path is unchanged — brownfield updates target a cluster whose DCRA already exists, so there is no race. This is the divergence from #9855.What's added
--enable-control-plane-metrics/--enable-cp-metricsonaz aks createandaz aks update.--disable-control-plane-metrics/--disable-cp-metricsonaz aks update.azext_aks_preview/_params.pyand help text inazext_aks_preview/_help.py.azext_aks_preview/_validators.py:azureMonitorProfile.metrics.enabled=trueon the cluster (or--enable-azure-monitor-metricsin the same command).--enable-control-plane-metrics+--disable-control-plane-metrics→MutuallyExclusiveArgumentError.--enable-control-plane-metrics+--disable-azure-monitor-metrics→MutuallyExclusiveArgumentError.AKSPreviewManagedClusterContextgetters inazext_aks_preview/managed_cluster_decorator.pyand routing inazext_aks_preview/custom.pyso create/update both populate the correct subfield ofManagedClusterAzureMonitorProfileMetrics.azext_aks_preview/azuremonitormetrics/azuremonitorprofile.py: the CP flip is deferred to the post-DCRAaddon_putcall instead of riding the initial cluster PUT. Brownfieldaks updateis unchanged.azext_aks_preview/__init__.py+setup.py, and history entry inHISTORY.rst.Scenario coverage
aks create --enable-azure-monitor-metrics --enable-control-plane-metricsaks create --enable-control-plane-metrics(no AMW flag)aks update --enable-control-plane-metricsaks update --enable-control-plane-metricsaks update --enable-azure-monitor-metrics --enable-control-plane-metrics …aks update --disable-control-plane-metricsaks update --enable-control-plane-metrics --disable-control-plane-metricsaks update --enable-control-plane-metrics --disable-azure-monitor-metricsFiles changed
src/aks-preview/azext_aks_preview/_help.pysrc/aks-preview/azext_aks_preview/_params.pysrc/aks-preview/azext_aks_preview/_validators.pysrc/aks-preview/azext_aks_preview/custom.pysrc/aks-preview/azext_aks_preview/managed_cluster_decorator.pysrc/aks-preview/azext_aks_preview/azuremonitormetrics/azuremonitorprofile.pysrc/aks-preview/azext_aks_preview/__init__.py,src/aks-preview/setup.py,src/aks-preview/HISTORY.rst(version bump + history)src/aks-preview/azext_aks_preview/tests/latest/...(unit-test updates + recorded cassettes)Relationship to existing PRs
az aks create/update: add --enable/--disable-control-plane-metrics azure-cli#33537 (in-box CLI mirror of the same flags).Testing Guide
Unit tests:
Live validation against a real AKS cluster + Azure Monitor workspace:
Validation in Azure Monitor workspace after each enable: default CCP metric families flow within ~5–10 min (
apiserver_request_total,apiserver_request_duration_seconds_*,etcd_server_has_leader,etcd_mvcc_db_total_size_in_bytes,process_start_time_seconds). After disable, allow ~15 min for the previous deployment's metrics to age out before re-asserting.This checklist is used to make sure that common guidelines for a pull request are followed.
Related command
az aks createaz aks updateGeneral Guidelines
azdev style <YOUR_EXT>locally? (pip install azdevto install)python scripts/ci/test_index.py -qlocally? (pip install wheel==0.30.0if you do not have wheel installed)About Extension Publish
There is a pipeline to automatically build, upload and publish extension wheels.
Once your PR is merged into main branch, a new PR will be created to update
src/index.jsonautomatically.You only need to manually edit the version in
src/{EXT_NAME}/setup.pyandsrc/{EXT_NAME}/HISTORY.rst.