Skip to content

{AKS} az aks create/update: add --enable/--disable-control-plane-metrics#9931

Open
bragi92 wants to merge 11 commits into
Azure:mainfrom
bragi92:kadubey/aks-control-plane-metrics-upstream
Open

{AKS} az aks create/update: add --enable/--disable-control-plane-metrics#9931
bragi92 wants to merge 11 commits into
Azure:mainfrom
bragi92:kadubey/aks-control-plane-metrics-upstream

Conversation

@bragi92

@bragi92 bragi92 commented Jun 11, 2026

Copy link
Copy Markdown
Member

Surface azureMonitorProfile.metrics.controlPlane.enabled so users can opt clusters in/out of Azure Monitor managed Prometheus control-plane metrics (controlplane-apiserver, controlplane-etcd) via the first-class API property — replaces the AFEC-gated preview. This is the aks-preview mirror of the in-box CLI change in Azure/azure-cli#33537.

New flags:

  • az aks create: --enable-control-plane-metrics (--enable-cp-metrics)
  • az aks update: --enable-control-plane-metrics (--enable-cp-metrics)
  • az aks update: --disable-control-plane-metrics (--disable-cp-metrics)

Enable requires Azure Monitor metrics to already be on or to be enabled in the same command via --enable-azure-monitor-metrics. Enable + disable in the same command, or enable-CP + --disable-azure-monitor-metrics, are rejected client-side with MutuallyExclusiveArgumentError.

Greenfield race fix:
On aks create, metrics.controlPlane.enabled=true is intentionally NOT set on the initial cluster PUT. Otherwise the RP would schedule the control-plane-metrics collection (CCP) pod before the DCRA is created in postprocessing (link_azure_monitor_profile_artifacts), causing the CCP pod to crash-loop with "DCRA not found" until reconciliation. The flip is deferred to the existing post-DCRA addon_put PUT, so the CCP pod is scheduled only after its DCRA exists. The update path is unchanged — brownfield updates target a cluster whose DCRA already exists, so there is no race. This is the divergence from #9855.

What's added

  • --enable-control-plane-metrics / --enable-cp-metrics on az aks create and az aks update.
  • --disable-control-plane-metrics / --disable-cp-metrics on az aks update.
  • Argument registration in azext_aks_preview/_params.py and help text in azext_aks_preview/_help.py.
  • Client-side validation in azext_aks_preview/_validators.py:
    • Enable requires azureMonitorProfile.metrics.enabled=true on the cluster (or --enable-azure-monitor-metrics in the same command).
    • --enable-control-plane-metrics + --disable-control-plane-metricsMutuallyExclusiveArgumentError.
    • --enable-control-plane-metrics + --disable-azure-monitor-metricsMutuallyExclusiveArgumentError.
  • AKSPreviewManagedClusterContext getters in azext_aks_preview/managed_cluster_decorator.py and routing in azext_aks_preview/custom.py so create/update both populate the correct subfield of ManagedClusterAzureMonitorProfileMetrics.
  • Greenfield race fix (see above) applied in azext_aks_preview/azuremonitormetrics/azuremonitorprofile.py: the CP flip is deferred to the post-DCRA addon_put call instead of riding the initial cluster PUT. Brownfield aks update is unchanged.
  • Extension version bump in azext_aks_preview/__init__.py + setup.py, and history entry in HISTORY.rst.

Scenario coverage

# Command Cluster state Result
1 aks create --enable-azure-monitor-metrics --enable-control-plane-metrics n/a Cluster created, AMW linked, DCRA created, then CP-metrics flipped on
2 aks create --enable-control-plane-metrics (no AMW flag) n/a Rejected: AMW required
3 aks update --enable-control-plane-metrics AMW already on CP-metrics enabled
4 aks update --enable-control-plane-metrics AMW off Rejected: AMW required
5 aks update --enable-azure-monitor-metrics --enable-control-plane-metrics … AMW off AMW enabled + CP-metrics enabled in one call
6 aks update --disable-control-plane-metrics AMW on, CP on CP-metrics disabled, AMW left intact
7 aks update --enable-control-plane-metrics --disable-control-plane-metrics any Rejected: mutually exclusive
8 aks update --enable-control-plane-metrics --disable-azure-monitor-metrics any Rejected: mutually exclusive

Files changed

  • src/aks-preview/azext_aks_preview/_help.py
  • src/aks-preview/azext_aks_preview/_params.py
  • src/aks-preview/azext_aks_preview/_validators.py
  • src/aks-preview/azext_aks_preview/custom.py
  • src/aks-preview/azext_aks_preview/managed_cluster_decorator.py
  • src/aks-preview/azext_aks_preview/azuremonitormetrics/azuremonitorprofile.py
  • src/aks-preview/azext_aks_preview/__init__.py, src/aks-preview/setup.py, src/aks-preview/HISTORY.rst (version bump + history)
  • src/aks-preview/azext_aks_preview/tests/latest/... (unit-test updates + recorded cassettes)

Relationship to existing PRs

Testing Guide

Unit tests:

azdev test aks-preview --discover
azdev test aks-preview --series --pytest-args "-k control_plane_metrics"

Live validation against a real AKS cluster + Azure Monitor workspace:

RG=ccp-test-rg
LOC=eastus
AMW=/subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.Monitor/accounts/<amw>
GRAFANA=/subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.Dashboard/grafana/<g>

# Greenfield: enable CP-metrics at create time
az aks create -g $RG -n green-cp --location $LOC \
  --enable-azure-monitor-metrics --azure-monitor-workspace-resource-id $AMW \
  --grafana-resource-id $GRAFANA \
  --enable-control-plane-metrics
az aks show -g $RG -n green-cp --query "azureMonitorProfile.metrics" -o jsonc

# Brownfield: enable, then disable, then re-enable
az aks create -g $RG -n brown-cp --location $LOC \
  --enable-azure-monitor-metrics --azure-monitor-workspace-resource-id $AMW \
  --grafana-resource-id $GRAFANA
az aks update -g $RG -n brown-cp --enable-control-plane-metrics
az aks show -g $RG -n brown-cp --query "azureMonitorProfile.metrics" -o jsonc
az aks update -g $RG -n brown-cp --disable-control-plane-metrics
az aks show -g $RG -n brown-cp --query "azureMonitorProfile.metrics" -o jsonc
az aks update -g $RG -n brown-cp --enable-control-plane-metrics
az aks show -g $RG -n brown-cp --query "azureMonitorProfile.metrics" -o jsonc

# Negative cases
az aks update -g $RG -n brown-cp --enable-control-plane-metrics --disable-control-plane-metrics
az aks update -g $RG -n brown-cp --enable-control-plane-metrics --disable-azure-monitor-metrics
az aks create -g $RG -n bad-cp --enable-control-plane-metrics   # no AMW => rejected

Validation in Azure Monitor workspace after each enable: default CCP metric families flow within ~5–10 min (apiserver_request_total, apiserver_request_duration_seconds_*, etcd_server_has_leader, etcd_mvcc_db_total_size_in_bytes, process_start_time_seconds). After disable, allow ~15 min for the previous deployment's metrics to age out before re-asserting.


This checklist is used to make sure that common guidelines for a pull request are followed.

Related command

  • az aks create
  • az aks update

General Guidelines

  • Have you run azdev style <YOUR_EXT> locally? (pip install azdev to install)
  • Have you run python scripts/ci/test_index.py -q locally? (pip install wheel==0.30.0 if you do not have wheel installed)

About Extension Publish

There is a pipeline to automatically build, upload and publish extension wheels.
Once your PR is merged into main branch, a new PR will be created to update src/index.json automatically.
You only need to manually edit the version in src/{EXT_NAME}/setup.py and src/{EXT_NAME}/HISTORY.rst.

bragi92 and others added 2 commits June 11, 2026 14:51
Surface the first-class API property
azureMonitorProfile.metrics.controlPlane.enabled (API version
2026-02-02-preview, already in the vendored SDK) so users can opt
clusters in/out of Azure Monitor managed Prometheus control plane
metrics (kube-apiserver, etcd, etc.) without the AFEC-gated preview.

- Add --enable-control-plane-metrics on `az aks create` and
  `az aks update`, plus --disable-control-plane-metrics on
  `az aks update`.
- Validate that --enable-control-plane-metrics requires Azure Monitor
  metrics to be enabled (either already on the cluster or via
  --enable-azure-monitor-metrics in the same command), and that enable
  and disable cannot be combined.
- Wire the flags into the create (set_up_azure_monitor_profile) and
  update (update_azure_monitor_profile) decorator paths.
- Decorator unit tests + live-only command tests (positive & negative).
- Add HISTORY.rst Pending entry.

Mirrors the upstream proposal at Azure#9855.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…n_put on create

On greenfield `az aks create --enable-azure-monitor-metrics
--enable-control-plane-metrics`, setting `azureMonitorProfile.metrics.controlPlane.enabled=true`
on the initial cluster PUT causes the AKS RP to schedule the CCP collector pod
before the Data Collection Rule Association (DCRA) has been created. The pod
then CrashLoopBackOffs until the postprocessing step finishes creating the
AMW/DCE/DCR/DCRA and the RP reconciles.

Fix: on the create flow, leave `metrics.controlPlane` unset on the initial PUT.
After postprocessing creates the DCRA, the existing fire-and-forget addon_put
PUT now also flips `metrics.controlPlane.enabled=true`, so the CCP pod is only
scheduled once its DCRA exists.

Changes:
* `_setup_azure_monitor_metrics` no longer mutates `metrics.control_plane`
  on the create path; it still calls `get_enable_control_plane_metrics()` so
  the mutually-exclusive flag validation fires early.
* New `_addon_put_with_control_plane` helper in
  `azuremonitormetrics/azuremonitorprofile.py` that mirrors core `addon_put`
  and additionally sets `metrics.controlPlane.enabled=true`.
* `link_azure_monitor_profile_artifacts` dispatches to the new helper when
  `create_flow=True` and `enable_control_plane_metrics=True`.
* Update path is unchanged (single-PUT update on an existing cluster that
  already has its DCRA does not race).
* Added unit tests covering create-path deferral and the
  `--enable-control-plane-metrics` without `--enable-azure-monitor-metrics`
  validation error.

Co-authored-by: Copilot &lt;223556219+Copilot@users.noreply.github.com&gt;
@azure-client-tools-bot-prd

azure-client-tools-bot-prd Bot commented Jun 11, 2026

Copy link
Copy Markdown
⚠️Azure CLI Extensions Breaking Change Test
⚠️aks-preview
rule cmd_name rule_message suggest_message
⚠️ 1006 - ParaAdd aks create cmd aks create added parameter enable_control_plane_metrics
⚠️ 1006 - ParaAdd aks update cmd aks update added parameter disable_control_plane_metrics
⚠️ 1006 - ParaAdd aks update cmd aks update added parameter enable_control_plane_metrics

@azure-client-tools-bot-prd

Copy link
Copy Markdown

Hi @bragi92,
Please write the description of changes which can be perceived by customers into HISTORY.rst.
If you want to release a new extension version, please update the version in setup.py as well.

@yonzhan

yonzhan commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

AKS

@bragi92 bragi92 marked this pull request as ready for review June 11, 2026 23:41
Copilot AI review requested due to automatic review settings June 11, 2026 23:41

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds first-class support for enabling/disabling Azure Monitor managed Prometheus control plane metrics for AKS via azureMonitorProfile.metrics.controlPlane.enabled, including validation and post-create behavior to avoid CCP/DCRA race conditions.

Changes:

  • Introduces --enable-control-plane-metrics (create/update) and --disable-control-plane-metrics (update) flags with validation rules and help text.
  • Defers setting metrics.controlPlane.enabled=true during create to a postprocessing PUT after DCRA creation.
  • Adds unit and live-only scenario tests covering positive and negative flows for create/update.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/aks-preview/azext_aks_preview/managed_cluster_decorator.py Adds flag getters/validation and update-time payload changes for control plane metrics; defers create-time enablement.
src/aks-preview/azext_aks_preview/azuremonitormetrics/azuremonitorprofile.py Adds a postprocessing PUT variant that flips controlPlane.enabled after DCRA creation.
src/aks-preview/azext_aks_preview/_params.py Wires new CLI flags for create/update with help text and aliases.
src/aks-preview/azext_aks_preview/custom.py Plumbs new parameters into command entrypoints.
src/aks-preview/azext_aks_preview/_help.py Documents new flags in command help.
src/aks-preview/azext_aks_preview/tests/latest/test_managed_cluster_decorator.py Adds unit tests for deferral/validation and update toggling behavior.
src/aks-preview/azext_aks_preview/tests/latest/test_aks_commands.py Adds live-only tests for create/update/negative control plane metrics scenarios.
src/aks-preview/HISTORY.rst Notes new flags and the shift from AFEC-gated preview to first-class API.

Comment thread src/aks-preview/azext_aks_preview/azuremonitormetrics/azuremonitorprofile.py Outdated
Comment thread src/aks-preview/azext_aks_preview/azuremonitormetrics/azuremonitorprofile.py Outdated
Comment thread src/aks-preview/azext_aks_preview/tests/latest/test_aks_commands.py
- Wait on the LRO in _addon_put_with_control_plane via poller.result(). This is the
  only place controlPlane.enabled is set during the greenfield create flow, so the
  CP flip must be durably persisted before the create command returns. Without the
  wait, callers and tests that read the cluster immediately could observe the
  pre-flip state. (The sibling addon_put intentionally remains fire-and-forget
  because metrics.enabled was already persisted on the initial cluster PUT.)
- Replace raise UnknownError(e) with raise UnknownError(str(e)) from e so the
  message is readable and the original traceback is preserved.
- Coerce _get_enable_control_plane_metrics / _get_disable_control_plane_metrics
  return values to bool() to match the declared -> bool return type when the
  parameter dict omits the key.
- Make the live test_aks_create_with_control_plane_metrics assertion robust:
  the controlPlane.enabled check is moved out of the immediate create response
  into an explicit aks show after aks wait, since the flip is intentionally
  deferred to post-DCRA postprocessing.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bragi92 added a commit to bragi92/azure-cli-extensions that referenced this pull request Jun 12, 2026
The wheel under test is the aks-preview extension PR (Azure#9931). The GA in-box CLI PR (#33537) is a parallel change in az aks, not a mirror.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@FumingZhang

Copy link
Copy Markdown
Member

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 2 pipeline(s).

@FumingZhang

Copy link
Copy Markdown
Member

see comment Azure/azure-cli#33537 (review)

bragi92 and others added 4 commits June 12, 2026 08:28
Per FumingZhang review feedback on Azure/azure-cli#33537: calling get_enable_control_plane_metrics() purely to trigger validation and discarding the return value is a confusing pattern. Extract the validation block into a new private _validate_control_plane_metrics_params method, expose a public validate_control_plane_metrics_params, and have the getters delegate to it when enable_validation=True (preserves existing API). The two _setup_azure_monitor_profile call sites now call the validator directly instead of discarding a getter result.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Mirrors the GA azure-cli cleanup. Other validators in the file are a single public def validate_xxx(self) -> None — no private companion. Collapse the extra _validate_control_plane_metrics_params indirection so the new validator matches the file's convention. Tests + behavior unchanged.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Mirrors the GA azure-cli cleanup. The aka.ms/aks/controlplane-metrics shortlink does not resolve. Drop the trailing reference from the help strings (create + update enable, update disable, plus _help.py YAML for both). Vendored SDK docstrings are auto-generated upstream and untouched.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Mirrors the GA azure-cli cleanup. Replace 'kube-apiserver, etcd, etc' with the actual default Prometheus scrape job names: controlplane-apiserver and controlplane-etcd. These are the targets users see in AMW and what the AKS docs reference. The 'etc' was also misleading since scheduler / controller-manager / NAP targets are opt-in via MinimalIngestionProfile and are not flipped on by this flag.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bragi92

bragi92 commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

/azp run

@azure-pipelines

Copy link
Copy Markdown
Commenter does not have sufficient privileges for PR 9931 in repo Azure/azure-cli-extensions

@FumingZhang

Copy link
Copy Markdown
Member

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 2 pipeline(s).

@FumingZhang

Copy link
Copy Markdown
Member

Please fix failed test cases @bragi92

FAILED src/aks-preview/azext_aks_preview/tests/latest/test_aks_commands.py::AzureKubernetesServiceScenarioTest::test_aks_check_network
FAILED src/aks-preview/azext_aks_preview/tests/latest/test_managed_cluster_decorator.py::AKSPreviewManagedClusterCreateDecoratorTestCase::test_set_up_azure_monitor_profile_create_cp_without_amp_raises

https://dev.azure.com/azclitools/public/_build/results?buildId=322389&view=logs&j=1cf1d69d-a933-5235-4979-b9c5545d49ac&t=100ba2b9-8d66-5bd1-5596-9506ad38ec65&l=5486

…sertion

The validator raises RequiredArgumentMissingError, not ArgumentUsageError. These are sibling classes under UserFault (not parent/child), so assertRaises(ArgumentUsageError) did not catch the error and the test failed in CI.
@bragi92

bragi92 commented Jun 15, 2026

Copy link
Copy Markdown
Member Author

Addressed failure @FumingZhang Can you help with re-running the pipeline with /azp run

Also it looks like ->
FAILED src/aks-preview/azext_aks_preview/tests/latest/test_aks_commands.py::AzureKubernetesServiceScenarioTest::test_aks_check_network

this might be a flaky test and unrelated to my change.

I've fixed the other test.

@FumingZhang

Copy link
Copy Markdown
Member

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 2 pipeline(s).

@FumingZhang

Copy link
Copy Markdown
Member

Fixing the recording file of test case test_aks_check_network in #9940, cc @bragi92

@bragi92

bragi92 commented Jun 16, 2026

Copy link
Copy Markdown
Member Author

Thank you Fuming, I'll wait for your PR to merge and then take in your change into my fork.

@FumingZhang

Copy link
Copy Markdown
Member

Thank you Fuming, I'll wait for your PR to merge and then take in your change into my fork.

Hey @bragi92, #9940 has been merged

bragi92 and others added 2 commits June 17, 2026 09:27
The test_aks_check_network integration test was failing across all
Python jobs with a CannotOverwriteExistingCassetteException because the
recorded VMSS request no longer matched. Pull the corrected recording
from upstream/main (Azure#9940) so the build passes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Bump aks-preview VERSION to 21.0.0b6 and move the
--enable/--disable-control-plane-metrics changelog entry out of the
Pending section into its own 21.0.0b6 release so it ships with this PR.

Reconcile HISTORY.rst with main, which already released 21.0.0b5
(prepared-image-specification + the --node-image-only fix), so the
changelog is consistent and the version is a clean increment over main.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bragi92

bragi92 commented Jun 17, 2026

Copy link
Copy Markdown
Member Author

Thanks @FumingZhang . I've merged in #9940 into my fork. Can you help with re-running the pipeline and the merge?

Resolve release conflicts in aks-preview HISTORY.rst and setup.py: keep version 21.0.0b6 for the control-plane-metrics change, on top of upstream's released 21.0.0b5 (prepared-image-specification + --node-image-only fix).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@FumingZhang

Copy link
Copy Markdown
Member

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 2 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AKS Auto-Assign Auto assign by bot

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants