Skip to content

admin/broker: restore build_info in ListBrokers response#29785

Open
pgellert wants to merge 3 commits intoredpanda-data:devfrom
pgellert:fix/conn-list-version-check
Open

admin/broker: restore build_info in ListBrokers response#29785
pgellert wants to merge 3 commits intoredpanda-data:devfrom
pgellert:fix/conn-list-version-check

Conversation

@pgellert
Copy link
Contributor

@pgellert pgellert commented Mar 10, 2026

bf9fe02 earlier removed build_info from
ListBrokers responses to avoid requiring all brokers to be reachable.
However, this broke rpk cluster connections list, which checks
build_info to validate version compatibility.

Populate build_info for all brokers in ListBrokers using the responding
broker's version. This fixes rpk cluster connections list which checks
build_info to validate version compatibility.

Fixes https://redpandadata.atlassian.net/browse/CORE-15134

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x

Release Notes

  • none

@pgellert pgellert self-assigned this Mar 10, 2026
@pgellert pgellert requested review from a team and rockwotj as code owners March 10, 2026 16:43
@pgellert pgellert requested review from Copilot and nguyen-andrew and removed request for a team March 10, 2026 16:43
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Restores build_info population in AdminV2 ListBrokers responses (to unblock rpk cluster connections list version compatibility checks) while keeping the endpoint resilient by returning partial broker entries when some nodes are unreachable.

Changes:

  • Update broker admin service to fetch broker info from other nodes in parallel via best-effort GetBroker RPCs.
  • Add rptest coverage for rpk cluster connections list and for ListBrokers behavior when a broker is down.
  • Update proto / type-stub comments to document that build_info / admin_server may be absent in ListBrokers if unreachable.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/v/redpanda/admin/services/broker.cc Parallelizes best-effort broker info fetching to repopulate build_info in ListBrokers.
proto/redpanda/core/admin/v2/broker.proto Documents new ListBrokers field presence expectations for unreachable brokers.
tests/rptest/clients/admin/proto/redpanda/core/admin/v2/broker_pb2.pyi Mirrors proto comment changes in Python typing stubs.
tests/rptest/tests/list_kafka_connections_test.py Adds an rptest that exercises the rpk cluster connections list compatibility path.
tests/rptest/tests/admin_api_auth_test.py Adds resilience test ensuring ListBrokers returns all brokers even if one is down.

Comment on lines +73 to +77
// Only populated for `GetBroker` RPCs
// May be absent in ListBrokers responses if the broker was unreachable.
optional BuildInfo build_info = 2;
// The admin server information.
//
// Only populated for `GetBroker` RPCs
// May be absent in ListBrokers responses if the broker was unreachable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like this being non-deterministic 😄

Can we leave documented it's not populated for GetBroker RPCs, and then have rpk call GetBroker for future versions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. The reason why I didn't go with that is that then this will still fail with 25.3 rpk versions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I'm saying let's populate it for backwards compat (or maybe just use the called broker's version and assume there is no partial rollout?), but just document it as removed so we can maybe remove it down the road.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I've adjusted the rpk implementation now so that we can make the broker-side removal later on. And I think assuming no partial rollout seems reasonable since the only user should be rpk's version check.

Populate build_info for all brokers in ListBrokers using the responding
broker's version. This fixes rpk cluster connections list which checks
build_info to validate version compatibility.
Verifies that rpk cluster connections list works correctly, exercising
the code path where rpk calls ListBrokers to check version compatibility.
Use GetBroker instead of ListBrokers for the version compatibility check.
This avoids relying on build_info being populated in ListBrokers responses,
allowing that field to be removed in the future.
@pgellert pgellert force-pushed the fix/conn-list-version-check branch from 0dbc49a to d4c98c4 Compare March 10, 2026 17:48
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Comment on lines +67 to +68
// compatibility check.
broker.set_build_info(std::move(self_broker().get_build_info()));
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ListBrokers is populating build_info for other brokers by copying the full self_broker().build_info (including build_sha). That makes the response report an incorrect build_sha (and any other future fields) for non-responding brokers. If the intent is only to preserve version compatibility checks, consider setting only the version field (and leaving build_sha unset/empty) for these synthetic broker entries, or omit build_sha explicitly.

Suggested change
// compatibility check.
broker.set_build_info(std::move(self_broker().get_build_info()));
// compatibility check. Do not set build_sha here, because we do not
// know the exact build of non-responding brokers.
proto::admin::build_info build_info;
build_info.set_version(ss::sstring(redpanda_git_version()));
broker.set_build_info(std::move(build_info));

Copilot uses AI. Check for mistakes.
Comment on lines 61 to 69
for (auto& [node_id, _] : clients) {
auto& broker = list_resp.get_brokers().emplace_back();
broker.set_node_id(node_id());

// Populate build_info for all brokers with this broker's version for
// backwards compatibility with rpk's cluster connections list version
// compatibility check.
broker.set_build_info(std::move(self_broker().get_build_info()));
}
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self_broker() builds and populates admin_server routes; calling it inside the loop means you rebuild that route list once per broker just to fetch build_info. This is avoidable overhead on every ListBrokers call; compute the responding broker's build info once outside the loop (or construct it directly) and reuse it for each synthetic broker entry.

Copilot uses AI. Check for mistakes.
"""
Tests that rpk cluster connections list works correctly.

This specifically exercises the code path where rpk calls ListBrokers
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test docstring mentions that rpk calls ListBrokers for version compatibility, but in this PR rpk cluster connections list now checks compatibility via GetBroker instead. Please update the comment to match the actual code path being exercised so the test remains accurate over time.

Suggested change
This specifically exercises the code path where rpk calls ListBrokers
This specifically exercises the code path where rpk calls GetBroker

Copilot uses AI. Check for mistakes.
@pgellert
Copy link
Contributor Author

force-push: use hardcoded version instead and adjust rpk implementation for 26.1 based on feedback

@pgellert pgellert requested a review from rockwotj March 10, 2026 17:51
Comment on lines -68 to -75
for _, broker := range brokers.Msg.Brokers {
if broker.BuildInfo == nil {
return false
}
version, err := redpanda.VersionFromString(broker.BuildInfo.Version)
if err != nil || !version.IsAtLeast(minVersion) {
return false
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seemed to check that all brokers had a version greater than minVersion to support the feature, but the new code only seems to check that a single broker version meets the min version. Is that okay? And which broker is checked (is it determined by the client parameter)?

Or is there somewhere else in the code that calls isFeatureSupported with a client for each broker to ensure that all brokers support it? Does list connections requires all brokers in the cluster to support it or can it work on some brokers in a cluster but not others (i.e. partial upgrades)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seemed to check that all brokers had a version greater than minVersion to support the feature, but the new code only seems to check that a single broker version meets the min version. Is that okay?

Yes, this should be fine since this is purely for better UX (showing a friendly "feature not supported" message instead of an API error).

And which broker is checked (is it determined by the client parameter)?

Whichever broker the admin client connects to, which is, I believe, is randomly chosen from the set of configured brokers.

Or is there somewhere else in the code that calls isFeatureSupported with a client for each broker to ensure that all brokers support it?

No.

Does list connections requires all brokers in the cluster to support it or can it work on some brokers in a cluster but not others (i.e. partial upgrades)?

Yes, the ListKafkaConnections API must be available on all brokers. If rpk connects to a supported broker but then queries an unsupported one, the user would see a different error (an API error rather than our friendly "feature not supported" message). Given the narrow scenario where this would matter (mid-upgrade, pre-25.3 brokers, post-26.1 rpk), and the minor impact (just a less friendly error message), this seems like a reasonable trade-off to simplify the admin API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for the in-depth context!

@vbotbuildovich
Copy link
Collaborator

CI test results

test results on build#81555
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
DatalakeClusterRestoreTest test_restore_partition_spec {"catalog_type": "rest_jdbc", "cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/81555#019cd8ed-bad9-4857-8072-64b5df6141c3 FLAKY 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0012, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeClusterRestoreTest&test_method=test_restore_partition_spec
RedpandaNodeOperationsSmokeTest test_node_ops_smoke_test {"cloud_storage_type": 1, "mixed_versions": false} integration https://buildkite.com/redpanda/redpanda/builds/81555#019cd8ed-22c9-40aa-88c9-7515a4b7ce71 FLAKY 36/41 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0477, p0=0.1218, reject_threshold=0.0100. adj_baseline=0.1363, p1=0.3480, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test

@pgellert pgellert requested a review from nguyen-andrew March 11, 2026 10:59
@pgellert
Copy link
Contributor Author

@graham-rp @r-vasquez When you have a moment, could you take a look at the rpk changes here? It's a small adjustment to use GetBroker instead of ListBrokers for the version check.

// Determine whether the broker version supports client connection monitoring
brokers, err := client.BrokerService().ListBrokers(ctx, &connect.Request[adminv2.ListBrokersRequest]{})
if err != nil || len(brokers.Msg.Brokers) == 0 {
resp, err := client.BrokerService().GetBroker(ctx, &connect.Request[adminv2.GetBrokerRequest]{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah for 25.3 this needs to be an explicit -1 node ID otherwise it will try to find node 0 which may or may not be available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants