admin/broker: restore build_info in ListBrokers response#29785
admin/broker: restore build_info in ListBrokers response#29785pgellert wants to merge 3 commits intoredpanda-data:devfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Restores build_info population in AdminV2 ListBrokers responses (to unblock rpk cluster connections list version compatibility checks) while keeping the endpoint resilient by returning partial broker entries when some nodes are unreachable.
Changes:
- Update broker admin service to fetch broker info from other nodes in parallel via best-effort
GetBrokerRPCs. - Add rptest coverage for
rpk cluster connections listand forListBrokersbehavior when a broker is down. - Update proto / type-stub comments to document that
build_info/admin_servermay be absent inListBrokersif unreachable.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
src/v/redpanda/admin/services/broker.cc |
Parallelizes best-effort broker info fetching to repopulate build_info in ListBrokers. |
proto/redpanda/core/admin/v2/broker.proto |
Documents new ListBrokers field presence expectations for unreachable brokers. |
tests/rptest/clients/admin/proto/redpanda/core/admin/v2/broker_pb2.pyi |
Mirrors proto comment changes in Python typing stubs. |
tests/rptest/tests/list_kafka_connections_test.py |
Adds an rptest that exercises the rpk cluster connections list compatibility path. |
tests/rptest/tests/admin_api_auth_test.py |
Adds resilience test ensuring ListBrokers returns all brokers even if one is down. |
| // Only populated for `GetBroker` RPCs | ||
| // May be absent in ListBrokers responses if the broker was unreachable. | ||
| optional BuildInfo build_info = 2; | ||
| // The admin server information. | ||
| // | ||
| // Only populated for `GetBroker` RPCs | ||
| // May be absent in ListBrokers responses if the broker was unreachable. |
There was a problem hiding this comment.
I really don't like this being non-deterministic 😄
Can we leave documented it's not populated for GetBroker RPCs, and then have rpk call GetBroker for future versions?
There was a problem hiding this comment.
Fair enough. The reason why I didn't go with that is that then this will still fail with 25.3 rpk versions.
There was a problem hiding this comment.
No I'm saying let's populate it for backwards compat (or maybe just use the called broker's version and assume there is no partial rollout?), but just document it as removed so we can maybe remove it down the road.
There was a problem hiding this comment.
Makes sense. I've adjusted the rpk implementation now so that we can make the broker-side removal later on. And I think assuming no partial rollout seems reasonable since the only user should be rpk's version check.
Populate build_info for all brokers in ListBrokers using the responding broker's version. This fixes rpk cluster connections list which checks build_info to validate version compatibility.
edf11c4 to
0dbc49a
Compare
Verifies that rpk cluster connections list works correctly, exercising the code path where rpk calls ListBrokers to check version compatibility.
Use GetBroker instead of ListBrokers for the version compatibility check. This avoids relying on build_info being populated in ListBrokers responses, allowing that field to be removed in the future.
0dbc49a to
d4c98c4
Compare
| // compatibility check. | ||
| broker.set_build_info(std::move(self_broker().get_build_info())); |
There was a problem hiding this comment.
ListBrokers is populating build_info for other brokers by copying the full self_broker().build_info (including build_sha). That makes the response report an incorrect build_sha (and any other future fields) for non-responding brokers. If the intent is only to preserve version compatibility checks, consider setting only the version field (and leaving build_sha unset/empty) for these synthetic broker entries, or omit build_sha explicitly.
| // compatibility check. | |
| broker.set_build_info(std::move(self_broker().get_build_info())); | |
| // compatibility check. Do not set build_sha here, because we do not | |
| // know the exact build of non-responding brokers. | |
| proto::admin::build_info build_info; | |
| build_info.set_version(ss::sstring(redpanda_git_version())); | |
| broker.set_build_info(std::move(build_info)); |
| for (auto& [node_id, _] : clients) { | ||
| auto& broker = list_resp.get_brokers().emplace_back(); | ||
| broker.set_node_id(node_id()); | ||
|
|
||
| // Populate build_info for all brokers with this broker's version for | ||
| // backwards compatibility with rpk's cluster connections list version | ||
| // compatibility check. | ||
| broker.set_build_info(std::move(self_broker().get_build_info())); | ||
| } |
There was a problem hiding this comment.
self_broker() builds and populates admin_server routes; calling it inside the loop means you rebuild that route list once per broker just to fetch build_info. This is avoidable overhead on every ListBrokers call; compute the responding broker's build info once outside the loop (or construct it directly) and reuse it for each synthetic broker entry.
| """ | ||
| Tests that rpk cluster connections list works correctly. | ||
|
|
||
| This specifically exercises the code path where rpk calls ListBrokers |
There was a problem hiding this comment.
The test docstring mentions that rpk calls ListBrokers for version compatibility, but in this PR rpk cluster connections list now checks compatibility via GetBroker instead. Please update the comment to match the actual code path being exercised so the test remains accurate over time.
| This specifically exercises the code path where rpk calls ListBrokers | |
| This specifically exercises the code path where rpk calls GetBroker |
|
force-push: use hardcoded version instead and adjust rpk implementation for 26.1 based on feedback |
| for _, broker := range brokers.Msg.Brokers { | ||
| if broker.BuildInfo == nil { | ||
| return false | ||
| } | ||
| version, err := redpanda.VersionFromString(broker.BuildInfo.Version) | ||
| if err != nil || !version.IsAtLeast(minVersion) { | ||
| return false | ||
| } |
There was a problem hiding this comment.
This seemed to check that all brokers had a version greater than minVersion to support the feature, but the new code only seems to check that a single broker version meets the min version. Is that okay? And which broker is checked (is it determined by the client parameter)?
Or is there somewhere else in the code that calls isFeatureSupported with a client for each broker to ensure that all brokers support it? Does list connections requires all brokers in the cluster to support it or can it work on some brokers in a cluster but not others (i.e. partial upgrades)?
There was a problem hiding this comment.
This seemed to check that all brokers had a version greater than minVersion to support the feature, but the new code only seems to check that a single broker version meets the min version. Is that okay?
Yes, this should be fine since this is purely for better UX (showing a friendly "feature not supported" message instead of an API error).
And which broker is checked (is it determined by the client parameter)?
Whichever broker the admin client connects to, which is, I believe, is randomly chosen from the set of configured brokers.
Or is there somewhere else in the code that calls isFeatureSupported with a client for each broker to ensure that all brokers support it?
No.
Does list connections requires all brokers in the cluster to support it or can it work on some brokers in a cluster but not others (i.e. partial upgrades)?
Yes, the ListKafkaConnections API must be available on all brokers. If rpk connects to a supported broker but then queries an unsupported one, the user would see a different error (an API error rather than our friendly "feature not supported" message). Given the narrow scenario where this would matter (mid-upgrade, pre-25.3 brokers, post-26.1 rpk), and the minor impact (just a less friendly error message), this seems like a reasonable trade-off to simplify the admin API.
There was a problem hiding this comment.
Got it, thanks for the in-depth context!
CI test resultstest results on build#81555
|
|
@graham-rp @r-vasquez When you have a moment, could you take a look at the rpk changes here? It's a small adjustment to use GetBroker instead of ListBrokers for the version check. |
| // Determine whether the broker version supports client connection monitoring | ||
| brokers, err := client.BrokerService().ListBrokers(ctx, &connect.Request[adminv2.ListBrokersRequest]{}) | ||
| if err != nil || len(brokers.Msg.Brokers) == 0 { | ||
| resp, err := client.BrokerService().GetBroker(ctx, &connect.Request[adminv2.GetBrokerRequest]{}) |
There was a problem hiding this comment.
Ah for 25.3 this needs to be an explicit -1 node ID otherwise it will try to find node 0 which may or may not be available.
bf9fe02 earlier removed build_info from
ListBrokers responses to avoid requiring all brokers to be reachable.
However, this broke rpk cluster connections list, which checks
build_info to validate version compatibility.
Populate build_info for all brokers in ListBrokers using the responding
broker's version. This fixes rpk cluster connections list which checks
build_info to validate version compatibility.
Fixes https://redpandadata.atlassian.net/browse/CORE-15134
Backports Required
Release Notes