Skip to content

Commit 5afa0de

Browse files
authored
[gateway] ingest sensor measurements from SPs into oximeter (#6354)
This branch adds code to the Management Gateway Service for periodically polling sensor measurements from SPs and emitting it to Oximeter. In particular, this consists of: - a task for managing the metrics endpoint, waiting until MGS knows its underlay network address to bind the endpoint and register it with the control plane, - tasks for polling sensor measurements from each individual SP that MGS knows about, - a task that waits until SP discovery has completed and the rack ID to be known, and then spawns a poller task for every discovered SP slot The SP poller tasks send samples to the Oximeter producer endpoint using a `tokio::sync::broadcast` channel, which I've chosen primarily because it can be used as a bounded ring buffer that actually overwrites the *oldest* value when the buffer is full. This mostway, we use a bounded amount of memory for samples, but prioritize the most recent samples if we have to throw anything away because Oximeter hasn't come along to collect them recently. The poller tasks cache the component inventory and identifying information from the SP, so that we don't have to re-read all this data from the SP on every poll. While MGS, running on a host, would probably be fine with doing this, it seems better to avoid making the SP do unnecessary work at a 1Hz poll frequency, especially when *both* switch zones are polling them. Instead, every time we poll sensor data from an SP, we first ask it for its current state, and only invalidate our cached understanding of the SP when the state changes. This way, if a SP starts reporting new metrics due to a firmware update, or gets replaced with a different chassis with a new serial number, revision, etc, we won't continue to report metrics for stale targets, but we don't have to reload all of that once per second. To detect scenarios where the SP's state and/or identity has changed in the midst of polling its sensors (which may result in mislabeled metrics), we check whether the SP's state at the end of the poll matches its state at the beginning, and if it's not, we poll again immediately with its new identity. At present, the timestamps for these metric samples is generated by MGS --- it's the time when MGS received the sensor data from the SP, as MGS understands it. Because we don't currently collect data that was recorded prior to the switch zone coming up, we don't need to worry about figuring out timestamps for data recorded by the SP prior to the existence of a wall clock. Figuring out the SP/MGS timebase synchronization is probably a lot of additional work, although it would be nice to do in the future. At present, [metrics emitted by sled-agent prior to NTP sync will also be from 1987][1], so I think it's fine to do something similar here, especially because the potential solutions to that [also have their fair share of tradeoffs][2]. The new metrics use a schema in `oximeter/oximeter/schema/hardware-component.toml`. The target of these metrics is a `hardware_component` that includes: - the rack ID and the identity of the MGS instance that collected the metric, - information identifying the chassis[^1] and of the SP that recorded them (its serial number, model number, revision, and whether it's a switch, a sled, or a power shelf), - the SP's Hubris archive version (since the reported sensor data may change in future firmware releases) - the SP's ID for the hardware component (e.g. "dev-7"), the kind of device (e.g. "tmp117", "max5970"), and the humman-readable description (e.g. "Southeast temperature sensor", "U.2 Sharkfin A hot swap controller", etc.) reported by the SP Each kind of sensor reading has an individual metric (`hardware_component:temperature`, `hardware_component:current`, `hardware_component:voltage`, and so on). These metrics are labeled with the SP-reported name of the individual sensor measurement channel. For instance, a MAX5970 hotswap controller on sharkfin will have a voltage and current metric named "V12_U2A_A0" for the 12V rail, and a voltage and current metric named "V3P3_U2A_A0" for the 3.3V rail. Finally, a `hardware_component:sensor_errors` metric records sensor errors reported by the SP, labeled with the sensor name, what kind of sensor it is, and a string representation of the error. [1]: #6354 (comment) [2]: #6354 (comment) [^1]: I'm using "chassis" as a generic term to refer to "switch, sled, or power shelf".
1 parent 31ea57e commit 5afa0de

File tree

28 files changed

+1990
-33
lines changed

28 files changed

+1990
-33
lines changed

Cargo.lock

Lines changed: 4 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

clients/nexus-client/src/lib.rs

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -213,6 +213,7 @@ impl From<omicron_common::api::internal::nexus::ProducerKind>
213213
fn from(kind: omicron_common::api::internal::nexus::ProducerKind) -> Self {
214214
use omicron_common::api::internal::nexus::ProducerKind;
215215
match kind {
216+
ProducerKind::ManagementGateway => Self::ManagementGateway,
216217
ProducerKind::SledAgent => Self::SledAgent,
217218
ProducerKind::Service => Self::Service,
218219
ProducerKind::Instance => Self::Instance,
@@ -390,6 +391,9 @@ impl From<types::ProducerKind>
390391
fn from(kind: types::ProducerKind) -> Self {
391392
use omicron_common::api::internal::nexus::ProducerKind;
392393
match kind {
394+
types::ProducerKind::ManagementGateway => {
395+
ProducerKind::ManagementGateway
396+
}
393397
types::ProducerKind::SledAgent => ProducerKind::SledAgent,
394398
types::ProducerKind::Instance => ProducerKind::Instance,
395399
types::ProducerKind::Service => ProducerKind::Service,

clients/oximeter-client/src/lib.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@ impl From<omicron_common::api::internal::nexus::ProducerKind>
2626
fn from(kind: omicron_common::api::internal::nexus::ProducerKind) -> Self {
2727
use omicron_common::api::internal::nexus;
2828
match kind {
29+
nexus::ProducerKind::ManagementGateway => Self::ManagementGateway,
2930
nexus::ProducerKind::Service => Self::Service,
3031
nexus::ProducerKind::SledAgent => Self::SledAgent,
3132
nexus::ProducerKind::Instance => Self::Instance,

common/src/api/internal/nexus.rs

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -223,6 +223,8 @@ pub enum ProducerKind {
223223
Service,
224224
/// The producer is a Propolis VMM managing a guest instance.
225225
Instance,
226+
/// The producer is a management gateway service.
227+
ManagementGateway,
226228
}
227229

228230
/// Information announced by a metric server, used so that clients can contact it and collect

dev-tools/mgs-dev/Cargo.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ futures.workspace = true
1414
gateway-messages.workspace = true
1515
gateway-test-utils.workspace = true
1616
libc.workspace = true
17+
omicron-gateway.workspace = true
1718
omicron-workspace-hack.workspace = true
1819
signal-hook-tokio.workspace = true
1920
tokio.workspace = true

dev-tools/mgs-dev/src/main.rs

Lines changed: 22 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ use clap::{Args, Parser, Subcommand};
88
use futures::StreamExt;
99
use libc::SIGINT;
1010
use signal_hook_tokio::Signals;
11+
use std::net::SocketAddr;
1112

1213
#[tokio::main]
1314
async fn main() -> anyhow::Result<()> {
@@ -36,7 +37,12 @@ enum MgsDevCmd {
3637
}
3738

3839
#[derive(Clone, Debug, Args)]
39-
struct MgsRunArgs {}
40+
struct MgsRunArgs {
41+
/// Override the address of the Nexus instance to use when registering the
42+
/// Oximeter producer.
43+
#[clap(long)]
44+
nexus_address: Option<SocketAddr>,
45+
}
4046

4147
impl MgsRunArgs {
4248
async fn exec(&self) -> Result<(), anyhow::Error> {
@@ -46,9 +52,23 @@ impl MgsRunArgs {
4652
let mut signal_stream = signals.fuse();
4753

4854
println!("mgs-dev: setting up MGS ... ");
49-
let gwtestctx = gateway_test_utils::setup::test_setup(
55+
let (mut mgs_config, sp_sim_config) =
56+
gateway_test_utils::setup::load_test_config();
57+
if let Some(addr) = self.nexus_address {
58+
mgs_config.metrics =
59+
Some(gateway_test_utils::setup::MetricsConfig {
60+
disabled: false,
61+
dev_nexus_address: Some(addr),
62+
dev_bind_loopback: true,
63+
});
64+
}
65+
66+
let gwtestctx = gateway_test_utils::setup::test_setup_with_config(
5067
"mgs-dev",
5168
gateway_messages::SpPort::One,
69+
mgs_config,
70+
&sp_sim_config,
71+
None,
5272
)
5373
.await;
5474
println!("mgs-dev: MGS is running.");

dev-tools/omdb/tests/successes.out

Lines changed: 20 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -141,9 +141,16 @@ SP DETAILS: type "Sled" slot 0
141141

142142
COMPONENTS
143143

144-
NAME DESCRIPTION DEVICE PRESENCE SERIAL
145-
sp3-host-cpu FAKE host cpu sp3-host-cpu Present None
146-
dev-0 FAKE temperature sensor fake-tmp-sensor Failed None
144+
NAME DESCRIPTION DEVICE PRESENCE SERIAL
145+
sp3-host-cpu FAKE host cpu sp3-host-cpu Present None
146+
dev-0 FAKE temperature sensor fake-tmp-sensor Failed None
147+
dev-1 FAKE temperature sensor tmp117 Present None
148+
dev-2 FAKE Southeast temperature sensor tmp117 Present None
149+
dev-6 FAKE U.2 Sharkfin A VPD at24csw080 Present None
150+
dev-7 FAKE U.2 Sharkfin A hot swap controller max5970 Present None
151+
dev-8 FAKE U.2 A NVMe Basic Management Command nvme_bmc Present None
152+
dev-39 FAKE T6 temperature sensor tmp451 Present None
153+
dev-53 FAKE Fan controller max31790 Present None
147154

148155
CABOOSES: none found
149156

@@ -167,8 +174,16 @@ SP DETAILS: type "Sled" slot 1
167174

168175
COMPONENTS
169176

170-
NAME DESCRIPTION DEVICE PRESENCE SERIAL
171-
sp3-host-cpu FAKE host cpu sp3-host-cpu Present None
177+
NAME DESCRIPTION DEVICE PRESENCE SERIAL
178+
sp3-host-cpu FAKE host cpu sp3-host-cpu Present None
179+
dev-0 FAKE temperature sensor tmp117 Present None
180+
dev-1 FAKE temperature sensor tmp117 Present None
181+
dev-2 FAKE Southeast temperature sensor tmp117 Present None
182+
dev-6 FAKE U.2 Sharkfin A VPD at24csw080 Present None
183+
dev-7 FAKE U.2 Sharkfin A hot swap controller max5970 Present None
184+
dev-8 FAKE U.2 A NVMe Basic Management Command nvme_bmc Present None
185+
dev-39 FAKE T6 temperature sensor tmp451 Present None
186+
dev-53 FAKE Fan controller max31790 Present None
172187

173188
CABOOSES: none found
174189

gateway-test-utils/configs/config.test.toml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,15 @@ addr = "[::1]:0"
8888
ignition-target = 3
8989
location = { switch0 = ["sled", 1], switch1 = ["sled", 1] }
9090

91+
#
92+
# Configuration for SP sensor metrics polling
93+
#
94+
[metrics]
95+
# Allow the Oximeter metrics endpoint to bind on the loopback IP. This is
96+
# useful in local testing and development, when the gateway service is not
97+
# given a "real" underlay network IP.
98+
dev_bind_loopback = true
99+
91100
#
92101
# NOTE: for the test suite, if mode = "file", the file path MUST be the sentinel
93102
# string "UNUSED". The actual path will be generated by the test suite for each

gateway-test-utils/configs/sp_sim_config.test.toml

Lines changed: 166 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,13 +20,19 @@ device = "fake-tmp-sensor"
2020
description = "FAKE temperature sensor 1"
2121
capabilities = 0x2
2222
presence = "Present"
23+
sensors = [
24+
{name = "Southwest", kind = "Temperature", last_data.value = 41.7890625, last_data.timestamp = 1234 },
25+
]
2326

2427
[[simulated_sps.sidecar.components]]
2528
id = "dev-1"
2629
device = "fake-tmp-sensor"
2730
description = "FAKE temperature sensor 2"
2831
capabilities = 0x2
2932
presence = "Failed"
33+
sensors = [
34+
{ name = "South", kind = "Temperature", last_error.value = "DeviceError", last_error.timestamp = 1234 },
35+
]
3036

3137
[[simulated_sps.sidecar]]
3238
multicast_addr = "::1"
@@ -56,6 +62,82 @@ device = "fake-tmp-sensor"
5662
description = "FAKE temperature sensor"
5763
capabilities = 0x2
5864
presence = "Failed"
65+
sensors = [
66+
{ name = "Southwest", kind = "Temperature", last_error.value = "DeviceError", last_error.timestamp = 1234 },
67+
]
68+
[[simulated_sps.gimlet.components]]
69+
id = "dev-1"
70+
device = "tmp117"
71+
description = "FAKE temperature sensor"
72+
capabilities = 0x2
73+
presence = "Present"
74+
sensors = [
75+
{ name = "South", kind = "Temperature", last_data.value = 42.5625, last_data.timestamp = 1234 },
76+
]
77+
78+
[[simulated_sps.gimlet.components]]
79+
id = "dev-2"
80+
device = "tmp117"
81+
description = "FAKE Southeast temperature sensor"
82+
capabilities = 0x2
83+
presence = "Present"
84+
sensors = [
85+
{ name = "Southeast", kind = "Temperature", last_data.value = 41.570313, last_data.timestamp = 1234 },
86+
]
87+
88+
[[simulated_sps.gimlet.components]]
89+
id = "dev-6"
90+
device = "at24csw080"
91+
description = "FAKE U.2 Sharkfin A VPD"
92+
capabilities = 0x0
93+
presence = "Present"
94+
95+
[[simulated_sps.gimlet.components]]
96+
id = "dev-7"
97+
device = "max5970"
98+
description = "FAKE U.2 Sharkfin A hot swap controller"
99+
capabilities = 0x2
100+
presence = "Present"
101+
sensors = [
102+
{ name = "V12_U2A_A0", kind = "Current", last_data.value = 0.45898438, last_data.timestamp = 1234 },
103+
{ name = "V3P3_U2A_A0", kind = "Current", last_data.value = 0.024414063, last_data.timestamp = 1234 },
104+
{ name = "V12_U2A_A0", kind = "Voltage", last_data.value = 12.03125, last_data.timestamp = 1234 },
105+
{ name = "V3P3_U2A_A0", kind = "Voltage", last_data.value = 3.328125, last_data.timestamp = 1234 },
106+
]
107+
108+
[[simulated_sps.gimlet.components]]
109+
id = "dev-8"
110+
device = "nvme_bmc"
111+
description = "FAKE U.2 A NVMe Basic Management Command"
112+
capabilities = 0x2
113+
presence = "Present"
114+
sensors = [
115+
{ name = "U2_N0", kind = "Temperature", last_data.value = 56.0, last_data.timestamp = 1234 },
116+
]
117+
[[simulated_sps.gimlet.components]]
118+
id = "dev-39"
119+
device = "tmp451"
120+
description = "FAKE T6 temperature sensor"
121+
capabilities = 0x2
122+
presence = "Present"
123+
sensors = [
124+
{ name = "t6", kind = "Temperature", last_data.value = 70.625, last_data.timestamp = 1234 },
125+
]
126+
[[simulated_sps.gimlet.components]]
127+
id = "dev-53"
128+
device = "max31790"
129+
description = "FAKE Fan controller"
130+
capabilities = 0x2
131+
presence = "Present"
132+
sensors = [
133+
{ name = "Southeast", kind = "Speed", last_data.value = 2607.0, last_data.timestamp = 1234 },
134+
{ name = "Northeast", kind = "Speed", last_data.value = 2476.0, last_data.timestamp = 1234 },
135+
{ name = "South", kind = "Speed", last_data.value = 2553.0, last_data.timestamp = 1234 },
136+
{ name = "North", kind = "Speed", last_data.value = 2265.0, last_data.timestamp = 1234 },
137+
{ name = "Southwest", kind = "Speed", last_data.value = 2649.0, last_data.timestamp = 1234 },
138+
{ name = "Northwest", kind = "Speed", last_data.value = 2275.0, last_data.timestamp = 1234 },
139+
]
140+
59141

60142
[[simulated_sps.gimlet]]
61143
multicast_addr = "::1"
@@ -72,6 +154,90 @@ capabilities = 0
72154
presence = "Present"
73155
serial_console = "[::1]:0"
74156

157+
158+
[[simulated_sps.gimlet.components]]
159+
id = "dev-0"
160+
device = "tmp117"
161+
description = "FAKE temperature sensor"
162+
capabilities = 0x2
163+
presence = "Present"
164+
sensors = [
165+
{ name = "Southwest", kind = "Temperature", last_data.value = 41.3629, last_data.timestamp = 1234 },
166+
]
167+
[[simulated_sps.gimlet.components]]
168+
id = "dev-1"
169+
device = "tmp117"
170+
description = "FAKE temperature sensor"
171+
capabilities = 0x2
172+
presence = "Present"
173+
sensors = [
174+
{ name = "South", kind = "Temperature", last_data.value = 42.5625, last_data.timestamp = 1234 },
175+
]
176+
177+
[[simulated_sps.gimlet.components]]
178+
id = "dev-2"
179+
device = "tmp117"
180+
description = "FAKE Southeast temperature sensor"
181+
capabilities = 0x2
182+
presence = "Present"
183+
sensors = [
184+
{ name = "Southeast", kind = "Temperature", last_data.value = 41.570313, last_data.timestamp = 1234 },
185+
]
186+
187+
[[simulated_sps.gimlet.components]]
188+
id = "dev-6"
189+
device = "at24csw080"
190+
description = "FAKE U.2 Sharkfin A VPD"
191+
capabilities = 0x0
192+
presence = "Present"
193+
194+
[[simulated_sps.gimlet.components]]
195+
id = "dev-7"
196+
device = "max5970"
197+
description = "FAKE U.2 Sharkfin A hot swap controller"
198+
capabilities = 0x2
199+
presence = "Present"
200+
sensors = [
201+
{ name = "V12_U2A_A0", kind = "Current", last_data.value = 0.41893438, last_data.timestamp = 1234 },
202+
{ name = "V3P3_U2A_A0", kind = "Current", last_data.value = 0.025614603, last_data.timestamp = 1234 },
203+
{ name = "V12_U2A_A0", kind = "Voltage", last_data.value = 12.02914, last_data.timestamp = 1234 },
204+
{ name = "V3P3_U2A_A0", kind = "Voltage", last_data.value = 3.2618, last_data.timestamp = 1234 },
205+
]
206+
207+
[[simulated_sps.gimlet.components]]
208+
id = "dev-8"
209+
device = "nvme_bmc"
210+
description = "FAKE U.2 A NVMe Basic Management Command"
211+
capabilities = 0x2
212+
presence = "Present"
213+
sensors = [
214+
{ name = "U2_N0", kind = "Temperature", last_data.value = 56.0, last_data.timestamp = 1234 },
215+
]
216+
[[simulated_sps.gimlet.components]]
217+
id = "dev-39"
218+
device = "tmp451"
219+
description = "FAKE T6 temperature sensor"
220+
capabilities = 0x2
221+
presence = "Present"
222+
sensors = [
223+
{ name = "t6", kind = "Temperature", last_data.value = 70.625, last_data.timestamp = 1234 },
224+
]
225+
[[simulated_sps.gimlet.components]]
226+
id = "dev-53"
227+
device = "max31790"
228+
description = "FAKE Fan controller"
229+
capabilities = 0x2
230+
presence = "Present"
231+
sensors = [
232+
{ name = "Southeast", kind = "Speed", last_data.value = 2510.0, last_data.timestamp = 1234 },
233+
{ name = "Northeast", kind = "Speed", last_data.value = 2390.0, last_data.timestamp = 1234 },
234+
{ name = "South", kind = "Speed", last_data.value = 2467.0, last_data.timestamp = 1234 },
235+
{ name = "North", kind = "Speed", last_data.value = 2195.0, last_data.timestamp = 1234 },
236+
{ name = "Southwest", kind = "Speed", last_data.value = 2680.0, last_data.timestamp = 1234 },
237+
{ name = "Northwest", kind = "Speed", last_data.value = 2212.0, last_data.timestamp = 1234 },
238+
]
239+
240+
75241
#
76242
# NOTE: for the test suite, the [log] section is ignored; sp-sim logs are rolled
77243
# into the gateway logfile.

0 commit comments

Comments
 (0)