[support bundle] Refactor into tasks #9253

smklein · 2025-10-17T23:40:09Z

Re-structures support bundle collection into tasks.

This centralizes step dispatching, and makes it slightly more clear that tasks are independent (where they can be).

smklein · 2025-10-17T23:43:47Z

nexus/src/app/background/tasks/support_bundle_collector.rs

+}
+
+type CollectionStepFn = Box<
+    dyn for<'b> FnOnce(


This signature looks a little nasty, but it's basically just saying:

Every "step" acts on a BundleCollection,

... has an output directory

... and can emit a CollectionStepOutput object (which either updates the report, or makes more steps)

perhaps this comment belongs in the code rather than just on the PR!

smklein · 2025-10-17T23:45:42Z

nexus/src/app/background/tasks/support_bundle_collector.rs

+        let mut tasks =
+            ParallelTaskSet::new_with_parallelism(MAX_CONCURRENT_STEPS);


Previously, we had some ParallelTaskSets embedded within the collection of sub-pieces of the bundle.

This new "step-based" infrastructure shares that more broadly - everything should be using this one ParallelTaskSet (which is good? that prevents a task that spawns a bunch of other task set, that spawns more task sets - everything is just a unit of work that can get added to this one set).

smklein · 2025-10-17T23:46:52Z

nexus/src/app/background/tasks/support_bundle_collector.rs


-        let ereport_collection = if let Some(ref ereport_filters) =
-            self.request.ereport_query
+    async fn collect_host_ereports(


The prior mechanism of collecting ereports from {host, sp} did a bit of manual task management, and stored atomics within the bundle itself.

I've just made each part of the ereport collection a distinct step: no more atomics, no more manual tokio tasks - each one is just "one step" with output.

Yup, that makes sense -- I had just wanted to make sure the (theoretically, totally separate) operations for host OS and SP ereports could execute in parallel. Since that's now a first-class aspect of the overall design, all that nonsense can be removed!

smklein · 2025-10-17T23:48:01Z

nexus/src/app/background/tasks/support_bundle_collector.rs

+    async fn get_or_initialize_mgs_client<'a>(
+        &self,
+        mgs_client: &'a OnceCell<Arc<Option<MgsClient>>>,
+    ) -> &'a Arc<Option<MgsClient>> {
+        mgs_client
+            .get_or_init(|| async {
+                Arc::new(self.create_mgs_client().await.ok())
+            })
+            .await
+    }

-            const MAX_CONCURRENT_SLED_REQUESTS: usize = 16;
-            const FAILURE_MESSAGE: &str =
-                "Failed to fully collect support bundle info from sled";
-            let mut set = ParallelTaskSet::new_with_parallelism(
-                MAX_CONCURRENT_SLED_REQUESTS,
+    async fn get_or_initialize_all_sleds<'a>(
+        &self,
+        all_sleds: &'a OnceCell<Arc<Option<Vec<Sled>>>>,
+    ) -> &'a Arc<Option<Vec<Sled>>> {
+        all_sleds
+            .get_or_init(|| async {
+                Arc::new(
+                    self.datastore
+                        .sled_list_all_batched(
+                            &self.opctx,
+                            SledFilter::InService,
+                        )
+                        .await
+                        .ok(),
+                )
+            })
+            .await


I'm using lazily-initialized variables for "data/clients that might get created as a part of bundle collection, but might not".

This becomes more relevant with #9254, when we may or may not need these values to get initialized at all.

smklein · 2025-10-17T23:48:47Z

nexus/src/app/background/tasks/support_bundle_collector.rs

+
+        let mut extra_steps: Vec<(&'static str, CollectionStepFn)> = vec![];
+        for sp in get_available_sps(&mgs_client).await? {
+            extra_steps.push((


Rather than collecting all SP reports via new tokio tasks, these are creating new "steps" that get kicked back to the top-level ParallelTaskSet I mentioned earlier.

hawkw

Overall, I like this design. Most of my notes were pretty small, but I did wonder if we might want to use a channel for the queue of new tasks to be spawned.

hawkw · 2025-10-20T18:33:37Z

nexus/src/app/background/tasks/support_bundle_collector.rs

+}
+
+type CollectionStepFn = Box<
+    dyn for<'b> FnOnce(


perhaps this comment belongs in the code rather than just on the PR!

hawkw · 2025-10-20T18:46:02Z

nexus/src/app/background/tasks/support_bundle_collector.rs

+        loop {
+            // Process all the currently-planned steps
+            while let Some((step_name, step)) = steps.pop() {
+                let previous_result = tasks.spawn({
+                    let collection = self.clone();
+                    let dir = output.path().to_path_buf();
+                    async move {
+                        debug!(collection.log, "Running step"; "name" => &step_name);
+                        step(&collection, dir.as_path()).await.inspect_err(|err| {
+                            warn!(
+                                collection.log,
+                                "Step failed";
+                                "name" => &step_name,
+                                InlineErrorChain::new(err.as_ref()),
+                            );
+                        })
+                    }
+                }).await;
+
+                if let Some(Ok(output)) = previous_result {
+                    output.process(&mut report, &mut steps);
+                };
+            }
+
+            // If we've run out of tasks to spawn, join all the existing steps.
+            while let Some(previous_result) = tasks.join_next().await {
+                if let Ok(output) = previous_result {
+                    output.process(&mut report, &mut steps);
+                };
+            }
+
+            // Executing steps may create additional steps, as follow-up work.
+            //
+            // Only finish if we've exhausted all possible steps and joined all spawned work.
+            if steps.is_empty() {
+                return report;
+            }
+        }


I kind of wonder if we might want to consider making steps a MPSC channel that each step has the Sender side cloned into, instead of having the output of the step future return a list of additional steps. That way, new tasks can be spawned as soon as there's space on the task set, instead of having to wait for the second join_next() loop to wait for all the currently running tasks to complete before spawning any new steps. With the current design, there's some additional latency while we're in the second loop that could be avoided this way.

Also, we end up with a potentially big vec sitting around in memory, which could be avoided if we spawned the tasks as soon as there was room for them.

hawkw · 2025-10-20T18:47:27Z

nexus/src/app/background/tasks/support_bundle_collector.rs


-        let ereport_collection = if let Some(ref ereport_filters) =
-            self.request.ereport_query
+    async fn collect_host_ereports(


Yup, that makes sense -- I had just wanted to make sure the (theoretically, totally separate) operations for host OS and SP ereports could execute in parallel. Since that's now a first-class aspect of the overall design, all that nonsense can be removed!

hawkw · 2025-10-20T18:52:30Z

nexus/src/app/background/tasks/support_bundle_collector.rs

+    //
+    // NOTE: The background task infrastructure will periodically check to see
+    // if the bundle has been cancelled by a user while it is being collected.
+    // If that happens, this function will be CANCELLED at an await point.
+    //
+    // As a result, it is important that this function be implemented as
+    // cancel-safe.


We now get this property from the fact that anything spawned on the ParallelTaskSet is automatically aborted when run_collect_bundle_steps is cancelled, FWIW. It might be worth saying that explicitly someplace --- any code in a step will be forcefully aborted when that future is cancelled, provided that it only spawns other tasks on the ParallelTaskSet rather than by using "normal spawn". We should maybe explictly note that so that people don't put random tokio::spawns in a step...

hawkw · 2025-10-20T18:53:42Z

nexus/src/app/background/tasks/support_bundle_collector.rs

+    ) -> anyhow::Result<CollectionStepOutput> {
+        save_sp_dumps(mgs_client, sp, dir)
+            .await
+            .with_context(|| format!("SP {} {}", sp.type_, sp.slot))?;


does this error get formatted in a way that makes it clear it was dump collection related?

wfchandler

No concerns beyond what Eliza has already noted. A great cleanup, thanks!

wfchandler · 2025-10-20T20:01:09Z

nexus/src/app/background/tasks/support_bundle_collector.rs

+    HostEreports(SupportBundleEreportStatus),
+    SpEreports(SupportBundleEreportStatus),
+    SavingSpDumps { listed_sps: bool },
+    // NOTE: The ditinction between this and "Spawn" is pretty artificial -


Nit: typo

Suggested change

// NOTE: The ditinction between this and "Spawn" is pretty artificial -

// NOTE: The distinction between this and "Spawn" is pretty artificial -

[support bundle] Refactor into tasks

6d3b503

smklein mentioned this pull request Oct 17, 2025

[support bundle] More structured data filtering #9254

Open

smklein commented Oct 17, 2025

View reviewed changes

smklein requested review from hawkw and wfchandler October 17, 2025 23:49

hawkw reviewed Oct 20, 2025

View reviewed changes

wfchandler reviewed Oct 20, 2025

View reviewed changes

		let mut tasks =
		ParallelTaskSet::new_with_parallelism(MAX_CONCURRENT_STEPS);

	// NOTE: The ditinction between this and "Spawn" is pretty artificial -
	// NOTE: The distinction between this and "Spawn" is pretty artificial -

[support bundle] Refactor into tasks #9253

Are you sure you want to change the base?

[support bundle] Refactor into tasks #9253

Uh oh!

Conversation

smklein commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hawkw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wfchandler left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

smklein commented Oct 17, 2025 •

edited

Loading