Add a4x2-package/a4x2-deploy xtasks #7424

faithanalog · 2025-01-29T04:56:28Z

overview

This PR adds two new xtasks:

a4x2-package
a4x2-deploy

a4x2-package

> cargo xtask a4x2-package -h
Generate a tarball with omicron packaged for deployment onto a4x2

Usage: cargo xtask a4x2-package [OPTIONS]

Options:
      --live-tests        Bundle omicron live tests into the output tarball
      --end-to-end-tests  Bundle omicron end-to-end-tests package into the output tarball
  -h, --help              Print help

This command will:

download & build the latest a4x2 from oxidecomputer/testbed
clone the local working tree to a temporary path & build + package it with a4x2
if requested, build live-tests
if requested, build end-to-end tests (but there is no way to run these yet)
package these all into a tar bundle, at out/a4x2-package-out.tgz

a4x2-deploy

> cargo xtask a4x2-deploy -h
Run a4x2 and deploy omicron onto it, and optionally run live-tests and end to end tests

Usage: cargo xtask a4x2-deploy <COMMAND>

Commands:
  start      Start a4x2, deploy a control plane to it, and then leave it running
  stop       Stop a4x2 that was previously launched with the `start` subcommand
  run-tests  Start a4x2, run tests, and then stop a4x2 unless you request otherwise
  help       Print this message or the help of the given subcommand(s)

Options:
  -h, --help  Print help

This command will:

tear down any existing a4x2 created by a previous invocation of a4x2-deploy
unpack a bundle generated by a4x2-package
download latest OVMF code & propolis server
launch a4x2
curl the control plane API until it becomes alive (or a long timeout is reached, in which case it considers the launch failed)
if requested, transfer live tests to the system and execute them
print information on how to access the server

By default, a4x2-deploy will use a bundle from out/a4x2-package.tgz, but an alternative can be specified with --package <path>, so you can build a package and send it to someone else for them to run on their machine.

Known shortcomings

a4x2 is still a bit flaky. Sometimes I will reboot my machine for good luck.
a4x2-deploy could and should automatically collect logs from the system when launches fail, but does not yet.
no support for incremental builds of omicron. a4x2 makes some changes to tomls in the current working directory, which is why we clone the source directory to another path prior to running the a4x2 package step.

Additionally, because of cloning the source to a temporary working path, any changes you have not git added will not be seen by a4x2-package. I consider this either mildly useful or mildly confusing depending on what you're doing in the moment. The jj crew can of course just run jj before using a4x2-package.

We could opt instead of doing a cp -a of the working directory instead of a git clone, but this could be problematic due to copying potentially massive target/ directories. We could use rsync with excludes, but I'm not sure we can rely on it always being present. The git clone seemed like the most reliable to me, and it has precedent in the releng xtask. But I do find this to be somewhat annoying when iterating on changes and am open to alternatives. Maybe we should just try to make a4x2 work in the current workdir without causing trouble.

Aside on xshell

This PR adds a dependency on xshell, which I use heavily for running system commands. I figured I would take the opportunity to talk about my experience with it, and provide some context that may be helpful when reviewing this PR. I've really enjoyed using it and think it makes it quite pleasant to use rust as a bash replacement.

A few things to keep in mind when reading it, which you can learn from the docs, but I will also highlight here:

All commands are echoed by default

Similar to set -x in bash, commands are echoed as they are run. This is great for logs. Additionally, non-zero exit codes become Errs, and command stdout/stderr are sent to the terminal by default, unless you make a point of capturing them.

Command invocations are not echoed when you use .read() to capture stdout

`cmd!()` is not sending your command into a shell

cmd!() is building a normal rust Command! If you see something like

cmd!(sh, "some command here {with_args}")

You may ask "will that escape with_args"? The answer is, there is no shell invocation, so there is nothing to escape! It is similar to as if you had done

Command::new("some").arg("command").arg("here").arg(&with_args)

xshell is mostly convenience around Command, with a few other file access shorthands thrown in.

`sh.pushdir()`

You will see a lot of a pattern

let _popdir = sh.pushdir(some_dir);

sh has its own internal working directory it stores (separate from the process workdir, btw). All commands / operations run through sh are relative to that working directory. pushdir changes that directory, and returns a guard. When that guard is dropped, the directory is popped.

You may be wary if you have bitten by code like this before:

// guard is immediately dropped, because thats what `let _ = ` does! Oops!
let _ = thing_that_returns_guard();

In the next version of xshell, it looks like pushdir is going away entirely, and will be replaced with approximately this:

sh.with_dir(some_dir, || {

})

I'll deal with that when it happens. I'm sure the lambdas will make some things more annoying, but I think it'll be better that way.

This PR adds two new xtasks: - a4x2-package - a4x2-deploy ``` > cargo xtask a4x2-package -h Generate a tarball with omicron packaged for deployment onto a4x2 Usage: cargo xtask a4x2-package [OPTIONS] Options: --live-tests Bundle omicron live tests into the output tarball --end-to-end-tests Bundle omicron end-to-end-tests package into the output tarball -h, --help Print help ``` This command will: - download & build the latest a4x2 from oxidecomputer/testbed - clone the local working tree to a temporary path & build + package it with a4x2 - if requested, build live-tests - if requested, build end-to-end tests (but there is no way to run these yet) - package these all into a tar bundle, at `out/a4x2-package-out.tgz` ``` > cargo xtask a4x2-deploy -h Run a4x2 and deploy omicron onto it, and optionally run live-tests and end to end tests Usage: cargo xtask a4x2-deploy <COMMAND> Commands: start Start a4x2, deploy a control plane to it, and then leave it running stop Stop a4x2 that was previously launched with the `start` subcommand run-tests Start a4x2, run tests, and then stop a4x2 unless you request otherwise help Print this message or the help of the given subcommand(s) Options: -h, --help Print help ``` This command will: - tear down any existing a4x2 created by a previous invocation of a4x2-deploy - unpack a bundle generated by a4x2-package - download latest OVMF code & propolis server - launch a4x2 - ping the control plane until it becomes alive (or a timeout is reached, in which case it considers the launch failed) - if requested, transfer live tests to the system and execute them - print information on how to access the server By default, a4x2-deploy will use a bundle from `out/a4x2-package.tgz`, but an alternative can be specified with `--package <path>`, so you can build a package and send it to someone else for them to run on their machine. - a4x2 is still a bit flaky. Sometimes I will reboot my machine for good luck. - a4x2-deploy could and should automatically collect logs from the system when launches fail, but does not yet. - no support for incremental builds of omicron. a4x2 makes some changes to tomls in the current working directory, which is why we clone the source directory to another path prior to running the a4x2 package step. Additionally, because of cloning the source to a temporary working path, any changes you have not `git add`ed will not be seen by `a4x2-package`. I consider this either mildly useful or mildly confusing depending on what you're doing in the moment. The jj crew can of course just run `jj` before using `a4x2-package`. We could opt instead of doing a `cp -a` of the working directory instead of a git clone, but this could be problematic due to copying potentially massive `target/` directories. We could use `rsync` wtith excludes, but I'm not sure we can rely on it always being present. The `git clone` seemed like the most reliable to me, and it has precedent in the `releng` xtask. But I do find this to be somewhat annoying when iterating on changes.

faithanalog · 2025-01-29T05:04:13Z

@davepacheco tagged for review since we've been talking about this for awhile. would also tag ry, but he is out. not sure who else might be interested in reviewing.

davepacheco · 2025-02-05T17:37:03Z

Sorry for the delay in looking at this! I'm just starting to look now.

Additionally, because of cloning the source to a temporary working path, any changes you have not git added will not be seen by a4x2-package. I consider this either mildly useful or mildly confusing depending on what you're doing in the moment. The jj crew can of course just run jj before using a4x2-package.
...
Maybe we should just try to make a4x2 work in the current workdir without causing trouble.

I think you've made the right call here in using Git. I also think making a4x2 work in the current work tree is a great goal, but shouldn't block this work. In the meantime, I'd suggest that the tool fail by default with a clear message if the working tree is dirty, but allow that to be overridden.

davepacheco

This is looking great! The description in the PR was very helpful in getting oriented -- thanks for that.

I haven't yet looked at a4x2-deploy.

davepacheco · 2025-02-05T17:38:54Z

dev-tools/xtask/Cargo.toml

@@ -31,8 +31,13 @@ clap.workspace = true
 fs-err.workspace = true
 macaddr.workspace = true
 serde.workspace = true
+serde_json.workspace = true


There was interest in keeping xtask as tiny as possible and using the "external" pattern (which is also pretty easy) to have it invoke other commands instead.

I don't have an opinion about whether this should be "external" or not.

I think this is ok, most of these look like deps that already exist. walkdir and xshell are the new ones, I think, and both of them look quite minimal.

davepacheco · 2025-02-05T17:39:59Z

dev-tools/xtask/src/a4x2_package.rs

+    /// Bundle omicron live tests into the output tarball
+    #[clap(long)]
+    live_tests: bool,
+
+    /// Bundle omicron end-to-end-tests package into the output tarball
+    #[clap(long)]
+    end_to_end_tests: bool,


Do you have a sense of how much more time these add? I'd be inclined to make them on by default.

davepacheco · 2025-02-05T17:40:16Z

dev-tools/xtask/src/a4x2_package.rs

+
+#[derive(Deserialize, Debug)]
+struct NextestConfigVersion {
+    // required: String,


davepacheco · 2025-02-05T17:40:38Z

dev-tools/xtask/src/a4x2_package.rs

+struct Environment {
+    cargo: String,
+    home_dir: Utf8PathBuf,
+    work_dir: Utf8PathBuf,
+    src_dir: Utf8PathBuf,
+    out_dir: Utf8PathBuf,
+    omicron_dir: Utf8PathBuf,
+    in_ci: bool,
+}


Some doc comments here would be a big help.

davepacheco · 2025-02-05T17:42:41Z

dev-tools/xtask/src/a4x2_deploy.rs

+        // Buildomat sets the CI environment variable. We'll work a bit
+        // differently in that case (no cleanup, use /work)
+        let in_ci = env::var("CI").is_ok();


Flags like this make me worried about diverging behavior between CI and local operation. How about allowing the work directory to be specified on the command line and having CI specify it? I'm not that worried about skipping cleanup in CI but if we that we could have a --skip-cleanup arg, too.

Yeah, I would prefer that. (Concerns like this are why I like having configuration profiles the way Cargo or nextest do, but that is probably overkill here.)

I agree, the program we run in CI that invokes this should just use flags to get the behaviours. I think folks ought, as much as possible, to be able to just type stuff they see in the CI job programs into their shell and have it work the same way.

Having read through the rest of this PR, I'm coming around to believing that maybe configuration profiles wouldn't be the worst after all.

davepacheco · 2025-02-05T17:49:48Z

dev-tools/xtask/src/a4x2_package.rs

+    // Get a ref we can checkout from the local working tree
+    let treeish = {
+        let _popdir = sh.push_dir(&env.src_dir);
+        let mut treeish = cmd!(sh, "git stash create").read()?;


This is where I'd be inclined to:

check if the working tree is dirty

if so: bail unless some --allow-dirty is specified

regardless, use HEAD

Also:

Use $GIT before git.

Altering stash-related state is a visible operation and I don't think we should do it -- instead, there are ways to create a "background" commit from the current state.

What happens if this is run in a non-git repo, e.g. a non-colocated Jujutsu repo?

Another option that has somewhat less source control interaction is:

get the list of all tracked files in the repo (here's instructions on how)

create hardlinks for all of them in a mirrored structure. (Hardlinks are fine because basically all editors write files atomically, which breaks them.)

davepacheco · 2025-02-05T17:51:12Z

dev-tools/xtask/src/a4x2_package.rs

+    cmd!(sh, "git fetch origin {treeish}").run()?;
+    cmd!(sh, "git checkout {treeish}").run()?;
+
+    // XXX is this needed in CI?


Suggested change

// XXX is this needed in CI?

davepacheco · 2025-02-05T17:52:28Z

dev-tools/xtask/src/a4x2_package.rs

+    // I'm not confident this does what it should, because I don't know much
+    // about the end to end tests. This is just going from what the shell script
+    // did.


@iliana maybe could answer this?

davepacheco · 2025-02-05T17:53:31Z

dev-tools/xtask/src/a4x2_package.rs

+    // XXX right now we only download a new nextest in CI, and we use
+    // the user's locally installed nextest otherwise. I'm inclined to
+    // say we should always download a fresh one for consistency.
+    // thoughts?


Suggested change

// XXX right now we only download a new nextest in CI, and we use

// the user's locally installed nextest otherwise. I'm inclined to

// say we should always download a fresh one for consistency.

// thoughts?

// XXX right now we only download a new nextest in CI, and we use

// the user's locally installed nextest otherwise. I'm inclined to

// say we should always download a fresh one for consistency.

// thoughts?

I'd be inclined to have this tool assume that the user has already set up nextest however they want. That's what the rest of Omicron currently does, I believe. If CI needs to download it, that's fine, but it should do it as a separate step before invoking this tool (which I think is what it does for the existing CI jobs).

Agreed -- CI already downloads nextest. I'd just package up whatever nextest is available in the PATH.

davepacheco · 2025-02-05T17:55:42Z

dev-tools/xtask/src/a4x2_package.rs

+    // XXX use main branch
+    // cmd!(sh, "git -C {testbed_dir} checkout artemis/init-in-smf").run()?;


Could make this a command-line argument. It could accept a branch name or alternatively a local path to a clone if you want to do anything other than "main".

davepacheco · 2025-02-05T18:06:02Z

dev-tools/xtask/src/a4x2_deploy.rs

+pub fn run_cmd(args: A4x2DeployArgs) -> Result<()> {
+    let sh = Shell::new()?;
+
+    // TODO: do we want to have a configurable falcon dataset?


Yes, definitely. I can't run a4x2 on my system without overriding the default here.

It'd be nice to override a few other things, like THREE_NODE_RSS. Not a blocker though for v1.

Agreed. This should be usable on shared systems like gimlozar, where more than one engineer need to be able to do this in parallel without clobbering one another.

davepacheco · 2025-02-05T18:07:17Z

dev-tools/xtask/src/a4x2_deploy.rs

+                // Delete any results from previous runs. We don't mind if
+                // there's errors. This needs to run as root in case there are
+                // artifacts owned by root left around from the deploy.
+                cmd!(sh, "pfexec rm -rf").arg(&env.work_dir).run()?;


This is (obviously) incredibly dangerous. I wonder how we can make it less so.

Yeah, agreed -- if you could set work_dir to something like /etc (e.g. via the CLI option discussed above) you could nuke the whole system.

How many files is this typically, and how many are owned by root? If it's not too many, I'm wondering if we could do a first pass that iterates over the file system to find files owned by root, and then lists out those files. Then the user can be prompted to ask if it's okay to use pfexec (and there can be a --force option to skip prompting.)

What goes in here? Why do you need pfexec to remove it?

davepacheco · 2025-02-05T18:09:26Z

dev-tools/xtask/src/a4x2_deploy.rs

+    // Generate an ssh key we will use to log into the sleds.
+    cmd!(sh, "ssh-keygen -t ed25519 -N '' -f a4x2-ssh-key").run()?;


Ooh, (I think) I like using a one-off key here. But it also needs to be easy for folks to ssh in after the fact. Should we give folks a command to load it into their agent?

Or provide an ssh command to copy-paste.

I think it makes sense to inject a key that you can reliably use locally from the automation here, but I would want to continue using my yubikey-backed SSH keys to actually log in -- we should be willing to import keys from some other place, such as ~/.ssh/authorized_keys, or I guess potentially from the output of ssh-add -L, etc.

davepacheco · 2025-02-05T18:11:27Z

dev-tools/xtask/src/a4x2_deploy.rs

+        .find(|v| v["dynamic"] == Value::Bool(true))
+        .ok_or(anyhow!("failed to find customer edge addr"))?["local"]
+        .as_str()
+        .unwrap();


May as well replace these unwrap()s with error messages?
Would it make sense to define a Rust struct that derives Deserialize so that if any of the assumptions in ce_addr_json[0]["addr_info"].as_array().unwrap() aren't true, we fail gracefully? (value might not be an array; might not have addr_info property, it might not be an array, etc.)

davepacheco · 2025-02-05T18:11:39Z

dev-tools/xtask/src/a4x2_deploy.rs

+    // XXX Im told that pinging the gateway from inside the sleds is no longer
+    // necessary so I'm leaving it out for now. We'll see if that's true.
+    // So far it seems to be.
+


Suggested change

// XXX Im told that pinging the gateway from inside the sleds is no longer

// necessary so I'm leaving it out for now. We'll see if that's true.

// So far it seems to be.

davepacheco · 2025-02-05T18:12:09Z

dev-tools/xtask/src/a4x2_deploy.rs

+    // necessary so I'm leaving it out for now. We'll see if that's true.
+    // So far it seems to be.
+
+    cmd!(sh, "pfexec route add 198.51.100.0/24 {customer_edge_addr}").run()?;


I'm surprised you have to do this because I don't remember doing it. Where does this happen during a manual a4x2 launch?

(and more generally, could there be a reference to documentation for this function so we can look at things step-by-step?)

Presumably you only need to do this if you're actually trying to directly communicate with that subnet from outside; i.e., from processes you run on the host alongside this.

davepacheco · 2025-02-05T18:13:02Z

dev-tools/xtask/src/a4x2_deploy.rs

+    // The important thing here is to do an HTTP request with timeout to the
+    // control plane API endpoint. If the server is up, we'll get the page you'd
+    // expect for an unauthenticated user. If not, then the request will either
+    // - fail
+    // - stall
+    //
+    // Stalling is why we do a timeout, and we need to handle timeout both at
+    // the TCP level and the HTTP level. `curl -m 5` will hard-out after 5
+    // seconds no matter what's going on.
+    //
+    // That 5 seconds is arbitrary, but it's been working well over in
+    // rackletteadm, from which this logic is copied.
+    //
+    // We could replace curl with rust code, but make sure we do the same thing!
+    while retries > 0 && cmd!(sh, "curl -s -m 5 {api_url}").run().is_err() {
+        retries -= 1;
+        thread::sleep(time::Duration::from_secs(25));
+
+        if retries % 5 == 0 {
+            // I see no reason to error while printing the date
+            let _ = cmd!(sh, "date").run();
+        }
+    }


I wonder if you can use the same code that the end-to-end tests use. I seem to recall them doing the same thing in Rust.

If we do that, then it's worth factoring that out so we don't have to build e.g. our Diesel code to build this binary. (And if we do that, we should definitely put this into an external binary.)

sunshowers

Thank you for doing this! Really excited about automating all of this.

(I'm still working on the review -- here's what I have for now.)

sunshowers · 2025-02-05T20:48:46Z

dev-tools/xtask/Cargo.toml

 swrite.workspace = true
 tabled.workspace = true
 textwrap.workspace = true
 toml.workspace = true
 usdt.workspace = true
+walkdir.workspace = true
+xshell.workspace = true
+sha2.workspace = true


Could you put this in sorted order? Thanks!

sunshowers · 2025-02-05T20:51:18Z

dev-tools/xtask/Cargo.toml

@@ -31,8 +31,13 @@ clap.workspace = true
 fs-err.workspace = true
 macaddr.workspace = true
 serde.workspace = true
+serde_json.workspace = true


I think this is ok, most of these look like deps that already exist. walkdir and xshell are the new ones, I think, and both of them look quite minimal.

sunshowers · 2025-02-05T20:52:46Z

dev-tools/xtask/src/a4x2_deploy.rs

+        // Buildomat sets the CI environment variable. We'll work a bit
+        // differently in that case (no cleanup, use /work)
+        let in_ci = env::var("CI").is_ok();


Yeah, I would prefer that. (Concerns like this are why I like having configuration profiles the way Cargo or nextest do, but that is probably overkill here.)

sunshowers · 2025-02-05T20:55:47Z

dev-tools/xtask/src/a4x2_deploy.rs

+                let result = teardown_a4x2(&sh, &env);
+                eprintln!("teardown result: {:?}", result);
+                eprintln!("continuing regardless of whether there were errors");


Could this print out the ok and error outputs nicely? Then the error can be printed conditionally.

sunshowers · 2025-02-05T21:04:04Z

dev-tools/xtask/src/a4x2_deploy.rs

+                // Delete any results from previous runs. We don't mind if
+                // there's errors. This needs to run as root in case there are
+                // artifacts owned by root left around from the deploy.
+                cmd!(sh, "pfexec rm -rf").arg(&env.work_dir).run()?;


Yeah, agreed -- if you could set work_dir to something like /etc (e.g. via the CLI option discussed above) you could nuke the whole system.

How many files is this typically, and how many are owned by root? If it's not too many, I'm wondering if we could do a first pass that iterates over the file system to find files owned by root, and then lists out those files. Then the user can be prompted to ask if it's okay to use pfexec (and there can be a --force option to skip prompting.)

sunshowers · 2025-02-05T21:32:06Z

dev-tools/xtask/src/a4x2_deploy.rs

+                cmd!(sh, "pfexec route delete 198.51.100.0/24 {gateway}")
+                    .run()?;


Worth printing out the fact that we're doing this.

In general I really like the approach where we first gather all operations into a structure, and then perform them in a separate pass. That enables prompting for confirmation, and things like --dry-run. Not sure if you want to put the effort into doing that, but if you do I'd really appreciate it :)

sunshowers · 2025-02-05T21:32:47Z

dev-tools/xtask/src/a4x2_deploy.rs

+        if !had_gateway {
+            break;
+        }


Is had_gateway not being present an exceptional condition? Worth adding a comment in any case.

sunshowers · 2025-02-05T21:33:18Z

dev-tools/xtask/src/a4x2_deploy.rs

+    // pour one out for enjoyers of clean indentation
+    println!(


https://docs.rs/indoc if you'd like to use it!

sunshowers · 2025-02-05T21:34:44Z

dev-tools/xtask/src/a4x2_deploy.rs

+                    "teardown_a4x2: could not get gateway for a4x2 route from line {ln}"
+                ))?;
+
+                cmd!(sh, "pfexec route delete 198.51.100.0/24 {gateway}")


It looks like some of these IPs and subnets are fixed and well-known -- could you make them constants at the top of the file?

sunshowers · 2025-02-05T21:35:38Z

dev-tools/xtask/src/a4x2_deploy.rs

+            let ipv4 = ln.split_whitespace().nth(3).ok_or(anyhow!("get_host_ip: could not extract IP for node {node} from line {ln}"))?;
+            let ipv4 = ipv4.strip_suffix("/24").ok_or(anyhow!("get_host_ip: could not extract IP for node {node} from line {ln}"))?;


same comments as in the gateway section above.

jclulow

This seems like a great idea! I have a hodge podge of thoughts I've left as comments, but I wanted to say I agree this makes things much more accessible to casual users in particular. I also like the concrete separation of build machine and target machine, with an artefact you can pass between them!

jclulow · 2025-02-05T21:17:25Z

dev-tools/xtask/src/a4x2_deploy.rs

+pub fn run_cmd(args: A4x2DeployArgs) -> Result<()> {
+    let sh = Shell::new()?;
+
+    // TODO: do we want to have a configurable falcon dataset?


Agreed. This should be usable on shared systems like gimlozar, where more than one engineer need to be able to do this in parallel without clobbering one another.

jclulow · 2025-02-05T21:18:33Z

dev-tools/xtask/src/a4x2_deploy.rs

+        // Buildomat sets the CI environment variable. We'll work a bit
+        // differently in that case (no cleanup, use /work)
+        let in_ci = env::var("CI").is_ok();


I agree, the program we run in CI that invokes this should just use flags to get the behaviours. I think folks ought, as much as possible, to be able to just type stuff they see in the CI job programs into their shell and have it work the same way.

jclulow · 2025-02-05T21:21:07Z

dev-tools/xtask/src/a4x2_deploy.rs

+        } else {
+            home_dir.join(".cache/a4x2-deploy")
+        };


How big does this cache/work directory get in practice?

jclulow · 2025-02-05T21:27:32Z

dev-tools/xtask/src/a4x2_deploy.rs

+    // all of them. But, I don't like unbounded loops, so we will do at max 10,
+    // which is a number that seems unlikely enough to reach, to me.
+    for _ in 0..10 {
+        let mut route_cmd = cmd!(sh, "pfexec route get 198.51.100.0/24");


Suggested change

let mut route_cmd = cmd!(sh, "pfexec route get 198.51.100.0/24");

let mut route_cmd = cmd!(sh, "route -n get 198.51.100.0/24");

You don't need elevated privileges to route get, and we should use the -n flag to avoid name service lookups.

jclulow · 2025-02-05T21:32:45Z

dev-tools/xtask/src/a4x2_deploy.rs

+}
+
+pub fn run_cmd(args: A4x2DeployArgs) -> Result<()> {
+    let sh = Shell::new()?;


I think we should probably have a common wrapper around Shell::new(), which should at minimum scrub the environment that these commands are going to run in. In particular, I think we should make sure no LC_* or LANG variables are inherited, and instead specify LANG='C.UTF-8' for all invocations so that we get consistent output from commands and so on.

jclulow · 2025-02-05T22:58:44Z

dev-tools/xtask/src/a4x2_deploy.rs

+        // If you want any change in functionality for the test runner, update
+        // run-live-tests over in a4x2_package.rs. Don't add it here!
+        let switch_zone_script = r#"
+            set -euxo pipefail
+            tar xvzf live-tests-bundle.tgz
+            cd live-tests-bundle
+            ./run-live-tests
+        "#;
+
+        let remote_script =
+            format!("zlogin oxz_switch bash -c '{switch_zone_script}'");
+
+        // Will error if the live tests fail. This is desired.
+        cmd!(sh, "ssh {ssh_args...} {ssh_host} {remote_script}").run()?;


I feel like you're getting extremely lucky with the several layers of belligerent escaping that occur when descending through all of xshell, SSH, zlogin, and then bash -c, haha.

I don't have any specific guidance other than it feels like this will fall down if there end up being any quotes or load bearing whitespace required.

jclulow · 2025-02-05T22:59:09Z

dev-tools/xtask/src/a4x2_deploy.rs

+                    "teardown_a4x2: could not get gateway for a4x2 route from line {ln}"
+                ))?;
+
+                cmd!(sh, "pfexec route delete 198.51.100.0/24 {gateway}")


Suggested change

cmd!(sh, "pfexec route delete 198.51.100.0/24 {gateway}")

cmd!(sh, "pfexec route -n delete 198.51.100.0/24 {gateway}")

jclulow · 2025-02-05T23:00:54Z

dev-tools/xtask/src/a4x2_deploy.rs

+        // We get an error code when there is no route (and thus, nothing for
+        // us to delete!)


I think before ignoring the result here I would make sure the output ends in not in table, otherwise it's presumably an error we're not expecting and we'd like to display it to the user and abort.

jclulow · 2025-02-05T23:01:48Z

dev-tools/xtask/src/a4x2_deploy.rs

+                        // We ignore any error here, because the result we
+                        // actually want to produce is whether tests passed
+                        let _ = teardown_a4x2(&sh, &env);
+                    }


I don't think we should ignore this error. Either teardown_a4x2() should work, or we need to stop and figure out why it did not.

jclulow · 2025-02-05T23:02:34Z

dev-tools/xtask/src/a4x2_deploy.rs

+                // Teardown previous deploy if it exists, before wiping the data
+                // for it. If this errors, we assume the deploy doesn't exist
+                // and carry on.
+                let result = teardown_a4x2(&sh, &env);


I don't think we should ignore this error. Either teardown_a4x2() should work, or we need to stop and figure out why it did not.

sunshowers

Thanks again -- went through and left a few more comments.

sunshowers · 2025-02-05T22:35:39Z

dev-tools/xtask/src/a4x2_package.rs

+        let src_dir = Utf8PathBuf::try_from(env::current_dir()?)?;
+
+        // Delete any results from previous runs. We don't mind if there's errors.
+        fs::remove_dir_all(&work_dir).ok();


Also worth reporting errors here as a warning.

sunshowers · 2025-02-05T22:38:38Z

dev-tools/xtask/src/a4x2_package.rs

+use serde::Deserialize;
+use sha2::Digest;
+use std::env;
+use std::fs;


Consider https://docs.rs/fs-err for better error messages.

sunshowers · 2025-02-06T02:01:13Z

dev-tools/xtask/src/main.rs

+mod a4x2_deploy;
+mod a4x2_package;


I suspect we'll want to share some code between the two -- what do you think about putting them in an a4x2 module and having a4x2/deploy.rs and package.rs?

sunshowers · 2025-02-06T02:09:21Z

dev-tools/xtask/src/a4x2_package.rs

+    // Get a ref we can checkout from the local working tree
+    let treeish = {
+        let _popdir = sh.push_dir(&env.src_dir);
+        let mut treeish = cmd!(sh, "git stash create").read()?;


Also:

Use $GIT before git.

Altering stash-related state is a visible operation and I don't think we should do it -- instead, there are ways to create a "background" commit from the current state.

What happens if this is run in a non-git repo, e.g. a non-colocated Jujutsu repo?

Another option that has somewhat less source control interaction is:

get the list of all tracked files in the repo (here's instructions on how)

create hardlinks for all of them in a mirrored structure. (Hardlinks are fine because basically all editors write files atomically, which breaks them.)

sunshowers · 2025-02-06T02:09:36Z

dev-tools/xtask/src/a4x2_package.rs

+        if treeish.is_empty() {
+            // nothing to stash
+            eprintln!("Nothing to stash, using most recent commit for clone");
+            treeish = cmd!(sh, "git rev-parse HEAD").read()?;


Again these should use $GIT.

sunshowers · 2025-02-06T02:17:34Z

dev-tools/xtask/src/a4x2_package.rs

+        if can_dedupe {
+            // yay, extract out to the common dir.
+            let dest = common_dir.join(base_path);
+            sh.create_dir(&dest.parent().unwrap())?;
+            sh.copy_file(&g0_path, &dest)?;
+            sh.remove_path(&g0_path)?;
+            for prefix in &prefixes {
+                sh.remove_path(&prefix.join(base_path))?;
+            }
+        }


Hmm interesting -- what code ensures that sled-common is checked?

sunshowers · 2025-02-06T02:20:35Z

dev-tools/xtask/src/a4x2_package.rs

+        // XXX do we need to test nextest exists & is the right version? need to
+        // see if the incidental errors from stuff below already take care of
+        // telling someone outside of CI what they need to be doing.


You could run cargo nextest show-config version -- exit code 92 means that the required version isn't met, and exit code 10 means that the recommended version isn't met.

sunshowers · 2025-02-06T02:21:43Z

dev-tools/xtask/src/a4x2_package.rs

+    // intentionally relative path so the tarball gets relative paths
+    let _popdir = sh.push_dir(&env.work_dir);
+    let bundle_dir_name = live_test_bundle_dir.file_name().unwrap();
+    cmd!(sh, "tar -czf {out_dir}/live-tests-bundle.tgz {bundle_dir_name}/")


live-tests-bundle.tgz maybe can be a constant.

sunshowers · 2025-02-06T02:22:04Z

dev-tools/xtask/src/a4x2_package.rs

+    let pkg_dir_name = env.out_dir.file_name().unwrap();
+
+    // Place output in `out/` when running locally, or leave in work dir for CI
+    let artifact = Utf8PathBuf::from("a4x2-package-out.tgz");


This filename can be a constant too, I think.

sunshowers · 2025-02-06T02:26:41Z

dev-tools/xtask/src/a4x2_package.rs

+    let a4x2_dir = testbed_dir.join("a4x2");
+
+    // TODO move this clone & build to prepare, so that if the omicron build
+    // takes awhile we wont have lost our git token when we get here. Relevant


Suggested change

// takes awhile we wont have lost our git token when we get here. Relevant

// takes awhile, we won't have lost our git token when we get here. Relevant

faithanalog requested a review from davepacheco January 29, 2025 04:56

davepacheco reviewed Feb 5, 2025

View reviewed changes

sunshowers reviewed Feb 5, 2025

View reviewed changes

jclulow reviewed Feb 5, 2025

View reviewed changes

sunshowers reviewed Feb 6, 2025

View reviewed changes

		// XXX use main branch
		// cmd!(sh, "git -C {testbed_dir} checkout artemis/init-in-smf").run()?;

		// Generate an ssh key we will use to log into the sleds.
		cmd!(sh, "ssh-keygen -t ed25519 -N '' -f a4x2-ssh-key").run()?;

	// XXX Im told that pinging the gateway from inside the sleds is no longer
	// necessary so I'm leaving it out for now. We'll see if that's true.
	// So far it seems to be.

		cmd!(sh, "pfexec route delete 198.51.100.0/24 {gateway}")
		.run()?;

		let ipv4 = ln.split_whitespace().nth(3).ok_or(anyhow!("get_host_ip: could not extract IP for node {node} from line {ln}"))?;
		let ipv4 = ipv4.strip_suffix("/24").ok_or(anyhow!("get_host_ip: could not extract IP for node {node} from line {ln}"))?;

	let mut route_cmd = cmd!(sh, "pfexec route get 198.51.100.0/24");
	let mut route_cmd = cmd!(sh, "route -n get 198.51.100.0/24");

	cmd!(sh, "pfexec route delete 198.51.100.0/24 {gateway}")
	cmd!(sh, "pfexec route -n delete 198.51.100.0/24 {gateway}")

		// We get an error code when there is no route (and thus, nothing for
		// us to delete!)

	// takes awhile we wont have lost our git token when we get here. Relevant
	// takes awhile, we won't have lost our git token when we get here. Relevant

Add a4x2-package/a4x2-deploy xtasks #7424

Are you sure you want to change the base?

Add a4x2-package/a4x2-deploy xtasks #7424

Conversation

faithanalog commented Jan 29, 2025 • edited Loading

overview

a4x2-package

a4x2-deploy

Known shortcomings

Aside on xshell

All commands are echoed by default

cmd!() is not sending your command into a shell

sh.pushdir()

faithanalog commented Jan 29, 2025

davepacheco commented Feb 5, 2025

davepacheco left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunshowers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jclulow left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunshowers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

faithanalog commented Jan 29, 2025 •

edited

Loading

`cmd!()` is not sending your command into a shell

`sh.pushdir()`