Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a4x2-package/a4x2-deploy xtasks #7424

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

faithanalog
Copy link
Contributor

@faithanalog faithanalog commented Jan 29, 2025

overview

This PR adds two new xtasks:

  • a4x2-package
  • a4x2-deploy

a4x2-package

> cargo xtask a4x2-package -h
Generate a tarball with omicron packaged for deployment onto a4x2

Usage: cargo xtask a4x2-package [OPTIONS]

Options:
      --live-tests        Bundle omicron live tests into the output tarball
      --end-to-end-tests  Bundle omicron end-to-end-tests package into the output tarball
  -h, --help              Print help

This command will:

  • download & build the latest a4x2 from oxidecomputer/testbed
  • clone the local working tree to a temporary path & build + package it with a4x2
  • if requested, build live-tests
  • if requested, build end-to-end tests (but there is no way to run these yet)
  • package these all into a tar bundle, at out/a4x2-package-out.tgz

a4x2-deploy

> cargo xtask a4x2-deploy -h
Run a4x2 and deploy omicron onto it, and optionally run live-tests and end to end tests

Usage: cargo xtask a4x2-deploy <COMMAND>

Commands:
  start      Start a4x2, deploy a control plane to it, and then leave it running
  stop       Stop a4x2 that was previously launched with the `start` subcommand
  run-tests  Start a4x2, run tests, and then stop a4x2 unless you request otherwise
  help       Print this message or the help of the given subcommand(s)

Options:
  -h, --help  Print help

This command will:

  • tear down any existing a4x2 created by a previous invocation of a4x2-deploy
  • unpack a bundle generated by a4x2-package
  • download latest OVMF code & propolis server
  • launch a4x2
  • curl the control plane API until it becomes alive (or a long timeout is reached, in which case it considers the launch failed)
  • if requested, transfer live tests to the system and execute them
  • print information on how to access the server

By default, a4x2-deploy will use a bundle from out/a4x2-package.tgz, but an alternative can be specified with --package <path>, so you can build a package and send it to someone else for them to run on their machine.

Known shortcomings

  • a4x2 is still a bit flaky. Sometimes I will reboot my machine for good luck.
  • a4x2-deploy could and should automatically collect logs from the system when launches fail, but does not yet.
  • no support for incremental builds of omicron. a4x2 makes some changes to tomls in the current working directory, which is why we clone the source directory to another path prior to running the a4x2 package step.

Additionally, because of cloning the source to a temporary working path, any changes you have not git added will not be seen by a4x2-package. I consider this either mildly useful or mildly confusing depending on what you're doing in the moment. The jj crew can of course just run jj before using a4x2-package.

We could opt instead of doing a cp -a of the working directory instead of a git clone, but this could be problematic due to copying potentially massive target/ directories. We could use rsync with excludes, but I'm not sure we can rely on it always being present. The git clone seemed like the most reliable to me, and it has precedent in the releng xtask. But I do find this to be somewhat annoying when iterating on changes and am open to alternatives. Maybe we should just try to make a4x2 work in the current workdir without causing trouble.

Aside on xshell

This PR adds a dependency on xshell, which I use heavily for running system commands. I figured I would take the opportunity to talk about my experience with it, and provide some context that may be helpful when reviewing this PR. I've really enjoyed using it and think it makes it quite pleasant to use rust as a bash replacement.

A few things to keep in mind when reading it, which you can learn from the docs, but I will also highlight here:

All commands are echoed by default

Similar to set -x in bash, commands are echoed as they are run. This is great for logs. Additionally, non-zero exit codes become Errs, and command stdout/stderr are sent to the terminal by default, unless you make a point of capturing them.

Command invocations are not echoed when you use .read() to capture stdout

cmd!() is not sending your command into a shell

cmd!() is building a normal rust Command! If you see something like

cmd!(sh, "some command here {with_args}")

You may ask "will that escape with_args"? The answer is, there is no shell invocation, so there is nothing to escape! It is similar to as if you had done

Command::new("some").arg("command").arg("here").arg(&with_args)

xshell is mostly convenience around Command, with a few other file access shorthands thrown in.

sh.pushdir()

You will see a lot of a pattern

let _popdir = sh.pushdir(some_dir);

sh has its own internal working directory it stores (separate from the process workdir, btw). All commands / operations run through sh are relative to that working directory. pushdir changes that directory, and returns a guard. When that guard is dropped, the directory is popped.

You may be wary if you have bitten by code like this before:

// guard is immediately dropped, because thats what `let _ = ` does! Oops!
let _ = thing_that_returns_guard();

In the next version of xshell, it looks like pushdir is going away entirely, and will be replaced with approximately this:

sh.with_dir(some_dir, || {

})

I'll deal with that when it happens. I'm sure the lambdas will make some things more annoying, but I think it'll be better that way.

This PR adds two new xtasks:

- a4x2-package
- a4x2-deploy

```
> cargo xtask a4x2-package -h
Generate a tarball with omicron packaged for deployment onto a4x2

Usage: cargo xtask a4x2-package [OPTIONS]

Options:
      --live-tests        Bundle omicron live tests into the output tarball
      --end-to-end-tests  Bundle omicron end-to-end-tests package into the output tarball
  -h, --help              Print help
```

This command will:

- download & build the latest a4x2 from oxidecomputer/testbed
- clone the local working tree to a temporary path & build + package it with a4x2
- if requested, build live-tests
- if requested, build end-to-end tests (but there is no way to run these yet)
- package these all into a tar bundle, at `out/a4x2-package-out.tgz`

```
> cargo xtask a4x2-deploy -h
Run a4x2 and deploy omicron onto it, and optionally run live-tests and
end to end tests

Usage: cargo xtask a4x2-deploy <COMMAND>

Commands:
  start      Start a4x2, deploy a control plane to it, and then leave it running
  stop       Stop a4x2 that was previously launched with the `start` subcommand
  run-tests  Start a4x2, run tests, and then stop a4x2 unless you request otherwise
  help       Print this message or the help of the given subcommand(s)

Options:
  -h, --help  Print help
```

This command will:

- tear down any existing a4x2 created by a previous invocation of a4x2-deploy
- unpack a bundle generated by a4x2-package
- download latest OVMF code & propolis server
- launch a4x2
- ping the control plane until it becomes alive (or a timeout is reached,
  in which case it considers the launch failed)
- if requested, transfer live tests to the system and execute them
- print information on how to access the server

By default, a4x2-deploy will use a bundle from `out/a4x2-package.tgz`,
but an alternative can be specified with `--package <path>`, so you can
build a package and send it to someone else for them to run on their
machine.

- a4x2 is still a bit flaky. Sometimes I will reboot my machine for good luck.
- a4x2-deploy could and should automatically collect logs from the
  system when launches fail, but does not yet.
- no support for incremental builds of omicron. a4x2 makes some changes to
  tomls in the current working directory, which is why we clone the source
  directory to another path prior to running the a4x2 package step.

Additionally, because of cloning the source to a temporary working path,
any changes you have not `git add`ed will not be seen by `a4x2-package`.
I consider this either mildly useful or mildly confusing depending on
what you're doing in the moment. The jj crew can of course just run `jj`
before using `a4x2-package`.

We could opt instead of doing a `cp -a` of the working directory instead
of a git clone, but this could be problematic due to copying potentially
massive `target/` directories. We could use `rsync` wtith excludes, but
I'm not sure we can rely on it always being present. The `git clone`
seemed like the most reliable to me, and it has precedent in the
`releng` xtask. But I do find this to be somewhat annoying when
iterating on changes.
@faithanalog
Copy link
Contributor Author

@davepacheco tagged for review since we've been talking about this for awhile. would also tag ry, but he is out. not sure who else might be interested in reviewing.

@davepacheco
Copy link
Collaborator

Sorry for the delay in looking at this! I'm just starting to look now.

Additionally, because of cloning the source to a temporary working path, any changes you have not git added will not be seen by a4x2-package. I consider this either mildly useful or mildly confusing depending on what you're doing in the moment. The jj crew can of course just run jj before using a4x2-package.
...
Maybe we should just try to make a4x2 work in the current workdir without causing trouble.

I think you've made the right call here in using Git. I also think making a4x2 work in the current work tree is a great goal, but shouldn't block this work. In the meantime, I'd suggest that the tool fail by default with a clear message if the working tree is dirty, but allow that to be overridden.

Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great! The description in the PR was very helpful in getting oriented -- thanks for that.

I haven't yet looked at a4x2-deploy.

@@ -31,8 +31,13 @@ clap.workspace = true
fs-err.workspace = true
macaddr.workspace = true
serde.workspace = true
serde_json.workspace = true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was interest in keeping xtask as tiny as possible and using the "external" pattern (which is also pretty easy) to have it invoke other commands instead.

I don't have an opinion about whether this should be "external" or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ok, most of these look like deps that already exist. walkdir and xshell are the new ones, I think, and both of them look quite minimal.

Comment on lines +20 to +26
/// Bundle omicron live tests into the output tarball
#[clap(long)]
live_tests: bool,

/// Bundle omicron end-to-end-tests package into the output tarball
#[clap(long)]
end_to_end_tests: bool,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a sense of how much more time these add? I'd be inclined to make them on by default.


#[derive(Deserialize, Debug)]
struct NextestConfigVersion {
// required: String,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove?

Comment on lines +41 to +49
struct Environment {
cargo: String,
home_dir: Utf8PathBuf,
work_dir: Utf8PathBuf,
src_dir: Utf8PathBuf,
out_dir: Utf8PathBuf,
omicron_dir: Utf8PathBuf,
in_ci: bool,
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some doc comments here would be a big help.

Comment on lines +87 to +89
// Buildomat sets the CI environment variable. We'll work a bit
// differently in that case (no cleanup, use /work)
let in_ci = env::var("CI").is_ok();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flags like this make me worried about diverging behavior between CI and local operation. How about allowing the work directory to be specified on the command line and having CI specify it? I'm not that worried about skipping cleanup in CI but if we that we could have a --skip-cleanup arg, too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I would prefer that. (Concerns like this are why I like having configuration profiles the way Cargo or nextest do, but that is probably overkill here.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, the program we run in CI that invokes this should just use flags to get the behaviours. I think folks ought, as much as possible, to be able to just type stuff they see in the CI job programs into their shell and have it work the same way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having read through the rest of this PR, I'm coming around to believing that maybe configuration profiles wouldn't be the worst after all.

// Get a ref we can checkout from the local working tree
let treeish = {
let _popdir = sh.push_dir(&env.src_dir);
let mut treeish = cmd!(sh, "git stash create").read()?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where I'd be inclined to:

  • check if the working tree is dirty
  • if so: bail unless some --allow-dirty is specified
  • regardless, use HEAD

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also:

  • Use $GIT before git.
  • Altering stash-related state is a visible operation and I don't think we should do it -- instead, there are ways to create a "background" commit from the current state.
  • What happens if this is run in a non-git repo, e.g. a non-colocated Jujutsu repo?

Another option that has somewhat less source control interaction is:

  • get the list of all tracked files in the repo (here's instructions on how)
  • create hardlinks for all of them in a mirrored structure. (Hardlinks are fine because basically all editors write files atomically, which breaks them.)

cmd!(sh, "git fetch origin {treeish}").run()?;
cmd!(sh, "git checkout {treeish}").run()?;

// XXX is this needed in CI?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// XXX is this needed in CI?

Comment on lines +158 to +160
// I'm not confident this does what it should, because I don't know much
// about the end to end tests. This is just going from what the shell script
// did.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iliana maybe could answer this?

Comment on lines +183 to +186
// XXX right now we only download a new nextest in CI, and we use
// the user's locally installed nextest otherwise. I'm inclined to
// say we should always download a fresh one for consistency.
// thoughts?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// XXX right now we only download a new nextest in CI, and we use
// the user's locally installed nextest otherwise. I'm inclined to
// say we should always download a fresh one for consistency.
// thoughts?
// XXX right now we only download a new nextest in CI, and we use
// the user's locally installed nextest otherwise. I'm inclined to
// say we should always download a fresh one for consistency.
// thoughts?

I'd be inclined to have this tool assume that the user has already set up nextest however they want. That's what the rest of Omicron currently does, I believe. If CI needs to download it, that's fine, but it should do it as a separate step before invoking this tool (which I think is what it does for the existing CI jobs).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed -- CI already downloads nextest. I'd just package up whatever nextest is available in the PATH.

Comment on lines +269 to +270
// XXX use main branch
// cmd!(sh, "git -C {testbed_dir} checkout artemis/init-in-smf").run()?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could make this a command-line argument. It could accept a branch name or alternatively a local path to a clone if you want to do anything other than "main".

pub fn run_cmd(args: A4x2DeployArgs) -> Result<()> {
let sh = Shell::new()?;

// TODO: do we want to have a configurable falcon dataset?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, definitely. I can't run a4x2 on my system without overriding the default here.

It'd be nice to override a few other things, like THREE_NODE_RSS. Not a blocker though for v1.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. This should be usable on shared systems like gimlozar, where more than one engineer need to be able to do this in parallel without clobbering one another.

// Delete any results from previous runs. We don't mind if
// there's errors. This needs to run as root in case there are
// artifacts owned by root left around from the deploy.
cmd!(sh, "pfexec rm -rf").arg(&env.work_dir).run()?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is (obviously) incredibly dangerous. I wonder how we can make it less so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agreed -- if you could set work_dir to something like /etc (e.g. via the CLI option discussed above) you could nuke the whole system.

How many files is this typically, and how many are owned by root? If it's not too many, I'm wondering if we could do a first pass that iterates over the file system to find files owned by root, and then lists out those files. Then the user can be prompted to ask if it's okay to use pfexec (and there can be a --force option to skip prompting.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What goes in here? Why do you need pfexec to remove it?

Comment on lines +226 to +227
// Generate an ssh key we will use to log into the sleds.
cmd!(sh, "ssh-keygen -t ed25519 -N '' -f a4x2-ssh-key").run()?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh, (I think) I like using a one-off key here. But it also needs to be easy for folks to ssh in after the fact. Should we give folks a command to load it into their agent?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or provide an ssh command to copy-paste.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to inject a key that you can reliably use locally from the automation here, but I would want to continue using my yubikey-backed SSH keys to actually log in -- we should be willing to import keys from some other place, such as ~/.ssh/authorized_keys, or I guess potentially from the output of ssh-add -L, etc.

.find(|v| v["dynamic"] == Value::Bool(true))
.ok_or(anyhow!("failed to find customer edge addr"))?["local"]
.as_str()
.unwrap();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May as well replace these unwrap()s with error messages?
Would it make sense to define a Rust struct that derives Deserialize so that if any of the assumptions in ce_addr_json[0]["addr_info"].as_array().unwrap() aren't true, we fail gracefully? (value might not be an array; might not have addr_info property, it might not be an array, etc.)

Comment on lines +271 to +274
// XXX Im told that pinging the gateway from inside the sleds is no longer
// necessary so I'm leaving it out for now. We'll see if that's true.
// So far it seems to be.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// XXX Im told that pinging the gateway from inside the sleds is no longer
// necessary so I'm leaving it out for now. We'll see if that's true.
// So far it seems to be.

// necessary so I'm leaving it out for now. We'll see if that's true.
// So far it seems to be.

cmd!(sh, "pfexec route add 198.51.100.0/24 {customer_edge_addr}").run()?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised you have to do this because I don't remember doing it. Where does this happen during a manual a4x2 launch?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and more generally, could there be a reference to documentation for this function so we can look at things step-by-step?)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably you only need to do this if you're actually trying to directly communicate with that subnet from outside; i.e., from processes you run on the host alongside this.

Comment on lines +288 to +310
// The important thing here is to do an HTTP request with timeout to the
// control plane API endpoint. If the server is up, we'll get the page you'd
// expect for an unauthenticated user. If not, then the request will either
// - fail
// - stall
//
// Stalling is why we do a timeout, and we need to handle timeout both at
// the TCP level and the HTTP level. `curl -m 5` will hard-out after 5
// seconds no matter what's going on.
//
// That 5 seconds is arbitrary, but it's been working well over in
// rackletteadm, from which this logic is copied.
//
// We could replace curl with rust code, but make sure we do the same thing!
while retries > 0 && cmd!(sh, "curl -s -m 5 {api_url}").run().is_err() {
retries -= 1;
thread::sleep(time::Duration::from_secs(25));

if retries % 5 == 0 {
// I see no reason to error while printing the date
let _ = cmd!(sh, "date").run();
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if you can use the same code that the end-to-end tests use. I seem to recall them doing the same thing in Rust.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do that, then it's worth factoring that out so we don't have to build e.g. our Diesel code to build this binary. (And if we do that, we should definitely put this into an external binary.)

Copy link
Contributor

@sunshowers sunshowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for doing this! Really excited about automating all of this.

(I'm still working on the review -- here's what I have for now.)

swrite.workspace = true
tabled.workspace = true
textwrap.workspace = true
toml.workspace = true
usdt.workspace = true
walkdir.workspace = true
xshell.workspace = true
sha2.workspace = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you put this in sorted order? Thanks!

@@ -31,8 +31,13 @@ clap.workspace = true
fs-err.workspace = true
macaddr.workspace = true
serde.workspace = true
serde_json.workspace = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ok, most of these look like deps that already exist. walkdir and xshell are the new ones, I think, and both of them look quite minimal.

Comment on lines +87 to +89
// Buildomat sets the CI environment variable. We'll work a bit
// differently in that case (no cleanup, use /work)
let in_ci = env::var("CI").is_ok();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I would prefer that. (Concerns like this are why I like having configuration profiles the way Cargo or nextest do, but that is probably overkill here.)

Comment on lines +123 to +125
let result = teardown_a4x2(&sh, &env);
eprintln!("teardown result: {:?}", result);
eprintln!("continuing regardless of whether there were errors");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this print out the ok and error outputs nicely? Then the error can be printed conditionally.

// Delete any results from previous runs. We don't mind if
// there's errors. This needs to run as root in case there are
// artifacts owned by root left around from the deploy.
cmd!(sh, "pfexec rm -rf").arg(&env.work_dir).run()?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agreed -- if you could set work_dir to something like /etc (e.g. via the CLI option discussed above) you could nuke the whole system.

How many files is this typically, and how many are owned by root? If it's not too many, I'm wondering if we could do a first pass that iterates over the file system to find files owned by root, and then lists out those files. Then the user can be prompted to ask if it's okay to use pfexec (and there can be a --force option to skip prompting.)

Comment on lines +380 to +381
cmd!(sh, "pfexec route delete 198.51.100.0/24 {gateway}")
.run()?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth printing out the fact that we're doing this.

In general I really like the approach where we first gather all operations into a structure, and then perform them in a separate pass. That enables prompting for confirmation, and things like --dry-run. Not sure if you want to put the effort into doing that, but if you do I'd really appreciate it :)

Comment on lines +387 to +389
if !had_gateway {
break;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is had_gateway not being present an exceptional condition? Worth adding a comment in any case.

Comment on lines +409 to +410
// pour one out for enjoyers of clean indentation
println!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://docs.rs/indoc if you'd like to use it!

"teardown_a4x2: could not get gateway for a4x2 route from line {ln}"
))?;

cmd!(sh, "pfexec route delete 198.51.100.0/24 {gateway}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like some of these IPs and subnets are fixed and well-known -- could you make them constants at the top of the file?

Comment on lines +456 to +457
let ipv4 = ln.split_whitespace().nth(3).ok_or(anyhow!("get_host_ip: could not extract IP for node {node} from line {ln}"))?;
let ipv4 = ipv4.strip_suffix("/24").ok_or(anyhow!("get_host_ip: could not extract IP for node {node} from line {ln}"))?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comments as in the gateway section above.

Copy link
Collaborator

@jclulow jclulow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a great idea! I have a hodge podge of thoughts I've left as comments, but I wanted to say I agree this makes things much more accessible to casual users in particular. I also like the concrete separation of build machine and target machine, with an artefact you can pass between them!

pub fn run_cmd(args: A4x2DeployArgs) -> Result<()> {
let sh = Shell::new()?;

// TODO: do we want to have a configurable falcon dataset?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. This should be usable on shared systems like gimlozar, where more than one engineer need to be able to do this in parallel without clobbering one another.

Comment on lines +87 to +89
// Buildomat sets the CI environment variable. We'll work a bit
// differently in that case (no cleanup, use /work)
let in_ci = env::var("CI").is_ok();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, the program we run in CI that invokes this should just use flags to get the behaviours. I think folks ought, as much as possible, to be able to just type stuff they see in the CI job programs into their shell and have it work the same way.

Comment on lines +103 to +105
} else {
home_dir.join(".cache/a4x2-deploy")
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How big does this cache/work directory get in practice?

// all of them. But, I don't like unbounded loops, so we will do at max 10,
// which is a number that seems unlikely enough to reach, to me.
for _ in 0..10 {
let mut route_cmd = cmd!(sh, "pfexec route get 198.51.100.0/24");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let mut route_cmd = cmd!(sh, "pfexec route get 198.51.100.0/24");
let mut route_cmd = cmd!(sh, "route -n get 198.51.100.0/24");

You don't need elevated privileges to route get, and we should use the -n flag to avoid name service lookups.

}

pub fn run_cmd(args: A4x2DeployArgs) -> Result<()> {
let sh = Shell::new()?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should probably have a common wrapper around Shell::new(), which should at minimum scrub the environment that these commands are going to run in. In particular, I think we should make sure no LC_* or LANG variables are inherited, and instead specify LANG='C.UTF-8' for all invocations so that we get consistent output from commands and so on.

Comment on lines +333 to +346
// If you want any change in functionality for the test runner, update
// run-live-tests over in a4x2_package.rs. Don't add it here!
let switch_zone_script = r#"
set -euxo pipefail
tar xvzf live-tests-bundle.tgz
cd live-tests-bundle
./run-live-tests
"#;

let remote_script =
format!("zlogin oxz_switch bash -c '{switch_zone_script}'");

// Will error if the live tests fail. This is desired.
cmd!(sh, "ssh {ssh_args...} {ssh_host} {remote_script}").run()?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like you're getting extremely lucky with the several layers of belligerent escaping that occur when descending through all of xshell, SSH, zlogin, and then bash -c, haha.

I don't have any specific guidance other than it feels like this will fall down if there end up being any quotes or load bearing whitespace required.

"teardown_a4x2: could not get gateway for a4x2 route from line {ln}"
))?;

cmd!(sh, "pfexec route delete 198.51.100.0/24 {gateway}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cmd!(sh, "pfexec route delete 198.51.100.0/24 {gateway}")
cmd!(sh, "pfexec route -n delete 198.51.100.0/24 {gateway}")

Comment on lines +367 to +368
// We get an error code when there is no route (and thus, nothing for
// us to delete!)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think before ignoring the result here I would make sure the output ends in not in table, otherwise it's presumably an error we're not expecting and we'd like to display it to the user and abort.

Comment on lines +171 to +174
// We ignore any error here, because the result we
// actually want to produce is whether tests passed
let _ = teardown_a4x2(&sh, &env);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should ignore this error. Either teardown_a4x2() should work, or we need to stop and figure out why it did not.

Comment on lines +120 to +123
// Teardown previous deploy if it exists, before wiping the data
// for it. If this errors, we assume the deploy doesn't exist
// and carry on.
let result = teardown_a4x2(&sh, &env);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should ignore this error. Either teardown_a4x2() should work, or we need to stop and figure out why it did not.

Copy link
Contributor

@sunshowers sunshowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again -- went through and left a few more comments.

let src_dir = Utf8PathBuf::try_from(env::current_dir()?)?;

// Delete any results from previous runs. We don't mind if there's errors.
fs::remove_dir_all(&work_dir).ok();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also worth reporting errors here as a warning.

use serde::Deserialize;
use sha2::Digest;
use std::env;
use std::fs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider https://docs.rs/fs-err for better error messages.

Comment on lines +16 to +17
mod a4x2_deploy;
mod a4x2_package;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect we'll want to share some code between the two -- what do you think about putting them in an a4x2 module and having a4x2/deploy.rs and package.rs?

// Get a ref we can checkout from the local working tree
let treeish = {
let _popdir = sh.push_dir(&env.src_dir);
let mut treeish = cmd!(sh, "git stash create").read()?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also:

  • Use $GIT before git.
  • Altering stash-related state is a visible operation and I don't think we should do it -- instead, there are ways to create a "background" commit from the current state.
  • What happens if this is run in a non-git repo, e.g. a non-colocated Jujutsu repo?

Another option that has somewhat less source control interaction is:

  • get the list of all tracked files in the repo (here's instructions on how)
  • create hardlinks for all of them in a mirrored structure. (Hardlinks are fine because basically all editors write files atomically, which breaks them.)

if treeish.is_empty() {
// nothing to stash
eprintln!("Nothing to stash, using most recent commit for clone");
treeish = cmd!(sh, "git rev-parse HEAD").read()?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again these should use $GIT.

Comment on lines +330 to +339
if can_dedupe {
// yay, extract out to the common dir.
let dest = common_dir.join(base_path);
sh.create_dir(&dest.parent().unwrap())?;
sh.copy_file(&g0_path, &dest)?;
sh.remove_path(&g0_path)?;
for prefix in &prefixes {
sh.remove_path(&prefix.join(base_path))?;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm interesting -- what code ensures that sled-common is checked?

Comment on lines +202 to +204
// XXX do we need to test nextest exists & is the right version? need to
// see if the incidental errors from stuff below already take care of
// telling someone outside of CI what they need to be doing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could run cargo nextest show-config version -- exit code 92 means that the required version isn't met, and exit code 10 means that the recommended version isn't met.

// intentionally relative path so the tarball gets relative paths
let _popdir = sh.push_dir(&env.work_dir);
let bundle_dir_name = live_test_bundle_dir.file_name().unwrap();
cmd!(sh, "tar -czf {out_dir}/live-tests-bundle.tgz {bundle_dir_name}/")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

live-tests-bundle.tgz maybe can be a constant.

let pkg_dir_name = env.out_dir.file_name().unwrap();

// Place output in `out/` when running locally, or leave in work dir for CI
let artifact = Utf8PathBuf::from("a4x2-package-out.tgz");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This filename can be a constant too, I think.

let a4x2_dir = testbed_dir.join("a4x2");

// TODO move this clone & build to prepare, so that if the omicron build
// takes awhile we wont have lost our git token when we get here. Relevant
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// takes awhile we wont have lost our git token when we get here. Relevant
// takes awhile, we won't have lost our git token when we get here. Relevant

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants