-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a4x2-package/a4x2-deploy xtasks #7424
base: main
Are you sure you want to change the base?
Conversation
This PR adds two new xtasks: - a4x2-package - a4x2-deploy ``` > cargo xtask a4x2-package -h Generate a tarball with omicron packaged for deployment onto a4x2 Usage: cargo xtask a4x2-package [OPTIONS] Options: --live-tests Bundle omicron live tests into the output tarball --end-to-end-tests Bundle omicron end-to-end-tests package into the output tarball -h, --help Print help ``` This command will: - download & build the latest a4x2 from oxidecomputer/testbed - clone the local working tree to a temporary path & build + package it with a4x2 - if requested, build live-tests - if requested, build end-to-end tests (but there is no way to run these yet) - package these all into a tar bundle, at `out/a4x2-package-out.tgz` ``` > cargo xtask a4x2-deploy -h Run a4x2 and deploy omicron onto it, and optionally run live-tests and end to end tests Usage: cargo xtask a4x2-deploy <COMMAND> Commands: start Start a4x2, deploy a control plane to it, and then leave it running stop Stop a4x2 that was previously launched with the `start` subcommand run-tests Start a4x2, run tests, and then stop a4x2 unless you request otherwise help Print this message or the help of the given subcommand(s) Options: -h, --help Print help ``` This command will: - tear down any existing a4x2 created by a previous invocation of a4x2-deploy - unpack a bundle generated by a4x2-package - download latest OVMF code & propolis server - launch a4x2 - ping the control plane until it becomes alive (or a timeout is reached, in which case it considers the launch failed) - if requested, transfer live tests to the system and execute them - print information on how to access the server By default, a4x2-deploy will use a bundle from `out/a4x2-package.tgz`, but an alternative can be specified with `--package <path>`, so you can build a package and send it to someone else for them to run on their machine. - a4x2 is still a bit flaky. Sometimes I will reboot my machine for good luck. - a4x2-deploy could and should automatically collect logs from the system when launches fail, but does not yet. - no support for incremental builds of omicron. a4x2 makes some changes to tomls in the current working directory, which is why we clone the source directory to another path prior to running the a4x2 package step. Additionally, because of cloning the source to a temporary working path, any changes you have not `git add`ed will not be seen by `a4x2-package`. I consider this either mildly useful or mildly confusing depending on what you're doing in the moment. The jj crew can of course just run `jj` before using `a4x2-package`. We could opt instead of doing a `cp -a` of the working directory instead of a git clone, but this could be problematic due to copying potentially massive `target/` directories. We could use `rsync` wtith excludes, but I'm not sure we can rely on it always being present. The `git clone` seemed like the most reliable to me, and it has precedent in the `releng` xtask. But I do find this to be somewhat annoying when iterating on changes.
@davepacheco tagged for review since we've been talking about this for awhile. would also tag ry, but he is out. not sure who else might be interested in reviewing. |
Sorry for the delay in looking at this! I'm just starting to look now.
I think you've made the right call here in using Git. I also think making a4x2 work in the current work tree is a great goal, but shouldn't block this work. In the meantime, I'd suggest that the tool fail by default with a clear message if the working tree is dirty, but allow that to be overridden. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great! The description in the PR was very helpful in getting oriented -- thanks for that.
I haven't yet looked at a4x2-deploy
.
@@ -31,8 +31,13 @@ clap.workspace = true | |||
fs-err.workspace = true | |||
macaddr.workspace = true | |||
serde.workspace = true | |||
serde_json.workspace = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was interest in keeping xtask
as tiny as possible and using the "external" pattern (which is also pretty easy) to have it invoke other commands instead.
I don't have an opinion about whether this should be "external" or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is ok, most of these look like deps that already exist. walkdir and xshell are the new ones, I think, and both of them look quite minimal.
/// Bundle omicron live tests into the output tarball | ||
#[clap(long)] | ||
live_tests: bool, | ||
|
||
/// Bundle omicron end-to-end-tests package into the output tarball | ||
#[clap(long)] | ||
end_to_end_tests: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a sense of how much more time these add? I'd be inclined to make them on by default.
|
||
#[derive(Deserialize, Debug)] | ||
struct NextestConfigVersion { | ||
// required: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove?
struct Environment { | ||
cargo: String, | ||
home_dir: Utf8PathBuf, | ||
work_dir: Utf8PathBuf, | ||
src_dir: Utf8PathBuf, | ||
out_dir: Utf8PathBuf, | ||
omicron_dir: Utf8PathBuf, | ||
in_ci: bool, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some doc comments here would be a big help.
// Buildomat sets the CI environment variable. We'll work a bit | ||
// differently in that case (no cleanup, use /work) | ||
let in_ci = env::var("CI").is_ok(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flags like this make me worried about diverging behavior between CI and local operation. How about allowing the work directory to be specified on the command line and having CI specify it? I'm not that worried about skipping cleanup in CI but if we that we could have a --skip-cleanup
arg, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I would prefer that. (Concerns like this are why I like having configuration profiles the way Cargo or nextest do, but that is probably overkill here.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, the program we run in CI that invokes this should just use flags to get the behaviours. I think folks ought, as much as possible, to be able to just type stuff they see in the CI job programs into their shell and have it work the same way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having read through the rest of this PR, I'm coming around to believing that maybe configuration profiles wouldn't be the worst after all.
// Get a ref we can checkout from the local working tree | ||
let treeish = { | ||
let _popdir = sh.push_dir(&env.src_dir); | ||
let mut treeish = cmd!(sh, "git stash create").read()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where I'd be inclined to:
- check if the working tree is dirty
- if so: bail unless some
--allow-dirty
is specified - regardless, use
HEAD
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also:
- Use
$GIT
beforegit
. - Altering stash-related state is a visible operation and I don't think we should do it -- instead, there are ways to create a "background" commit from the current state.
- What happens if this is run in a non-git repo, e.g. a non-colocated Jujutsu repo?
Another option that has somewhat less source control interaction is:
- get the list of all tracked files in the repo (here's instructions on how)
- create hardlinks for all of them in a mirrored structure. (Hardlinks are fine because basically all editors write files atomically, which breaks them.)
cmd!(sh, "git fetch origin {treeish}").run()?; | ||
cmd!(sh, "git checkout {treeish}").run()?; | ||
|
||
// XXX is this needed in CI? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// XXX is this needed in CI? |
// I'm not confident this does what it should, because I don't know much | ||
// about the end to end tests. This is just going from what the shell script | ||
// did. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iliana maybe could answer this?
// XXX right now we only download a new nextest in CI, and we use | ||
// the user's locally installed nextest otherwise. I'm inclined to | ||
// say we should always download a fresh one for consistency. | ||
// thoughts? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// XXX right now we only download a new nextest in CI, and we use | |
// the user's locally installed nextest otherwise. I'm inclined to | |
// say we should always download a fresh one for consistency. | |
// thoughts? | |
// XXX right now we only download a new nextest in CI, and we use | |
// the user's locally installed nextest otherwise. I'm inclined to | |
// say we should always download a fresh one for consistency. | |
// thoughts? |
I'd be inclined to have this tool assume that the user has already set up nextest however they want. That's what the rest of Omicron currently does, I believe. If CI needs to download it, that's fine, but it should do it as a separate step before invoking this tool (which I think is what it does for the existing CI jobs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed -- CI already downloads nextest. I'd just package up whatever nextest is available in the PATH
.
// XXX use main branch | ||
// cmd!(sh, "git -C {testbed_dir} checkout artemis/init-in-smf").run()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could make this a command-line argument. It could accept a branch name or alternatively a local path to a clone if you want to do anything other than "main".
pub fn run_cmd(args: A4x2DeployArgs) -> Result<()> { | ||
let sh = Shell::new()?; | ||
|
||
// TODO: do we want to have a configurable falcon dataset? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, definitely. I can't run a4x2 on my system without overriding the default here.
It'd be nice to override a few other things, like THREE_NODE_RSS. Not a blocker though for v1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. This should be usable on shared systems like gimlozar, where more than one engineer need to be able to do this in parallel without clobbering one another.
// Delete any results from previous runs. We don't mind if | ||
// there's errors. This needs to run as root in case there are | ||
// artifacts owned by root left around from the deploy. | ||
cmd!(sh, "pfexec rm -rf").arg(&env.work_dir).run()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is (obviously) incredibly dangerous. I wonder how we can make it less so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, agreed -- if you could set work_dir
to something like /etc
(e.g. via the CLI option discussed above) you could nuke the whole system.
How many files is this typically, and how many are owned by root? If it's not too many, I'm wondering if we could do a first pass that iterates over the file system to find files owned by root, and then lists out those files. Then the user can be prompted to ask if it's okay to use pfexec
(and there can be a --force
option to skip prompting.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What goes in here? Why do you need pfexec to remove it?
// Generate an ssh key we will use to log into the sleds. | ||
cmd!(sh, "ssh-keygen -t ed25519 -N '' -f a4x2-ssh-key").run()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooh, (I think) I like using a one-off key here. But it also needs to be easy for folks to ssh in after the fact. Should we give folks a command to load it into their agent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or provide an ssh command to copy-paste.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to inject a key that you can reliably use locally from the automation here, but I would want to continue using my yubikey-backed SSH keys to actually log in -- we should be willing to import keys from some other place, such as ~/.ssh/authorized_keys
, or I guess potentially from the output of ssh-add -L
, etc.
.find(|v| v["dynamic"] == Value::Bool(true)) | ||
.ok_or(anyhow!("failed to find customer edge addr"))?["local"] | ||
.as_str() | ||
.unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May as well replace these unwrap()
s with error messages?
Would it make sense to define a Rust struct that derives Deserialize
so that if any of the assumptions in ce_addr_json[0]["addr_info"].as_array().unwrap()
aren't true, we fail gracefully? (value might not be an array; might not have addr_info
property, it might not be an array, etc.)
// XXX Im told that pinging the gateway from inside the sleds is no longer | ||
// necessary so I'm leaving it out for now. We'll see if that's true. | ||
// So far it seems to be. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// XXX Im told that pinging the gateway from inside the sleds is no longer | |
// necessary so I'm leaving it out for now. We'll see if that's true. | |
// So far it seems to be. |
// necessary so I'm leaving it out for now. We'll see if that's true. | ||
// So far it seems to be. | ||
|
||
cmd!(sh, "pfexec route add 198.51.100.0/24 {customer_edge_addr}").run()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised you have to do this because I don't remember doing it. Where does this happen during a manual a4x2 launch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(and more generally, could there be a reference to documentation for this function so we can look at things step-by-step?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably you only need to do this if you're actually trying to directly communicate with that subnet from outside; i.e., from processes you run on the host alongside this.
// The important thing here is to do an HTTP request with timeout to the | ||
// control plane API endpoint. If the server is up, we'll get the page you'd | ||
// expect for an unauthenticated user. If not, then the request will either | ||
// - fail | ||
// - stall | ||
// | ||
// Stalling is why we do a timeout, and we need to handle timeout both at | ||
// the TCP level and the HTTP level. `curl -m 5` will hard-out after 5 | ||
// seconds no matter what's going on. | ||
// | ||
// That 5 seconds is arbitrary, but it's been working well over in | ||
// rackletteadm, from which this logic is copied. | ||
// | ||
// We could replace curl with rust code, but make sure we do the same thing! | ||
while retries > 0 && cmd!(sh, "curl -s -m 5 {api_url}").run().is_err() { | ||
retries -= 1; | ||
thread::sleep(time::Duration::from_secs(25)); | ||
|
||
if retries % 5 == 0 { | ||
// I see no reason to error while printing the date | ||
let _ = cmd!(sh, "date").run(); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if you can use the same code that the end-to-end tests use. I seem to recall them doing the same thing in Rust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do that, then it's worth factoring that out so we don't have to build e.g. our Diesel code to build this binary. (And if we do that, we should definitely put this into an external binary.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for doing this! Really excited about automating all of this.
(I'm still working on the review -- here's what I have for now.)
swrite.workspace = true | ||
tabled.workspace = true | ||
textwrap.workspace = true | ||
toml.workspace = true | ||
usdt.workspace = true | ||
walkdir.workspace = true | ||
xshell.workspace = true | ||
sha2.workspace = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you put this in sorted order? Thanks!
@@ -31,8 +31,13 @@ clap.workspace = true | |||
fs-err.workspace = true | |||
macaddr.workspace = true | |||
serde.workspace = true | |||
serde_json.workspace = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is ok, most of these look like deps that already exist. walkdir and xshell are the new ones, I think, and both of them look quite minimal.
// Buildomat sets the CI environment variable. We'll work a bit | ||
// differently in that case (no cleanup, use /work) | ||
let in_ci = env::var("CI").is_ok(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I would prefer that. (Concerns like this are why I like having configuration profiles the way Cargo or nextest do, but that is probably overkill here.)
let result = teardown_a4x2(&sh, &env); | ||
eprintln!("teardown result: {:?}", result); | ||
eprintln!("continuing regardless of whether there were errors"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this print out the ok and error outputs nicely? Then the error can be printed conditionally.
// Delete any results from previous runs. We don't mind if | ||
// there's errors. This needs to run as root in case there are | ||
// artifacts owned by root left around from the deploy. | ||
cmd!(sh, "pfexec rm -rf").arg(&env.work_dir).run()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, agreed -- if you could set work_dir
to something like /etc
(e.g. via the CLI option discussed above) you could nuke the whole system.
How many files is this typically, and how many are owned by root? If it's not too many, I'm wondering if we could do a first pass that iterates over the file system to find files owned by root, and then lists out those files. Then the user can be prompted to ask if it's okay to use pfexec
(and there can be a --force
option to skip prompting.)
cmd!(sh, "pfexec route delete 198.51.100.0/24 {gateway}") | ||
.run()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth printing out the fact that we're doing this.
In general I really like the approach where we first gather all operations into a structure, and then perform them in a separate pass. That enables prompting for confirmation, and things like --dry-run
. Not sure if you want to put the effort into doing that, but if you do I'd really appreciate it :)
if !had_gateway { | ||
break; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is had_gateway
not being present an exceptional condition? Worth adding a comment in any case.
// pour one out for enjoyers of clean indentation | ||
println!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://docs.rs/indoc if you'd like to use it!
"teardown_a4x2: could not get gateway for a4x2 route from line {ln}" | ||
))?; | ||
|
||
cmd!(sh, "pfexec route delete 198.51.100.0/24 {gateway}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like some of these IPs and subnets are fixed and well-known -- could you make them constants at the top of the file?
let ipv4 = ln.split_whitespace().nth(3).ok_or(anyhow!("get_host_ip: could not extract IP for node {node} from line {ln}"))?; | ||
let ipv4 = ipv4.strip_suffix("/24").ok_or(anyhow!("get_host_ip: could not extract IP for node {node} from line {ln}"))?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comments as in the gateway section above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a great idea! I have a hodge podge of thoughts I've left as comments, but I wanted to say I agree this makes things much more accessible to casual users in particular. I also like the concrete separation of build machine and target machine, with an artefact you can pass between them!
pub fn run_cmd(args: A4x2DeployArgs) -> Result<()> { | ||
let sh = Shell::new()?; | ||
|
||
// TODO: do we want to have a configurable falcon dataset? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. This should be usable on shared systems like gimlozar, where more than one engineer need to be able to do this in parallel without clobbering one another.
// Buildomat sets the CI environment variable. We'll work a bit | ||
// differently in that case (no cleanup, use /work) | ||
let in_ci = env::var("CI").is_ok(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, the program we run in CI that invokes this should just use flags to get the behaviours. I think folks ought, as much as possible, to be able to just type stuff they see in the CI job programs into their shell and have it work the same way.
} else { | ||
home_dir.join(".cache/a4x2-deploy") | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How big does this cache/work directory get in practice?
// all of them. But, I don't like unbounded loops, so we will do at max 10, | ||
// which is a number that seems unlikely enough to reach, to me. | ||
for _ in 0..10 { | ||
let mut route_cmd = cmd!(sh, "pfexec route get 198.51.100.0/24"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let mut route_cmd = cmd!(sh, "pfexec route get 198.51.100.0/24"); | |
let mut route_cmd = cmd!(sh, "route -n get 198.51.100.0/24"); |
You don't need elevated privileges to route get, and we should use the -n flag to avoid name service lookups.
} | ||
|
||
pub fn run_cmd(args: A4x2DeployArgs) -> Result<()> { | ||
let sh = Shell::new()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should probably have a common wrapper around Shell::new(), which should at minimum scrub the environment that these commands are going to run in. In particular, I think we should make sure no LC_* or LANG variables are inherited, and instead specify LANG='C.UTF-8' for all invocations so that we get consistent output from commands and so on.
// If you want any change in functionality for the test runner, update | ||
// run-live-tests over in a4x2_package.rs. Don't add it here! | ||
let switch_zone_script = r#" | ||
set -euxo pipefail | ||
tar xvzf live-tests-bundle.tgz | ||
cd live-tests-bundle | ||
./run-live-tests | ||
"#; | ||
|
||
let remote_script = | ||
format!("zlogin oxz_switch bash -c '{switch_zone_script}'"); | ||
|
||
// Will error if the live tests fail. This is desired. | ||
cmd!(sh, "ssh {ssh_args...} {ssh_host} {remote_script}").run()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like you're getting extremely lucky with the several layers of belligerent escaping that occur when descending through all of xshell, SSH, zlogin, and then bash -c
, haha.
I don't have any specific guidance other than it feels like this will fall down if there end up being any quotes or load bearing whitespace required.
"teardown_a4x2: could not get gateway for a4x2 route from line {ln}" | ||
))?; | ||
|
||
cmd!(sh, "pfexec route delete 198.51.100.0/24 {gateway}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cmd!(sh, "pfexec route delete 198.51.100.0/24 {gateway}") | |
cmd!(sh, "pfexec route -n delete 198.51.100.0/24 {gateway}") |
// We get an error code when there is no route (and thus, nothing for | ||
// us to delete!) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think before ignoring the result here I would make sure the output ends in not in table, otherwise it's presumably an error we're not expecting and we'd like to display it to the user and abort.
// We ignore any error here, because the result we | ||
// actually want to produce is whether tests passed | ||
let _ = teardown_a4x2(&sh, &env); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should ignore this error. Either teardown_a4x2() should work, or we need to stop and figure out why it did not.
// Teardown previous deploy if it exists, before wiping the data | ||
// for it. If this errors, we assume the deploy doesn't exist | ||
// and carry on. | ||
let result = teardown_a4x2(&sh, &env); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should ignore this error. Either teardown_a4x2() should work, or we need to stop and figure out why it did not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again -- went through and left a few more comments.
let src_dir = Utf8PathBuf::try_from(env::current_dir()?)?; | ||
|
||
// Delete any results from previous runs. We don't mind if there's errors. | ||
fs::remove_dir_all(&work_dir).ok(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also worth reporting errors here as a warning.
use serde::Deserialize; | ||
use sha2::Digest; | ||
use std::env; | ||
use std::fs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider https://docs.rs/fs-err for better error messages.
mod a4x2_deploy; | ||
mod a4x2_package; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect we'll want to share some code between the two -- what do you think about putting them in an a4x2
module and having a4x2/deploy.rs
and package.rs
?
// Get a ref we can checkout from the local working tree | ||
let treeish = { | ||
let _popdir = sh.push_dir(&env.src_dir); | ||
let mut treeish = cmd!(sh, "git stash create").read()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also:
- Use
$GIT
beforegit
. - Altering stash-related state is a visible operation and I don't think we should do it -- instead, there are ways to create a "background" commit from the current state.
- What happens if this is run in a non-git repo, e.g. a non-colocated Jujutsu repo?
Another option that has somewhat less source control interaction is:
- get the list of all tracked files in the repo (here's instructions on how)
- create hardlinks for all of them in a mirrored structure. (Hardlinks are fine because basically all editors write files atomically, which breaks them.)
if treeish.is_empty() { | ||
// nothing to stash | ||
eprintln!("Nothing to stash, using most recent commit for clone"); | ||
treeish = cmd!(sh, "git rev-parse HEAD").read()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again these should use $GIT
.
if can_dedupe { | ||
// yay, extract out to the common dir. | ||
let dest = common_dir.join(base_path); | ||
sh.create_dir(&dest.parent().unwrap())?; | ||
sh.copy_file(&g0_path, &dest)?; | ||
sh.remove_path(&g0_path)?; | ||
for prefix in &prefixes { | ||
sh.remove_path(&prefix.join(base_path))?; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm interesting -- what code ensures that sled-common is checked?
// XXX do we need to test nextest exists & is the right version? need to | ||
// see if the incidental errors from stuff below already take care of | ||
// telling someone outside of CI what they need to be doing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could run cargo nextest show-config version
-- exit code 92 means that the required version isn't met, and exit code 10 means that the recommended version isn't met.
// intentionally relative path so the tarball gets relative paths | ||
let _popdir = sh.push_dir(&env.work_dir); | ||
let bundle_dir_name = live_test_bundle_dir.file_name().unwrap(); | ||
cmd!(sh, "tar -czf {out_dir}/live-tests-bundle.tgz {bundle_dir_name}/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
live-tests-bundle.tgz
maybe can be a constant.
let pkg_dir_name = env.out_dir.file_name().unwrap(); | ||
|
||
// Place output in `out/` when running locally, or leave in work dir for CI | ||
let artifact = Utf8PathBuf::from("a4x2-package-out.tgz"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This filename can be a constant too, I think.
let a4x2_dir = testbed_dir.join("a4x2"); | ||
|
||
// TODO move this clone & build to prepare, so that if the omicron build | ||
// takes awhile we wont have lost our git token when we get here. Relevant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// takes awhile we wont have lost our git token when we get here. Relevant | |
// takes awhile, we won't have lost our git token when we get here. Relevant |
overview
This PR adds two new xtasks:
a4x2-package
This command will:
out/a4x2-package-out.tgz
a4x2-deploy
This command will:
By default, a4x2-deploy will use a bundle from
out/a4x2-package.tgz
, but an alternative can be specified with--package <path>
, so you can build a package and send it to someone else for them to run on their machine.Known shortcomings
Additionally, because of cloning the source to a temporary working path, any changes you have not
git add
ed will not be seen bya4x2-package
. I consider this either mildly useful or mildly confusing depending on what you're doing in the moment. The jj crew can of course just runjj
before usinga4x2-package
.We could opt instead of doing a
cp -a
of the working directory instead of a git clone, but this could be problematic due to copying potentially massivetarget/
directories. We could usersync
with excludes, but I'm not sure we can rely on it always being present. Thegit clone
seemed like the most reliable to me, and it has precedent in thereleng
xtask. But I do find this to be somewhat annoying when iterating on changes and am open to alternatives. Maybe we should just try to make a4x2 work in the current workdir without causing trouble.Aside on xshell
This PR adds a dependency on xshell, which I use heavily for running system commands. I figured I would take the opportunity to talk about my experience with it, and provide some context that may be helpful when reviewing this PR. I've really enjoyed using it and think it makes it quite pleasant to use rust as a bash replacement.
A few things to keep in mind when reading it, which you can learn from the docs, but I will also highlight here:
All commands are echoed by default
Similar to
set -x
in bash, commands are echoed as they are run. This is great for logs. Additionally, non-zero exit codes becomeErr
s, and command stdout/stderr are sent to the terminal by default, unless you make a point of capturing them.Command invocations are not echoed when you use
.read()
to capture stdoutcmd!()
is not sending your command into a shellcmd!()
is building a normal rustCommand
! If you see something likeYou may ask "will that escape
with_args
"? The answer is, there is no shell invocation, so there is nothing to escape! It is similar to as if you had donexshell is mostly convenience around
Command
, with a few other file access shorthands thrown in.sh.pushdir()
You will see a lot of a pattern
sh
has its own internal working directory it stores (separate from the process workdir, btw). All commands / operations run throughsh
are relative to that working directory.pushdir
changes that directory, and returns a guard. When that guard is dropped, the directory is popped.You may be wary if you have bitten by code like this before:
In the next version of xshell, it looks like
pushdir
is going away entirely, and will be replaced with approximately this:I'll deal with that when it happens. I'm sure the lambdas will make some things more annoying, but I think it'll be better that way.