Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test/e2e: libvirt: Try and reduce the resource usage of the kcli cluster #2117

Open
stevenhorsman opened this issue Oct 14, 2024 · 13 comments
Open
Labels

Comments

@stevenhorsman
Copy link
Member

At the moment in the libvirt testing we are using the default node size. This leads to the situation were each of the work and control-plane defaults uses 4 vCPU and 6GB RAM:

# kcli info vm peer-pods-worker-0
name: peer-pods-worker-0
id: a5ce3795-a67c-4daf-854d-62df6736081b
creationdate: 09-10-2024 10:35
status: up
autostart: False
image: ubuntu2204
user: ubuntu
plan: peer-pods
profile: kvirt
numcpus: 4
memory: 6144

In an ideal world we'd like to reduce our test footprint to fit inside the github hosted runner, which is a 4x16GB machine.

Our peer pod VM is currently using 2x8GB of it's own, which we are working on reducing, but the 8 vCPU and 12 GB RAM that the kcli cluster uses is way to big. Actually reducing this shouldn't be too tricky as I think it's just editing the default parameters we pass in in kcli_cluster.sh, but the tricky bit is working out the minimum resources we can get away with without impacting the tests, so looking at the resource usage on an existing cluster might help there.

@stevenhorsman
Copy link
Member Author

@wainersm, @mkulke do you know the size of the az-ubuntu-2204 (that I think are created by garm) at the moment as a reference point?

@mkulke
Copy link
Collaborator

mkulke commented Oct 14, 2024

@wainersm, @mkulke do you know the size of the az-ubuntu-2204 (that I think are created by garm) at the moment as a reference point?

https://cloudprice.net/vm/Standard_D4s_v4

Currently we run the tests on a 4 vCPU 16gb ram machine.

@stevenhorsman
Copy link
Member Author

https://cloudprice.net/vm/Standard_D4s_v4

That's very interesting as if it's a 4x16 machine then it's the same size as the github hosted runners (and might explain some of the libvirt ci flakiness as we try and squeeze 10 vCPUs and 20GB RAM out of a 4x16 box! Maybe we can try out libvirt e2e on a self-hosted runner now... (I'll be back with results)

@stevenhorsman
Copy link
Member Author

It failed (https://github.com/stevenhorsman/cloud-api-adaptor/actions/runs/11331308718/job/31511037253) with:

time="2024-10-14T16:25:22Z" level=info msg="Installing peerpod-ctrl"
F1014 16:25:22.856817   22292 env.go:369] Setup failure: exit status 2
Error: No space left on device : '/home/runner/runners/2.320.0/_diag/pages/ca098555-ffe8-4649-a147-31a3909d57d3_9f5eba70-d33b-5377-96f2-d94c82946629_1.log'

The GH runners have 14GB of storage, so maybe that isn't enough, so it might be another path to investigate

@mkulke
Copy link
Collaborator

mkulke commented Oct 14, 2024

It failed (https://github.com/stevenhorsman/cloud-api-adaptor/actions/runs/11331308718/job/31511037253) with:

time="2024-10-14T16:25:22Z" level=info msg="Installing peerpod-ctrl"
F1014 16:25:22.856817   22292 env.go:369] Setup failure: exit status 2
Error: No space left on device : '/home/runner/runners/2.320.0/_diag/pages/ca098555-ffe8-4649-a147-31a3909d57d3_9f5eba70-d33b-5377-96f2-d94c82946629_1.log'

The GH runners have 14GB of storage, so maybe that isn't enough, so it might be another path to investigate

currently we build the kbs client with rust, which can produce a surprisingly large target folder. we could either clean that up or download the kbs-client via oras?

@stevenhorsman
Copy link
Member Author

stevenhorsman commented Oct 14, 2024

download the kbs-client via oras

Yeah, I think that would be great. I'll try out the e2e tests without the KBS section and see if that helps and also re-run it once the caching PR is merged 😃

@stevenhorsman
Copy link
Member Author

Ooh - cutting out the KBS deployment and test meant the gh-runner tests worked: https://github.com/stevenhorsman/cloud-api-adaptor/actions/runs/11331632234/job/31512084686 😃

@mkulke
Copy link
Collaborator

mkulke commented Oct 15, 2024

that's great. if we change this line:

to

oras pull "ghcr.io/confidential-containers/staged-images/kbs-client:sample_only-x86_64-linux-gnu-${KBS_SHA}"
chmod +x ./kbs-client

we can also drop the rust toolchain installation

@stevenhorsman
Copy link
Member Author

Cool - I'll give that a try in my fork

@stevenhorsman
Copy link
Member Author

TestLibvirtKbsKeyRelease/KbsKeyReleasePod_test failed trying that approach: https://github.com/stevenhorsman/cloud-api-adaptor/actions/runs/11342464633/job/31542928046

When I get a chance I'll try and re-create and debug locally

@mkulke
Copy link
Collaborator

mkulke commented Oct 15, 2024

TestLibvirtKbsKeyRelease/KbsKeyReleasePod_test failed trying that approach: https://github.com/stevenhorsman/cloud-api-adaptor/actions/runs/11342464633/job/31542928046

When I get a chance I'll try and re-create and debug locally

hmm I've never tested it either, the whole kbs-client business is a bit of a black box to me, so the available binary might not work. an alternative would be to build the kbs-client in another job and pass it around as an artifact.

upside: faster builds, b/c it can be built in parallel before the test; doesn't consume space on the test instance
downside: checkout-kbs.sh needs untangling, since it performs both cloning of the kbs repo and building of the kbs client

@stevenhorsman
Copy link
Member Author

Ah - the kbs client is extracted to the wrong directory as it's build to targets/release. I'll try and fix that and see if that helps. I also want to understand why we aren't hitting errors in the e2e tests first trying to use a non-existing file?

@stevenhorsman
Copy link
Member Author

I also want to understand why we aren't hitting errors in the e2e tests first trying to use a non-existing file?

So we just ignore any errors thrown in the kbs client code. I thought I remembered fixing that, but https://github.com/confidential-containers/cloud-api-adaptor/pull/2055/files hasn't merged yet.

(T)he kbs client is extracted to the wrong directory as it's build to targets/release.

I think we just move the expectation for the client to be in kbs directly, but I'm not sure if anyone would still want to build their own version of it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants