Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move to vanilla firecracker snapshots #816

Conversation

CuriousGeorgiy
Copy link
Member

@CuriousGeorgiy CuriousGeorgiy commented Sep 11, 2023

Summary

Closes #794

Implementation Notes ⚒️

  • Replaced the tap manager with the new network manager and dropped the tap manager;
  • Reworked snapshot manager to maintain one list of snapshots with a new ready feature;
  • Created a device snapshot for the base container image during patch creation;
  • Dropped VM offloading APIs and replaced them with snapshot creation or VM shutdown;
  • Added a new network pool size option to the orchestrator;
  • Updated firecracker-related binaries;
  • Updated firecracker integration code;
  • Added basic remote snapshot create and load tests.

Snapshot revisions

Snapshots are now identified by a function instance revision, which is naturally provided by the K_REVISION environment variable of knative1.

External Dependencies 🍀

Breaking API Changes ⚠️

  • N/A.

Footnotes

  1. https://knative.dev/docs/concepts/serving-resources/revisions/

@CuriousGeorgiy CuriousGeorgiy changed the title Support vanilla firecracker snapshots Move to vanilla firecracker snapshots Sep 11, 2023
@CuriousGeorgiy CuriousGeorgiy self-assigned this Sep 11, 2023
@CuriousGeorgiy CuriousGeorgiy force-pushed the vanilla-firecracker-snapshots-impl branch 19 times, most recently from b0d35ac to 0fb1459 Compare September 12, 2023 14:30
@CuriousGeorgiy CuriousGeorgiy changed the base branch from main to firecracker-v1.4.1-vhive-integration September 12, 2023 14:30
@CuriousGeorgiy CuriousGeorgiy force-pushed the vanilla-firecracker-snapshots-impl branch 7 times, most recently from 2001bcd to 9013471 Compare September 12, 2023 15:14
@CuriousGeorgiy CuriousGeorgiy changed the base branch from firecracker-v1.4.1-vhive-integration to main September 26, 2023 12:32
ustiugov
ustiugov previously approved these changes Sep 26, 2023
@CuriousGeorgiy CuriousGeorgiy force-pushed the vanilla-firecracker-snapshots-impl branch from a05913f to 93fbcd2 Compare September 26, 2023 17:07
@leokondrashov
Copy link
Contributor

@CuriousGeorgiy, there are several errors with lint and compiler (unused import and formatting). Please fix and push.

@CuriousGeorgiy
Copy link
Member Author

@leokondrashov done.

@CuriousGeorgiy CuriousGeorgiy force-pushed the vanilla-firecracker-snapshots-impl branch from a4e60a3 to 05855e1 Compare September 27, 2023 08:47
@ustiugov ustiugov force-pushed the vanilla-firecracker-snapshots-impl branch from 05855e1 to 7e45704 Compare October 3, 2023 03:28
@lrq619
Copy link
Collaborator

lrq619 commented Oct 10, 2023

Hello George, codes on this branch cannot pass firecracker-cri tests.
Codes on the main branch can pass tests so it's not the problem of runners. You can see here: https://github.com/vhive-serverless/vHive/actions/runs/6460688210
image

Please check what is broken in the branch and make sure it passes the firecracker-cri tests, thank you!

@CuriousGeorgiy
Copy link
Member Author

CuriousGeorgiy commented Oct 10, 2023

@lrq619 Hi! What you are saying is true, but at the same time:

  1. If I set up the CRI environment on a fresh CloudLab node, and run the CRI tests, everything works fine.
  2. I cannot see any error messages or signs of something going wrong in the logs.

So I don't know how to reproduce this issue, and I am not sure it is inherently related to my changes, it may as well be something like #781 or something related to #836.

I have already addressed this to @leokondrashov and kindly asked him to take a look.

@ustiugov
Copy link
Member

let's postpone the gvisor upgrade and merge this PR. @leokondrashov, can you please proceed to a merge without the gvisor upgrade. in case of runner failure, please Ruiqi for help

UPF feature for vanilla firecracker requires a separate integration effort
(vhive-serverless#807), so temporarily disable this feature and all tests for it.

Signed-off-by: Georgiy Lebedev <[email protected]>
@ustiugov
Copy link
Member

ustiugov commented Oct 13, 2023

@CuriousGeorgiy can you please rebase your branch so we can check the latest CI status? If the problem indeed persists, you have to debug it. if the problem is with the runners, we will take over (but as tests pass in main, it doesn't seem to be the case)

I cannot see any error messages or signs of something going wrong in the logs.

the logs are uploaded as artifacts, please check there.

in future, please have your branches inside the vHive upstream repo, otherwise we cannot take over tasks from you.

@leokondrashov leokondrashov force-pushed the vanilla-firecracker-snapshots-impl branch from 7e45704 to cf5f5e6 Compare October 13, 2023 09:23
@leokondrashov
Copy link
Contributor

The issue with PRs from forks was not connected to the accesses. It was pre-push hook that was trying to access fork's LFS which we don't need.

@CuriousGeorgiy
Copy link
Member Author

CuriousGeorgiy commented Oct 13, 2023

@ustiugov

the logs are uploaded as artifacts, please check there.

As I said, I have checked the artifacts and haven't found any errors compared to the main branch.

I have also tried setting up the CRI environment on a fresh CloudLab node (it succeeds) and running CRI tests there (it also succeeds), so I am not sure how I can debug the issue without access to the runners.

@CuriousGeorgiy CuriousGeorgiy force-pushed the vanilla-firecracker-snapshots-impl branch from cf5f5e6 to 14e7759 Compare October 13, 2023 09:57
@CuriousGeorgiy
Copy link
Member Author

CuriousGeorgiy commented Oct 13, 2023

@leokondrashov 1 meaningful difference I see in the containerd logs between this branch and the main branch is that on this branch containerd doesn't get any more requests after the istio-proxy pod is (successfully) created.

@leokondrashov
Copy link
Contributor

I managed to make the test work. This was a strange behaviour of runner.

Later the result might change because Ruiqi will be testing runners once again. But we can merge, finally.

@leokondrashov leokondrashov merged commit 8e30e36 into vhive-serverless:main Oct 16, 2023
21 of 22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Move to vanilla firecracker snapshots
4 participants