Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: ListNodeImageVersions + shared image gallery support #526

Open
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

Bryce-Soghigian
Copy link
Collaborator

@Bryce-Soghigian Bryce-Soghigian commented Oct 16, 2024

Fixes #

Description
This pr introduces

  1. USE_SIG option that is used to control if we should or should not use shared image galleries
  2. Shared Image Gallery ID Resolution
  3. New ListNodeImageVersions logic for getting the latest approved agentbaker node image version rather than using galleries as a source of truth

Open Qs:
Are we ok with all Nodes with Community Image gallery node images being drifted over to use SIG in managed karpenter with this change?

How was this change tested?

  • manual testing on MSFT Tenant SIG Galleries for Ubuntu and Azure Linux
  • make az-all
  • make test

Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: #
  • No

Release Note


@coveralls
Copy link

coveralls commented Oct 16, 2024

Pull Request Test Coverage Report for Build 11371322239

Details

  • 470 of 547 (85.92%) changed or added relevant lines in 13 files are covered.
  • 19 unchanged lines in 4 files lost coverage.
  • Overall coverage decreased (-0.2%) to 97.675%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/providers/imagefamily/types.go 8 12 66.67%
pkg/providers/instance/azure_client.go 1 7 14.29%
pkg/providers/imagefamily/image.go 50 79 63.29%
pkg/providers/imagefamily/nodeimageversionsclient.go 0 38 0.0%
Files with Coverage Reduction New Missed Lines %
pkg/providers/instance/azure_client.go 1 15.0%
pkg/providers/imagefamily/image.go 1 69.81%
pkg/cache/unavailableofferings.go 2 95.45%
pkg/fake/communityimageversionsapi.go 15 46.67%
Totals Coverage Status
Change from base Build 11299903000: -0.2%
Covered Lines: 36716
Relevant Lines: 37590

💛 - Coveralls

func (n NodeImageVersionsAPI) List(_ context.Context, _, _ string) (imagefamily.NodeImageVersionsResponse, error) {
return imagefamily.NodeImageVersionsResponse{
Values: []imagefamily.NodeImageVersion{
{
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Codegen this data

@Bryce-Soghigian Bryce-Soghigian marked this pull request as ready for review October 16, 2024 06:35

expectedImageID, err := c.imageProvider.GetImageID(ctx, communityImageName, publicGalleryURL, nodeClass.Spec.GetImageVersion())
expectedImageID, err := c.imageProvider.GetLatestImageID(ctx, imageStub, nodeClass.Spec.GetImageVersion())
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note this will trigger drift when we switch it on going from community to sig

@@ -90,6 +94,8 @@ func (o *Options) AddFlags(fs *coreoptions.FlagSet) {
fs.StringVar(&o.SubnetID, "vnet-subnet-id", env.WithDefaultString("VNET_SUBNET_ID", ""), "The default subnet ID to use for new nodes. This must be a valid ARM resource ID for subnet that does not overlap with the service CIDR or the pod CIDR")
fs.Var(newNodeIdentitiesValue(env.WithDefaultString("NODE_IDENTITIES", ""), &o.NodeIdentities), "node-identities", "User assigned identities for nodes.")
fs.StringVar(&o.NodeResourceGroup, "node-resource-group", env.WithDefaultString("AZURE_NODE_RESOURCE_GROUP", ""), "[REQUIRED] the resource group created and managed by AKS where the nodes live")
fs.BoolVar(&o.ManagedKarpenter, "managed-karpenter", env.WithDefaultBool("MANAGED_KARPENTER", false), "Whether Karpenter is managed by AKS or not.")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fs.BoolVar(&o.ManagedKarpenter, "managed-karpenter", env.WithDefaultBool("MANAGED_KARPENTER", false), "Whether Karpenter is managed by AKS or not.")
fs.BoolVar(&o.ManagedKarpenter, "managed-karpenter", env.WithDefaultBool("MANAGED_KARPENTER", false), "[MANAGED-FLAG] Whether Karpenter is managed by AKS or not.")

@@ -90,6 +94,8 @@ func (o *Options) AddFlags(fs *coreoptions.FlagSet) {
fs.StringVar(&o.SubnetID, "vnet-subnet-id", env.WithDefaultString("VNET_SUBNET_ID", ""), "The default subnet ID to use for new nodes. This must be a valid ARM resource ID for subnet that does not overlap with the service CIDR or the pod CIDR")
fs.Var(newNodeIdentitiesValue(env.WithDefaultString("NODE_IDENTITIES", ""), &o.NodeIdentities), "node-identities", "User assigned identities for nodes.")
fs.StringVar(&o.NodeResourceGroup, "node-resource-group", env.WithDefaultString("AZURE_NODE_RESOURCE_GROUP", ""), "[REQUIRED] the resource group created and managed by AKS where the nodes live")
fs.BoolVar(&o.ManagedKarpenter, "managed-karpenter", env.WithDefaultBool("MANAGED_KARPENTER", false), "Whether Karpenter is managed by AKS or not.")
fs.StringVar(&o.SIGSubscriptionID, "sig-subscription-id", env.WithDefaultString("SIG_SUBSCRIPTION_ID", ""), "The subscription ID of the shared image gallery.")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fs.StringVar(&o.SIGSubscriptionID, "sig-subscription-id", env.WithDefaultString("SIG_SUBSCRIPTION_ID", ""), "The subscription ID of the shared image gallery.")
fs.StringVar(&o.SIGSubscriptionID, "sig-subscription-id", env.WithDefaultString("SIG_SUBSCRIPTION_ID", ""), "[MANAGED-FLAG] The subscription ID of the shared image gallery we want to point to")

@@ -74,6 +74,10 @@ type Options struct {
SubnetID string // => VnetSubnetID to use (for nodes in Azure CNI Overlay and Azure CNI + pod subnet; for for nodes and pods in Azure CNI), unless overridden via AKSNodeClass
setFlags map[string]bool

ManagedKarpenter bool // => ManagedKarpenter is true if Karpenter is managed by AKS, false if it is a self-hosted karpenter installation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remind me: is there any way to prevent self-hosted users from setting this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, presumably they good set it if they wanted to, but after we put Agentbaker bootstrapping behind this flag bootstrapping won't work for them.

I guess we could also determine if we are using managed based on the kubeclients or build path(main_ccp.go vs main.go) but i thought we wanted to remove those and consolidate them eventually.

Copy link
Collaborator

@charliedmcb charliedmcb Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be an enum?

Mode:

  • Managed
  • SelfHosted

With default SelfHosted?

@matthchr, feel like I remember best practice of using enums instead of bools?

}

const (
kubernetesVersionCacheKey = "kubernetesVersion"

// if imageVersion is equal to an "", then we will get the latest version
autoUpgradeMode = ""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are looking to remove imageVersion I think leaving this out for now would be best.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, missed this is a const. Makes sense for clarity. However, should still be getting removed.

Copy link
Collaborator

@tallaxes tallaxes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pease resolve merge conflicts (which should include removal of AKSNodeClass imageVersion)

// /subscriptions/{subscriptionID}/resourceGroups/{resourceGroup}/providers/Microsoft.Compute/galleries/{galleryName}/images/{imageDefinition}/versions/{imageVersion}.
// If imageVersion is an empty string, it fetches the latest version of the image and caches it for imageExpirationInterval (3 days).
func (p *Provider) getImageIDSIG(ctx context.Context, imgStub DefaultImageOutput, imageVersion string) (string, error) {
key := fmt.Sprintf(sharedImageGalleryImageIDFormat, options.FromContext(ctx).SIGSubscriptionID, imgStub.GalleryResourceGroup, imgStub.GalleryName, imgStub.ImageDefinition, imageVersion)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does subscription ID need to be part of the key?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could reconstruct the full id without the cache including subscription id. Especially since sub is a constant. Good catch.

if imageID, ok := p.imageCache.Get(key); ok {
return imageID.(string), nil
}
if imageVersion == autoUpgradeMode {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking forward to simplified code w/o imageVersion

imageID, imageExpirationInterval)
}
// return the latest version of the image from the cache after we have caached all of the imageDefinitions
if imageID, ok := p.imageCache.Get(key); ok {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What guarantees this key will be available?

Copy link
Collaborator Author

@Bryce-Soghigian Bryce-Soghigian Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We iterate through all the image definitions + image versions from ListNodeImageVersions call before this section and set them in the cache.

		for _, version := range versions.Values {
			imageID := fmt.Sprintf(sharedImageGalleryImageIDFormat, options.FromContext(ctx).SIGSubscriptionID, imgStub.GalleryResourceGroup, imgStub.GalleryName, version.SKU, version.Version)
			cacheKey := fmt.Sprintf(sharedImageGalleryImageIDFormat, options.FromContext(ctx).SIGSubscriptionID, imgStub.GalleryResourceGroup, imgStub.GalleryName, version.SKU, autoUpgradeMode)
			p.imageCache.Set(
				cacheKey,
				imageID, imageExpirationInterval)
		}

pkg/providers/imagefamily/image.go Outdated Show resolved Hide resolved
@@ -137,11 +140,13 @@ func NewAZClient(ctx context.Context, cfg *auth.Config, env *azure.Environment)
}
klog.V(5).Infof("Created azure resource graph client %v, using a token credential", azureResourceGraphClient)

imageVersionsClient, err := armcomputev5.NewCommunityGalleryImageVersionsClient(cfg.SubscriptionID, cred, opts)
communityImageVersionsClient, err := armcomputev5.NewCommunityGalleryImageVersionsClient(cfg.SubscriptionID, cred, opts)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This client may not be needed, based on SIG flag, right? Would be good to only instantiate clients that are needed.

@@ -90,6 +94,8 @@ func (o *Options) AddFlags(fs *coreoptions.FlagSet) {
fs.StringVar(&o.SubnetID, "vnet-subnet-id", env.WithDefaultString("VNET_SUBNET_ID", ""), "The default subnet ID to use for new nodes. This must be a valid ARM resource ID for subnet that does not overlap with the service CIDR or the pod CIDR")
fs.Var(newNodeIdentitiesValue(env.WithDefaultString("NODE_IDENTITIES", ""), &o.NodeIdentities), "node-identities", "User assigned identities for nodes.")
fs.StringVar(&o.NodeResourceGroup, "node-resource-group", env.WithDefaultString("AZURE_NODE_RESOURCE_GROUP", ""), "[REQUIRED] the resource group created and managed by AKS where the nodes live")
fs.BoolVar(&o.ManagedKarpenter, "use-sig", env.WithDefaultBool("USE_SIG", false), "If set to true karpenter will use the AKS managed shared image galleries and the node image versions api. If set to false karpenter will use community image galleries. Only a subset of image features will be available in the community image galleries and this flag is only for the managed node provisioning addon.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename the internal variable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

// GetLatestImageID returns the latest image ID for the given image version and image specification
func (p *Provider) GetLatestImageID(ctx context.Context, defaultImage DefaultImageOutput, imageVersion string) (string, error) {
// Managed Karpenter will use the AKS Managed Shared Image Galleries
if options.FromContext(ctx).ManagedKarpenter {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can one cache here, instead of later?

@Bryce-Soghigian Bryce-Soghigian force-pushed the bsoghigian/list-node-image-versions-poc branch from 321f082 to cc07eb7 Compare November 15, 2024 09:41
@Bryce-Soghigian
Copy link
Collaborator Author

<TODO: Move from 04-02 preview to 09-02 preview>

Copy link
Collaborator

@tallaxes tallaxes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failing CI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants