Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: rework options flow to have a more consistent pattern #390

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

comtalyst
Copy link
Collaborator

Fixes #

Description

How was this change tested?

Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: #
  • No

Release Note


)
imageResolver := imagefamily.New(
operator.GetClient(),
imageProvider,
)
launchTemplateProvider := launchtemplate.NewProvider(
ctx,
opts,
Copy link
Collaborator Author

@comtalyst comtalyst Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be the highlight worth discussing.

Prior to this, in addition to passing the options (and azConfig) directly that you can see in the changes preview, some of the providers did call options.FromContext(ctx) again. I find this inconsistent and would prefer to stick to one side—either pass all required parameters here, or just pass in ctx/opts.

The current code chooses passing in opts. This will make the providers depends on the knowledge of options (no change), while sparing operator from having to know which provider needs what. This is particularly helpful in launchTemplateProvider, which is have a lot of demands for fields in options.

With that said, each provider have different demands in options fields, but ideally I want to keep the dependencies/pattern consistent for all providers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we could stay explicit on the interface and pass all variables that we currently getting from options.FromContext(ctx) through here instead. The interface will be be big for high-demanding providers, but we will be able to cut their dependence on options.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another alternative is to add some layering in options to categorize them by uses, appearing at "sub-structs".
But this will force options to contain knowledge about the exact uses of their fields, which is going to add additional complexity, given some fields are shared with multiple uses.

An extension to that is to have additional layer between options (which will mostly take envs in raw) and this layer to map those to its provider uses. But that would be no difference that just creating structs here, which will only be better than explicit passing by extensibility and small reusability, at the cost of additional complexity.

Copy link
Collaborator Author

@comtalyst comtalyst Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But all these options are not really making use of options.FromContext(ctx) pattern established by karpenter-core and AWS provider. If we want to keep that, another options is to just pass in ctx to all places and let options.FromContext(ctx) retrieve the options whenever it is needed. This will assume that all instances here are singletons or at least have no differences between each, which is true(?), for now.

If you ask me, I just don't think it is a better option than others. Opinions in offline discussions I had seems to be against this pattern.
Although it is probably the best in terms of consistency.

ClusterEndpoint string // => APIServerName in bootstrap, except needs to be w/o https/port
VMMemoryOverheadPercent float64
ClusterID string
// Target cluster information; might be use for both bootstrapping and ARM authentications
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea behind this is I try to categorize this by user-facing behavior (e.g., target cluster for the operations to occur, what will be on provisioned nodes) rather than implementation details.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some should be renamed. But let's put that in a different PR

func (o *Options) Default(ctx context.Context) error {
var err error

var authManager *auth.AuthManager
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a flaw in the implementation, as authManager is created in both here and operator. Given core, there is nothing much we can do(?) if we want to follow this pattern.

ClusterName string
ClusterEndpoint string // => APIServerName in bootstrap, except needs to be w/o https/port
ClusterID string
APIServerName string
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and VnetGUID (that I just added) are the options that are not available to be configurable directly by the customer, but directly depends on other options through defaulting. Open to suggestions if this should be elsewhere.

@@ -44,13 +44,17 @@ import (
type Controller struct {
kubeClient client.Client
instanceProvider *instance.Provider

// Will be used to calculate the goal state
opts *options.Options
Copy link
Collaborator Author

@comtalyst comtalyst Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another instance of "not sure if options is overkill", see the comments in operator.go for similar discussions. But if you ask me I think this is appropriate given the nature of the controllers.

Copy link
Collaborator

@charliedmcb charliedmcb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just did a quick scan. I like the direction of streamlining options. However, would recommend that we keep the naming generic since the flow doesn't necessarily have to just be used for workload identities.

value: ""
- name: AZURE_NODE_RESOURCE_GROUP
value: ${AZURE_RESOURCE_GROUP_MC}
- name: ARM_AUTH_METHOD
value: "workload-identity"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the reason its called "from environment" is because its actually using a more generic/general auth option that finds the creds in the environment. This is the flow that works for workload-identity. However, I think the naming should probably keep the more generic term since that is what its actually doing. Unless you have more tied to this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, by theory, there could be "from environment" auth option that is not workload identity, right?
If that's the case then I think keeping "from-environment" would be a better choice, given we doesn't seem to have ties to workload identity (right?).

pkg/auth/authmanager.go Outdated Show resolved Hide resolved
Comment on lines +59 to +76
// NewCredential provides a token credential
func (am AuthManager) NewCredential() (azcore.TokenCredential, error) {
if am.authMethod == AuthMethodWorkloadIdentity {
klog.V(2).Infoln("cred: using workload identity for new credential")
return azidentity.NewDefaultAzureCredential(nil)
}

if am.authMethod == AuthMethodSysMSI {
klog.V(2).Infoln("cred: using system assigned MSI for new credential")
msiCred, err := azidentity.NewManagedIdentityCredential(nil)
if err != nil {
return nil, err
}
return msiCred, nil
}

return nil, fmt.Errorf("cred: unsupported auth method: %s", am.authMethod)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this belong in the shared SDK? For my token fix i imagine other ppl may want the same fix. Uing azure sdk for go extensions may be a good idea for our auth so we can share that logic too. Should we move this there?

controllers := []controller.Controller{
nodeclaimgarbagecollection.NewController(kubeClient, cloudProvider),
inplaceupdate.NewController(kubeClient, instanceProvider),
inplaceupdate.NewController(kubeClient, instanceProvider, opts),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just options.FromContext(ctx)? or for that matter why not just pass in ctx so that future updates to the controller can access everything from the context object? Are we refactoring options out of ctx?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(discussed offline)
Although both share the same principle, options.FromContext(ctx) seems to be the pattern that is generally discouraged by all opinions we have discussed with, except karpenter-core.
For my own reasoning, this would allow us to separate options from ctx, which might wanted to be used for other purposes. We could still argue that options still contain things that might not be used, but it still allow extensibility and hide the knowledge of "what this component requires" from the creation interface. So I would say it is something in between ctx and explicitly passing parameters.

Anyway, it is not a strong opinion.

)

// Default sets the default values for the options, but only for those too complicated to be in env default (e.g., depends on other envs)
func (o *Options) Default(ctx context.Context) error {
Copy link
Collaborator

@Bryce-Soghigian Bryce-Soghigian Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to go this far, can't we just resolve a lot of the configuration from having clustername and rg?

Why make users pass in things like "NetworkDataplane" or "NetworkPluginMode"? Why not just use the AKS SDK to dynamically resolve that configuration for them?

For Self hosted at least that might make things easier, if the values aren't updated consistently between the two models on restart for managed that might be a problem.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a bad idea. Now that we establish this framework, we could consider simplifying the integration interface and get obvious fields by ourselves. A trade-off for this is Karpenter will have more responsibility of "getting required info of a cluster" which could at least lead to having more traffic at the initialization.

I think the changes for this, if we decided to, should be in a separate PR at least. But we can start discussing here if we have strong ideas.

}

if o.ClusterID, err = getAKSClusterID(o.APIServerName); err != nil {
return fmt.Errorf("failed to get ClusterID: %w", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we make these errors typed? Makes this part easier to test, rachel left this feedback in another of my prs but you touch this part a lot more so lmk what you think about changing all of these to typed errors

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be a nice to have, but not a necessity at the moment.

Comment on lines +65 to +75
// getAKSClusterID returns cluster ID based on the DNS prefix of the cluster.
// The logic comes from AgentBaker and other places, originally from aks-engine
// with the additional assumption of DNS prefix being the first 33 chars of FQDN
func getAKSClusterID(apiServerFQDN string) (string, error) {
dnsPrefix := apiServerFQDN[:33]
h := fnv.New64a()
h.Write([]byte(dnsPrefix))
r := rand.New(rand.NewSource(int64(h.Sum64()))) //nolint:gosec
return fmt.Sprintf("%08d", r.Uint32())[:8], nil
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General question, does this belong in agentbaker? We stole it from agentbaker will self contained have this in the pkg/agent/common or wahtever we can just import rather than redefining it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I will dive deeper into this along with the contract interface rework. Hate

Comment on lines +101 to +107
func contains(slice []string, target string) bool {
for _, element := range slice {
if target == element {
return true
}
}
return false
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubernetes helpers for slices should have this already

}

if o.NodeResourceGroup == "" {
return fmt.Errorf("missing field, node-resource-group")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment about shared error type. we really only care that

  1. its missing
  2. what is missing

@@ -40,7 +40,7 @@ type AKS struct {
Arch string
TenantID string
SubscriptionID string
UserAssignedIdentityID string
KubeletIdentityClientID string
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great name change!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we worry about these values being different from those in azure json? So people may grep for UserAssignedIdentityID since its called that everywhere else?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(discussed offline)

It is better this way given that the UserAssignedIdentityID that is known by azure.json is in fact KubeletIdentityClientID that is known by the cluster. This confusion is the prime motivation for this entire work.

Currently, I choose to keep this variable name until we feed it to the bootstrap module. There could be even more opportunities around that area to make it even more clear...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have this knowledge, we should be aware when coming across their uses in those "everywhere else." Maybe we can catch some big bugs.

return &Provider{
kubernetesVersionCache: kubernetesVersionCache,
imageCache: cache.New(imageExpirationInterval, imageCacheCleaningInterval),
location: location,
location: opts.Location,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just put options in the Provider struct itself and we can access location from that? I also will need some values from options in some of these providers in this pr https://github.com/Azure/karpenter-provider-azure/pull/365/files so that pattern would make access a bit easier.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(discussed offline)

I am still hesitate to make these providers depends on options entirely.
Right now it does module-wise, but not the struct, as the only reference to options is on creation function (NewProvider). Maybe the better name to this is NewProviderFromOptions.

But again, not a strong opinion. Wondering what do others think? Should we be more extreme on one side (e.g., using opts/ctx everywhere)?

memory := resources.Quantity(fmt.Sprintf("%dGi", int64(memoryGiB(sku))))
// Account for VM overhead in calculation
memory.Sub(resource.MustParse(fmt.Sprintf("%dMi", int64(math.Ceil(
float64(memory.Value())*options.FromContext(ctx).VMMemoryOverheadPercent/1024/1024)))))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this come from core?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, yes, this will introduce a difference between the two.
In general, I think our changes it will be for the better (at least, intention-wise). But the trade-off is consistency.

In this case, I am not completely sure, but leaning towards keeping current change.
Current changes mean consistency with this pattern within our provider (e.g., providers).
Restoring to original means consistency with the AWS provider. But regardless whether we keep other changes or restore all, inconsistency remains within our provider, unless we become more extreme with the use of ctx.

// TODO(bsoghigian): this should be refactored to lo.Ternary(nodeClass.Spec.VnetSubnetID != nil, lo.FromPtr(nodeClass.Spec.VnetSubnetID), os.Getenv("AZURE_SUBNET_ID")) when we add VnetSubnetID to the nodeclass
vnetSubnetComponents, err := utils.GetVnetSubnetIDComponents(options.FromContext(ctx).SubnetID)
func (p *Provider) getVnetInfoLabels(_ *v1alpha2.AKSNodeClass) (map[string]string, error) {
// TODO(bsoghigian): this should be refactored to lo.Ternary(nodeClass.Spec.SubnetID != nil, lo.FromPtr(nodeClass.Spec.SubnetID), p.opts.SubnetID) when we add VnetSubnetID to the nodeclass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure we should differ from the AKS Agentpool api in what we call SubnetID? We discussed this somewhere on the other PR and i thought we agreed on calling it VnetSubnetID

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I named it SubnetID from the start but got feedback we should change it to be VNETSubnetID

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change is a mistake from mass text replace. My bad.

Copy link
Collaborator Author

@comtalyst comtalyst Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But there will be similar changes (not in this PR) that I intend to rename all SubnetID to VnetSubnetID. We can discuss more later along with the rest of the renames.

@@ -64,7 +65,9 @@ func NewAPI() client.PricingAPI {
return client.New()
}

func NewProvider(ctx context.Context, pricing client.PricingAPI, region string, startAsync <-chan struct{}) *Provider {
func NewProvider(ctx context.Context, opts *options.Options, pricing client.PricingAPI, startAsync <-chan struct{}) *Provider {
region := opts.Location
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just use opts.Location in the map lookup?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is being used in other places (at least line 76/79) as well. But I don't think any choice matters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants