Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Race conditions in Targeted Deletion of machines by CA #341

Open
wants to merge 17 commits into
base: machine-controller-manager-provider
Choose a base branch
from

Conversation

rishabh-11
Copy link

@rishabh-11 rishabh-11 commented Dec 20, 2024

What this PR does / why we need it:
This PR fixes the issues noticed in live issues 6120 and 6101. We introduce a mutex in the NodeGroupImpl struct (node group implementation) which must be acquired before performing a scale-down or scale-up operation.

We also introduce an annotation on the MachineDeployment TriggerDeletionByMCM = "node.machine.sapcloud.io/trigger-deletion-by-mcm" whose value denotes the machines CA wants to remove and then atomically update replica count and annotation as one single step. This will help recognize machines for which CA has already reduced the replicas of the MachineDeployment and prevent it from being duplicated - race conditions are avoided when CA scaledowns are occurring in parallel. The MCM is also updated to check this TriggerDeletionByMCM annotation on the MachineDeployment.

The CA no longer directly sets the MachinePriority=1 annotation on the Machine objects. This is now done by the MCM controller when reconciling the updated MachineDeployment.

Which issue(s) this PR fixes:
Fixes #342

Special notes for your reviewer:

Release note:

Fix for data-races in concurrent CA scaledowns and CA restarts

@gardener-robot gardener-robot added needs/review Needs review size/m Size of pull request is medium (see gardener-robot robot/bots/size.py) labels Dec 20, 2024
@gardener-robot-ci-3 gardener-robot-ci-3 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Dec 20, 2024
@rishabh-11 rishabh-11 marked this pull request as ready for review December 24, 2024 08:31
@rishabh-11 rishabh-11 requested review from unmarshall and a team as code owners December 24, 2024 08:31
klog.Infof("machine deployment %s is under rolling update, skipping", machineDeployment.Name)
return nil
}
markedMachines := sets.New(strings.Split(mcd.Annotations[machinesMarkedByCAForDeletion], ",")...)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this to a small function mcm.getMachinesMarkedByCAForDeletion(mcd) (machineNames sets.Set[string]) which is unit testable and can be consumed by both the mcm_cloud_provider.go and mcm_manager.go

klog.Errorf("[Refresh] failed to get machines for machine deployment %s, hence skipping it. Err: %v", machineDeployment.Name, err.Error())
return err
}
var incorrectlyMarkedMachines []*Ref

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we can omit the Ref struct and just used types.NamespacedName which is already being used anyways in this file

@gardener-robot gardener-robot added size/l Size of pull request is large (see gardener-robot robot/bots/size.py) needs/second-opinion Needs second review by someone else labels Dec 24, 2024
@rishabh-11 rishabh-11 requested a review from elankath December 24, 2024 12:23
@gardener-robot gardener-robot removed the size/m Size of pull request is medium (see gardener-robot robot/bots/size.py) label Dec 24, 2024
@gardener-robot-ci-3 gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Dec 24, 2024
@gardener-robot-ci-2 gardener-robot-ci-2 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Dec 24, 2024
@@ -539,12 +567,17 @@ func buildMachineDeploymentFromSpec(value string, mcmManager *McmManager) (*Mach
return machinedeployment, nil
}

func getMachinesMarkedByCAForDeletion(mcd *v1alpha1.MachineDeployment) sets.Set[string] {
Copy link

@elankath elankath Dec 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps using the "comma, ok" idiom here for mcd.Annotations[machinesMarkedByCAForDeletion] would be good for clarity to avoid unnecessary split of empty string.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it allowed to read from a nil map? Only writes will cause panic, right?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getMachinesMarkedByCAForDeletion->getMachineNamesMarkedByCAForDeletion

// ignore the machine deployment if it is in rolling update
if !isRollingUpdateFinished(mcd) {
klog.Infof("machine deployment %s is under rolling update, skipping", machineDeployment.Name)
return nil

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we return an error here ? We are doing it for other cases of !isRollingUpdateFinished.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove this check. Even if a machineDeployment is under rolling update, we should allow the annotation update if needed. wdyt?

// addNodeGroup adds node group defined in string spec. Format:
// minNodes:maxNodes:namespace.machineDeploymentName
func (m *McmManager) addNodeGroup(spec string) error {
machineDeployment, err := buildMachineDeploymentFromSpec(spec, m)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buildMachineDeploymentFromSpec should also be moved to mcm_manager.go.

func (machineDeployment *MachineDeployment) Refresh() error {
machineDeployment.scalingMutex.Lock()
defer machineDeployment.scalingMutex.Unlock()
mcd, err := machineDeployment.mcmManager.machineDeploymentLister.MachineDeployments(machineDeployment.Namespace).Get(machineDeployment.Name)
Copy link

@elankath elankath Dec 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is repeated nearly a dozen times everywhere including common error handling: machineDeployment.mcmManager.machineDeploymentLister.MachineDeployments(machineDeployment.Namespace).Get(machineDeployment.Name) . Move to a method in mcmManager called GetMachineDeploymentResource which returns a formatted error that can simply be returned, so that error message is fully consistent with all uses. we are already having methods like mcmManager.getMachinesForMachineDeployment so this matches the existing convention.

if machine.Status.CurrentStatus.Phase == v1alpha1.MachineTerminating || machine.Status.CurrentStatus.Phase == v1alpha1.MachineFailed {
continue
}
if annotValue, ok := machine.Annotations[machinePriorityAnnotation]; ok && annotValue == priorityValueForCandidateMachines && !markedMachines.Has(machine.Name) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, Why is this called priorityValueForCandidateMachines ? Shouldn't it be defined as const PriorityDeletionValueForCandidateMachine = 1 ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can rename it to PriorityValueForDeletionCandidateMachine. wdyt?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should also add a comment here to explain what we are doing so next guy making a patch doesn't scratch his head.

}
}
clone := mcd.DeepCopy()
clone.Annotations[machinesMarkedByCAForDeletion] = strings.Join(updatedMarkedMachines, ",")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This strings.Join to construct the annotation repeated elsewhere. Make a utility method called getMarkedForDeletionAnnotationValue(machineNames []string) string

klog.Errorf("[Refresh] failed to get machines for machine deployment %s, hence skipping it. Err: %v", machineDeployment.Name, err.Error())
return err
}
var incorrectlyMarkedMachines []*types.NamespacedName

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor (need not correct): types.NamespacedName is a simple struct - a value object. Modern programming practices discourages use of pointers to value objects for keeping data inline for processor cache performance. Pointers should be used only for structs holding active resource (like locks/files/etc) or class/service objects. Using a simple []types.NamespacedName and using types.NamespacedName instead of *types.NamespacedName should be followed for newer code.

@@ -539,12 +567,17 @@ func buildMachineDeploymentFromSpec(value string, mcmManager *McmManager) (*Mach
return machinedeployment, nil
}

func getMachinesMarkedByCAForDeletion(mcd *v1alpha1.MachineDeployment) sets.Set[string] {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name function to getMachineNamesMarkedByCAForDeletion

@@ -508,27 +474,32 @@ func (m *McmManager) DeleteMachines(targetMachineRefs []*Ref) error {
if !isRollingUpdateFinished(md) {
return fmt.Errorf("MachineDeployment %s is under rolling update , cannot reduce replica count", commonMachineDeployment.Name)
}
markedMachines := getMachinesMarkedByCAForDeletion(md)
Copy link

@elankath elankath Dec 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

markedMachines -> machineNamesMarkedByCA

@gardener-robot-ci-1 gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jan 16, 2025
@gardener-robot-ci-3 gardener-robot-ci-3 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jan 16, 2025
@gardener-robot-ci-2 gardener-robot-ci-2 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jan 16, 2025
@elankath
Copy link

elankath commented Jan 17, 2025

UPDATE: Testing has shown that the current fix in this PR has a tendency to slow down the CA scaling due to the all the NodeGroup mutexes acquired in CloudProvider.Refresh. We are looking whether this is acceptable or whether we can improve this.

@elankath
Copy link

elankath commented Jan 23, 2025

Testing has revealed that the current fix unfortunately does not address all data races and also slows down the performance due to the multiple NodeGroup locks needed in the CloudProvider.Refresh.

It has been determined we cannot achieve this as a CA-cloudprovider isolated fix and we need to make MCM changes too to ensure that scaledown works correctly even in cases where CA issues concurrent scaledowns for a NodeGroup and changes its mind later about which node to scale down in a future iteration.

@gardener-robot-ci-3 gardener-robot-ci-3 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jan 28, 2025
@gardener-robot-ci-1 gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jan 28, 2025
@gardener-robot-ci-3 gardener-robot-ci-3 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jan 28, 2025
@gardener-robot-ci-3 gardener-robot-ci-3 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jan 28, 2025
@gardener-robot-ci-3 gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jan 29, 2025
@gardener-robot-ci-2 gardener-robot-ci-2 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jan 29, 2025
@gardener-robot-ci-1 gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jan 29, 2025
@gardener-robot-ci-3 gardener-robot-ci-3 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jan 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) needs/review Needs review needs/second-opinion Needs second review by someone else size/xl Size of pull request is huge (see gardener-robot robot/bots/size.py)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CA Issues invalid scale downs due to scale-down processing delay in the MCM
8 participants