-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SKS-1573: Refactor available hosts and vm migration code #130
Conversation
Codecov Report
@@ Coverage Diff @@
## master #130 +/- ##
==========================================
+ Coverage 53.66% 55.26% +1.60%
==========================================
Files 15 16 +1
Lines 2581 2602 +21
==========================================
+ Hits 1385 1438 +53
+ Misses 1084 1045 -39
- Partials 112 119 +7
|
@@ -170,7 +170,7 @@ func (r *ElfMachineReconciler) preCheckPlacementGroup(ctx *context.MachineContex | |||
|
|||
// KCP is not in scaling down/rolling update. | |||
if !(kcputil.IsKCPRollingUpdateFirstMachine(kcp) || kcputil.IsKCPInScalingDown(kcp)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里和上周五讨论的 MHC和KCP RollingOut交叉在一起的问题相关: https://smartx1.slack.com/archives/C04288LAHTP/p1690784537848789?thread_ts=1690455013.021899&cid=C04288LAHTP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
当前的判断条件如下。kcp.Status.UpdatedReplicas == 1 表示 已经创建了一个并且只有一个新的CP Machine,但此时Machine对应的VM并不一定创建出来了(因为此时有一个旧CP Node not ready,reconcile 卡在这里)。
func IsKCPRollingUpdateFirstMachine(kcp *controlplanev1.KubeadmControlPlane) bool {
- return *kcp.Spec.Replicas < kcp.Status.Replicas && kcp.Status.UpdatedReplicas == 1
+ return IsKCPRollingUpdate(kcp) && kcp.Status.UpdatedReplicas == 1
}
如果把约束条件kcp.Status.UpdatedReplicas == 1
去掉,能否解决这个场景的问题?有没有副作用?
去掉后的效果应该是这样:只要KCP处于RollingUpdate状态,如果当前放置组有空位,就把新CP VM加入放置组;如果没有空位,就把CP VM放置在最后一个被删除的旧VM所在Host(暂时不加入放置组)
@haijianyang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个 PR 先不处理其他改动吧。先做优化再处理其他的。
@@ -170,7 +170,7 @@ func (r *ElfMachineReconciler) preCheckPlacementGroup(ctx *context.MachineContex | |||
|
|||
// KCP is not in scaling down/rolling update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个注释需要改得更准确点:
// When KCP is not in scaling down and it's not rolling update the 1st CP Machine, just return since the placement group is full.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix
@@ -213,7 +213,7 @@ func (r *ElfMachineReconciler) preCheckPlacementGroup(ctx *context.MachineContex | |||
|
|||
// KCP is in rolling update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// When the PlacementGroup is full and it's rolling update the 1st CP Machine and, place the VM on the target host without joining the PlacementGroup and power it on. After other CP Machine are rolled out successfully, add this 1st CP VM into the PlacementGroup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix
@@ -213,7 +213,7 @@ func (r *ElfMachineReconciler) preCheckPlacementGroup(ctx *context.MachineContex | |||
|
|||
// KCP is in rolling update. | |||
|
|||
if !service.ContainsUnavailableHost(hosts, usedHostsByPG.UnsortedList(), *service.TowerMemory(ctx.ElfMachine.Spec.MemoryMiB)) && | |||
if usedHostsByPG.Len() == usedHostsByPG.Available(*service.TowerMemory(ctx.ElfMachine.Spec.MemoryMiB)).Len() && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这两个if分别是用来判断什么的?上次讨论后有点忘记了。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的逻辑可能需要调整。
现在代码的逻辑是:滚动更新的时候需要每个主机都是可用的,且内存足够创建一个 CP。然后选择最老的一个 CP 的所在主机赋值给第一个滚动更新的 CP。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
滚动更新的时候需要每个主机都是可用的,且内存足够创建一个 CP。
改成这样: 只需要为这个CP VM选中的目标主机有足够内存就够了 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -225,7 +225,7 @@ func (r *ElfMachineReconciler) preCheckPlacementGroup(ctx *context.MachineContex | |||
return pointer.String(hostID), err | |||
} | |||
|
|||
ctx.Logger.V(1).Info("The placement group is full, wait for enough available hosts", "placementGroup", *placementGroup.Name, "availableHosts", availableHostSet.UnsortedList(), "usedHostsByPG", usedHostsByPG.UnsortedList()) | |||
ctx.Logger.V(1).Info("The placement group is full, wait for enough available hosts", "placementGroup", *placementGroup.Name, "availableHosts", availableHosts.String(), "usedHostsByPG", usedHostsByPG.String()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这条log msg和 173行的一模一样,不方便区分,如果这两处场景不是一样的, 就需要加些带特征描述的msg文字
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix
@@ -213,7 +213,7 @@ func (r *ElfMachineReconciler) preCheckPlacementGroup(ctx *context.MachineContex | |||
|
|||
// KCP is in rolling update. | |||
|
|||
if !service.ContainsUnavailableHost(hosts, usedHostsByPG.UnsortedList(), *service.TowerMemory(ctx.ElfMachine.Spec.MemoryMiB)) && | |||
if usedHostsByPG.Len() == usedHostsByPG.Available(*service.TowerMemory(ctx.ElfMachine.Spec.MemoryMiB)).Len() && | |||
int(*kcp.Spec.Replicas) == usedHostsByPG.Len() { | |||
// Only when KCP is in rolling update and the placement group is full |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这两行注释可以删除,统一加到214行的注释中。
pkg/service/collections.go
Outdated
} | ||
|
||
// Available returns a Hosts with available hosts. | ||
func (s Hosts) Available(memory int64) Hosts { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
建议改个名字:
// GetHostsWithEnoughMemory returns a Hosts containing the available host which has allocatable memory no less than the specified memory.
GetHostsWithEnoughMemory(memory int64) Hosts {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里不止判断内存,使用的是 Tower 可用主机判断。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
嗯,那也没错,我调整下func描述就行
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FilterAvailableHostsWithEnoughMemory(memory int32)
pkg/service/mock_services/vm_mock.go
Outdated
@@ -1,5 +1,5 @@ | |||
// Code generated by MockGen. DO NOT EDIT. | |||
// Source: github.com/smartxworks/cluster-api-provider-elf/pkg/service/vm.go | |||
// Source: /root/github/cluster-api-provider-elf/pkg/service/vm.go |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个改动好像不对?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix
pkg/util/kcp/kcp.go
Outdated
// When all machines are up to date, KCP controller marks MachinesSpecUpToDateCondition to true. | ||
// | ||
// For more information about KCP MachinesSpecUpToDateCondition and RollingUpdateInProgressReason, refer to https://github.com/kubernetes-sigs/cluster-api/blob/main/api/v1beta1/condition_consts.go | ||
func IsKCPRollingUpdate(kcp *controlplanev1.KubeadmControlPlane) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
下面的IsKCPInScalingDown() 使用了In,IsKCPRollingUpdate() 没带In,两者统一下风格
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix
pkg/util/kcp/kcp.go
Outdated
// | ||
// For more information about KCP MachinesSpecUpToDateCondition and RollingUpdateInProgressReason, refer to https://github.com/kubernetes-sigs/cluster-api/blob/main/api/v1beta1/condition_consts.go | ||
func IsKCPRollingUpdate(kcp *controlplanev1.KubeadmControlPlane) bool { | ||
return conditions.IsFalse(kcp, controlplanev1.MachinesSpecUpToDateCondition) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个直接使用ReadyCondition就可以吧?另外字面上看 MachinesSpecUpToDateCondition 只表示Spec是UpToDate,并不代表KCP已经Ready(也就是RollingUpdate已经完成)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
conditions.SetSummary(kcp,
conditions.WithConditions(
controlplanev1.MachinesCreatedCondition,
controlplanev1.MachinesSpecUpToDateCondition,
controlplanev1.ResizedCondition,
controlplanev1.MachinesReadyCondition,
controlplanev1.AvailableCondition,
controlplanev1.CertificatesAvailableCondition,
),
)
可以改,但是会有误判的风险。具体看一下之前讨论的 WithConditions。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我是觉得 MachinesSpecUpToDateCondition 在此func的含义中不靠谱
本地升级验证正常。准备发测试版到 E2E跑。 |
https://smartx1.slack.com/archives/C04288LAHTP/p1690784537848789?thread_ts=1690455013.021899&cid=C04288LAHTP 验证,发现以下问题,yishanreproduce-controlplane-f86p6(滚动更新的第二个 CP)没有通过虚拟机迁移的判断逻辑,导致不能被迁移。 环境在 SKS-1656 targetHost := availableHosts.UnsortedList()[0]
usedHostsByPG := sets.Set[string]{}
for i := 0; i < len(cpElfMachines); i++ {
if ctx.ElfMachine.Name != cpElfMachines[i].Name &&
cpElfMachines[i].Status.PlacementGroupRef == *placementGroup.ID &&
cpElfMachines[i].CreationTimestamp.After(ctx.ElfMachine.CreationTimestamp.Time) {
usedHostsByPG.Insert(cpElfMachines[i].Status.HostServerRef)
}
} |
E2E 中发现了 panic 现象,目前只出现了1次(和目前的 PR 没有关系), |
|
||
ok, message := service.IsAvailableHost(host, *service.TowerMemory(ctx.ElfMachine.Spec.MemoryMiB)) | ||
if ok { | ||
ctx.Logger.Info("Selected the host server for VM since the placement group is full", "hostID", *vm.Host.ID, "vmRef", ctx.ElfMachine.Status.VMRef) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 修改下msg:"Select a host to power on the VM since the placement group is full"
- 这里和其他打印vm.Host.ID的地方 能显示出host name 吗 ?
pkg/service/collections.go
Outdated
return s.Filter(func(h *models.Host) bool { | ||
ok, _ := IsAvailableHost(h, memory) | ||
return ok | ||
}) | ||
} | ||
|
||
// FilterUnavailableHostsWithoutEnoughMemory returns a Hosts containing the unavailable host which has allocatable memory less than the specified memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
调整如下:
// FilterUnavailableHostsOrWithoutEnoughMemory returns a Hosts containing the unavailable hosts or available hosts whose available memory is less than the specified memory.
func (s Hosts) FilterUnavailableHostsOrWithoutEnoughMemory(memory int64) Hosts {
// When KCP is not in scaling down and it's not rolling update the 1st CP Machine, just return since the placement group is full. | ||
if !(kcputil.IsKCPRollingUpdateFirstMachine(kcp) || kcputil.IsKCPInScalingDown(kcp)) { | ||
ctx.Logger.V(1).Info("KCP is not in scaling down and it's not rolling update the 1st CP Machine, the placement group is full, wait for enough available hosts", "placementGroup", *placementGroup.Name, "availableHosts", availableHosts.String(), "usedHostsByPG", usedHostsByPG.String()) | ||
// When KCP is not in rolling update and it's not in scaling down, just return since the placement group is full. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> // When KCP is not in rolling update and not in scaling down
ctx.Logger.V(1).Info("KCP is not in scaling down and it's not rolling update the 1st CP Machine, the placement group is full, wait for enough available hosts", "placementGroup", *placementGroup.Name, "availableHosts", availableHosts.String(), "usedHostsByPG", usedHostsByPG.String()) | ||
// When KCP is not in rolling update and it's not in scaling down, just return since the placement group is full. | ||
if !kcputil.IsKCPInRollingUpdate(kcp) && !kcputil.IsKCPInScalingDown(kcp) { | ||
ctx.Logger.V(1).Info("KCP is not in rolling update and it's not in scaling down, the placement group is full, wait for enough available hosts", "placementGroup", *placementGroup.Name, "availableHosts", availableHosts.String(), "usedHostsByPG", usedHostsByPG.String()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
msg修改同上
if err != nil || hostID == "" { | ||
return nil, err | ||
} | ||
usedAndUnavailableHostsByPG := usedHostsByPG.FilterUnavailableHostsWithoutEnoughMemory(*service.TowerMemory(ctx.ElfMachine.Spec.MemoryMiB)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
改个简单的名字:usedAndUnavailableHostsByPG -> unusableHosts
ctx.Logger.V(1).Info("KCP is in rolling update, the placement group is full and has unusable hosts, will wait for enough available hosts"
usedHostsByPG.Len() == usedHostsByPG.Available(*service.TowerMemory(ctx.ElfMachine.Spec.MemoryMiB)).Len() && | ||
int(*kcp.Spec.Replicas) == usedHostsByPG.Len() { | ||
ctx.Logger.Info("The placement group is full and KCP is in rolling update, skip adding VM to the placement group", "placementGroup", *placementGroup.Name, "availableHosts", availableHosts.String(), "usedHostsByPG", usedHostsByPG.String(), "vmRef", ctx.ElfMachine.Status.VMRef, "vmId", *vm.ID) | ||
// In this case machine created by KCP rolling update can be powered on without being added to the placement group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个comment改成:
// In this case the machine created by KCP rolling update can be powered on without being added to the placement group, so return true and nil to let reconcileVMStatus() power it on.
@@ -415,14 +418,17 @@ func (r *ElfMachineReconciler) joinPlacementGroup(ctx *context.MachineContext, v | |||
|
|||
// Only when the KCP is in rolling update, the VM is stopped, and all the hosts used by the placement group are available, | |||
// will the upgrade be allowed with the same number of hosts and CP nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
按slack中沟通,420行可以删除
if kcputil.IsKCPInRollingUpdate(kcp) && *vm.Status == models.VMStatusSTOPPED { | ||
usedAndUnavailableHostsByPG := usedHostsByPG.FilterUnavailableHostsWithoutEnoughMemory(*service.TowerMemory(ctx.ElfMachine.Spec.MemoryMiB)) | ||
if usedAndUnavailableHostsByPG.IsEmpty() { | ||
ctx.Logger.Info("KCP is in rolling update, the placement group is full and has no unavailable hosts, skip adding VM to the placement group", "placementGroup", *placementGroup.Name, "availableHosts", availableHosts.String(), "usedHostsByPG", usedHostsByPG.String(), "vmRef", ctx.ElfMachine.Status.VMRef, "vmId", *vm.ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个log也改成如下:
"KCP is in rolling update, the placement group is full and has no unusable hosts, so skip adding VM to the placement group and power it on"
|
||
return true, nil | ||
ctx.Logger.Info("KCP is in rolling update, the placement group is full and has unavailable hosts, wait for enough available hosts", "placementGroup", *placementGroup.Name, "availableHosts", availableHosts.String(), "usedHostsByPG", usedHostsByPG.String(), "vmRef", ctx.ElfMachine.Status.VMRef, "vmId", *vm.ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unavailable hosts -> unusable hosts
wait for -> so wait for
return true, nil | ||
ctx.Logger.Info("KCP is in rolling update, the placement group is full and has unavailable hosts, wait for enough available hosts", "placementGroup", *placementGroup.Name, "availableHosts", availableHosts.String(), "usedHostsByPG", usedHostsByPG.String(), "vmRef", ctx.ElfMachine.Status.VMRef, "vmId", *vm.ID) | ||
|
||
return false, nil | ||
} | ||
|
||
if *vm.Status == models.VMStatusRUNNING || *vm.Status == models.VMStatusSUSPENDED { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
什么情况下会走到 434 行?VM已经在放置组中还是不在?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
滚动更新第一个 CP 就不需要进入放置组开机。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
前面已经有判断是否在放置组,在放置不会走到这里。
走到这里表示放置组已经满了,没有可用主机,但是当前 VM 已经开机了,直接返回。
|
||
return true, nil | ||
} | ||
|
||
// KCP is scaling out or being created. | ||
ctx.Logger.V(1).Info("The placement group is full, wait for enough available hosts", "placementGroup", *placementGroup.Name, "availableHosts", availableHostSet.UnsortedList(), "usedHostsByPG", usedHostsByPG.UnsortedList(), "vmRef", ctx.ElfMachine.Status.VMRef, "vmId", *vm.ID) | ||
ctx.Logger.V(1).Info("The placement group is full, wait for enough available hosts", "placementGroup", *placementGroup.Name, "availableHosts", availableHosts.String(), "usedHostsByPG", usedHostsByPG.String(), "vmRef", ctx.ElfMachine.Status.VMRef, "vmId", *vm.ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个log改成 "KCP is in scaling up or being created, the placement group is full, so wait for enough available hosts"
return false, err | ||
} | ||
|
||
targetHost := availableHosts.UnsortedList()[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这行移到480行之前(在此行之前没有使用)
ctx.Logger.V(1).Info("The placement group is full, wait for enough available hosts", "placementGroup", *placementGroup.Name, "availableHosts", availableHostSet.UnsortedList(), "usedHostsByPG", usedHostsByPG.UnsortedList()) | ||
// When KCP is not in rolling update and not in scaling down, just return since the placement group is full. | ||
if !kcputil.IsKCPInRollingUpdate(kcp) && !kcputil.IsKCPInScalingDown(kcp) { | ||
ctx.Logger.V(1).Info("KCP is not in rolling update and not in scaling down, the placement group is full, wait for enough available hosts", "placementGroup", *placementGroup.Name, "availableHosts", availableHosts.String(), "usedHostsByPG", usedHostsByPG.String()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
原来的 wait for enough available hosts 都统一改成这个吧 "so wait for enough available hosts" ;加个so 表示前面是因,后面是果。
@@ -460,24 +463,35 @@ func (r *ElfMachineReconciler) joinPlacementGroup(ctx *context.MachineContext, v | |||
return true, nil | |||
} | |||
|
|||
// The 1st new CP ElfMachine should wait for other new CP ElfMachines to join the target PlacementGroup. | |||
// The new CP ElfMachine should wait for other new CP ElfMachines to join the target PlacementGroup. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// The new CP ElfMachine
->
// The powered on CP ElfMachine which is not in the PlacementGroup
// and kcp.Status.UnavailableReplicas == 0. | ||
// So we need to check if the number of CP ElfMachine is equal to kcp.Spec.Replicas. | ||
if len(cpElfMachines) != int(*kcp.Spec.Replicas) { | ||
ctx.Logger.Info("The number of CP ElfMachine does not match the expected", "kcp", formatKCP(kcp), "cpElfMachines", cpElfMachineNames) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个也使用V(1)吧
// This is the last CP ElfMachine (i.e. the 1st new CP ElfMachine) which has not been added into the target PlacementGroup. | ||
// Migrate this VM to the target host, then it will be added into the target PlacementGroup. | ||
|
||
ctx.Logger.V(1).Info("Start migrateVM since KCP is not in rolling update process", "targetHost", formatHost(targetHost)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里msg改成 Start migrating VM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
问题
优化放置组相关代码
解决
其他改动:移除了限制同时只能迁移一个虚拟机的逻辑。
测试
3 主机 ELF 集群,3 CP + 1 Worker 集群,经过三次集群升级验证,使用脚本每秒更新第一个升级的 CP,没有发现虚拟机迁移情况。
测试版本在 E2E 中验证完成。