Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SKS-2345: Add support for system disk expansion #169

Merged
merged 4 commits into from
Apr 22, 2024

Conversation

haijianyang
Copy link
Contributor

@haijianyang haijianyang commented Jan 19, 2024

Issue

支持虚拟机磁盘扩容

  • 创建节点的时候指定磁盘大小
  • 磁盘容量热扩容

Change

磁盘扩容技术文档

待讨论:

  • 创建节点的时候,通过 cloud-init 对磁盘进行的扩容操作在 CAPE 进行还是在虚拟机模板进行。当前 CAPE 在 CAPI 生成的 cloud-init 基础上增加了磁盘操作命令。
  • host-agent 任务失败后的重试策略。当前任务失败后每两三分钟重试一次。

Test

  1. 创建 1CP + 1Worker 集群,指定磁盘容量为 220G(虚拟机模板默认 200G)
apiVersion: kubesmart.smtx.io/v1alpha1
kind: KubeSmartCluster
metadata:
  name: haijian-test1
  namespace: default
spec:
  version: v1.25.15
  controlPlaneEndpoint:
    host: 1.1.1.1
    port: 6443
  network:
    managementNetworkInterface: ens4
    cni:
      name: calico
  storage:
  topology:
    controlPlane:
      name: controlplane
      replicas: 1
      nodeConfig:
        cpuCores: 8
        memoryMB: 12288
        diskGiB: 220
        cloneMode: FastClone
        network:
          nameservers: []
          devices:
          - networkType: IPV4_DHCP
    workers:
    - name: workergroup1
      replicas: 1
      nodeConfig:
        cpuCores: 8
        memoryMB: 8192
        diskGiB: 220
        cloneMode: FastClone
        network:
          ipPoolRef:
            namespace: default
            name: ip-pool-node
          nameservers: []
          devices:
          - networkType: IPV4_DHCP
  cloudProvider:
    name: cloudtower
    cloudtower:
      cloudtowerServer:
        secretRef:
          name: cloudtower-server
          namespace: default
      elfCluster: dd1f408f-7715-48c1-a817-13c3568f1d93
      elfVlan: dd1f408f-7715-48c1-a817-13c3568f1d93_4cd00407-63ca-440b-80b7-ceacfccb8d08
      vmTemplate: sks-rocky-8.8-amd64-ens4-k8s-v1.25.15-template-v3growpart01-20240117073358
      zbsVip: 10.244.0.11
  1. 观察到创建出来的节点虚拟机的磁盘容量符合预期
    image
    image
    image
    image

  2. 集群创建出来后,将磁盘容量从 220G -> 240G,观察到触发了磁盘扩容:
    image
    image

@jessehu
Copy link
Collaborator

jessehu commented Jan 19, 2024

@Levi080513 帮忙review下

@Levi080513
Copy link
Contributor

@Levi080513 帮忙review下

👌

}
}
if agentJob == nil {
agentJob, err = hostagent.AddNewDiskCapacityToRoot(ctx, kubeClient, ctx.ElfMachine)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这些 agentJob 要不要考虑在执行完成后进行清理?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以删除,agent 也许会支持自动清理。

controllers/elfmachinetemplate_controller.go Outdated Show resolved Hide resolved
controllers/elfmachinetemplate_controller.go Outdated Show resolved Hide resolved
@jessehu jessehu changed the title [WIP] SKS-2345: Add support for expand disk [WIP] SKS-2345: Add support for system disk expansion Jan 22, 2024
@jessehu
Copy link
Collaborator

jessehu commented Jan 27, 2024

磁盘扩容过程增加扩容前检查和集群升级、节点故障等场景的判断处理

@haijianyang 集群的状态检查可以在KSC webhook中做吗?这样可以直接把错误返回给UI。

@haijianyang
Copy link
Contributor Author

磁盘扩容过程增加扩容前检查和集群升级、节点故障等场景的判断处理

@haijianyang 集群的状态检查可以在KSC webhook中做吗?这样可以直接把错误返回给UI。

可以在 webhook 拦截,检查磁盘扩容的时候的集群状态。但如果集群在扩容的时候滚动了,这个没有办法反馈。


// Agent needs to wait for the node exists before it can run and execute commands.
if machineutil.IsUpdatingElfMachineResources(ctx.ElfMachine) &&
ctx.Machine.Status.NodeInfo == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

需不需要在 sks-controller-manager 判断一下 hostAgent 有没有部署且状态是否正常。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

没有必要吧,这样就依赖这个服务了。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

热扩容功能现阶段不是强依赖 hostAgent 么?

Copy link
Contributor Author

@haijianyang haijianyang Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果按照这个逻辑,那么 CAPE 依赖了其他服务,也都需要确认服务在正常运行?

如果要判断 agent 是否安装,应该是在上游判断吧。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我的意思是需不需要在 KSC 加对应的处理。虚拟机集群 1.3 会默认安装 hostAgent 么? @huaqing1994

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

现在虚拟机集群没有默认安装。
我觉得对于 CAPE 来说,是可以不检查,host-config-agent 的 CR 能创建出来就 OK 了。

如果虚拟机热更新功能要进,感觉得所有集群都默认安装 host-config-agent。
不然要在 KSC 做前置判断,有点麻烦。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KSC webhook 检查是否 host-config-agent 安装以及状态正常?如果没有安装或者状态不正常,集群最终会处于 failed 么?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

webhook 能判断出所有热更新的情况就可以,怕以后热更新类型多了。
没安装或状态不对,CAPE 这边的处理就完不成,KSC 那边是会 Failed。

controllers/elfmachine_controller_resources.go Outdated Show resolved Hide resolved
@haijianyang haijianyang force-pushed the resize-disk branch 2 times, most recently from 902e023 to f37521f Compare April 12, 2024 07:14
@haijianyang
Copy link
Contributor Author

  • 修改了是否已经启动过虚拟机的逻辑,防止每次启动虚拟机之前都会走是否需要进行磁盘扩容的逻辑
  • 增加测试

@haijianyang haijianyang changed the title [WIP] SKS-2345: Add support for system disk expansion SKS-2345: Add support for system disk expansion Apr 16, 2024
@haijianyang haijianyang removed the request for review from jessehu April 16, 2024 08:55
@haijianyang
Copy link
Contributor Author

兼容原有磁盘容量设置为 0,如果磁盘容量为 0,表示容量和虚拟机模板一样。

@haijianyang haijianyang merged commit 939f02d into smartxworks:master Apr 22, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants