Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CAPM3 v1.6.0 cluster deployment using bond interfaces is failing due to missing bondXmitHashPolicy #1437

Open
baburciu opened this issue Jan 31, 2024 · 4 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. triage/accepted Indicates an issue is ready to be actively worked on.

Comments

@baburciu
Copy link

What steps did you take and what happened:
[A clear and concise description on how to REPRODUCE the bug.]

In Project Sylva we're maintaining a helm chart sylva-capi-cluster to create CAPI manifests definitions for clusters (not using clusterctl) and since upgrading to cluster-api-provider-metal3/releases/tag/v1.6.0 (through a Kustomize and GitOps way in capm3 unit we've noticed the following error:

Metal3DataTemplate.infrastructure.cluster.x-k8s.io "my-rke2-capm3-cp-metadata-30e3e5cc86" is invalid: spec.networkData.links.bonds\[0\].bondXmitHashPolicy: Unsupported value: "": supported values: "layer2", "layer3+4", "layer2+3"

when we feed a CAPI management cluster with a manifest like:

Metal3DataTemplate/my-rke2-capm3-cp-metadata-30e3e5cc86 (click to expand)
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: Metal3DataTemplate
metadata:
  name: my-rke2-capm3-cp-metadata-30e3e5cc86
  namespace: sylva-system
spec:
  clusterName: my-rke2-capm3
  metaData:
    ipAddressesFromIPPool:
    - kind: IPPool
      apiGroup: ipam.metal3.io
      name: my-rke2-capm3-provisioning-pool
      key: provisioningIP
    objectNames:
    - key: name
      object: machine
    - key: local-hostname
      object: machine
    - key: local_hostname
      object: machine
    prefixesFromIPPool:
    - kind: IPPool
      apiGroup: ipam.metal3.io
      name: my-rke2-capm3-provisioning-pool
      key: provisioningCIDR
    fromLabels:
    - key: longhorn
      object: baremetalhost
      label: longhorn
    fromAnnotations:
    - key: sylva_longhorn_disks
      object: baremetalhost
      annotation: sylvaproject.org/default-longhorn-disks-config
  networkData:
    links:
      ethernets:
      - id: ens1f0
        macAddress:
          fromHostInterface: ens1f0
        type: phy
      - id: ens1f1
        macAddress:
          fromHostInterface: ens1f1
        type: phy
      - id: ens2f0
        macAddress:
          fromHostInterface: ens2f0
        type: phy
      - id: ens2f1
        macAddress:
          fromHostInterface: ens2f1
        type: phy
      bonds:
      - id: bond0
        bondLinks:
          - ens1f0
          - ens1f1
        bondMode: balance-xor
        macAddress: {}
      - id: bond1
        bondLinks:
          - ens2f0
          - ens2f1
        bondMode: active-backup
        macAddress: {}
      vlans:
      - id: bond0.92
        macAddress:
          fromHostInterface: ens1f0
        vlanID: 92
        vlanLink: bond0
      - id: bond1.206
        macAddress:
          fromHostInterface: ens2f0
        vlanID: 206
        vlanLink: bond1
      - id: ens1f0.92
        macAddress:
          fromHostInterface: ens1f0
        vlanID: 92
        vlanLink: ens1f0
      - id: ens1f1.92
        macAddress:
          fromHostInterface: ens1f1
        vlanID: 92
        vlanLink: ens1f1
      - id: ens2f0.206
        macAddress:
          fromHostInterface: ens2f0
        vlanID: 206
        vlanLink: ens2f0
      - id: ens2f1.206
        macAddress:
          fromHostInterface: ens2f1
        vlanID: 206
        vlanLink: ens2f1
    networks:
      ipv4:
      - id: bond12
        ipAddressFromIPPool: my-rke2-capm3-provisioning-pool
        link: bond12
        routes:
          - gateway:
              string: 192.168.111.254
            network: 192.168.211.0
            prefix: 30
      - id: bond12.13
        ipAddressFromIPPool: my-rke2-capm3-public-pool
        link: bond12.13
        routes:
          - gateway:
              string: 192.168.2.254
            network: 0.0.0.0
            prefix: 0
    services:
      dns:
      - 1.1.1.1

Upon researching upstream, we found out that Metal3DataTemplate.spec.networkData.links.bonds[*].bondXmitHashPolicy option was introduced in upstream https://github.com/metal3-io/cluster-api-provider-metal3/pull/708/files and landed in this v1.6.0 tag.

Next thing is just a supposition, it might very well be wrong:
Based on

// +kubebuilder:validation:Enum="layer2";"layer3+4";"layer2+3"
// Selects the transmit hash policy used for port selection in balance-xor and 802.3ad modes
// +optional
BondXmitHashPolicy string `json:"bondXmitHashPolicy"`
and the presence of the +optional directive (but no presence of omitempty anywhere in https://github.com/metal3-io/cluster-api-provider-metal3/pull/708/files if I read https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#optional-vs-required; mentioning I'm a novice in Go), this field is maybe marked as optional so users can choose to omit it when defining instances of the custom resource, and the default value (if any) or a zero value will be used.

So I'm assuming the problem to be this default value, as in Go, when a string field is not initialized, its zero value is an empty string (""), and unless it's set by the controller logic when it processes the custom resource (which doesn't seem to be the case based on the referenced PR), this default value is not included in the supported values: "layer2", "layer3+4", "layer2+3".

What did you expect to happen:

Well, ideally the CAPM3 controller should have set a default or maybe we need to mention this as a breaking change in cluster-api-provider-metal3/releases/tag/v1.6.0 🤔

For us it's not a blocker, we're just adding the field in the helm chart, but we thought it should be reported to the community, as maybe others would fall in the same situation? Or maybe clusterctl is adding it?

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

  • Cluster-api version: v1.6.1
  • Cluster-api-provider-metal3 version: v1.6.0
  • Environment (metal3-dev-env or other): physical BMs with bonding configuration
  • Kubernetes version: (use kubectl version): v1.26.9 (or v1.26.9+rke2r1)

/kind bug

@metal3-io-bot metal3-io-bot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue lacks a `triage/foo` label and requires one. labels Jan 31, 2024
@Rozzii
Copy link
Member

Rozzii commented Feb 7, 2024

/triage accepted
We will discuss with some community members what would be a suitable default value, thank you for reporting this issue!

@metal3-io-bot metal3-io-bot added triage/accepted Indicates an issue is ready to be actively worked on. and removed needs-triage Indicates an issue lacks a `triage/foo` label and requires one. labels Feb 7, 2024
@Rozzii Rozzii self-assigned this Feb 7, 2024
@tmmorin
Copy link
Contributor

tmmorin commented Feb 7, 2024

We will discuss with some community members what would be a suitable default value, thank you for reporting this issue!

One simple option would be to adopt the kernel default (layer2).

@metal3-io-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@metal3-io-bot metal3-io-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 7, 2024
@Rozzii
Copy link
Member

Rozzii commented May 8, 2024

This is still a legit request
/lifecycle frozen

@metal3-io-bot metal3-io-bot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. triage/accepted Indicates an issue is ready to be actively worked on.
Projects
Status: CAPM3 on hold / blocked
Development

No branches or pull requests

4 participants