Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TiDB operator cannot update nor contact the TiDB cluster after the TLS config is changed #5728

Open
kos-team opened this issue Aug 30, 2024 · 5 comments

Comments

@kos-team
Copy link

kos-team commented Aug 30, 2024

Bug Report

What version of Kubernetes are you using?

1.28.1

What version of TiDB Operator are you using?

1.6.0

What storage classes exist in the Kubernetes cluster and what are used for PD/TiKV pods?

standard

What's the status of the TiDB cluster pods?

What did you do?

We tried to change the TLS configuration of the TiDB cluster, but found that the operator fails to contact the TiDB cluster after we change the tlsCluster.enabled option. This bug happens both when we change it from true to false and false to true.
The root cause of this bug is due to the mismatch of operator's pd client's TLS configuration and the server's config.

To reproduce this bug:

  1. First, create the certificate secrets according to the documentation: https://docs.pingcap.com/tidb-in-kubernetes/stable/enable-tls-between-components#using-cert-manager
  2. Install TiDB cluster by applying the CR with tlsCluster.enabled to true
apiVersion: pingcap.com/v1alpha1
kind: TidbCluster
metadata:
  name: basic
spec:
  version: v8.1.0
  timezone: UTC
  pvReclaimPolicy: Retain
  enableDynamicConfiguration: true
  configUpdateStrategy: RollingUpdate
  discovery: {}
  helper:
    image: alpine:3.16.0
  tlsCluster:
    enabled: true
  pd:
    baseImage: pingcap/pd
    maxFailoverCount: 0
    replicas: 1
    requests:
      storage: "1Gi"
    config: {}
  tikv:
    baseImage: pingcap/tikv
    maxFailoverCount: 0
    # If only 1 TiKV is deployed, the TiKV region leader
    # cannot be transferred during upgrade, so we have
    # to configure a short timeout
    evictLeaderTimeout: 1m
    replicas: 1
    requests:
      storage: "1Gi"
    config:
      storage:
        # In basic examples, we set this to avoid using too much storage.
        reserve-space: "0MB"
        engine: "partitioned-raft-kv"
      rocksdb:
        # In basic examples, we set this to avoid the following error in some Kubernetes clusters:
        # "the maximum number of open file descriptors is too small, got 1024, expect greater or equal to 82920"
        max-open-files: 256
      raftdb:
        max-open-files: 256
  tidb:
    baseImage: pingcap/tidb
    maxFailoverCount: 0
    replicas: 1
    service:
      type: ClusterIP
    config: {}
  1. Change the tlsCluster.enabled to false
apiVersion: pingcap.com/v1alpha1
kind: TidbCluster
metadata:
  name: basic
spec:
  version: v8.1.0
  timezone: UTC
  pvReclaimPolicy: Retain
  enableDynamicConfiguration: true
  configUpdateStrategy: RollingUpdate
  discovery: {}
  helper:
    image: alpine:3.16.0
  tlsCluster:
    enabled: true
  pd:
    baseImage: pingcap/pd
    maxFailoverCount: 0
    replicas: 1
    requests:
      storage: "1Gi"
    config: {}
  tikv:
    baseImage: pingcap/tikv
    maxFailoverCount: 0
    # If only 1 TiKV is deployed, the TiKV region leader
    # cannot be transferred during upgrade, so we have
    # to configure a short timeout
    evictLeaderTimeout: 1m
    replicas: 1
    requests:
      storage: "1Gi"
    config:
      storage:
        # In basic examples, we set this to avoid using too much storage.
        reserve-space: "0MB"
        engine: "partitioned-raft-kv"
      rocksdb:
        # In basic examples, we set this to avoid the following error in some Kubernetes clusters:
        # "the maximum number of open file descriptors is too small, got 1024, expect greater or equal to 82920"
        max-open-files: 256
      raftdb:
        max-open-files: 256
  tidb:
    baseImage: pingcap/tidb
    maxFailoverCount: 0
    replicas: 1
    service:
      type: ClusterIP
    config: {}

What did you expect to see?
The operator should be able to reconfigure the TLS config for the TiDB cluster.

What did you see instead?
The operator fails with the error logs:

| E0830 20:39:35.006167       1 pd_member_manager.go:201] failed to sync TidbCluster: [default/advanced-tidb]'s status, error: Get "https://advanced-tidb-pd.default:2379/pd/api/v1/health": EOF                   │
│ E0830 20:39:35.020171       1 tidb_cluster_controller.go:144] TidbCluster: default/advanced-tidb, sync failed tidbcluster: [default/advanced-tidb]'s pd status sync failed, can not to be upgraded, requeuing    │
│ E0830 20:39:41.747446       1 pd_member_manager.go:201] failed to sync TidbCluster: [default/advanced-tidb]'s status, error: Get "https://advanced-tidb-pd.default:2379/pd/api/v1/health": EOF                   │
│ E0830 20:39:41.760679       1 tidb_cluster_controller.go:144] TidbCluster: default/advanced-tidb, sync failed tidbcluster: [default/advanced-tidb]'s pd status sync failed, can not to be upgraded, requeuing

The root cause is the pd client's TLS configuration. After the tlsCluster configuration gets changed, the operator uses a different client with a different TLS configuration than what the TiDB cluster is currently using. This causes the healthcheck to fail due to the client error, although the cluster is already healthy.

https://github.com/pingcap/tidb-operator/blob/ff467a6e7a0563b31a3ace2fe5d060774012780d/pkg/pdapi/pd_control.go#L241C1-L252C3

@csuzhangxc
Copy link
Member

@kos-team in the above example, the root cause is: after changed from TLS to non-TLS, the operator is still using a client with TLS enabled to connect to PD, right?

can you workaround it after restarting tidb-controller-manager?

@kos-team
Copy link
Author

kos-team commented Sep 2, 2024

@csuzhangxc The root cause is because the operator switched to use a client with non-TLS, while the PD is still running with TLS. This causes the calls to the PD to fail. I believe restarting the tidb-operator won't help.

@csuzhangxc
Copy link
Member

@csuzhangxc The root cause is because the operator switched to use a client with non-TLS, while the PD is still running with TLS. This causes the calls to the PD to fail. I believe restarting the tidb-operator won't help.

In your case, after updating the TidbCluster CR from TLS to non-TLS, is PD still using TLS? Is the ConfigMap of PD updated? if the ConfigMap is updated, this may be related to #5708

@kos-team
Copy link
Author

kos-team commented Sep 3, 2024

The ConfigMap is not updated.
The pd_member_manager early returns at this line:

if err := m.syncTidbClusterStatus(tc, oldPDSet); err != nil {

which is before where the ConfigMap is updated:
cm, err := m.syncPDConfigMap(tc, oldPDSet)

And as mentioned in the issue report, the reason that syncTidbClusterStatus fails is because of the inconsistent client config. The client should use TLS config according to the server's current config, but not the desired TLS config specified in the CR.

@csuzhangxc
Copy link
Member

Got it! Let's think about how to resolve it with minimal changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants