Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operator dropping wrong replicas when removing older replicas from cluster layout #1545

Open
jannikbend opened this issue Oct 28, 2024 · 2 comments

Comments

@jannikbend
Copy link

jannikbend commented Oct 28, 2024

Description

When removing replicas from a cluster that are not the ones that were added most recently the operator tries to remove the wrong replicas from the cluster (using SYSTEM DROP REPLICA '...'). This only affects the replicas in ch and not the STS, Pods and other k8s resources as these are remove correctly.

So given a cluster config looks like:

clusters:
      - name: replicated
        layout:
          shards:
            - name: "0"
              replicas:
                - name: "0-0" # Old
                - name: "0-1" # Old
                - name: "0-2" # Old
                # New replicas on dedicated nodes
                - name: "0-0-dedicated"
                  templates:
                    podTemplate: clickhouse-dedicated
                - name: "0-1-dedicated"
                  templates:
                    podTemplate: clickhouse-dedicated
                - name: "0-2-dedicated"
                  templates:
                    podTemplate: clickhouse-dedicated

When we now try to remove the old replicas "0-0", "0-1" and "0-2" the operator will try to remove the replicas "0-0-dedicated" etc. However it fails in doing so (E1024 08:42:53.714011 1 connection.go:194] Exec():FAILED Exec(http://operator/:***@chi-analytics-replicated-0-0.clickhouse.svc.cluster.local:8123/) doRequest: transport failed to send a request to ClickHouse: dial tcp: lookup chi-analytics-replicated-0-0.clickhouse.svc.cluster.local on 172.17.0.10:53: no such host for SQL: SYSTEM DROP REPLICA 'chi-analytics-replicated-0-0-dedicated') because it tries to execute the SQL statements on the already removed pods.

The same behaviour can be observed when adding replicas: If we try to add a new replicas by not appending it to the end of the replicas array there will be no schema migration.

See this related slack post as well: https://altinitydbworkspace.slack.com/archives/C02K1MWEK2L/p1729859400145679

Reproduction

  1. Create cluster wit replicas
clusters:
      - name: replicated
        layout:
          shards:
            - name: "0"
              replicas:
                - name: "0-0"
                - name: "0-1"
                - name: "0-2"
  1. Add a new replicas
clusters:
      - name: replicated
        layout:
          shards:
            - name: "0"
              replicas:
                - name: "0-0" # Old
                - name: "0-1" # Old
                - name: "0-2" # Old
                # New replicas on dedicated nodes
                - name: "0-0-dedicated"
                  templates:
                    podTemplate: clickhouse-dedicated
                - name: "0-1-dedicated"
                  templates:
                    podTemplate: clickhouse-dedicated
                - name: "0-2-dedicated"
                  templates:
                    podTemplate: clickhouse-dedicated
  1. Remove the created replicas in step 1
clusters:
      - name: replicated
        layout:
          shards:
            - name: "0"
              replicas:
                # New replicas on dedicated nodes
                - name: "0-0-dedicated"
                  templates:
                    podTemplate: clickhouse-dedicated
                - name: "0-1-dedicated"
                  templates:
                    podTemplate: clickhouse-dedicated
                - name: "0-2-dedicated"
                  templates:
                    podTemplate: clickhouse-dedicated
@hodgesrm
Copy link
Member

hodgesrm commented Oct 28, 2024

Thanks @jannikbend for posting. This behavior is at least surprising, I would be interested to hear from @Slach and @sunsingerus what they think about the case.

@sunsingerus
Copy link
Collaborator

Need to check in details, adding/deleting replicas is quite common scenario, but may be devil is in some details...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants