Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-Creating node from scratch does not copy tables for the Postgres and Kafka engines #1455

Open
Hubbitus opened this issue Jul 12, 2024 · 56 comments
Assignees

Comments

@Hubbitus
Copy link

Hubbitus commented Jul 12, 2024

We use your Operator to manage Clickhouse cluster. Thank you.

After some hardware failure we reset PVC (and zookeeper namespace) to re-create one clickhouse node.

Most of metadata like views, materialized views and tables with most engines (MergeTree, ReplicatedMergeTree etc.) was successfully re-created on the node and replication was started.

Meantime none of Postgres and Kafka based engines tables was recreated.
Is it a bug, or we need to use some commands or hacks to sync all metadata across the cluster?

@alex-zaitsev
Copy link
Member

@Hubbitus , have you used latest 0.23.6 or earlier release?

@Hubbitus
Copy link
Author

Hubbitus commented Jul 24, 2024

@alex-zaitsev, thank you for the response.

That was in older version. Now we have updated operator. What is a correct way to re-init node? Is it enough to just delete PVC of failed node and delete POD?

@alex-zaitsev
Copy link
Member

@Hubbitus , if you want to re-init the existing node, delete STS, PVC, PV and start re-concile. Do you have multiple replicas?

@Hubbitus
Copy link
Author

Hubbitus commented Jul 31, 2024

@alex-zaitsev, thank you for the reply.

I understand how to delete objects. But what you are meant under "start re-concile"?

I have two replicas chi-gid-gid-0-0-0 and chi-gid-gid-0-1-0. And now chi-gid-gid-0-0-0 is misfunction. I want to re-init it from the data in chi-gid-gid-0-1-0. And that should include sync all:

  • metadata (all type of objects like MergeTree tables, Postgres, kafka engines, materialized views, etc)
  • populate it with data from replica 1
  • Users and all permissions to the objects

@alex-zaitsev
Copy link
Member

@Hubbitus , we have released 0.23.7 that is more aggressive re-creating the schema. So you may try to delete PVC/PV completely, and let it to re-create the objects.

@Hubbitus
Copy link
Author

Hubbitus commented Sep 4, 2024

@alex-zaitsev, thank you very much!
Eventually I get it updated for our cluster:

kub_dev get pods --all-namespaces -o jsonpath="{.items[*].spec['initContainers', 'containers'][*].image}" -l app=clickhouse-operator                                                                                                     
altinity/clickhouse-operator:0.23.7 altinity/metrics-exporter:0.23.7

And doing in ArgoCD:

  • Deleted PVC default-volume-claim-chi-gid-gid-0-0-0
  • Deleted pod chi-gid-gid-0-0-0

Then PVC had been re-created.

I see pod is up and running.

  1. But there are a lot of errors like 2024.09.04 23:50:34.382651 [ 712 ] {} <Error> Access(user directories): from: 10.42.9.104, user: data_quality: Authentication failed: Code: 192. DB::Exception: There is no user data_quality in local_directory. (UNKNOWN_USER).... So, users are not copied
  2. Tables looks like also not synced:
SELECT hostname() as node, COUNT(*)
FROM clusterAllReplicas('{cluster}', system.tables)
WHERE database NOT IN ('INFORMATION_SCHEMA', 'information_schema', 'system')
GROUP BY node
node count()
chi-gid-gid-0-1-0 620

And also error in log like: 2024.09.04 23:52:49.039132 [ 714 ] {bb628508-db8e-4cf9-8307-a13133a185c9} <Error> PredefinedQueryHandler: Code: 60. DB::Exception: Table system.operator_compatible_metrics does not exist. (UNKNOWN_TABLE) - so even in system database some tables missing...

So, I see only tables in information_schema for the 1-st node.

@alex-zaitsev
Copy link
Member

alex-zaitsev commented Sep 20, 2024

Notes:

  1. Users are not replicated by operator since it can not access sensitive data (like passwords). Use CHI/XML user management or replicated user directory.
<clickhouse>
  <user_directories replace="replace">
    <users_xml>
      <path>/etc/clickhouse-server/users.xml</path>
    </users_xml>
    <replicated>
      <zookeeper_path>/clickhouse/access/</zookeeper_path>
    </replicated>
    <local_directory>
       <path>/var/lib/clickhouse/access/</path>
    </local_directory>
  </user_directories>
</clickhouse>

Note, the order is important, but local_directory may be skipped if you are not using it. Keep it, if there are users defined with CREATE USER already, otherwise they disappear at all.

  1. Tables in system database are not replicated as well, since it is supposed there are no user tables in there.

Others should work, so operator log is needed to check what went wrong.

The correct PVC recovery sequence is:

  1. Delete PVC (or PVC and STS)
  2. Run reconcile adding taskID to CHI, for instance

Looks like since you have deleted PVC and Pod, the recovery has been handled by Kubernetes (STS), and Operator even did not know that PVC has been recreated. So make sure you delete STS as well. Also consider using operator managed persistance:

spec:
  defaults:
    storageManagement:
      provisioner: Operator

@Hubbitus
Copy link
Author

Hubbitus commented Sep 21, 2024

@alex-zaitsev, very thank you for the answer. First I would like to recover my tables, then I will try to deal with users.

Today, I eventfully receive rights to see operator pod in kube-system namespace.
And just after deletion of PVC and pod I see errors in clickhouse-operator pod:

I0921 22:13:23.555553       1 worker.go:275] processReconcilePod():gidplatform-dev/chi-gid-gid-0-0-0:Delete Pod. gidplatform-dev/chi-gid-gid-0-0-0
I0921 22:13:23.686901       1 worker.go:266] processReconcilePod():gidplatform-dev/chi-gid-gid-0-0-0:Add Pod. gidplatform-dev/chi-gid-gid-0-0-0
I0921 22:13:32.391425       1 chi.go:38] prepareListOfTemplates():gidplatform-dev/gid:Found applicable templates num: 0
I0921 22:13:32.391446       1 chi.go:82] ApplyCHITemplates():gidplatform-dev/gid:Applied templates num: 0
E0921 22:13:32.394908       1 connection.go:194] Exec():FAILED Exec(http://test_operator:***@chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local:8123/) doRequest: transport failed to send a request to ClickHouse: dial tcp 10.42.9.84:8123: connect: connection refused for
SQL: SYSTEM DROP DNS CACHE
W0921 22:13:32.394938       1 retry.go:52] exec():chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local:FAILED single try. No retries will be made for Applying sqls
I0921 22:13:32.414341       1 chi.go:38] prepareListOfTemplates():gidplatform-dev/gid:Found applicable templates num: 0
I0921 22:13:32.414363       1 chi.go:82] ApplyCHITemplates():gidplatform-dev/gid:Applied templates num: 0
I0921 22:13:32.415447       1 worker.go:387] gidplatform-dev/gid/b22b39fe-b7d8-40e3-a510-e169d1ffab18:updating endpoints for CHI-1 gid
I0921 22:13:32.450485       1 worker.go:389] gidplatform-dev/gid/b22b39fe-b7d8-40e3-a510-e169d1ffab18:IPs of the CHI-1 update endpoints gidplatform-dev/gid: len: 2 [10.42.9.84 10.42.5.92]
I0921 22:13:32.464127       1 chi.go:38] prepareListOfTemplates():gidplatform-dev/gid:Found applicable templates num: 0
I0921 22:13:32.464172       1 chi.go:82] ApplyCHITemplates():gidplatform-dev/gid:Applied templates num: 0
I0921 22:13:32.466517       1 worker.go:393] gidplatform-dev/gid/f2584b3a-a25a-4f22-8dfd-72f2a5166984:Update users IPS-1
I0921 22:13:32.481724       1 worker.go:1315] updateConfigMap():gidplatform-dev/gid/f2584b3a-a25a-4f22-8dfd-72f2a5166984:Update ConfigMap gidplatform-dev/chi-gid-common-usersd
I0921 22:13:42.168333       1 chi.go:38] prepareListOfTemplates():gidplatform-dev/gid:Found applicable templates num: 0
I0921 22:13:42.168355       1 chi.go:82] ApplyCHITemplates():gidplatform-dev/gid:Applied templates num: 0
I0921 22:13:42.190633       1 chi.go:38] prepareListOfTemplates():gidplatform-dev/gid:Found applicable templates num: 0
I0921 22:13:42.190651       1 chi.go:82] ApplyCHITemplates():gidplatform-dev/gid:Applied templates num: 0
I0921 22:13:42.191751       1 worker.go:387] gidplatform-dev/gid/ef8a0da7-09d3-4890-9a59-c760233aedb5:updating endpoints for CHI-1 gid
I0921 22:13:42.215106       1 worker.go:389] gidplatform-dev/gid/ef8a0da7-09d3-4890-9a59-c760233aedb5:IPs of the CHI-1 update endpoints gidplatform-dev/gid: len: 2 [10.42.9.84 10.42.5.92]
I0921 22:13:42.224452       1 chi.go:38] prepareListOfTemplates():gidplatform-dev/gid:Found applicable templates num: 0
I0921 22:13:42.224470       1 chi.go:82] ApplyCHITemplates():gidplatform-dev/gid:Applied templates num: 0
I0921 22:13:42.225507       1 worker.go:393] gidplatform-dev/gid/d9105257-3cfe-4596-b3bf-0f6cd6935843:Update users IPS-1
I0921 22:13:42.235027       1 worker.go:1315] updateConfigMap():gidplatform-dev/gid/d9105257-3cfe-4596-b3bf-0f6cd6935843:Update ConfigMap gidplatform-dev/chi-gid-common-usersd

@Hubbitus
Copy link
Author

Hubbitus commented Sep 29, 2024

As we are speaking, I have tried to reconcile cluster by providing:

spec:
  taskID: "click-reconcile-1"

Indeed, that looks like triggering reconcile. Logs of operator pod:

kubectl -n kube-system logs --selector=app=clickhouse-operator --container=clickhouse-operator --tail=1000
I0929 11:54:59.076600       1 worker.go:574] ActionPlan start---------------------------------------------:
Diff start -------------------------
modified spec items num: 1
diff item [0]:'.TaskID' = '"click-reconcile-1"'
Diff end -------------------------

ActionPlan end---------------------------------------------
I0929 11:54:59.076655       1 worker-chi-reconciler.go:89] reconcileCHI():gidplatform-dev/gid/click-reconcile-1:ActionPlan has actions - continue reconcile
I0929 11:54:59.125555       1 worker.go:663] markReconcileStart():gidplatform-dev/gid/click-reconcile-1:reconcile started, task id: click-reconcile-1
I0929 11:54:59.681288       1 worker.go:820] FOUND host: ns:gidplatform-dev|chi:gid|clu:gid|sha:0|rep:0|host:0-0
I0929 11:54:59.681436       1 worker.go:820] FOUND host: ns:gidplatform-dev|chi:gid|clu:gid|sha:0|rep:1|host:0-1
I0929 11:54:59.681607       1 worker.go:844] RemoteServersGeneratorOptions: exclude hosts: [], attributes: status: , add: true, remove: false, modify: false, found: false, exclude: true
I0929 11:54:59.859367       1 worker.go:1315] updateConfigMap():gidplatform-dev/gid/click-reconcile-1:Update ConfigMap gidplatform-dev/chi-gid-common-configd
I0929 11:55:00.648852       1 worker.go:1315] updateConfigMap():gidplatform-dev/gid/click-reconcile-1:Update ConfigMap gidplatform-dev/chi-gid-common-usersd
I0929 11:55:01.284151       1 service.go:86] CreateServiceCluster():gidplatform-dev/gid/click-reconcile-1:gidplatform-dev/cluster-gid-gid
I0929 11:55:01.294688       1 worker-chi-reconciler.go:819] PDB updated: gidplatform-dev/gid-gid
I0929 11:55:01.294746       1 worker-chi-reconciler.go:554] not found ReconcileShardsAndHostsOptionsCtxKey, use empty opts
I0929 11:55:01.294769       1 worker-chi-reconciler.go:568] starting first shard separately
I0929 11:55:01.294967       1 cluster.go:84] Run query on: chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local]
I0929 11:55:01.305993       1 worker-chi-reconciler.go:349] getHostClickHouseVersion():Get ClickHouse version on host: 0-0 version: 24.2.1.2248
I0929 11:55:01.306072       1 worker-chi-reconciler.go:684] reconcileHost():Reconcile Host start. Host: 0-0 ClickHouse version running: 24.2.1.2248
I0929 11:55:01.897135       1 worker.go:1565] getObjectStatusFromMetas():gidplatform-dev/chi-gid-gid-0-0:cur and new objects are equal based on object version label. Update of the object is not required. Object: gidplatform-dev/chi-gid-gid-0-0
I0929 11:55:01.897345       1 worker.go:1001] worker.go:1001:excludeHost():start:exclude host start
I0929 11:55:02.047624       1 worker.go:159] shouldForceRestartHost():Host restart is not required. Host: 0-0
I0929 11:55:02.047656       1 worker.go:1170] shouldExcludeHost():Host is the same, would not be updated, no need to exclude. Host/shard/cluster: 0/0/gid
I0929 11:55:02.047669       1 worker.go:1005] worker.go:1002:excludeHost():end:exclude host end
I0929 11:55:02.047693       1 worker.go:1020] worker.go:1020:completeQueries():start:complete queries start
I0929 11:55:02.047730       1 worker.go:1220] shouldWaitQueries():Will wait for queries to complete according to CHOp config 'reconcile.host.wait.queries' setting. Host is not yet in the cluster. Host/shard/cluster: 0/0/gid
I0929 11:55:02.047779       1 cluster.go:84] Run query on: chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local]
I0929 11:55:02.087023       1 poller.go:138] Poll():gidplatform-dev/0-0:OK gidplatform-dev/0-0
I0929 11:55:02.087048       1 worker.go:1024] worker.go:1021:completeQueries():end:complete queries end
I0929 11:55:02.248789       1 worker.go:1315] updateConfigMap():gidplatform-dev/gid/click-reconcile-1:Update ConfigMap gidplatform-dev/chi-gid-deploy-confd-gid-0-0
I0929 11:55:02.884163       1 worker-chi-reconciler.go:716] reconcileHost():Reconcile PVCs and check possible data loss for host: 0-0
I0929 11:55:03.458635       1 worker-chi-reconciler.go:406] worker-chi-reconciler.go:406:reconcileHostStatefulSet():start:reconcile StatefulSet start
I0929 11:55:03.458764       1 cluster.go:84] Run query on: chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local]
I0929 11:55:03.465752       1 worker-chi-reconciler.go:349] getHostClickHouseVersion():Get ClickHouse version on host: 0-0 version: 24.2.1.2248
I0929 11:55:03.472628       1 worker-chi-reconciler.go:412] reconcileHostStatefulSet():Reconcile host: 0-0. ClickHouse version: 24.2.1.2248
I0929 11:55:03.651853       1 worker.go:159] shouldForceRestartHost():Host restart is not required. Host: 0-0
I0929 11:55:03.651943       1 worker-chi-reconciler.go:425] reconcileHostStatefulSet():Reconcile host: 0-0. Reconcile StatefulSet
I0929 11:55:03.655273       1 worker.go:1565] getObjectStatusFromMetas():gidplatform-dev/chi-gid-gid-0-0:cur and new objects are equal based on object version label. Update of the object is not required. Object: gidplatform-dev/chi-gid-gid-0-0
I0929 11:55:04.097497       1 worker-chi-reconciler.go:445] worker-chi-reconciler.go:407:reconcileHostStatefulSet():end:reconcile StatefulSet end
I0929 11:55:04.654273       1 worker-chi-reconciler.go:900] reconcileService():gidplatform-dev/gid/click-reconcile-1:Service found: gidplatform-dev/chi-gid-gid-0-0. Will try to update
I0929 11:55:04.853666       1 worker.go:1459] updateService():gidplatform-dev/gid/click-reconcile-1:Update Service success: gidplatform-dev/chi-gid-gid-0-0
I0929 11:55:05.487521       1 worker-chi-reconciler.go:922] reconcileService():gidplatform-dev/gid/click-reconcile-1:Service reconcile successful: gidplatform-dev/chi-gid-gid-0-0
I0929 11:55:05.487592       1 worker-chi-reconciler.go:461] reconcileHostService():DONE Reconcile service of the host: 0-0
I0929 11:55:05.487682       1 cluster.go:84] Run query on: chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local]
I0929 11:55:05.495665       1 worker-chi-reconciler.go:349] getHostClickHouseVersion():Get ClickHouse version on host: 0-0 version: 24.2.1.2248
I0929 11:55:05.495739       1 poller.go:138] Poll():gidplatform-dev/0-0:OK gidplatform-dev/0-0
I0929 11:55:05.495824       1 worker-chi-reconciler.go:753] reconcileHost():Check host for ClickHouse availability before migrating tables. Host: 0-0 ClickHouse version running: 24.2.1.2248
I0929 11:55:05.495957       1 worker.go:908] migrateTables():No need to add tables on host 0 to shard 0 in cluster gid
I0929 11:55:05.496005       1 worker.go:1057] includeHost():Include into cluster host 0 shard 0 cluster gid
I0929 11:55:05.496048       1 worker.go:1124] includeHostIntoClickHouseCluster():going to include host 0 shard 0 cluster gid
I0929 11:55:05.496070       1 worker.go:844] RemoteServersGeneratorOptions: exclude hosts: [], attributes: status: , add: true, remove: false, modify: false, found: false, exclude: true
I0929 11:55:05.648655       1 worker.go:1315] updateConfigMap():gidplatform-dev/gid/click-reconcile-1:Update ConfigMap gidplatform-dev/chi-gid-common-configd
I0929 11:55:06.449496       1 cluster.go:84] Run query on: chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local]
I0929 11:55:06.463606       1 worker-chi-reconciler.go:349] getHostClickHouseVersion():Get ClickHouse version on host: 0-0 version: 24.2.1.2248
I0929 11:55:06.463648       1 poller.go:138] Poll():gidplatform-dev/0-0:OK gidplatform-dev/0-0
I0929 11:55:06.463703       1 worker-chi-reconciler.go:776] reconcileHost():Reconcile Host completed. Host: 0-0 ClickHouse version running: 24.2.1.2248
I0929 11:55:07.086061       1 worker-chi-reconciler.go:797] reconcileHost():[now: 2024-09-29 11:55:07.085979541 +0000 UTC m=+530555.182385088] ProgressHostsCompleted: 1 of 2
I0929 11:55:08.084486       1 worker-chi-reconciler.go:900] reconcileService():gidplatform-dev/gid/click-reconcile-1:Service found: gidplatform-dev/clickhouse-gid. Will try to update
I0929 11:55:08.253098       1 worker.go:1459] updateService():gidplatform-dev/gid/click-reconcile-1:Update Service success: gidplatform-dev/clickhouse-gid
I0929 11:55:08.883102       1 worker-chi-reconciler.go:922] reconcileService():gidplatform-dev/gid/click-reconcile-1:Service reconcile successful: gidplatform-dev/clickhouse-gid
I0929 11:55:08.883295       1 cluster.go:84] Run query on: chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local]
I0929 11:55:08.889935       1 worker-chi-reconciler.go:349] getHostClickHouseVersion():Get ClickHouse version on host: 0-1 version: 24.2.1.2248
I0929 11:55:08.890015       1 worker-chi-reconciler.go:684] reconcileHost():Reconcile Host start. Host: 0-1 ClickHouse version running: 24.2.1.2248
I0929 11:55:09.524136       1 worker.go:1572] getObjectStatusFromMetas():gidplatform-dev/chi-gid-gid-0-1:cur and new objects ARE DIFFERENT based on object version label: Update of the object is required. Object: gidplatform-dev/chi-gid-gid-0-1
I0929 11:55:09.524219       1 worker.go:1001] worker.go:1001:excludeHost():start:exclude host start
I0929 11:55:09.647870       1 worker.go:159] shouldForceRestartHost():Host restart is not required. Host: 0-1
I0929 11:55:09.647935       1 worker.go:1177] shouldExcludeHost():Host should be excluded. Host/shard/cluster: 1/0/gid
I0929 11:55:09.647982       1 worker.go:1010] excludeHost():Exclude from cluster host 1 shard 0 cluster gid
I0929 11:55:10.090456       1 chi.go:38] prepareListOfTemplates():gidplatform-dev/gid/click-reconcile-1:Found applicable templates num: 0
I0929 11:55:10.090524       1 chi.go:82] ApplyCHITemplates():gidplatform-dev/gid/click-reconcile-1:Applied templates num: 0
I0929 11:55:10.132801       1 chi.go:38] prepareListOfTemplates():gidplatform-dev/gid/click-reconcile-1:Found applicable templates num: 0
I0929 11:55:10.132824       1 chi.go:82] ApplyCHITemplates():gidplatform-dev/gid/click-reconcile-1:Applied templates num: 0
I0929 11:55:10.134283       1 worker.go:387] gidplatform-dev/gid/click-reconcile-1:updating endpoints for CHI-1 gid
I0929 11:55:10.256392       1 worker.go:1099] excludeHostFromClickHouseCluster():going to exclude host 1 shard 0 cluster gid
I0929 11:55:10.256420       1 worker.go:844] RemoteServersGeneratorOptions: exclude hosts: [], attributes: status: , add: true, remove: false, modify: false, found: false, exclude: true
I0929 11:55:10.651725       1 worker.go:1315] updateConfigMap():gidplatform-dev/gid/click-reconcile-1:Update ConfigMap gidplatform-dev/chi-gid-common-configd
I0929 11:55:10.847886       1 worker.go:389] gidplatform-dev/gid/click-reconcile-1:IPs of the CHI-1 update endpoints gidplatform-dev/gid: len: 2 [10.42.9.86 10.42.5.48]
I0929 11:55:10.859857       1 chi.go:38] prepareListOfTemplates():gidplatform-dev/gid/click-reconcile-1:Found applicable templates num: 0
I0929 11:55:10.859903       1 chi.go:82] ApplyCHITemplates():gidplatform-dev/gid/click-reconcile-1:Applied templates num: 0
I0929 11:55:10.862438       1 worker.go:393] gidplatform-dev/gid/click-reconcile-1:Update users IPS-1
I0929 11:55:11.249384       1 worker.go:1315] updateConfigMap():gidplatform-dev/gid/click-reconcile-1:Update ConfigMap gidplatform-dev/chi-gid-common-usersd
I0929 11:55:11.887237       1 worker.go:1203] shouldWaitExcludeHost():wait to exclude host fallback to operator's settings. host 1 shard 0 cluster gid
I0929 11:55:11.896425       1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:55:16.902829       1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:55:21.913913       1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:55:26.921150       1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:55:31.928701       1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:55:36.936718       1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:55:41.945459       1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:55:46.954333       1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:55:51.962841       1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:55:56.971440       1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:56:01.978083       1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:56:06.984911       1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:56:11.996098       1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:56:11.996147       1 poller.go:170] Poll():gidplatform-dev/0-1:WAIT:gidplatform-dev/0-1
I0929 11:56:17.002241       1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:56:17.002279       1 poller.go:170] Poll():gidplatform-dev/0-1:WAIT:gidplatform-dev/0-1
I0929 11:56:22.008717       1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:56:22.008762       1 poller.go:170] Poll():gidplatform-dev/0-1:WAIT:gidplatform-dev/0-1
I0929 11:56:27.015747       1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:56:27.015810       1 poller.go:170] Poll():gidplatform-dev/0-1:WAIT:gidplatform-dev/0-1
I0929 11:56:32.024632       1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:56:32.024713       1 poller.go:170] Poll():gidplatform-dev/0-1:WAIT:gidplatform-dev/0-1
I0929 11:56:37.037036       1 schemer.go:137] IsHostInCluster():The host 0-1 is outside of the cluster
I0929 11:56:37.037107       1 poller.go:138] Poll():gidplatform-dev/0-1:OK gidplatform-dev/0-1
I0929 11:56:37.037132       1 worker.go:1015] worker.go:1002:excludeHost():end:exclude host end
I0929 11:56:37.037189       1 worker.go:1020] worker.go:1020:completeQueries():start:complete queries start
I0929 11:56:37.037281       1 worker.go:1220] shouldWaitQueries():Will wait for queries to complete according to CHOp config 'reconcile.host.wait.queries' setting. Host is not yet in the cluster. Host/shard/cluster: 1/0/gid
I0929 11:56:37.037353       1 cluster.go:84] Run query on: chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local]
I0929 11:56:37.041809       1 poller.go:138] Poll():gidplatform-dev/0-1:OK gidplatform-dev/0-1
I0929 11:56:37.041827       1 worker.go:1024] worker.go:1021:completeQueries():end:complete queries end
I0929 11:56:37.048773       1 worker.go:1315] updateConfigMap():gidplatform-dev/gid/click-reconcile-1:Update ConfigMap gidplatform-dev/chi-gid-deploy-confd-gid-0-1
I0929 11:56:37.098510       1 worker-chi-reconciler.go:716] reconcileHost():Reconcile PVCs and check possible data loss for host: 0-1
I0929 11:56:37.119348       1 worker-chi-reconciler.go:406] worker-chi-reconciler.go:406:reconcileHostStatefulSet():start:reconcile StatefulSet start
I0929 11:56:37.119427       1 cluster.go:84] Run query on: chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local]
I0929 11:56:37.123489       1 worker-chi-reconciler.go:349] getHostClickHouseVersion():Get ClickHouse version on host: 0-1 version: 24.2.1.2248
I0929 11:56:37.127378       1 worker-chi-reconciler.go:412] reconcileHostStatefulSet():Reconcile host: 0-1. ClickHouse version: 24.2.1.2248
I0929 11:56:37.131620       1 worker.go:159] shouldForceRestartHost():Host restart is not required. Host: 0-1
I0929 11:56:37.131650       1 worker-chi-reconciler.go:425] reconcileHostStatefulSet():Reconcile host: 0-1. Reconcile StatefulSet
I0929 11:56:37.133351       1 worker.go:1565] getObjectStatusFromMetas():gidplatform-dev/chi-gid-gid-0-1:cur and new objects are equal based on object version label. Update of the object is not required. Object: gidplatform-dev/chi-gid-gid-0-1
I0929 11:56:37.168247       1 worker-chi-reconciler.go:445] worker-chi-reconciler.go:407:reconcileHostStatefulSet():end:reconcile StatefulSet end
I0929 11:56:37.653395       1 worker-chi-reconciler.go:900] reconcileService():gidplatform-dev/gid/click-reconcile-1:Service found: gidplatform-dev/chi-gid-gid-0-1. Will try to update
I0929 11:56:37.849923       1 worker.go:1459] updateService():gidplatform-dev/gid/click-reconcile-1:Update Service success: gidplatform-dev/chi-gid-gid-0-1
I0929 11:56:38.491295       1 worker-chi-reconciler.go:922] reconcileService():gidplatform-dev/gid/click-reconcile-1:Service reconcile successful: gidplatform-dev/chi-gid-gid-0-1
I0929 11:56:38.491349       1 worker-chi-reconciler.go:461] reconcileHostService():DONE Reconcile service of the host: 0-1
I0929 11:56:38.491418       1 cluster.go:84] Run query on: chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local]
I0929 11:56:38.495556       1 worker-chi-reconciler.go:349] getHostClickHouseVersion():Get ClickHouse version on host: 0-1 version: 24.2.1.2248
I0929 11:56:38.495593       1 poller.go:138] Poll():gidplatform-dev/0-1:OK gidplatform-dev/0-1
I0929 11:56:38.495629       1 worker-chi-reconciler.go:753] reconcileHost():Check host for ClickHouse availability before migrating tables. Host: 0-1 ClickHouse version running: 24.2.1.2248
I0929 11:56:38.495686       1 worker.go:908] migrateTables():No need to add tables on host 1 to shard 0 in cluster gid
I0929 11:56:38.495706       1 worker.go:1057] includeHost():Include into cluster host 1 shard 0 cluster gid
I0929 11:56:38.495726       1 worker.go:1124] includeHostIntoClickHouseCluster():going to include host 1 shard 0 cluster gid
I0929 11:56:38.495737       1 worker.go:844] RemoteServersGeneratorOptions: exclude hosts: [], attributes: status: , add: true, remove: false, modify: false, found: false, exclude: true
I0929 11:56:38.654056       1 worker.go:1315] updateConfigMap():gidplatform-dev/gid/click-reconcile-1:Update ConfigMap gidplatform-dev/chi-gid-common-configd
I0929 11:56:39.689499       1 chi.go:38] prepareListOfTemplates():gidplatform-dev/gid/click-reconcile-1:Found applicable templates num: 0
I0929 11:56:39.689543       1 chi.go:82] ApplyCHITemplates():gidplatform-dev/gid/click-reconcile-1:Applied templates num: 0
I0929 11:56:39.711932       1 chi.go:38] prepareListOfTemplates():gidplatform-dev/gid/click-reconcile-1:Found applicable templates num: 0
I0929 11:56:39.711952       1 chi.go:82] ApplyCHITemplates():gidplatform-dev/gid/click-reconcile-1:Applied templates num: 0
I0929 11:56:39.713061       1 worker.go:387] gidplatform-dev/gid/click-reconcile-1:updating endpoints for CHI-1 gid
I0929 11:56:39.851639       1 cluster.go:84] Run query on: chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local]
I0929 11:56:39.853763       1 worker-chi-reconciler.go:349] getHostClickHouseVersion():Get ClickHouse version on host: 0-1 version: 24.2.1.2248
I0929 11:56:39.853841       1 poller.go:138] Poll():gidplatform-dev/0-1:OK gidplatform-dev/0-1
I0929 11:56:39.853942       1 worker-chi-reconciler.go:776] reconcileHost():Reconcile Host completed. Host: 0-1 ClickHouse version running: 24.2.1.2248
I0929 11:56:40.449305       1 worker.go:389] gidplatform-dev/gid/click-reconcile-1:IPs of the CHI-1 update endpoints gidplatform-dev/gid: len: 2 [10.42.9.86 10.42.5.48]
I0929 11:56:40.460088       1 chi.go:38] prepareListOfTemplates():gidplatform-dev/gid/click-reconcile-1:Found applicable templates num: 0
I0929 11:56:40.460129       1 chi.go:82] ApplyCHITemplates():gidplatform-dev/gid/click-reconcile-1:Applied templates num: 0
I0929 11:56:40.462470       1 worker.go:393] gidplatform-dev/gid/click-reconcile-1:Update users IPS-1
I0929 11:56:40.849312       1 worker.go:1315] updateConfigMap():gidplatform-dev/gid/click-reconcile-1:Update ConfigMap gidplatform-dev/chi-gid-common-usersd
I0929 11:56:41.078096       1 worker-chi-reconciler.go:797] reconcileHost():[now: 2024-09-29 11:56:41.078003076 +0000 UTC m=+530649.174408624] ProgressHostsCompleted: 2 of 2
I0929 11:56:43.083018       1 worker-chi-reconciler.go:581] Starting rest of shards on workers: 1
I0929 11:56:43.249032       1 worker.go:1315] updateConfigMap():gidplatform-dev/gid/click-reconcile-1:Update ConfigMap gidplatform-dev/chi-gid-common-configd
I0929 11:56:43.885956       1 worker-deleter.go:43] clean():gidplatform-dev/gid/click-reconcile-1:remove items scheduled for deletion
I0929 11:56:44.481307       1 worker-deleter.go:46] clean():gidplatform-dev/gid/click-reconcile-1:List of objects which have failed to reconcile:
I0929 11:56:44.481378       1 worker-deleter.go:47] clean():gidplatform-dev/gid/click-reconcile-1:List of successfully reconciled objects:
PVC: gidplatform-dev/default-volume-claim-chi-gid-gid-0-0-0
PVC: gidplatform-dev/default-volume-claim-chi-gid-gid-0-1-0
StatefulSet: gidplatform-dev/chi-gid-gid-0-1
StatefulSet: gidplatform-dev/chi-gid-gid-0-0
Service: gidplatform-dev/chi-gid-gid-0-0
Service: gidplatform-dev/clickhouse-gid
Service: gidplatform-dev/chi-gid-gid-0-1
ConfigMap: gidplatform-dev/chi-gid-common-configd
ConfigMap: gidplatform-dev/chi-gid-common-usersd
ConfigMap: gidplatform-dev/chi-gid-deploy-confd-gid-0-0
ConfigMap: gidplatform-dev/chi-gid-deploy-confd-gid-0-1
PDB: gidplatform-dev/gid-gid
I0929 11:56:45.252969       1 worker-deleter.go:50] clean():gidplatform-dev/gid/click-reconcile-1:Existing objects:
PVC: gidplatform-dev/default-volume-claim-chi-gid-gid-0-0-0
PVC: gidplatform-dev/default-volume-claim-chi-gid-gid-0-1-0
PDB: gidplatform-dev/gid-gid
StatefulSet: gidplatform-dev/chi-gid-gid-0-0
StatefulSet: gidplatform-dev/chi-gid-gid-0-1
ConfigMap: gidplatform-dev/chi-gid-common-configd
ConfigMap: gidplatform-dev/chi-gid-common-usersd
ConfigMap: gidplatform-dev/chi-gid-deploy-confd-gid-0-0
ConfigMap: gidplatform-dev/chi-gid-deploy-confd-gid-0-1
Service: gidplatform-dev/chi-gid-gid-0-0
Service: gidplatform-dev/chi-gid-gid-0-1
Service: gidplatform-dev/clickhouse-gid
I0929 11:56:45.253123       1 worker-deleter.go:52] clean():gidplatform-dev/gid/click-reconcile-1:Non-reconciled objects:
I0929 11:56:45.253195       1 worker-deleter.go:68] worker-deleter.go:68:dropReplicas():start:gidplatform-dev/gid/click-reconcile-1:drop replicas based on AP
I0929 11:56:45.253260       1 worker-deleter.go:80] worker-deleter.go:80:dropReplicas():end:gidplatform-dev/gid/click-reconcile-1:processed replicas: 0
I0929 11:56:45.253308       1 worker.go:640] addCHIToMonitoring():gidplatform-dev/gid/click-reconcile-1:add CHI to monitoring
I0929 11:56:45.885652       1 worker.go:595] worker.go:595:waitForIPAddresses():start:gidplatform-dev/gid/click-reconcile-1:wait for IP addresses to be assigned to all pods
I0929 11:56:45.893820       1 worker.go:600] gidplatform-dev/gid/click-reconcile-1:all IP addresses are in place
I0929 11:56:45.893858       1 worker.go:673] worker.go:673:finalizeReconcileAndMarkCompleted():start:gidplatform-dev/gid/click-reconcile-1:finalize reconcile
I0929 11:56:45.904253       1 chi.go:38] prepareListOfTemplates():gidplatform-dev/gid/click-reconcile-1:Found applicable templates num: 0
I0929 11:56:45.904335       1 chi.go:82] ApplyCHITemplates():gidplatform-dev/gid/click-reconcile-1:Applied templates num: 0
I0929 11:56:45.904391       1 controller.go:617] OK update watch (gidplatform-dev/gid): {"namespace":"gidplatform-dev","name":"gid","labels":{"argocd.argoproj.io/instance":"bi-clickhouse-dev","k8slens-edit-resource-version":"v1"},"annotations":{},"clusters":[{"name":"gid","hosts":[{"name":"0-0","hostname":"chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local","tcpPort":9000,"httpPort":8123},{"name":"0-1","hostname":"chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local","tcpPort":9000,"httpPort":8123}]}]}
I0929 11:56:45.906676       1 worker.go:677] gidplatform-dev/gid/click-reconcile-1:updating endpoints for CHI-2 gid
I0929 11:56:46.249853       1 worker.go:679] gidplatform-dev/gid/click-reconcile-1:IPs of the CHI-2 finalize reconcile gidplatform-dev/gid: len: 2 [10.42.9.86 10.42.5.48]
I0929 11:56:46.261380       1 chi.go:38] prepareListOfTemplates():gidplatform-dev/gid/click-reconcile-1:Found applicable templates num: 0
I0929 11:56:46.261442       1 chi.go:82] ApplyCHITemplates():gidplatform-dev/gid/click-reconcile-1:Applied templates num: 0
I0929 11:56:46.263792       1 worker.go:683] gidplatform-dev/gid/click-reconcile-1:Update users IPS-2
I0929 11:56:46.449574       1 worker.go:1315] updateConfigMap():gidplatform-dev/gid/click-reconcile-1:Update ConfigMap gidplatform-dev/chi-gid-common-usersd
I0929 11:56:47.495545       1 worker.go:707] finalizeReconcileAndMarkCompleted():gidplatform-dev/gid/click-reconcile-1:reconcile completed successfully, task id: click-reconcile-1
I0929 11:56:48.077981       1 worker-chi-reconciler.go:134] worker-chi-reconciler.go:60:reconcileCHI():end:gidplatform-dev/gid/click-reconcile-1
I0929 11:56:48.078036       1 worker.go:469] worker.go:432:updateCHI():end:gidplatform-dev/gid/click-reconcile-1

Not sure what going wrong, but on host chi-gid-gid-0-0-0 even no databases copied. And still present only single default.

@Hubbitus
Copy link
Author

@alex-zaitsev, could you please look on it?

@Slach
Copy link
Collaborator

Slach commented Oct 14, 2024

I0929 11:55:05.495957 [worker.go:908] migrateTables():No need to add tables on host 0 to shard 0 in cluster gid

I0929 11:56:38.495686 [worker.go:908] migrateTables():No need to add tables on host 1 to shard 0 in cluster gid

@Hubbitus is your cluster have 2 shards with only 1 replica inside shard?

Could you share:
kubectl get chi -n gidplatform-de gid -o yaml
without sensitive credentials?

@Hubbitus
Copy link
Author

@Slach, thanks to response.
We do not use sharding yet.

Output of kubectl get chi -n gidplatform-dev gid -o yaml:
chi.yaml.gz

@Slach
Copy link
Collaborator

Slach commented Oct 17, 2024

@Hubbitus
Could you share result of following clickhouse-client query
SELECT database, table, engine_full, count() c FROM cluster('all-sharded',system.tables) WHERE database NOT IN ('system','INFORMATION_SCHEMA','information_schema') GROUP BY ALL HAVING c<2

@Hubbitus
Copy link
Author

Sure (limit to 10, total 269):

database table engine_full c
datamart appmarket__public__widget PostgreSQL(appmarket_db, table = 'widget', schema = 'public') 1
datamart bonus__public__promotion PostgreSQL(bonus_db, table = 'promotion', schema = 'public') 1
sandbox gid_mt_sessions ReplicatedMergeTree('/clickhouse/tables/ad0a75c4-1aa7-4386-a542-c16c19f2b2c6/{shard}', '{replica}') ORDER BY tsEvent SETTINGS index_granularity = 8192 1
_source scs__public__story__foreign PostgreSQL(scs_db, table = 'story', schema = 'public') 1
datamart loyalty__public__level PostgreSQL(loyalty_db, table = 'level', schema = 'public') 1
_source feed__public__reaction__foreign PostgreSQL(feed_db, table = 'reaction', schema = 'public') 1
_loopback nomail_account_register ReplicatedMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}') PARTITION BY tuple() ORDER BY time SETTINGS index_granularity = 8192 1
_source questionnaires__public__anketa_access_group__foreign PostgreSQL(questionnaire_db, table = 'anketa_access_group', schema = 'public') 1
sandbox gid_mt_activities ReplicatedMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}') ORDER BY dtEvent SETTINGS index_granularity = 8192 1
_source lms__public__lms_user_courses_progress__foreign PostgreSQL(lms_db, table = 'lms_user_courses_progress', schema = 'public') 1

@Slach
Copy link
Collaborator

Slach commented Oct 19, 2024

You shared logs for 29 sep 2024 since 11:55 UTC
is your node lost PVC data before this date, or after this date?

@Hubbitus
Copy link
Author

Hello.
Last shared logs was from 15 October. And that after one more attempt to recover by deleting PVC and STS.

@Slach
Copy link
Collaborator

Slach commented Oct 21, 2024

@Hubbitus
#1455 (comment)
there is logs only for 29 sep 2024

i don't see logs from 15 Oct 2024

I need to ensure you tried to reconcile, after drop PVC and STS
did you change in CHI spec.taskID manually to trigger reconcile after delete PVC and STS?

@Hubbitus
Copy link
Author

did you change in CHI spec.taskID manually to trigger reconcile after delete PVC and STS?

Yes. By suggestion of @alex-zaitsev I had introduced there TaskID parameter and on each clean attempt increase here number.

@Slach
Copy link
Collaborator

Slach commented Oct 24, 2024

share clickhouse-operator logs for 15 Oct related to your changes

@Hubbitus
Copy link
Author

Hubbitus commented Nov 2, 2024

Hello.

I do not have so old logs.

But I've switching on branch were set taskID: "click-reconcile-3". It looks like reconcile started automatically.
Relevant part of output (slightly obfuscated)

kubectl -n kube-system logs --selector=app=clickhouse-operator --container=clickhouse-operator

operator.2024-11-02T17:47:14+03:00.obfuscated.log

Output of

SELECT database, table, engine_full, count() c, hostname()
FROM
	cluster('{cluster}',system.tables)
WHERE
	database NOT IN ('system','INFORMATION_SCHEMA','information_schema')
GROUP BY ALL
HAVING c<2

Contains 515 rows. Heading of it:

database table engine_full c hostname()
datamart v_subs__public__channel_requests 1 chi-gid-gid-0-1-0
cdc api__public__reaction ReplicatedReplacingMergeTree('/clickhouse/{cluster}/cdc/tables/api__public__reaction/{shard}', '{replica}') PRIMARY KEY id ORDER BY id SETTINGS index_granularity = 8192 1 chi-gid-gid-0-1-0
_source bonus_to_gid__user_mappings Kafka(kafka_integration, kafka_topic_list = 'dev__bonus_to_gid__user_mappings', kafka_group_name = 'dev__bonus_to_gid__user_mappings') SETTINGS format_avro_schema_registry_url = 'http://gid-integration-partner-kafka.gid.team:8081' 1 chi-gid-gid-0-1-0
datamart v_calendar__public__event_type 1 chi-gid-gid-0-1-0
_raw api__public__questionnaire_result__dbt_materialized ReplicatedMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}') ORDER BY id SETTINGS replicated_deduplication_window = 0, index_granularity = 8192 1 chi-gid-gid-0-1-0
datamart v_jiradatabase__public__ao_54307e_slaauditlog 1 chi-gid-gid-0-1-0
datamart v_appmarket__public__widget_notification 1 chi-gid-gid-0-1-0
_raw feed__public__feed_comment__dbt_materialized ReplicatedMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}') ORDER BY id SETTINGS replicated_deduplication_window = 0, index_granularity = 8192 1 chi-gid-gid-0-1-0
_raw lms__public__lms_courses_chapters__dbt_materialized ReplicatedMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}') ORDER BY id SETTINGS replicated_deduplication_window = 0, index_granularity = 8192 1 chi-gid-gid-0-1-0
datamart tmp_gazprombonus_user_bonus_to_gid_mapping_inner ReplicatedMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}') ORDER BY id SETTINGS index_granularity = 1024 1 chi-gid-gid-0-1-0
datamart api__poll_vote 1 chi-gid-gid-0-1-0
_source loyalty__public__achievement__foreign PostgreSQL(loyalty_db, table = 'achievement', schema = 'public') 1 chi-gid-gid-0-1-0
datamart v_calendar__public__like 1 chi-gid-gid-0-1-0
_source calendar__public__event_type__foreign PostgreSQL(calendar_db, table = 'event_type', schema = 'public') 1 chi-gid-gid-0-1-0

@Slach
Copy link
Collaborator

Slach commented Nov 2, 2024

according to logs you just triggered reconcile for -0-0-0 when sts is not deleted

try

kubectl delete sts -n gidplatform-dev chi-gid-gid-0-0
kubectl delete pvc -n gidplatform-dev -l clickhouse.altinity.com/cluster=gid,clickhouse.altinity.com/shard=0,clickhouse.altinity.com/replica=0

kubectl edit chi -n gidplatform-dev gid

edit spec.taskID to manual-4
watch reconciling process again, when sts and pvc not found during reconcile then operator shall propagate schema

@Hubbitus
Copy link
Author

Hubbitus commented Nov 2, 2024

Ok, thank you.
Doing again:

  1. Delete STS and PVC:
$ kubectl delete sts -n gidplatform-dev chi-gid-gid-0-0
statefulset.apps "chi-gid-gid-0-0" deleted
$ kubectl delete pvc -n gidplatform-dev -l clickhouse.altinity.com/cluster=gid,clickhouse.altinity.com/shard=0,clickhouse.altinity.com/replica=0
persistentvolumeclaim "default-volume-claim-chi-gid-gid-0-0-0" deleted
  1. Pushed commit with taskID: "click-reconcile-4". Run sync in ArgoCD with prune.

I think relevants logs are:


I1102 16:33:32.889891       1 worker-chi-reconciler.go:89] reconcileCHI():gidplatform-dev/gid/click-reconcile-4:ActionPlan has actions - continue reconcile
I1102 16:33:32.934904       1 worker.go:663] markReconcileStart():gidplatform-dev/gid/click-reconcile-4:reconcile started, task id: click-reconcile-4
I1102 16:33:33.446722       1 worker.go:820] FOUND host: ns:gidplatform-dev|chi:gid|clu:gid|sha:0|rep:0|host:0-0
I1102 16:33:33.446764       1 worker.go:820] FOUND host: ns:gidplatform-dev|chi:gid|clu:gid|sha:0|rep:1|host:0-1
I1102 16:33:33.446914       1 worker.go:844] RemoteServersGeneratorOptions: exclude hosts: [], attributes: status: , add: true, remove: false, modify: false, found: false, exclude: true
I1102 16:33:33.642314       1 worker.go:1315] updateConfigMap():gidplatform-dev/gid/click-reconcile-4:Update ConfigMap gidplatform-dev/chi-gid-common-configd
I1102 16:33:34.443758       1 worker.go:1315] updateConfigMap():gidplatform-dev/gid/click-reconcile-4:Update ConfigMap gidplatform-dev/chi-gid-common-usersd
I1102 16:33:35.086985       1 service.go:86] CreateServiceCluster():gidplatform-dev/gid/click-reconcile-4:gidplatform-dev/cluster-gid-gid
I1102 16:33:35.104209       1 worker-chi-reconciler.go:819] PDB updated: gidplatform-dev/gid-gid
I1102 16:33:35.104304       1 worker-chi-reconciler.go:554] not found ReconcileShardsAndHostsOptionsCtxKey, use empty opts
I1102 16:33:35.104344       1 worker-chi-reconciler.go:568] starting first shard separately
I1102 16:33:35.104638       1 cluster.go:84] Run query on: chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local]
E1102 16:33:35.112714       1 connection.go:145] QueryContext():FAILED Query(http://test_operator:***@chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local:8123/) doRequest: transport failed to send a request to ClickHouse: dial tcp: lookup chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local on 10.43.0.10:53: no such host for SQL: SELECT version()
W1102 16:33:35.112777       1 cluster.go:91] QueryAny():FAILED to run query on: chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local] skip to next. err: doRequest: transport failed to send a request to ClickHouse: dial tcp: lookup chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local on 10.43.0.10:53: no such host
E1102 16:33:35.112846       1 cluster.go:95] QueryAny():FAILED to run query on all hosts [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local]
W1102 16:33:35.112926       1 worker-chi-reconciler.go:345] getHostClickHouseVersion():Failed to get ClickHouse version on host: 0-0
W1102 16:33:35.112980       1 worker-chi-reconciler.go:690] reconcileHost():Reconcile Host start. Host: 0-0 Failed to get ClickHouse version: failed to query
W1102 16:33:35.692945       1 worker.go:1537] gidplatform-dev/chi-gid-gid-0-0:No cur StatefulSet available but host has an ancestor. Found deleted StatefulSet. for gidplatform-dev/chi-gid-gid-0-0

So operator can't resolve hostname of node:

doRequest: transport failed to send a request to ClickHouse: dial tcp: lookup chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local on 10.43.0.10:53: no such host for SQL: SELECT version()

Indeed hostname chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local looks strange, clichouse known it in another way:

SELECT cluster, host_name
FROM system.clusters
WHERE cluster = 'gid'
cluster host_name
gid chi-gid-gid-0-0
gid chi-gid-gid-0-1

@Slach
Copy link
Collaborator

Slach commented Nov 2, 2024

You did not share full logs, just found first error message, error message is expected, because you deleted sts and kubernetes service name will not resolve

Hostname, contains SERVICE name, not pod

is sts chi-gid-gid-0-0-0 and pvc created?

could you share full operator logs?

@Hubbitus
Copy link
Author

Hubbitus commented Nov 2, 2024

Sure. I found much more errors in log:
operator.2024-11-02T19:40:16+03:00.obfuscated.log

@Slach
Copy link
Collaborator

Slach commented Nov 3, 2024

@sunsingerus. according to shared logs

first reconcile was applied at 2024-11-02 14:44:41
and sts + pvs was exists

I1102 14:44:44.267254 1 cluster.go:84] Run query on: chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local]
I1102 14:44:44.276002 1 worker-chi-reconciler.go:349] getHostClickHouseVersion():Get ClickHouse version on host: 0-0 version: 24.2.1.2248
I1102 14:44:44.276073 1 worker-chi-reconciler.go:684] reconcileHost():Reconcile Host start. Host: 0-0 ClickHouse version running: 24.2.1.2248

second try after sts + pvc deletion

I1102 16:30:39.627295 1 worker.go:275] processReconcilePod():gidplatform-dev/chi-gid-gid-0-0-0:Delete Pod. gidplatform-dev/chi-gid-gid-0-0-0
I1102 16:33:32.819212 1 controller.go:572] ENQUEUE new ReconcileCHI cmd=update for gidplatform-dev/gid

I1102 16:33:32.934904 1 worker.go:663] markReconcileStart():gidplatform-dev/gid/click-reconcile-4:reconcile started, task id: click-reconcile-4

STS and PVC was deleted

W1102 16:33:35.692945 1 worker.go:1537] gidplatform-dev/chi-gid-gid-0-0:No cur StatefulSet available but host has an ancestor. Found deleted StatefulSet. for gidplatform-dev/chi-gid-gid-0-0
I1102 16:33:35.839840 1 worker.go:1177] shouldExcludeHost():Host should be excluded. Host/shard/cluster: 0/0/gid
I1102 16:33:35.839914 1 worker.go:1010] excludeHost():Exclude from cluster host 0 shard 0 cluster gid
I1102 16:33:37.880392 1 worker-chi-reconciler.go:716] reconcileHost():Reconcile PVCs and check possible data loss for host: 0-0

PVC recreated

I1102 16:33:38.042697 1 worker-chi-reconciler.go:1251] PVC (gidplatform-dev/0-0/default-volume-claim/default-volume-claim-chi-gid-gid-0-0-0) not found and model will not be provided by the operator
W1102 16:33:38.042849 1 worker-chi-reconciler.go:1162] PVC is either newly added to the host or was lost earlier (gidplatform-dev/0-0/default-volume-claim/pvc-name-unknown-pvc-not-exist)

Migration force to be applied
Start creating Statefulset

I1102 16:33:38.043010 1 worker-chi-reconciler.go:730] reconcileHost():Data loss detected for host: 0-0. Will do force migrate
I1102 16:33:38.043073 1 worker-chi-reconciler.go:406] worker-chi-reconciler.go:406:reconcileHostStatefulSet():start:reconcile StatefulSet start

I1102 16:33:38.440090 1 worker.go:1596] createStatefulSet():Create StatefulSet gidplatform-dev/chi-gid-gid-0-0 - started
I1102 16:33:39.086823 1 creator.go:35] createStatefulSet()
I1102 16:33:39.086858 1 creator.go:44] Create StatefulSet gidplatform-dev/chi-gid-gid-0-0

I1102 16:34:09.634311 1 worker.go:1615] createStatefulSet():Create StatefulSet gidplatform-dev/chi-gid-gid-0-0 - completed

Prepare for table migration, try

I1102 16:34:11.228817 1 worker-chi-reconciler.go:753] reconcileHost():Check host for ClickHouse availability before migrating tables. Host: 0-0 ClickHouse version running: 24.2.1.2248

Trying drop data from ZK

@sunsingerus
0-0-0 is empty and doesn't contains any table definitions,
so I think, SYSTEM DROP REPLICA 'chi-gid-gid-0-0-0' will do nothing, and this is root cause

I1102 16:34:11.228960 1 schemer.go:56] HostDropReplica():Drop replica: chi-gid-gid-0-0 at 0-0
I1102 16:34:11.236511 1 worker-deleter.go:414] dropReplica():Drop replica host: 0-0 in cluster: gid

Get SQL object definitions

I1102 16:34:12.415941 1 replicated.go:35] shouldCreateReplicatedObjects():SchemaPolicy.Shard says we need replicated objects. Should create replicated objects for the shard: [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local]
I1102 16:34:12.416343 1 cluster.go:84] Run query on: chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local]
I1102 16:34:12.437761 1 cluster.go:84] Run query on: chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local]
I1102 16:34:12.717044 1 cluster.go:84] Run query on: chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local]
I1102 16:34:12.756658 1 distributed.go:39] shouldCreateDistributedObjects():Should create distributed objects in the cluster: [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local]
I1102 16:34:12.756850 1 cluster.go:84] Run query on: chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local]
I1102 16:34:12.786098 1 cluster.go:84] Run query on: chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local]
I1102 16:34:13.018954 1 cluster.go:84] Run query on: chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local]

Trying to restore and failure cause ZK data still present

I1102 16:34:13.035413 1 schemer.go:98] HostCreateTables():Creating replicated objects at 0-0: [_loopback _raw service _source temp ....]
E1102 16:34:13.089822 1 connection.go:194] Exec():FAILED Exec(http://test_operator:***@chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local:8123/) Code: 253, Message: Replica /clickhouse/tables/edf41bd4-46aa-4341-bed7-2e19b838e9e1/0/replicas/chi-gid-gid-0-0 already exists for SQL: CREATE TABLE IF NOT EXISTS _loopback.nomail_account_register UUID 'edf41bd4-46aa-4341-bed7-2e19b838e9e1' ....
I1102 16:34:13.089918 1 cluster.go:160] func1():chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local:Replica is already in ZooKeeper. Trying ATTACH TABLE instead

We need to choose 0-1-0 for execution SYSTEM DROP REPLICA...

@Slach
Copy link
Collaborator

Slach commented Nov 3, 2024

@Hubbitus after reconcile
most of tables shall restore (via ATTACH)
but some of tables, not restored with strange difference

`slo_value` Decimal(15, 5)	DEFAULT	0

in zookeeper
and

`slo_value` Decimal(15, 5)	

in local SQL

E1102 16:36:08.836812       1 connection.go:194] Exec():FAILED Exec(http://test_operator:***@chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local:8123/) Code: 122, Message: Table columns structure in ZooKeeper is different from local table structure. Local columns:
columns format version: 1
14 columns:
...
Zookeeper columns:
columns format version: 1
14 columns:
...
for SQL: CREATE TABLE IF NOT EXISTS _raw.victoriametrics__slo__metrics__airflow_hour_agg_old UUID '46a1c218-9274-446d-9300-e644bbd4cc0e' ....
 

@Slach
Copy link
Collaborator

Slach commented Nov 3, 2024

@Hubbitus could you share

kubectl exec -n gidplatform-dev chi-gid-gid-0-1-0 -- clickhouse-client -q "SHOW CREATE TABLE _raw.victoriametrics__slo__metrics__airflow_hour_agg_old FORMA Vertical"

and

kubectl exec -n gidplatform-dev chi-gid-gid-0-0-0 -- clickhouse-client -q "SHOW CREATE TABLE _raw.victoriametrics__slo__metrics__airflow_hour_agg_old FORMA Vertical"

@Hubbitus
Copy link
Author

Hubbitus commented Nov 3, 2024

@Slach, sure (comments of columns and table stripped):

$ kubectl exec -n gidplatform-dev chi-gid-gid-0-1-0 -- clickhouse-client -q "SHOW CREATE TABLE _raw.victoriametrics__slo__metrics__airflow_hour_agg_old FORMAT Vertical"

Row 1:
──────
statement: CREATE TABLE _raw.victoriametrics__slo__metrics__airflow_hour_agg_old
(
    `slo_metric` LowCardinality(String),
    `slo_service` LowCardinality(String),
    `slo_namespace` LowCardinality(String),
    `slo_status` LowCardinality(String),
    `slo_method` LowCardinality(String),
    `slo_uri` LowCardinality(String),
    `slo_le` LowCardinality(String),
    `slo_event_ts` DateTime64(6, 'UTC'),
    `slo_orig_value` UInt64,
    `slo_value` UInt32,
    `slo_rec_num` UInt32,
    `slo_tags` Map(LowCardinality(String), LowCardinality(String)),
    `_row_hash_` UInt64 MATERIALIZED cityHash64(slo_metric, slo_service, slo_namespace, slo_status, slo_method, slo_uri, slo_event_ts, slo_value, slo_rec_num, slo_tags),
    `__insert_ts` DateTime64(6, 'UTC') DEFAULT now64(6, 'UTC')
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}')
ORDER BY (slo_metric, slo_service, slo_namespace, slo_event_ts)
SETTINGS index_granularity = 8192
$ kubectl exec -n gidplatform-dev chi-gid-gid-0-0-0 -- clickhouse-client -q "SHOW CREATE TABLE _raw.victoriametrics__slo__metrics__airflow_hour_agg_old FORMAT Vertical"
Received exception from server (version 24.2.1):
Code: 390. DB::Exception: Received from localhost:9000. DB::Exception: Table `victoriametrics__slo__metrics__airflow_hour_agg_old` doesn't exist. (CANNOT_GET_CREATE_TABLE_QUERY)
(query: SHOW CREATE TABLE _raw.victoriametrics__slo__metrics__airflow_hour_agg_old FORMAT Vertical)
command terminated with exit code 134

Error looks reasonable - we got error on table creation after delete STS and PVC, is not?
Maybe some info leaved in zookeeper?

@Slach
Copy link
Collaborator

Slach commented Nov 3, 2024

add SETTINGS show_table_uuid_in_table_create_query_if_not_nil=1

kubectl exec -n gidplatform-dev chi-gid-gid-0-1-0 -- clickhouse-client -q "SHOW CREATE TABLE _raw.victoriametrics__slo__metrics__airflow_hour_agg_old SETTTINGS show_table_uuid_in_table_create_query_if_not_nil=1 FORMAT Vertical"

@Hubbitus
Copy link
Author

Hubbitus commented Nov 3, 2024

$ kubectl exec -n gidplatform-dev chi-gid-gid-0-1-0 -- clickhouse-client -q "SHOW CREATE TABLE _raw.victoriametrics__slo__metrics__airflow_hour_agg_old FORMAT Vertical SETTINGS show_table_uuid_in_table_create_query_if_not_nil=1"
Row 1:
──────
statement: CREATE TABLE _raw.victoriametrics__slo__metrics__airflow_hour_agg_old UUID '46a1c218-9274-446d-9300-e644bbd4cc0e'
(
    `slo_metric` LowCardinality(String),
    `slo_service` LowCardinality(String),
    `slo_namespace` LowCardinality(String),
    `slo_status` LowCardinality(String),
    `slo_method` LowCardinality(String),
    `slo_uri` LowCardinality(String),
    `slo_le` LowCardinality(String),
    `slo_event_ts` DateTime64(6, 'UTC'),
    `slo_orig_value` UInt64,
    `slo_value` UInt32,
    `slo_rec_num` UInt32,
    `slo_tags` Map(LowCardinality(String), LowCardinality(String)),
    `_row_hash_` UInt64 MATERIALIZED cityHash64(slo_metric, slo_service, slo_namespace, slo_status, slo_method, slo_uri, slo_event_ts, slo_value, slo_rec_num, slo_tags),
    `__insert_ts` DateTime64(6, 'UTC') DEFAULT now64(6, 'UTC')
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}')
ORDER BY (slo_metric, slo_service, slo_namespace, slo_event_ts)
SETTINGS index_granularity = 8192

@Slach
Copy link
Collaborator

Slach commented Nov 3, 2024

try to restore table

CREATE TABLE IF NOT EXISTS _raw.victoriametrics__slo__metrics__airflow_hour_agg_old UUID '46a1c218-9274-446d-9300-e644bbd4cc0e' ON CLUSTER '{cluster}'
(
    `slo_metric` LowCardinality(String),
    `slo_service` LowCardinality(String),
    `slo_namespace` LowCardinality(String),
    `slo_status` LowCardinality(String),
    `slo_method` LowCardinality(String),
    `slo_uri` LowCardinality(String),
    `slo_le` LowCardinality(String),
    `slo_event_ts` DateTime64(6, 'UTC'),
    `slo_orig_value` UInt64,
    `slo_value` UInt32,
    `slo_rec_num` UInt32,
    `slo_tags` Map(LowCardinality(String), LowCardinality(String)),
    `_row_hash_` UInt64 MATERIALIZED cityHash64(slo_metric, slo_service, slo_namespace, slo_status, slo_method, slo_uri, slo_event_ts, slo_value, slo_rec_num, slo_tags),
    `__insert_ts` DateTime64(6, 'UTC') DEFAULT now64(6, 'UTC')
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}')
ORDER BY (slo_metric, slo_service, slo_namespace, slo_event_ts)
SETTINGS index_granularity = 8192

@Hubbitus
Copy link
Author

Hubbitus commented Nov 3, 2024

@Slach, why it is not restored automatically? Really, this table may be deleted. IS it the solution?

@Slach
Copy link
Collaborator

Slach commented Nov 4, 2024

It tried to restore, and most of the tables were restored (am I right?)
but looks like you applied some mutations or something else and few tables contains different SQL structure which we get in operator from system.tables with current zookeeper /clickhouse/tables/{uuid}/{shard}/{replica}/columns key content

@Slach
Copy link
Collaborator

Slach commented Nov 4, 2024

try to upgrade clickhouse-keeper to latest and clickhouse-server to 24.8 latest LTS

@Hubbitus
Copy link
Author

Hubbitus commented Nov 4, 2024

It tried to restore, and most of the tables were restored (am I right?)

No.

Query

SELECT database, table, engine_full, count() c, hostname()
FROM
	cluster('{cluster}',system.tables)
WHERE
	database NOT IN ('system','INFORMATION_SCHEMA','information_schema')
GROUP BY ALL
HAVING c<2

Still returns 729 tables present only on host chi-gid-gid-0-1-0.

try to upgrade clickhouse-keeper to latest and clickhouse-server to 24.8 latest LTS

It is not so fast. We will try to do it in the next week.
But we use zookeeper, and not clickhouse keeper. Is it important to switch (that may be the problem)?

@Slach
Copy link
Collaborator

Slach commented Nov 4, 2024

ok. let's try again

# safe shutdown 
kubectl exec -n gidplatform-dev chi-gid-gid-0-0-0 -- clickhouse-client -q "SYSTEM SHUTDOWN"

# different node
kubectl exec -n gidplatform-dev chi-gid-gid-0-1-0 -- clickhouse-client -q "SYSTEM DROP REPLICA 'chi-gid-gid-0-0'"

# delete sts to propogate schema during reconcile
kubectl delete sts -n gidplatform-dev chi-gid-gid-0-0

# change spec.taskID
kubectl edit chi -n gidplatform-dev gid

# wait when 
watch -n 1 kubectl describe chi chi -n gidplatform-dev gid

# check tables
# share clickhouse-operator logs

@Hubbitus
Copy link
Author

Hubbitus commented Nov 5, 2024

Ok, lets try!

I have updated Clickhouse to version 24.8.6.70 (LiveView was dropped because of incompatibility).

Still used zokeeper (no step to clickhouse-keeper)

New attempt:

  1. kubectl exec -n gidplatform-dev chi-gid-gid-0-0-0 -- clickhouse-client -q "SYSTEM SHUTDOWN"
  2. kubectl exec -n gidplatform-dev chi-gid-gid-0-1-0 -- clickhouse-client -q "SYSTEM DROP REPLICA 'chi-gid-gid-0-0'"
    In pod logs bunch of lines like:
2024.11.05 14:40:38.743969 [ 854 ] {edd055cb-da43-4722-a28c-d394c9a4d119} <Information> InterpreterSystemQuery: Removing replica /clickhouse/tables/60542e19-3cef-4f3b-9e24-0cd98431c112/0/replicas/chi-gid-gid-0-0, marking it as lost                                      
2024.11.05 14:40:38.791925 [ 854 ] {edd055cb-da43-4722-a28c-d394c9a4d119} <Information> InterpreterSystemQuery: Removing replica /clickhouse/tables/336feb8b-b059-42e1-894d-f84691b935ce/0/replicas/chi-gid-gid-0-0, marking it as lost                                      
2024.11.05 14:40:38.836350 [ 854 ] {edd055cb-da43-4722-a28c-d394c9a4d119} <Information> InterpreterSystemQuery: Removing replica /clickhouse/tables/7a551a1f-9721-4907-affc-b5564e1ccdde/0/replicas/chi-gid-gid-0-0, marking it as lost                                      
2024.11.05 14:40:38.880698 [ 854 ] {edd055cb-da43-4722-a28c-d394c9a4d119} <Information> InterpreterSystemQuery: Removing replica /clickhouse/tables/e7a92d37-a5dc-40ee-a79d-946373b57f10/0/replicas/chi-gid-gid-0-0, marking it as lost                                      
2024.11.05 14:40:38.905697 [ 612 ] {} <Debug> StorageKafka (tracking_stage): Pushing 0.00 rows to _source.tracking_stage (586af751-597e-4bf8-a8f5-5a226f7d8a4b) took 5015 ms.                                                                                                
2024.11.05 14:40:38.906393 [ 612 ] {} <Debug> MemoryTracker: Peak memory usage: 280.37 KiB.
$ kubectl delete sts -n gidplatform-dev chi-gid-gid-0-0
statefulset.apps "chi-gid-gid-0-0" deleted

Wasn't PVC should be deleted also?

  1. Commit update with taskID: "click-reconcile-5". In ArgoCD call sync with prune option.

# wait when
watch -n 1 kubectl describe chi chi -n gidplatform-dev gid

Sorry, what I should wait in such output?

For command kubectl describe chi chi -n gidplatform-dev gid last lines are:

  Error    CreateFailed            32s    clickhouse-operator  ERROR add tables added successfully on shard/host:0/0 cluster:gid err:Code: 415, Message: Table /clickhouse/tables/62324966-8fef-4f9e-908d-a990b880d20c/0 was suddenly removed
  Info     UpdateCompleted         31s    clickhouse-operator  Update ConfigMap gidplatform-dev/chi-gid-common-configd
  Info     ProgressHostsCompleted  31s    clickhouse-operator  [now: 2024-11-05 14:49:20.814770037 +0000 UTC m=+699169.611621638] ProgressHostsCompleted: 1 of 2
  Info     ReconcileCompleted      31s    clickhouse-operator  Reconcile Host completed. Host: 0-0 ClickHouse version running: 24.8.6.70
  Info     UpdateCompleted         30s    clickhouse-operator  Update ConfigMap gidplatform-dev/chi-gid-common-usersd
  Info     UpdateCompleted         28s    clickhouse-operator  Update Service success: gidplatform-dev/clickhouse-gid
  Info     ReconcileStarted        27s    clickhouse-operator  Reconcile Host start. Host: 0-1 ClickHouse version running: 24.8.6.70
  Info     UpdateCompleted         26s    clickhouse-operator  Update ConfigMap gidplatform-dev/chi-gid-common-configd
  Info     UpdateCompleted         25s    clickhouse-operator  Update ConfigMap gidplatform-dev/chi-gid-common-usersd
Error from server (NotFound): clickhouseinstallations.clickhouse.altinity.com "chi" not found

Waiting some time. I suppose I need line Info ReconcileCompleted 17s clickhouse-operator reconcile completed successfully, task id: click-reconcile-5. Is not?

  Info     UpdateCompleted         9m59s  clickhouse-operator  Update Service success: gidplatform-dev/chi-gid-gid-0-0
  Info     DeleteCompleted         9m58s  clickhouse-operator  Drop replica host: 0-0 in cluster: gid
  Info     CreateStarted           9m58s  clickhouse-operator  Adding tables on shard/host:0/0 cluster:gid
  Error    CreateFailed            5m30s  clickhouse-operator  ERROR add tables added successfully on shard/host:0/0 cluster:gid err:Code: 415, Message: Table /clickhouse/tables/62324966-8fef-4f9e-908d-a990b880d20c/0 was suddenly removed
  Info     ReconcileCompleted      5m29s  clickhouse-operator  Reconcile Host completed. Host: 0-0 ClickHouse version running: 24.8.6.70
  Info     ProgressHostsCompleted  5m29s  clickhouse-operator  [now: 2024-11-05 14:49:20.814770037 +0000 UTC m=+699169.611621638] ProgressHostsCompleted: 1 of 2
  Info     UpdateCompleted         5m29s  clickhouse-operator  Update ConfigMap gidplatform-dev/chi-gid-common-configd
  Info     UpdateCompleted         5m28s  clickhouse-operator  Update ConfigMap gidplatform-dev/chi-gid-common-usersd
  Info     UpdateCompleted         5m26s  clickhouse-operator  Update Service success: gidplatform-dev/clickhouse-gid
  Info     ReconcileStarted        5m25s  clickhouse-operator  Reconcile Host start. Host: 0-1 ClickHouse version running: 24.8.6.70
  Info     UpdateCompleted         5m24s  clickhouse-operator  Update ConfigMap gidplatform-dev/chi-gid-common-configd
  Info     UpdateCompleted         5m23s  clickhouse-operator  Update ConfigMap gidplatform-dev/chi-gid-common-usersd
  Info     UpdateCompleted         92s    clickhouse-operator  Update ConfigMap gidplatform-dev/chi-gid-deploy-confd-gid-0-1
  Info     UpdateCompleted         91s    clickhouse-operator  Update ConfigMap gidplatform-dev/chi-gid-common-configd
  Info     UpdateCompleted         91s    clickhouse-operator  Update Service success: gidplatform-dev/chi-gid-gid-0-1
  Info     ReconcileCompleted      89s    clickhouse-operator  Reconcile Host completed. Host: 0-1 ClickHouse version running: 24.8.6.70
  Info     UpdateCompleted         88s    clickhouse-operator  Update ConfigMap gidplatform-dev/chi-gid-common-usersd
  Info     ProgressHostsCompleted  88s    clickhouse-operator  [now: 2024-11-05 14:53:21.400360761 +0000 UTC m=+699410.197212365] ProgressHostsCompleted: 2 of 2
  Info     UpdateCompleted         86s    clickhouse-operator  Update ConfigMap gidplatform-dev/chi-gid-common-configd
  Info     ReconcileInProgress     85s    clickhouse-operator  remove items scheduled for deletion
  Info     ReconcileInProgress     84s    clickhouse-operator  add CHI to monitoring
  Info     UpdateCompleted         83s    clickhouse-operator  Update ConfigMap gidplatform-dev/chi-gid-common-usersd
  Info     ReconcileCompleted      82s    clickhouse-operator  reconcile completed successfully, task id: click-reconcile-5
Error from server (NotFound): clickhouseinstallations.clickhouse.altinity.com "chi" not found
  1. Looks like same result: 727 tables on single host chi-gid-gid-0-1-0.
  2. Full log:
    operator.2024-11-05T17:56:30+03:00.obfuscated.log

I see again a bunch of authorization errors in logs, but do not want to make any assumptions.

@Slach
Copy link
Collaborator

Slach commented Nov 6, 2024

kubectl delete sts -n gidplatform-dev chi-gid-gid-0-0
statefulset.apps "chi-gid-gid-0-0" deleted
Wasn't PVC should be deleted also?

I think we need to delete PVC+PV for dead 0-0-0, to complete erase data for -0-0-0

I found in logs
storageclassname: openebs-hostpath-dataplatform
it means data in clickhouse will lost if your pod will re-scheduled in different worker node
of if worker node will reinstall from scratch
did you have operations similar to described above?

Let another try

# safe shutdown 
kubectl exec -n gidplatform-dev chi-gid-gid-0-0-0 -- clickhouse-client -q "SYSTEM SHUTDOWN"

# different node
kubectl exec -n gidplatform-dev chi-gid-gid-0-1-0 -- clickhouse-client -q "SYSTEM DROP REPLICA 'chi-gid-gid-0-0'"

# delete sts+PV+PVC to propogate schema during reconcile
kubectl delete sts -n gidplatform-dev chi-gid-gid-0-0
kubectl delete pvc -n gidplatform-dev -l clickhouse.altinity.com/chi=gid,clickhouse.altinity.com/replica=0,clickhouse.altinity.com/shard=0

kubectl get pv | grep 0-0-0
kubectl delete pv <name of pv from previous step>


# change spec.taskID
kubectl edit chi -n gidplatform-dev gid

# wait when reconcile completed and look to events, watch to Adding table
watch -n 1 bash -c "kubectl describe chi -n gidplatform-dev gid | grep -i table"

@Hubbitus
Copy link
Author

Hubbitus commented Nov 6, 2024

it means data in clickhouse will lost if your pod will re-scheduled in different worker node of if worker node will reinstall from scratch did you have operations similar to described above?

Yes, we use openebs and taint to node. Openebs is one of the fastest storage class. And I hope there was no pod migration to another node.

Let another try

Let's doing:

$ kubectl exec -n gidplatform-dev chi-gid-gid-0-0-0 -- clickhouse-client -q "SYSTEM SHUTDOWN"

$ kubectl exec -n gidplatform-dev chi-gid-gid-0-1-0 -- clickhouse-client -q "SYSTEM DROP REPLICA 'chi-gid-gid-0-0'"
Received exception from server (version 24.8.6):
Code: 305. DB::Exception: Received from localhost:9000. DB::Exception: Can't drop replica: chi-gid-gid-0-0, because it's active. (TABLE_WAS_NOT_DROPPED)
(query: SYSTEM DROP REPLICA 'chi-gid-gid-0-0')
command terminated with exit code 49

I've tried several times last command with the same result.
Stopped here.

@Slach
Copy link
Collaborator

Slach commented Nov 6, 2024

ok. change action sequence

# safe shutdown 
kubectl exec -n gidplatform-dev chi-gid-gid-0-0-0 -- clickhouse-client -q "SYSTEM SHUTDOWN"
sleep 0.5
# delete sts+PV+PVC to propogate schema during reconcile
kubectl delete sts -n gidplatform-dev chi-gid-gid-0-0
kubectl delete pvc -n gidplatform-dev -l clickhouse.altinity.com/chi=gid,clickhouse.altinity.com/replica=0,clickhouse.altinity.com/shard=0

kubectl get pv | grep 0-0-0
kubectl delete pv <name of pv from previous step>


# clean ZK from different node
kubectl exec -n gidplatform-dev chi-gid-gid-0-1-0 -- clickhouse-client -q "SYSTEM DROP REPLICA 'chi-gid-gid-0-0'"

# change spec.taskID
kubectl edit chi -n gidplatform-dev gid

# wait when reconcile completed and look to events, watch to Adding table
watch -n 1 bash -c "kubectl describe chi -n gidplatform-dev gid | grep -i table"

@Hubbitus
Copy link
Author

Hubbitus commented Nov 6, 2024

+ kubectl exec -n gidplatform-dev chi-gid-gid-0-0-0 -- clickhouse-client -q 'SYSTEM SHUTDOWN'
+ read -p 'Press Enter' T
Press Enter
+ kubectl delete sts -n gidplatform-dev chi-gid-gid-0-0
statefulset.apps "chi-gid-gid-0-0" deleted
+ read -p 'Press Enter' T
Press Enter
+ kubectl delete pvc -n gidplatform-dev -l clickhouse.altinity.com/chi=gid,clickhouse.altinity.com/replica=0,clickhouse.altinity.com/shard=0
persistentvolumeclaim "default-volume-claim-chi-gid-gid-0-0-0" deleted
+ read -p 'Press Enter' T
Press Enter
+ kubectl -n gidplatform-dev delete pv pvc-b76f4aaf-c323-4c74-bbc5-682bdb35e607
Warning: deleting cluster-scoped resources, not scoped to the provided namespace
persistentvolume "pvc-b76f4aaf-c323-4c74-bbc5-682bdb35e607" deleted
+ read -p 'Press Enter' T
Press Enter
+ kubectl exec -n gidplatform-dev chi-gid-gid-0-1-0 -- clickhouse-client -q 'SYSTEM DROP REPLICA '\''chi-gid-gid-0-0'\'''
Received exception from server (version 24.8.6):
Code: 305. DB::Exception: Received from localhost:9000. DB::Exception: Can't drop replica: chi-gid-gid-0-0, because it's active. (TABLE_WAS_NOT_DROPPED)
(query: SYSTEM DROP REPLICA 'chi-gid-gid-0-0')
command terminated with exit code 49

@Hubbitus
Copy link
Author

Hubbitus commented Nov 6, 2024

Another try of kubectl exec -n gidplatform-dev chi-gid-gid-0-1-0 -- clickhouse-client -q "SYSTEM DROP REPLICA 'chi-gid-gid-0-0'" succeeded.

watch -n 1 -x bash -c "kubectl describe chi -n gidplatform-dev gid | grep -i table --color"

Last line appeared:

Line `Info     CreateStarted     3m8s   clickhouse-operator  Adding tables on shard/host:0/0 cluster:gid` app

Then:

  Error    CreateFailed            21s    clickhouse-operator  ERROR add tables added successfully on shard/host:0/0 cluster:gid err:Code: 415, Message: Table /clickhouse/tables/62324966-8fef-4f9e-908d-a990b880d20c/0 was suddenly removed

Reconcile done: I1106 15:02:21.382386 1 worker.go:469] worker.go:432:updateCHI():end:gidplatform-dev/gid/click-reconcile-6

Now 730 tables only on host chi-gid-gid-0-0-0 :(

Full log:
operator.2024-11-06T18:04:12+03:00.obfuscated.log

@Slach
Copy link
Collaborator

Slach commented Nov 6, 2024

kubectl exec -n gidplatform-dev chi-gid-gid-0-1-0 -- clickhouse-client -q "SELECT * FROM system.tables WHERE uuid='62324966-8fef-4f9e-908d-a990b880d20c' FORMAT Vertical"

@Hubbitus
Copy link
Author

Hubbitus commented Nov 6, 2024

$ kubectl exec -n gidplatform-dev chi-gid-gid-0-1-0 -- clickhouse-client -q "SELECT * FROM system.tables WHERE uuid='62324966-8fef-4f9e-908d-a990b880d20c' FORMAT Vertical"
Row 1:
──────
database:                      _source
name:                          victoriametrics__slo__metrics_old
uuid:                          62324966-8fef-4f9e-908d-a990b880d20c
engine:                        ReplicatedReplacingMergeTree
is_temporary:                  0
data_paths:                    ['/var/lib/clickhouse/store/623/62324966-8fef-4f9e-908d-a990b880d20c/']
metadata_path:                 /var/lib/clickhouse/store/783/7834b023-e34a-4087-bba9-98c168870af6/victoriametrics__slo__metrics_old.sql
metadata_modification_time:    2024-09-10 08:22:16
metadata_version:              0
dependencies_database:         []
dependencies_table:            []
create_table_query:            CREATE TABLE _source.victoriametrics__slo__metrics_old (`metric` LowCardinality(String), `tags` Map(LowCardinality(String), String), `ts` UInt64, `dt` Nullable(DateTime64(6, 'UTC')) MATERIALIZED toDateTime64(ts, 6, 'UTC'), `value` Nullable(String), `_row_hash_` UInt64 MATERIALIZED cityHash64((metric, tags, ts, value)), `_insert_dt_` DateTime64(6, 'UTC') DEFAULT now64(6, 'UTC') ENGINE = ReplicatedReplacingMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}') PRIMARY KEY (_row_hash_, ts) ORDER BY (_row_hash_, ts) SETTINGS index_granularity = 8192
engine_full:                   ReplicatedReplacingMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}') PRIMARY KEY (_row_hash_, ts) ORDER BY (_row_hash_, ts) SETTINGS index_granularity = 8192
as_select:                     
partition_key:                 
sorting_key:                   _row_hash_, ts
primary_key:                   _row_hash_, ts
sampling_key:                  
storage_policy:                default
total_rows:                    .....
total_bytes:                   .....
total_bytes_uncompressed:      .....
parts:                         6
active_parts:                  6
total_marks:                   13301
lifetime_rows:                 ᴺᵁᴸᴸ
lifetime_bytes:                ᴺᵁᴸᴸ
has_own_data:                  1
loading_dependencies_database: []
loading_dependencies_table:    []
loading_dependent_database:    []
loading_dependent_table:       []

Comments stripped.

I can drop that table if it helps. But what is the reason it is not restored?

@Slach
Copy link
Collaborator

Slach commented Nov 6, 2024

kubectl exec -n gidplatform-dev chi-gid-gid-0-1-0 -- clickhouse-client -q "SELET * FROM system.zookeper WHERE path='/clickhouse/tables/62324966-8fef-4f9e-908d-a990b880d20c/{shard}/replicas' FORMAT Vertical

@Hubbitus
Copy link
Author

Hubbitus commented Nov 6, 2024

$ kubectl exec -n gidplatform-dev chi-gid-gid-0-1-0 -- clickhouse-client -q "SELECT * FROM system.zookeeper WHERE path='/clickhouse/tables/62324966-8fef-4f9e-908d-a990b880d20c/{shard}/replicas' FORMAT Vertical"

Returns nothing

@Slach
Copy link
Collaborator

Slach commented Nov 6, 2024

try replace {shard} to '0'

kubectl exec -n gidplatform-dev chi-gid-gid-0-1-0 -- clickhouse-client -q "SELET * FROM system.zookeper WHERE path='/clickhouse/tables/62324966-8fef-4f9e-908d-a990b880d20c/0/replicas' FORMAT Vertical

@Hubbitus
Copy link
Author

Hubbitus commented Nov 7, 2024

$ kubectl exec -n gidplatform-dev chi-gid-gid-0-1-0 -- clickhouse-client -q "SELECT * FROM system.zookeeper WHERE path='/clickhouse/tables/62324966-8fef-4f9e-908d-a990b880d20c/0/replicas' FORMAT Vertical"
Row 1:
──────
name:  chi-gid-gid-0-1
value: 
path:  /clickhouse/tables/62324966-8fef-4f9e-908d-a990b880d20c/0/replicas

And for 1 still empty:

$ kubectl exec -n gidplatform-dev chi-gid-gid-0-1-0 -- clickhouse-client -q "SELECT * FROM system.zookeeper WHERE path='/clickhouse/tables/62324966-8fef-4f9e-908d-a990b880d20c/1/replicas' FORMAT Vertical"

@Slach
Copy link
Collaborator

Slach commented Nov 7, 2024

/1/ expected empty cause {shard} have 0 value
you have two replicas in one shard

this is really weird, why did you get
Code: 415, Message: Table /clickhouse/tables/62324966-8fef-4f9e-908d-a990b880d20c/0 was suddenly removed

ok. let's try create schema on 0-0-0 completelly manually

Create databases

kubectl exec -n gidplatform-dev chi-gid-gid-0-0-0 -- bash -c "clickhouse-client -q \"SELECT DISTINCT 'CREATE DATABASE IF NOT EXISTS \"' || name || '\" Engine = ' || engine_full || ';' AS create_db_query FROM cluster('all-sharded', system.databases) databases WHERE name NOT IN ('system', 'information_schema', 'INFORMATION_SCHEMA') FORMAT TSVRaw\" | clickhouse-client -mn --echo"

Create MergeTree tables

kubectl exec -n gidplatform-dev chi-gid-gid-0-0-0 -- bash -c " clickhouse-client -q \"SELECT DISTINCT replaceRegexpOne(create_table_query, 'CREATE (TABLE|VIEW|MATERIALIZED VIEW|DICTIONARY|LIVE VIEW|WINDOW VIEW)', 'CREATE \\1 IF NOT EXISTS') || ';' AS q FROM cluster('all-sharded', system.tables) WHERE database NOT IN ('system', 'information_schema', 'INFORMATION_SCHEMA') AND create_table_query != '' AND name NOT LIKE '.inner.%' AND name NOT LIKE '.inner_id.%' AND engine LIKE '%MergeTree%' SETTINGS show_table_uuid_in_table_create_query_if_not_nil=1 FORMAT TSVRaw" | clickhouse-client -mn --echo"

Create Distributed tables

kubectl exec -n gidplatform-dev chi-gid-gid-0-0-0 -- bash -c " clickhouse-client -q \"SELECT DISTINCT replaceRegexpOne(create_table_query, 'CREATE (TABLE|VIEW|MATERIALIZED VIEW|DICTIONARY|LIVE VIEW|WINDOW VIEW)', 'CREATE \\1 IF NOT EXISTS') || ';' AS q FROM cluster('all-sharded', system.tables) WHERE database NOT IN ('system', 'information_schema', 'INFORMATION_SCHEMA') AND create_table_query != '' AND name NOT LIKE '.inner.%' AND name NOT LIKE '.inner_id.%' AND engine LIKE '%Distributed%' SETTINGS show_table_uuid_in_table_create_query_if_not_nil=1 FORMAT TSVRaw" | clickhouse-client -mn --echo"

Create Other tables, MV, Dictionaries

kubectl exec -n gidplatform-dev chi-gid-gid-0-0-0 -- bash -c " clickhouse-client -q \"SELECT DISTINCT replaceRegexpOne(create_table_query, 'CREATE (TABLE|VIEW|MATERIALIZED VIEW|DICTIONARY|LIVE VIEW|WINDOW VIEW)', 'CREATE \\1 IF NOT EXISTS') || ';' AS q FROM cluster('all-sharded', system.tables) WHERE database NOT IN ('system', 'information_schema', 'INFORMATION_SCHEMA') AND create_table_query != '' AND name NOT LIKE '.inner.%' AND name NOT LIKE '.inner_id.%' AND engine NOT LIKE '%Distributed%' AND engine NOT LIKE '%MergeTree%' SETTINGS show_table_uuid_in_table_create_query_if_not_nil=1 FORMAT TSVRaw" | clickhouse-client -mn --echo"

@Hubbitus
Copy link
Author

Hubbitus commented Nov 7, 2024

@Slach,
I've tried first:

+ kubectl exec -n gidplatform-dev chi-gid-gid-0-0-0 -- bash -c 'clickhouse-client -q "SELECT DISTINCT '\''CREATE DATABASE IF NOT EXISTS "'\'' || name || '\''" Engine = '\'' || engine_full || '\'';'\'' AS create_db_query FROM cluster('\''all-sharded'\'', system.databases) databases WHERE name NOT IN ('\''system'\'', '\''information_schema'\'', '\''INFORMATION_SCHEMA'\'') FORMAT TSVRaw" | clickhouse-client -mn --echo'
Code: 62. DB::Exception: Syntax error: failed at position 32 ('||'): || name ||  Engine = Atomic;
. Expected identifier. (SYNTAX_ERROR)

command terminated with exit code 62

My attempt to fix:

$ kubectl exec -n gidplatform-dev chi-gid-gid-0-0-0 -- bash -c "clickhouse-client -q \"SELECT DISTINCT 'CREATE DATABASE IF NOT EXISTS \"' || name || '\" Engine = ' || engine_full || ';' AS create_db_query FROM cluster('all-sharded', system.databases) databases WHERE name NOT IN ('system', 'information_schema', 'INFORMATION_SCHEMA') FORMAT TSVRaw\" "
CREATE DATABASE IF NOT EXISTS  || name ||  Engine = Atomic;

Correcting quotes probably will:

$ kubectl exec -n gidplatform-dev chi-gid-gid-0-0-0 -- bash -c "clickhouse-client -q \"SELECT DISTINCT 'CREATE DATABASE IF NOT EXISTS ' || name || ' Engine = ' || engine_full || ';' AS create_db_query FROM cluster('all-sharded', system.databases) databases WHERE name NOT IN ('system', 'information_schema', 'INFORMATION_SCHEMA') FORMAT TSVRaw\" "
CREATE DATABASE IF NOT EXISTS _loopback Engine = Atomic;
CREATE DATABASE IF NOT EXISTS _raw Engine = Atomic;
CREATE DATABASE IF NOT EXISTS _service_ Engine = Atomic;
CREATE DATABASE IF NOT EXISTS _source Engine = Atomic;
CREATE DATABASE IF NOT EXISTS _temp_ Engine = Atomic;
CREATE DATABASE IF NOT EXISTS _temporaldb Engine = Atomic;
CREATE DATABASE IF NOT EXISTS cdc Engine = Atomic;
CREATE DATABASE IF NOT EXISTS datamart Engine = Atomic;
CREATE DATABASE IF NOT EXISTS default Engine = Atomic;
CREATE DATABASE IF NOT EXISTS sandbox Engine = Atomic;
CREATE DATABASE IF NOT EXISTS tmp_incr Engine = Atomic;

That without last part | clickhouse-client -mn --echo.

But what the reason dong so on same 0 node? Dumping from it and applying there. So clause IF NOT EXISTS will lead to do nothing.

Is it intended to dump structure from 1 (working) node and apply to 0 (broken)?

@Slach
Copy link
Collaborator

Slach commented Nov 7, 2024

Is it intended to dump structure from 1 (working) node and apply to 0 (broken)?

yes

@Slach
Copy link
Collaborator

Slach commented Nov 7, 2024

let's change approach
run shell in 0-0-0

kubectl exec -n gidplatform-dev chi-gid-gid-0-0-0 -- bash

Datbases

clickhouse-client -q "SELECT DISTINCT 'CREATE DATABASE IF NOT EXISTS \"' || name || '\" Engine = ' || engine_full || ';' AS create_db_query FROM cluster('all-sharded', system.databases) databases WHERE name NOT IN ('system', 'information_schema', 'INFORMATION_SCHEMA') FORMAT TSVRaw" | clickhouse-client -mn --echo

Create MergeTree tables

clickhouse-client -q "SELECT DISTINCT replaceRegexpOne(create_table_query, 'CREATE (TABLE|VIEW|MATERIALIZED VIEW|DICTIONARY|LIVE VIEW|WINDOW VIEW)', 'CREATE \\1 IF NOT EXISTS') || ';' AS q FROM cluster('all-sharded', system.tables) WHERE database NOT IN ('system', 'information_schema', 'INFORMATION_SCHEMA') AND create_table_query != '' AND name NOT LIKE '.inner.%' AND name NOT LIKE '.inner_id.%' AND engine LIKE '%MergeTree%' SETTINGS show_table_uuid_in_table_create_query_if_not_nil=1 FORMAT TSVRaw" | clickhouse-client -mn --echo

Create Distributed tables

clickhouse-client -q "SELECT DISTINCT replaceRegexpOne(create_table_query, 'CREATE (TABLE|VIEW|MATERIALIZED VIEW|DICTIONARY|LIVE VIEW|WINDOW VIEW)', 'CREATE \\1 IF NOT EXISTS') || ';' AS q FROM cluster('all-sharded', system.tables) WHERE database NOT IN ('system', 'information_schema', 'INFORMATION_SCHEMA') AND create_table_query != '' AND name NOT LIKE '.inner.%' AND name NOT LIKE '.inner_id.%' AND engine LIKE '%Distributed%' SETTINGS show_table_uuid_in_table_create_query_if_not_nil=1 FORMAT TSVRaw" | clickhouse-client -mn --echo

Create Other tables, MV, Dictionaries

clickhouse-client -q "SELECT DISTINCT replaceRegexpOne(create_table_query, 'CREATE (TABLE|VIEW|MATERIALIZED VIEW|DICTIONARY|LIVE VIEW|WINDOW VIEW)', 'CREATE \\1 IF NOT EXISTS') || ';' AS q FROM cluster('all-sharded', system.tables) WHERE database NOT IN ('system', 'information_schema', 'INFORMATION_SCHEMA') AND create_table_query != '' AND name NOT LIKE '.inner.%' AND name NOT LIKE '.inner_id.%' AND engine NOT LIKE '%Distributed%' AND engine NOT LIKE '%MergeTree%' SETTINGS show_table_uuid_in_table_create_query_if_not_nil=1 FORMAT TSVRaw" | clickhouse-client -mn --echo

@Hubbitus
Copy link
Author

Hubbitus commented Nov 7, 2024

@Slach,

Is it intended to dump structure from 1 (working) node and apply to 0 (broken)?

yes

Then it probably should be like:

kubectl exec -n gidplatform-dev chi-gid-gid-0-1-0 -- bash -c "clickhouse-client -q \"SELECT DISTINCT 'CREATE DATABASE IF NOT EXISTS ' || name || ' Engine = ' || engine_full || ';' AS create_db_query FROM cluster('all-sharded', system.databases) databases WHERE name NOT IN ('system', 'information_schema', 'INFORMATION_SCHEMA') FORMAT TSVRaw\" " \
        | kubectl exec -n gidplatform-dev chi-gid-gid-0-0-0 -- bash -c "clickhouse-client -mn --echo"

But still, if there was not DROP DATABASE followed by CREATE DATABASE, it will do nothing. Because, how we saw in previous comment, there are already databases present.

But what the purpose of such operations?
Shouldn't it be completed automatically by operator? I hope it should.
Or we just want to confirm what manual way of copying objects works?

let's change approach run shell in 0-0-0

kubectl exec -n gidplatform-dev chi-gid-gid-0-0-0 -- bash

Datbases

clickhouse-client -q "SELECT DISTINCT 'CREATE DATABASE IF NOT EXISTS \"' || name || '\" Engine = ' || engine_full || ';' AS create_db_query FROM cluster('all-sharded', system.databases) databases WHERE name NOT IN ('system', 'information_schema', 'INFORMATION_SCHEMA') FORMAT TSVRaw" | clickhouse-client -mn --echo

That still will be executed within the single 0 node.

@Slach
Copy link
Collaborator

Slach commented Nov 8, 2024

Then it probably should be like:

no, i provided command with pipeline inside bash -c
last message #1455 (comment)
described command which just need to executed in interactive shell

Or we just want to confirm what manual way of copying objects works?

Yes, i'm looking into clickhouse-operator source code and trying to execute schemer.go commands manually

@Hubbitus
Copy link
Author

Hubbitus commented Nov 8, 2024

no, i provided command with pipeline inside bash -c
last message #1455 (comment)
described command which just need to executed in interactive shell

Yes, they piped, but inside an interactive shell of 0 node. There are no cross-nodes connections.

@Slach
Copy link
Collaborator

Slach commented Nov 8, 2024

look to FROM cluster('all-sharded', system.tables)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants