Provide the ability to add custom health checks that precede and follow ClickHouse reconfiguration/upgrade #1477

hodgesrm · 2024-08-06T18:18:47Z

We had a recent discussion with a user who has the following use case. They have an operator that watches for replication lag on ClickHouse and only schedules upgrade when it's below a certain level. They were wondering if the operator could help them implement logic that amount to "guard conditions" on operator upgrades. It could work as follows:

Permit a ClickHouse replica upgrade operation to start when replication lag is below a "reasonable" level.
Confirm that the replication lag is back to a reasonable level before moving to the next replica.

In this case the guard condition is replica lag but it's likely that there would be others. They and other users would presumably want to customize the conditions.

It seems as if this could be implemented in a general way by adding a custom health check section to CHI instances that contain one or more SQL queries to run before and after upgrade. If the check(s) pass, the upgrade could proceed. If they fail the operator would wait.

There are some questions with this kind of guard that we need to contemplate.

How would other services learn the status of the cluster health checks?
What happens if replica lag increases suddenly during an upgrade? It could block an incomplete operation indefinitely.
Does it just apply to upgrade or to any reconfiguration that causes a pod restart?
Should we report these health check results in our exported metrics? It would be important to see these visually in, say, Grafana when kicking off an upgrade or diagnosing problems.

sunsingerus · 2024-08-07T07:54:15Z

This is a very interesting topic that requires detailed discussion. However, these are my brief thoughts on the matter.

How would other services learn the status of the cluster health checks?

"Basic or k8s" health checks are run by k8s and clickhouse-operator does not know results of the health checks on-the-fly. These statuses can be fetched from each Pod, though, so "other services" can monitor Pods directly. List of clickhouse pods is provided by the clickhouse-operator in CHI.Status section
Some kind of "custom health checks" explicitly described in CHI manifests may be introduced, but this requires a good amount of design. Main questions to answer should be like "should SQL-only checks be allowed or custom shell commands as well" and "how to expose results of custom checks". As to results exposure, this can be done, for example, via REST endpoint, along with metrics being already exported by the operator. Speaking about metrics arises the question "do we need additional explicit health checks or metrics can be enhanced to serve this way?"

What happens if replica lag increases suddenly during an upgrade? It could block an incomplete operation indefinitely.

Clickhouse-operator has timeouts for host reconcile processes. Some additional or "general" timeout(s) should be introduced, if we are talking about waiting for external permission to reconcile a host.

Does it just apply to upgrade or to any reconfiguration that causes a pod restart?

There is no such a special case as "upgrade" at the moment, operator performs general reconcile. I'd rather avoid narrow cases and prefer to go with more generic approach. Introducing some kind of "external hooks" may be a proper way to go. Details are to be specified additionally.

Should we report these health check results in our exported metrics? It would be important to see these visually in, say, Grafana when kicking off an upgrade or diagnosing problems.

This question is the good starting point to think on how:

To enhance metrics functionality and introduce "custom metrics" that can be specified in CHI manifest and provide flexible mechanism to run complex tasks on the clickhouse host (having in mind questions like: SQL-only or to allow shell scripts as well)
Provide hooks (general external or to custom metrics only) that will control the operator's behavior during reconcile. Say to block host reconcile until specified metrics allows to proceed.

Anyway, this is a big topic to discuss and to plan, which we can start to.

ondrej-smola · 2024-08-07T11:03:22Z

Just a quick idea - option to launch k8s job from user provided template and wait for its completion before starting ch pod reconciliation (or some condition). We might augment this job with well-known env variables (such as pod to be reconciled, all cluster pods). Important advantage could be that we will move all the configuration (secrets, job.podFailurePolicies and others) to k8s and have build-in way to watch/debug it. Job could even push metrics/logs to collector during its runtime.

com6056 · 2024-08-19T22:44:52Z

👋 Hey folks! I'm an engineer over at Mux, the company asking y'all for this feature 🙌

Overall I think what y'all are saying makes sense, for some clarity I think we only need the ability to execute a single SQL statement before progressing onto the next pod to ensure our strict latency requirements are met.

For example, a query we'd want to run before a pod rolls is something like this (used a random value threshold since I'm not 100% sure what replication lag threshold we'd want to block progress on):

SELECT value FROM system.asynchronous_metrics WHERE metric = 'ReplicasMaxAbsoluteDelay' AND value > 5

I was thinking it could be implemented similar to how Prometheus alerts work, where any data returned means a failure: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/

Whenever the alert expression results in one or more vector elements at a given point in time, the alert counts as active for these elements' label sets.

It looks like we already have some logic that runs SQL before it proceeds with the reconcile here:

clickhouse-operator/pkg/controller/chi/worker-chi-reconciler.go

Line 704 in d5f265f

_ = w.completeQueries(ctx, host)

Under the hood, that calls

clickhouse-operator/pkg/model/chi/schemer/schemer.go

Lines 151 to 154 in d5f265f

    
           // HostActiveQueriesNum returns how many active queries are on the host 
        
           func (s *ClusterSchemer) HostActiveQueriesNum(ctx context.Context, host *api.ChiHost) (int, error) { 
        
           	return s.QueryHostInt(ctx, host, s.sqlActiveQueriesNum()) 
        
           }

which runs

clickhouse-operator/pkg/model/chi/schemer/sql.go

Lines 244 to 246 in d5f265f

    
           func (s *ClusterSchemer) sqlActiveQueriesNum() string { 
        
           	return `SELECT count() FROM system.processes` 
        
           }

.

We could just add another function call right above/below w.completeQueries() to execute the query provided in the custom resource (if it exists) 🎉

I think all we really need is the ability to specify a single SQL statement in the custom resource, maybe a field like healthcheckSQL or something (I hate naming things, y'all probably have better ideas for a name haha), since you can just simply chain multiple statements together in the event you need to halt reconciles based on multiple sources of data. And then in the operator we would simply check if any data was returned from the query and if so, halt progress in the operator until no data is returned (or some external timeout is hit like mentioned in a previous comment).

Curious what y'all think of this!

alex-zaitsev · 2024-08-26T07:55:53Z

@com6056, thank you for your input. As an alternative, would it make sense to have this custom SQL as a part of the readiness probe? Or even a special condition to include host to the load balancer.

I am asking because syncing the replica may take quite a lot of time. Operator has no visibility what is going on when check is failing. What is a timeout value?

Alternatively, we may add replica wait logic to operator explicitly, without a custom SQL. In this case, operator will know what is going on. It will be able not just monitor the replica delay, but also check replication queue and see if there is any progress, or host is stuck.

com6056 · 2024-08-26T16:59:05Z

I think the latter idea sounds best with the operator being in full control of the rolling based on those metrics. Ideally we don't want pods rolling if the replication delay is above a certain amount and I don't believe a readiness probe halts rolling of pods with this setup (not 100% sure how it all works in your operator since you have a special StatefulSet setup), the probe only prevents the pod from being served traffic.

sunsingerus self-assigned this Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide the ability to add custom health checks that precede and follow ClickHouse reconfiguration/upgrade #1477

Provide the ability to add custom health checks that precede and follow ClickHouse reconfiguration/upgrade #1477

hodgesrm commented Aug 6, 2024 •

edited

Loading

sunsingerus commented Aug 7, 2024

ondrej-smola commented Aug 7, 2024 •

edited

Loading

com6056 commented Aug 19, 2024 •

edited

Loading

alex-zaitsev commented Aug 26, 2024

com6056 commented Aug 26, 2024

Provide the ability to add custom health checks that precede and follow ClickHouse reconfiguration/upgrade #1477

Provide the ability to add custom health checks that precede and follow ClickHouse reconfiguration/upgrade #1477

Comments

hodgesrm commented Aug 6, 2024 • edited Loading

sunsingerus commented Aug 7, 2024

ondrej-smola commented Aug 7, 2024 • edited Loading

com6056 commented Aug 19, 2024 • edited Loading

alex-zaitsev commented Aug 26, 2024

com6056 commented Aug 26, 2024

hodgesrm commented Aug 6, 2024 •

edited

Loading

ondrej-smola commented Aug 7, 2024 •

edited

Loading

com6056 commented Aug 19, 2024 •

edited

Loading