-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide one-shot check mode #12
Comments
The challenge I see for implementing this enhancement is the broker replication check. Some background on that check: we found that when running Kafka 0.10.0.0, the brokers sometimes get stuck in a state where they continue to replay simple topics, but don't cooperate in the replication of topics with replication factor greater than 1. To heal this state, a simple restart of the non-cooperating broker is enough. To detect the non-cooperating broker state, this check continuously injects records in a replication check topic that is replicated by every broker of the cluster, and by this tries to kick non-cooperating brokers out of the topic's ISR. If the broker is found to be out of the ISR, it is considered unhealthy. In order to run that check, the kafka-health-check has to get the broker it checks into the replica set of the replication check topic when it is started. Similarly, it shrinks the replica set to exclude the monitored broker when the check is shut down. I don't like this check since it doesn't seem to be robust; however, I found it to be necessary in order to automatically fix these non-cooperation states. Now to come back to this enhancement: how would we go about implementing this replication check in a one-shot check mode? Or maybe we should first introduce a simple check mode excluding the replication check, and only support hat in the one-shot check mode? |
Just a little feedback:
works perfectly, thank you!! |
Sorry for the silence, things have been exciting! I think it would make sense to have a mode like this skip the replication analysis process you've described. At least for me, the use-case is in bootstrapping or sanity-checking, and this could easily be the first Kafka instance to start up. I feel like that's especially OK since one of the first things that will run after Kafka "boots" is |
@UnsignedLong The check you pasted only tests that kafka-health-check is running, not that the cluster it is observing is up yet, as far as I can tell. Getting to the latter is why I'd like this feature. |
@wilson because the endpoint returns a 500 if the status is "nook" it works great as a broker liveness probe - I have raised the "initialDelaySeconds" to 3600 to give an unclean shutdown broker enough time to recover. |
Haha I have been doing it the hard way, then. I was thinking those were 200s as well. |
Provide start options that execute the checks once. The check should wait for the broker to become healthy within a specified deadline, and report success or failure through exit codes 0 and 1 respectively.
Motivation: Container health checks have to be specified as repeatedly executed commands. Providing a one-shot check mode would allow to use
kafka-health-check
directly instead of usingcurl
to query the daemonized kafka-health-check process.See also #8 (comment)
The text was updated successfully, but these errors were encountered: