Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide one-shot check mode #12

Open
andreas-schroeder opened this issue Sep 25, 2017 · 6 comments
Open

Provide one-shot check mode #12

andreas-schroeder opened this issue Sep 25, 2017 · 6 comments

Comments

@andreas-schroeder
Copy link
Owner

Provide start options that execute the checks once. The check should wait for the broker to become healthy within a specified deadline, and report success or failure through exit codes 0 and 1 respectively.

Motivation: Container health checks have to be specified as repeatedly executed commands. Providing a one-shot check mode would allow to use kafka-health-check directly instead of using curl to query the daemonized kafka-health-check process.

See also #8 (comment)

@andreas-schroeder
Copy link
Owner Author

The challenge I see for implementing this enhancement is the broker replication check.

Some background on that check: we found that when running Kafka 0.10.0.0, the brokers sometimes get stuck in a state where they continue to replay simple topics, but don't cooperate in the replication of topics with replication factor greater than 1. To heal this state, a simple restart of the non-cooperating broker is enough.

To detect the non-cooperating broker state, this check continuously injects records in a replication check topic that is replicated by every broker of the cluster, and by this tries to kick non-cooperating brokers out of the topic's ISR. If the broker is found to be out of the ISR, it is considered unhealthy.

In order to run that check, the kafka-health-check has to get the broker it checks into the replica set of the replication check topic when it is started. Similarly, it shrinks the replica set to exclude the monitored broker when the check is shut down.

I don't like this check since it doesn't seem to be robust; however, I found it to be necessary in order to automatically fix these non-cooperation states.

Now to come back to this enhancement: how would we go about implementing this replication check in a one-shot check mode?

Or maybe we should first introduce a simple check mode excluding the replication check, and only support hat in the one-shot check mode?

@UnsignedLong
Copy link

UnsignedLong commented Sep 27, 2017

Just a little feedback:
Running kafka && kafka-health-check inside Kubernetes using this liveness probe:

        livenessProbe:
          httpGet:
            path: /
            port: 8000
          initialDelaySeconds: 60
          timeoutSeconds: 5

works perfectly, thank you!!

@wilson
Copy link
Contributor

wilson commented Oct 4, 2017

Sorry for the silence, things have been exciting!

I think it would make sense to have a mode like this skip the replication analysis process you've described. At least for me, the use-case is in bootstrapping or sanity-checking, and this could easily be the first Kafka instance to start up.

I feel like that's especially OK since one of the first things that will run after Kafka "boots" is kafka-health-check, running in its normal mode. I want to close the timing loopholes that exist when you are just starting up a new cluster.

@wilson
Copy link
Contributor

wilson commented Oct 4, 2017

@UnsignedLong The check you pasted only tests that kafka-health-check is running, not that the cluster it is observing is up yet, as far as I can tell. Getting to the latter is why I'd like this feature.

@UnsignedLong
Copy link

@wilson because the endpoint returns a 500 if the status is "nook" it works great as a broker liveness probe - I have raised the "initialDelaySeconds" to 3600 to give an unclean shutdown broker enough time to recover.

@wilson
Copy link
Contributor

wilson commented Oct 5, 2017

Haha I have been doing it the hard way, then. I was thinking those were 200s as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants