-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Linstor controller crashlooping #364
Comments
Thank you. I have found and fixed the bug causing the NullPointerException. However, said exception might kill the thread that collects data for SpaceTracking, which is not ideal I agree on that, but I do not see reasons why the entire controller should be killed from this exception. I do see the last log line, but can you please additionally check for additional logs or more ErrorReports? Or is it possible that the machine simply runs out of memory? |
There's no last line, I think it's getting killed before the line is flushed. And no, it has plenty of memory Although I think I saw some errors about getting disconnected from peers. @WanzenBug could you comment on why the controller is killed when there are errors? I think I've seen a similar behavior when it couldn't resize a volume (piraeusdatastore/piraeus-operator#345) |
It should not get killed. The only reason would be if either the liveness probe fails (unlikely, as the probe is not even using the LINSTOR API), or if the LINSTOR Controller exits. |
To get more logs, you can use a strategy described here: piraeusdatastore/piraeus-operator#184 (comment) |
Ok, I figured the problem. |
Another error now:
|
Could I please get the image with first error fixed at least? |
Still stuck and can't use the storage... Help please |
The fix for the first exception For the second exception, I'm still not sure how you can get a null reference there. Would you mind sending me a database dump to my email address (see my profile) so that I can investigate a bit further? |
Ah, I missed it. Thanks!
Do you mean the kubernetes objects? All of them? |
Sure, usually I do something like these two lines:
|
Here's the attempt to run 1.24:
|
Right, thanks, we are already aware of this issue and are working on the fix for it. |
Done |
Hello again! |
I've sent the whole /var/log/linstor-controller folder to the email, is that good? |
Thanks for the reports, and yes they were helpful. It looks like you have some issues with your network, as the first few ErrorReports state:
I agree that LINSTOR should also handle this case better and not allow other components as SpaceTracking or the autoplacer to run into NullPointerExceptions like in your other ErrorReports, but for now you should investigate the connectivity issue to "fix" the problem. We will try to find a way to improve LINSTOR's error handling in this case. |
Some satellites are not available, it's a big cluster.. Are you saying that's the problem? |
After I manually deleted all "unknown" nodes, I could mount the volumes, as this error went away:
|
I can't delete the broken nodes permanently. Once I delete the "unknown" nodes, it works, but after that operator re-adds those, even though I reduced the diskless satelliteset to just a few nodes |
This seems more like a operator issue then. Are you sure you used the right label to limit the satellite? You need to set the nodeSelector in the |
I'm still having the above issue with controller crashlooping once I have "Unknown" nodes in the cluster (and I can't delete those because of an error in the operator) |
Controller logs:
The text was updated successfully, but these errors were encountered: