-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Catch dm oom #1
base: master
Are you sure you want to change the base?
Catch dm oom #1
Conversation
// If cause is Error (OutOfMemoryError or any other error), shutdown the process | ||
if (cause instanceof Error) { | ||
LOGGER.error("Unrecoverable error: shutting down."); | ||
System.exit(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Executing System.exit is a huge antipattern. It is okish as a hotfix, but it shouldn't be the norm.
Instead, there should be a protocol between the thread executing this and the server/broker main thread. When this thread reaches a fatal condition like this one, it should communicate the issue tot he main thread and that thread should produce a controlled stop.
Executing a System.exit may have unpredictable consequences. For example, if this code is reached on a test, the process will die without reporting the issues. It could also happen that we kill the process in the middle of an important transaction and we left some stored data in an inconsistent (aka corrupted) state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, in case we actually want to do this, we can just enable these JVM flags instead of populate the code with exits:
https://stackoverflow.com/questions/12096403/java-shutting-down-on-out-of-memory-error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the correct solution would be to add here something like:
super.exceptionCaught(ctx, cause);
That would fire the default behavior, which is:
public void exceptionCaught(ChannelHandlerContext ctx, Throwable cause)
throws Exception {
ctx.fireExceptionCaught(cause);
}
which means that the next ChannelHandler will be informed. That means that the error will be propagated on the handler chain and will eventually reach the initial handler added by netty, which should break the connection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW: I'm like 90% sure that firing the exception will close the channel. But in Baeldung example, the channel is being closed explicitly by calling ctx.close()
. Maybe that is the correct solution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, this method is deprecated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Responses:
- Re using
System.exit
agreed, I asked about if we had a way to trigger a shutdown gracefully and it seems this is the way to do it in Pinot. And I spent a bit of time trying to find an alternative and could not. - Which method was deprecated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as dropping the message in readChannel0
I believe the OOM happens before that function is called so we cannot drop it on the receiver side.
I looked a few weeks ago and could not find a way to cap the direct memory that Netty uses such that it will drop an incoming message message that is too big before it OOMs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One concern I have with simple breaking teh connection or recreating it is: what happens to the data in direct memory? Does netty clean it up and reclaim it, does it create a memory leak?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Netty uses a reference counter and we are in charge of keeping its invariants. We need to be sure that we are increasing and decreasing the counter as expected. In case we are doing that correctly, Netty will release the direct memory.
I believe the OOM happens before that function is called so we cannot drop it on the receiver side.
But if that is correct, it seems that we are configuring Netty to use heap buffers, not direct buffers. BTW, both heap and direct buffers use reference counter, so we need to keep the invariant anyway
It seems to be the opposite. The OOM error is thrown due to allocating too much direct memory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Netty uses a reference counter and we are in charge of keeping its invariants. We need to be sure that we are increasing and decreasing the counter as expected. In case we are doing that correctly, Netty will release the direct memory.
Correct, and as far as I can tell, we only keep the Netty managed memory for as long as it takes deserialize the message into a DataTable
value. That is, we don't create any additional references. I think our event loop has a very simple one step chain: it gets the full byte buffer from Netty, deserializes it into a DataTable and then sends the reference to the DataTable object to the Query Router.
It seems to be the opposite. The OOM error is thrown due to allocating too much direct memory
Yes, I believe the OOM happens when Netty is attempting to load the message so that it can then route it to the event loop.
Starting PR for catching unrecoverable exceptions and shutting a Pinot instance down so it can be restarted.
Note: PR not opened against the base repository so that I can do testing and polishing first.