Catch dm oom #1

ege-st · 2023-08-25T20:17:04Z

Starting PR for catching unrecoverable exceptions and shutting a Pinot instance down so it can be restarted.

Note: PR not opened against the base repository so that I can do testing and polishing first.

gortiz · 2023-09-07T08:32:01Z

pinot-core/src/main/java/org/apache/pinot/core/transport/DataTableHandler.java

+    // If cause is Error (OutOfMemoryError or any other error), shutdown the process
+    if (cause instanceof Error) {
+      LOGGER.error("Unrecoverable error: shutting down.");
+      System.exit(1);


Executing System.exit is a huge antipattern. It is okish as a hotfix, but it shouldn't be the norm.

Instead, there should be a protocol between the thread executing this and the server/broker main thread. When this thread reaches a fatal condition like this one, it should communicate the issue tot he main thread and that thread should produce a controlled stop.

Executing a System.exit may have unpredictable consequences. For example, if this code is reached on a test, the process will die without reporting the issues. It could also happen that we kill the process in the middle of an important transaction and we left some stored data in an inconsistent (aka corrupted) state.

BTW, in case we actually want to do this, we can just enable these JVM flags instead of populate the code with exits:

https://stackoverflow.com/questions/12096403/java-shutting-down-on-out-of-memory-error

I think the correct solution would be to add here something like:

super.exceptionCaught(ctx, cause);

That would fire the default behavior, which is:

public void exceptionCaught(ChannelHandlerContext ctx, Throwable cause) throws Exception { ctx.fireExceptionCaught(cause); }

which means that the next ChannelHandler will be informed. That means that the error will be propagated on the handler chain and will eventually reach the initial handler added by netty, which should break the connection

BTW: I'm like 90% sure that firing the exception will close the channel. But in Baeldung example, the channel is being closed explicitly by calling ctx.close(). Maybe that is the correct solution

BTW, this method is deprecated

Responses:

Re using System.exit agreed, I asked about if we had a way to trigger a shutdown gracefully and it seems this is the way to do it in Pinot. And I spent a bit of time trying to find an alternative and could not.

Which method was deprecated?

As far as dropping the message in readChannel0 I believe the OOM happens before that function is called so we cannot drop it on the receiver side.

I looked a few weeks ago and could not find a way to cap the direct memory that Netty uses such that it will drop an incoming message message that is too big before it OOMs.

One concern I have with simple breaking teh connection or recreating it is: what happens to the data in direct memory? Does netty clean it up and reclaim it, does it create a memory leak?

Netty uses a reference counter and we are in charge of keeping its invariants. We need to be sure that we are increasing and decreasing the counter as expected. In case we are doing that correctly, Netty will release the direct memory.

I believe the OOM happens before that function is called so we cannot drop it on the receiver side.

~~But if that is correct, it seems that we are configuring Netty to use heap buffers, not direct buffers. BTW, both heap and direct buffers use reference counter, so we need to keep the invariant anyway~~

It seems to be the opposite. The OOM error is thrown due to allocating too much direct memory

Netty uses a reference counter and we are in charge of keeping its invariants. We need to be sure that we are increasing and decreasing the counter as expected. In case we are doing that correctly, Netty will release the direct memory.

Correct, and as far as I can tell, we only keep the Netty managed memory for as long as it takes deserialize the message into a DataTable value. That is, we don't create any additional references. I think our event loop has a very simple one step chain: it gets the full byte buffer from Netty, deserializes it into a DataTable and then sends the reference to the DataTable object to the Query Router.

It seems to be the opposite. The OOM error is thrown due to allocating too much direct memory

Yes, I believe the OOM happens when Netty is attempting to load the message so that it can then route it to the event loop.

ege-st added 4 commits August 24, 2023 10:52

Catch Direct Memory error and gracefully shutdown

22a581b

Catch Direct Memory OOM in the Server and shutdown.

fb5d9e9

add comment

d7a93c1

Improve log messages

900f1ef

gortiz reviewed Sep 7, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catch dm oom #1

Catch dm oom #1

ege-st commented Aug 25, 2023

gortiz Sep 7, 2023

gortiz Sep 7, 2023

gortiz Sep 7, 2023

gortiz Sep 7, 2023

gortiz Sep 7, 2023

ege-st Sep 7, 2023

ege-st Sep 7, 2023

ege-st Sep 7, 2023

gortiz Sep 8, 2023 •

edited

Loading

ege-st Sep 8, 2023

Catch dm oom #1

Are you sure you want to change the base?

Catch dm oom #1

Conversation

ege-st commented Aug 25, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gortiz Sep 8, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gortiz Sep 8, 2023 •

edited

Loading