Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catch dm oom #1

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from
Draft

Catch dm oom #1

wants to merge 4 commits into from

Conversation

ege-st
Copy link
Owner

@ege-st ege-st commented Aug 25, 2023

Starting PR for catching unrecoverable exceptions and shutting a Pinot instance down so it can be restarted.

Note: PR not opened against the base repository so that I can do testing and polishing first.

// If cause is Error (OutOfMemoryError or any other error), shutdown the process
if (cause instanceof Error) {
LOGGER.error("Unrecoverable error: shutting down.");
System.exit(1);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Executing System.exit is a huge antipattern. It is okish as a hotfix, but it shouldn't be the norm.

Instead, there should be a protocol between the thread executing this and the server/broker main thread. When this thread reaches a fatal condition like this one, it should communicate the issue tot he main thread and that thread should produce a controlled stop.

Executing a System.exit may have unpredictable consequences. For example, if this code is reached on a test, the process will die without reporting the issues. It could also happen that we kill the process in the middle of an important transaction and we left some stored data in an inconsistent (aka corrupted) state.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, in case we actually want to do this, we can just enable these JVM flags instead of populate the code with exits:

https://stackoverflow.com/questions/12096403/java-shutting-down-on-out-of-memory-error

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the correct solution would be to add here something like:

    super.exceptionCaught(ctx, cause);

That would fire the default behavior, which is:

  public void exceptionCaught(ChannelHandlerContext ctx, Throwable cause)
            throws Exception {
        ctx.fireExceptionCaught(cause);
    }

which means that the next ChannelHandler will be informed. That means that the error will be propagated on the handler chain and will eventually reach the initial handler added by netty, which should break the connection

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW: I'm like 90% sure that firing the exception will close the channel. But in Baeldung example, the channel is being closed explicitly by calling ctx.close(). Maybe that is the correct solution

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, this method is deprecated

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responses:

  • Re using System.exit agreed, I asked about if we had a way to trigger a shutdown gracefully and it seems this is the way to do it in Pinot. And I spent a bit of time trying to find an alternative and could not.
  • Which method was deprecated?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as dropping the message in readChannel0 I believe the OOM happens before that function is called so we cannot drop it on the receiver side.

I looked a few weeks ago and could not find a way to cap the direct memory that Netty uses such that it will drop an incoming message message that is too big before it OOMs.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concern I have with simple breaking teh connection or recreating it is: what happens to the data in direct memory? Does netty clean it up and reclaim it, does it create a memory leak?

Copy link

@gortiz gortiz Sep 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Netty uses a reference counter and we are in charge of keeping its invariants. We need to be sure that we are increasing and decreasing the counter as expected. In case we are doing that correctly, Netty will release the direct memory.

I believe the OOM happens before that function is called so we cannot drop it on the receiver side.

But if that is correct, it seems that we are configuring Netty to use heap buffers, not direct buffers. BTW, both heap and direct buffers use reference counter, so we need to keep the invariant anyway

It seems to be the opposite. The OOM error is thrown due to allocating too much direct memory

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Netty uses a reference counter and we are in charge of keeping its invariants. We need to be sure that we are increasing and decreasing the counter as expected. In case we are doing that correctly, Netty will release the direct memory.

Correct, and as far as I can tell, we only keep the Netty managed memory for as long as it takes deserialize the message into a DataTable value. That is, we don't create any additional references. I think our event loop has a very simple one step chain: it gets the full byte buffer from Netty, deserializes it into a DataTable and then sends the reference to the DataTable object to the Query Router.

It seems to be the opposite. The OOM error is thrown due to allocating too much direct memory

Yes, I believe the OOM happens when Netty is attempting to load the message so that it can then route it to the event loop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants