Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resource leak when using IPC subscription #589

Open
erikfinnman opened this issue Aug 15, 2024 · 3 comments
Open

Resource leak when using IPC subscription #589

erikfinnman opened this issue Aug 15, 2024 · 3 comments
Assignees
Labels
bug This issue is a bug. investigating Issue is being investigated and/or work is in progress to resolve the issue. p2 This is a standard priority issue

Comments

@erikfinnman
Copy link

erikfinnman commented Aug 15, 2024

Describe the bug

We have detected what appears to be a resource leak in the Greengrass nucleus related to IPC subscriptions.

When the below Python code is executed (in a Greengrass component) which constantly creates an IPC client, sets up a topic subscription and then closes the subscription and client, the underlying resources appear not to be freed.

Expected Behavior

Resources in the Greengrass nucleus should be released when the client is closed.

Current Behavior

Heap is eventually exhausted for the Greengrass Java process.

Reproduction Steps

    log.info("Mem-test of IPC client")
    count = 0
    while True:
        ipc_client = GreengrassCoreIPCClientV2()
        request_id = uuid.uuid1()
        response_topic = f"dummy_method-response-{request_id}"
        def response_listener(message: SubscriptionResponseMessage) -> None:
            log.info("Response listener")
        def error_listener(_: Exception) -> Union[None, bool]:
            log.info("Error listener")
            return True

        _, operation = ipc_client.subscribe_to_topic(
            topic=response_topic,
            on_stream_event=response_listener,
            on_stream_error=error_listener,
        )
        operation.close()
        ipc_client.close()
        count += 1
        if count > 10000:
            log.info("Created %s clients", count)
            count = 0
            time.sleep(1)

If the Greengrass heap is set to something like 100Mb, the memory is exhausted after about 15-20 minutes when running the above snippet in a Greengrass component, which we can see by enabling the Native Memory Tracking feature in the JVM.

The above code snippet was the most compact way we were able to replicate the problem we have been seeing on our production devices (but there the memory leak takes several weeks to manifest since we obviously don’t create clients as frequently as in the code snippet above).

Analyzing memory dumps of the JVM identifies the com.aws.greengrass.builtin.services.pubsub.PubSubIPCEventStreamAgent as the object retaining almost all memory. Digging into the references of this class reveals hundreds of thousands of objects of type java.util.concurrent.ConcurrentHashMap$Node which in turn have references to com.aws.greengrass.builtin.services.pubsub.SubscriptionTrie. It looks like this class contains the topic name of the generated subscription.

Studying the IPC documentation I can’t see anything obviously wrong with our code snippet - both the Greengrass IPC client and the subscription operations are closed - shouldn’t this free up all resources?

I raised this issue with the Nucleus team, but they state that this must be a problem in Python SDK since the client is not disconnected: aws-greengrass/aws-greengrass-nucleus#1650

Possible Solution

No response

Additional Information/Context

I've also tried with a version of the above snippet when the client is created once (outside the loop), but I get the same behavior.

SDK version used

1.19.0

Environment details (OS name and version, etc.)

Linux 5.15.61-v8+ #1579 SMP PREEMPT 2022 aarch64 GNU/Linux

@erikfinnman erikfinnman added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Aug 15, 2024
@jmklix jmklix self-assigned this Aug 21, 2024
@jmklix
Copy link
Member

jmklix commented Sep 11, 2024

Sorry for the delay, still looking into this. Trying to verify that the server correctly gets the message that the channel is closed. If that isn't happening then it's there is likely a problem with this sdk. Otherwise it might be a greengrass bug.

@jmklix jmklix added investigating Issue is being investigated and/or work is in progress to resolve the issue. p2 This is a standard priority issue and removed needs-triage This issue or PR still needs to be triaged. labels Sep 11, 2024
@erikfinnman
Copy link
Author

Sorry for the delay, still looking into this. Trying to verify that the server correctly gets the message that the channel is closed. If that isn't happening then it's there is likely a problem with this sdk. Otherwise it might be a greengrass bug.

Ok, thanks for taking the time to update the issue.

@felixeriksson
Copy link

Hello, I just wanted to add some information. In the reproduction example above, a new GreengrassCoreIPCClientV2 is created and closed in every loop iteration. This is not necessary to reproduce the problem - the memory leak is present also in the case where only 1 GreengrassCoreIPCClientV2 is created (before the loop), and reused for every subscription.

We've taken heap dumps of machines running a test like the one in the example. Analysing these dumps, it seems that the memory is hogged by the object Map<String, SubscriptionTrie<K>> children in an instance of the class SubscriptionTrie (see source). The SubscriptionTrie instance that hogs the memory is called listeners which is a member of the class PubSubIPCEventStreamAgent (source).
386201870-9902e383-248d-42a7-882a-79171019d7fb

It seems reasonable to presume that calling operation.close() in the Python SDK should trigger a call to the method remove() on the SubscriptionTrie, which cleans up its member object subscriptionCallbacks, but seemingly not children - perhaps that's a clue, but I don't understand SubscriptionTrie well enough to say something about this with certainty.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. investigating Issue is being investigated and/or work is in progress to resolve the issue. p2 This is a standard priority issue
Projects
None yet
Development

No branches or pull requests

3 participants