Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xDS Memory Issue - Issue w/ LrsCallState, statsTimer #2892

Open
gfrancz opened this issue Jan 28, 2025 · 0 comments
Open

xDS Memory Issue - Issue w/ LrsCallState, statsTimer #2892

gfrancz opened this issue Jan 28, 2025 · 0 comments

Comments

@gfrancz
Copy link

gfrancz commented Jan 28, 2025

Problem description

We've discovered a memory leak in Node.js applications using gRPC with xDS (@grpc/grpc-js v1.12.5 and @grpc/grpc-js-xds v1.12.1). Applications showed a consistent pattern of memory growth, increasing by approximately 300MB within 24 hours of running, eventually leading to container crashes due to hitting memory limits.

Reproduction steps

The issue occurred consistently when xDS was enabled.

Environment

  • OS name, version and architecture: Debian 4.19.208-1 x86_64 GNU/Linux
  • Node version: v20.17.0
  • Node installation method [e.g. nvm]: nvm
  • If applicable, compiler version [e.g. clang 3.8.0-2ubuntu4]: N/A
  • Package name and version [e.g. [email protected]]: @grpc/grpc-js v1.12.5 and @grpc/grpc-js-xds v1.12.1

Additional context

We have recently started to use xDS with gRPC, and in our Node applications we’re using the following packages:

  • @grpc/grpc-js: v1.12.5
  • @grpc/grpc-js-xds: v1.12.1

With xDS enabled, we noticed that all Node applications exhibit the same memory utilization pattern: within 24 hours of running, the memory footprint of the application increases by about 300 MB.

At some point the containers hit the max memory limit, crash and are recreated.

With xDS disabled, we do not observe this pattern. Disabled means that the grpc-js-xds package is still loaded; however, the endpoint used for gRPC does not use the xDS protocol. E.g.,

process.env.WS_GRPC_XDS_OFF` === 'true' ? 
	'${packageName}.platform.internal:8001' : 
	'xds:///${packageName}:8000';

Below is a chart of memory utilization for a particular Node application. In this example, xDS was turned on for the application just before 6:00 PM on January 9th, over a 24 hour period we see the memory utilization climb abnormally, just after 6:00 on January 10th, xDS was disabled and the containers were restarted. For the next 2+ days the memory profile on the application was normal.

Image

This is one particular Node application, but we observed this pattern on all Node applications where xDS was enabled.

After analyzing heap snapshots of affected application instances, we identified a large number of LrsCallState objects with a high total retained memory size (249 instances, with 631 MB retained in this example).

Image

After reviewing the implementation, we expected to see only one instance of the LrsCallState Class as this is set in the XdsSingleServerClient, which we found had only one instance:

https://github.com/grpc/grpc-node/blob/master/packages/grpc-js-xds/src/xds-client.ts#L822

After deeper review, we found that this lrsCallState attribute is unset here:

https://github.com/grpc/grpc-node/blob/master/packages/grpc-js-xds/src/xds-client.ts#L941

And then recreated here:

https://github.com/grpc/grpc-node/blob/master/packages/grpc-js-xds/src/xds-client.ts#L956

Normally, the unset instance without references would be garbage collected; however, the LrsCallState has a NodeJS.Timeout that is created with a setInterval call here:

https://github.com/grpc/grpc-node/blob/master/packages/grpc-js-xds/src/xds-client.ts#L719

When an instance is unset, the statsTimer is not cleared, and it continues to operate (be referenced) in the global context – because of the backreference to the instance of the LrsCallState, the LrsCallState and associated resources are not collected, resulting in a memory leak:

Image

We created a patched version of the grpc-js-xds package, with the changes in this draft PR:

#2891

After 3 full days, we see memory utilization performing much closer to the normal profile for Node Applications with the patched xDS client.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant