-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File descriptors opened by jmx exporter getting stuck in CLOSE_WAIT state #911
Comments
We are also having this problem. We have found it can be so bad it will stop the application running in this case a cluster of Trino instances. Do you have any time to work on this? We had to remove it from our production env. |
The core HTTP server code is in the Prometheus prometheus-metrics-exporter-httpserver I create an issue for the |
I have tested my application with the latest changes available in main branch of the exporter which uses latest version of prometheus http server from client java repository and I see that when I'm using that my file descriptor connections stuck in CLOSE_WAIT state are getting cleared, could you please confirm what changes have been made that the fd's are getting cleared without adding the timeout ? |
@anilsomisetty I am not aware of any direct changes in We are targeting a beta release (due to the major changes made in |
@dhoard I see that due to the changes that have been made in this commit of client java code prometheus/client_java@1966186 after version 0.16.0 release the connections getting stuck in CLOSE_WAIT state are getting cleared after 2 minutes as they are in idle state. I have tested the 0.19.0 version of jmx exporter integrated with 0.16.0 release of client java with this commit changes included and have seen that the close wait stuck connection are getting closed which was not the case without this commit changes. |
@anilsomisetty Interesting observation. The issue with the code in prometheus/client_java@1966186 is that the The code in the latest release should behave the same way. Is it possible for you build and test the main branch (unreleased) to see if you can reproduce the issue? |
@dhoard I have build the main branch code of jmx exporter which is using client java version v1.1.0 and I'm not able to reproduce the issue, the stuck close wait connections are getting closed and are cleared. In the latest code i see ThreadPoolExecutor has been bounded by corepoolsize and maxpoolsize so it makes this bounded am I right ? |
@anilsomisetty correct - the Thanks for testing the code in main (unreleased)! Do you experience the issue with the latest release 0.20.0? |
@dhoard yes, the issue persists in the latest release 0.20.0 Could you please tell me when is the plan to have a generally available official release if beta is planned in next 2 weeks ? |
Hi, we are facing the same issue, our java application is using : here are some of the file descryptor stuck : We have around 20k of those and then need to restart the application. Is there an expected solution for that soon ? |
@anilsomisetty @liransliveperson I have updated the Is it possible for you to build/test |
@dhoard Im using the main version, I have created a jar from it and using it. An Exception occurred while scraping metrics: java.lang.IllegalArgumentException: Duplicate labels in metric data: {brandId="10138221"} This issue does not happening with jmx_prometheus_javaagent-0.19.0.jar |
@liransliveperson interesting. The current code is I would check your rules. The exception is stating that you are trying to add a duplicate label - label names have to be unique. |
@dhoard I have check my rules, there are no duplicate rules, I have used the same rules, one with the main version here-> that produces exception, Then with the jmx_prometheus_javaagent-0.19.0.jar that not produces the exception. The rules are exactly the same in both cases. there is some issue in your main code. |
@liransliveperson This appears to be an exporter rule configuration issue that the older version of Can you provide the specific exporter rule and a list of attributes on the MBean? |
@dhoard we have many rules and attributes in the MBean its hard to put it all in here. |
@liransliveperson we don't have access to the underlying information in the JMX Exporter, but I feel we could add it to the exception when implementing prometheus/client_java#942 |
Here is a JUnit test that reproduces the issue:
|
@dhoard is the master ready for build with the detailed exception on duplicate labels ? |
@dhoard here is our servers metrics : |
@liransliveperson I have disabled the test. Until the new version of If you want to test it some code with the exception displaying more information...
The jar will be in |
@dhoard I wasnt able to build https://github.com/prometheus/client_java it has test error in : |
Hmm... https://app.circleci.com/pipelines/github/prometheus/client_java My test branch requires the |
Try to run localy ProcessMetricsTest |
What is your OS version? JDK version? |
@dhoard I was able to deploy the test version with the detailed exception, The exception says : An Exception occurred while scraping metrics: io.prometheus.metrics.model.snapshots.DuplicateLabelsException: Duplicate labels for metric "metricsPerBrand_Acd_WaitingConvs_Value": {brandId="10138221"} Looking on our metrics we have multiple "metricsPerBrand_Acd_WaitingConvs_Value" for different brands : metricsPerBrand_Acd_WaitingConvs_Value{brandId="10138221",} 0.0 But only one for brand Id 10138221 : This is the Yaml rule:
|
@liransliveperson Can you provide the full stacktrace? EDIT:
Can you enable some debugging logging by either definiing...
or
... and should get a debug print for ever metric that's being added.
|
I was able to print all the labels for a metric.
And looks like the jmx exporter sees labels : An Exception occurred while scraping metrics: io.prometheus.metrics.model.snapshots.DuplicateLabelsException: Duplicate labels for metric "metricsPerBrand_BotAcd_SingleConv_DispatchTimeMs_Mean": {brandId="29422842"} All Labels : {brandId="29422842"},{brandId="29422842"},. Although in the JMX we have only once the brand 29422842. |
@liransliveperson can you enable the debug? (the output will be to where ever system out is captured) can you provide your complete YAML file? |
Here is the Yaml file |
@liransliveperson can you provide the requested debug output? |
@dhoard we found the issue in the rules that caused the duplicate values, we will proceed with the main version inorder to solve the open files issue. |
@liransliveperson can you provide an update on the |
@liransliveperson version 1.0.1 has been released which should resolve the |
I am using trino where I am using the jmx exporter of version 19 to expose my application metrics over a http port for my prometheus agent to collect these metrics.
I have my prometheus agent configured to collect metrics from the exposed http port with a specific timeout.
My issue: File descriptors are stuck in CLOSE_WAIT state and their count keeps on increasing, after the file descriptor count reaches a specific limit my application is crashing and becoming unresponsive.
In my case what's happening is the file descriptor connections that has been opened between jmx exporter and my prometheus metric collection agent are getting stuck in CLOSE_WAIT state because my agent is exiting after the timeout and the exporter is not able to provide the metrics within specified timeout.
The exporter is stuck in writing metrics to the descriptor because on the other side agent reading the metrics has left and ByteArrayOutputStream is stuck in write call and the file descriptor is not getting closed at this line of code
https://github.com/prometheus/client_java/blob/ed0d7ae3b57a3986f6531d1a37db031a331227e6/simpleclient_httpserver/src/main/java/io/prometheus/client/exporter/HTTPServer.java#L126
My agent collecting the metrics leaves after specified timeout, but ideally the exporter should understand that the requestor has left and should close the connection and shouldn't get stuck in write call and shouldn't keep the fd open.
I understand that once the agent leaves after timeout and exporter will not be able to give it the entire metrics of that moment and the metrics will be lost at that timeframe which is fine for me.
Please help me with this.
The text was updated successfully, but these errors were encountered: