Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File descriptors opened by jmx exporter getting stuck in CLOSE_WAIT state #911

Closed
anilsomisetty opened this issue Feb 9, 2024 · 34 comments

Comments

@anilsomisetty
Copy link

anilsomisetty commented Feb 9, 2024

I am using trino where I am using the jmx exporter of version 19 to expose my application metrics over a http port for my prometheus agent to collect these metrics.

I have my prometheus agent configured to collect metrics from the exposed http port with a specific timeout.

My issue: File descriptors are stuck in CLOSE_WAIT state and their count keeps on increasing, after the file descriptor count reaches a specific limit my application is crashing and becoming unresponsive.

In my case what's happening is the file descriptor connections that has been opened between jmx exporter and my prometheus metric collection agent are getting stuck in CLOSE_WAIT state because my agent is exiting after the timeout and the exporter is not able to provide the metrics within specified timeout.

The exporter is stuck in writing metrics to the descriptor because on the other side agent reading the metrics has left and ByteArrayOutputStream is stuck in write call and the file descriptor is not getting closed at this line of code

https://github.com/prometheus/client_java/blob/ed0d7ae3b57a3986f6531d1a37db031a331227e6/simpleclient_httpserver/src/main/java/io/prometheus/client/exporter/HTTPServer.java#L126

My agent collecting the metrics leaves after specified timeout, but ideally the exporter should understand that the requestor has left and should close the connection and shouldn't get stuck in write call and shouldn't keep the fd open.

I understand that once the agent leaves after timeout and exporter will not be able to give it the entire metrics of that moment and the metrics will be lost at that timeframe which is fine for me.

Please help me with this.

@rwilliams-r7
Copy link

We are also having this problem. We have found it can be so bad it will stop the application running in this case a cluster of Trino instances. Do you have any time to work on this? We had to remove it from our production env.

@dhoard
Copy link
Collaborator

dhoard commented Feb 22, 2024

The core HTTP server code is in the Prometheus client_java library. It will have to expose configuration to allow setting the socket timeout. I still have concerns around the RMI socket timeouts.

prometheus-metrics-exporter-httpserver

Relevant class: https://github.com/prometheus/client_java/blob/main/prometheus-metrics-exporter-httpserver/src/main/java/io/prometheus/metrics/exporter/httpserver/HTTPServer.java

I create an issue for the client_java library (prometheus/client_java#924) Once added, the configuration can be exposed via the exporter configuration.

@anilsomisetty
Copy link
Author

I have tested my application with the latest changes available in main branch of the exporter which uses latest version of prometheus http server from client java repository and I see that when I'm using that my file descriptor connections stuck in CLOSE_WAIT state are getting cleared, could you please confirm what changes have been made that the fd's are getting cleared without adding the timeout ?
When would be the official release of the latest version of jmx exporter which is integrated with client_java version of v1.1.0 ?

@dhoard
Copy link
Collaborator

dhoard commented Feb 26, 2024

@anilsomisetty I am not aware of any direct changes in client_java v1.1.0 that would have changed the behavior regarding CLOSE_WAIT sockets.

We are targeting a beta release (due to the major changes made in client_java) of the main code in the next 2 weeks.

@anilsomisetty
Copy link
Author

@dhoard I see that due to the changes that have been made in this commit of client java code prometheus/client_java@1966186 after version 0.16.0 release the connections getting stuck in CLOSE_WAIT state are getting cleared after 2 minutes as they are in idle state.

I have tested the 0.19.0 version of jmx exporter integrated with 0.16.0 release of client java with this commit changes included and have seen that the close wait stuck connection are getting closed which was not the case without this commit changes.

@dhoard
Copy link
Collaborator

dhoard commented Feb 26, 2024

@anilsomisetty Interesting observation.

The issue with the code in prometheus/client_java@1966186 is that the ThreadPoolExecutor is unbounded.

The code in the latest release should behave the same way.

Is it possible for you build and test the main branch (unreleased) to see if you can reproduce the issue?

@anilsomisetty
Copy link
Author

anilsomisetty commented Feb 26, 2024

@dhoard I have build the main branch code of jmx exporter which is using client java version v1.1.0 and I'm not able to reproduce the issue, the stuck close wait connections are getting closed and are cleared.

In the latest code i see ThreadPoolExecutor has been bounded by corepoolsize and maxpoolsize so it makes this bounded am I right ?

@dhoard
Copy link
Collaborator

dhoard commented Feb 28, 2024

@anilsomisetty correct - the ThreadPoolExecutor is now bounded.

Thanks for testing the code in main (unreleased)!

Do you experience the issue with the latest release 0.20.0?

@anilsomisetty
Copy link
Author

anilsomisetty commented Feb 28, 2024

@dhoard yes, the issue persists in the latest release 0.20.0
As there is a default limit of max 10 threads set in this latest release when all 10 connections are getting stuck in close wait state no new connections are getting created.

Could you please tell me when is the plan to have a generally available official release if beta is planned in next 2 weeks ?

@dhoard dhoard self-assigned this Mar 8, 2024
@liransliveperson
Copy link

liransliveperson commented Apr 1, 2024

Hi, we are facing the same issue, our java application is using :
-javaagent:jmx_prometheus_javaagent-0.19.0.jar=14102
And we are getting aggregation of file descriptor files stuck in CLOSE_WAIT, we need to restart our application in production once every 2 weeks.

here are some of the file descryptor stuck :
java 19727 appuser *223u IPv4 1411652589 0t0 TCP ssor-jet135.lpdomain.com:14102->svor-mon28.lpdomain.com:59184 (CLOSE_WAIT)
java 19727 appuser *224u IPv4 1411912918 0t0 TCP ssor-jet135.lpdomain.com:14102->svor-mon28.lpdomain.com:56524 (CLOSE_WAIT)
java 19727 appuser *225u IPv4 1412503559 0t0 TCP ssor-jet135.lpdomain.com:14102->svor-mon28.lpdomain.com:53210 (CLOSE_WAIT)
java 19727 appuser *226u IPv4 1411689546 0t0 TCP ssor-jet135.lpdomain.com:14102->svor-mon28.lpdomain.com:34842 (CLOSE_WAIT)
java 19727 appuser *227u IPv4 1412056270 0t0 TCP ssor-jet135.lpdomain.com:14102->svor-mon28.lpdomain.com:55478 (CLOSE_WAIT)

We have around 20k of those and then need to restart the application.

Is there an expected solution for that soon ?

@dhoard
Copy link
Collaborator

dhoard commented Apr 1, 2024

@anilsomisetty @liransliveperson I have updated the main branch to use client_java 1.2.0, which has some fixes around closing the HttpExchange if there is an exception writing the metrics response to the client.

Is it possible for you to build/test main?

@liransliveperson
Copy link

liransliveperson commented Apr 2, 2024

@dhoard Im using the main version, I have created a jar from it and using it.
When trying to load the metrics I'm getting :

An Exception occurred while scraping metrics: java.lang.IllegalArgumentException: Duplicate labels in metric data: {brandId="10138221"}
at io.prometheus.metrics.model.snapshots.MetricSnapshot.validateLabels(MetricSnapshot.java:46)
at io.prometheus.metrics.model.snapshots.MetricSnapshot.(MetricSnapshot.java:33)
at io.prometheus.metrics.model.snapshots.UnknownSnapshot.(UnknownSnapshot.java:21)
at io.prometheus.metrics.model.snapshots.UnknownSnapshot$Builder.build(UnknownSnapshot.java:130)
at io.prometheus.jmx.JmxCollector.collect(JmxCollector.java:836)
at io.prometheus.metrics.model.registry.MultiCollector.collect(MultiCollector.java:26)
at io.prometheus.metrics.model.registry.PrometheusRegistry.scrape(PrometheusRegistry.java:72)
at io.prometheus.metrics.exporter.common.PrometheusScrapeHandler.scrape(PrometheusScrapeHandler.java:112)
at io.prometheus.metrics.exporter.common.PrometheusScrapeHandler.handleRequest(PrometheusScrapeHandler.java:53)
at io.prometheus.metrics.exporter.httpserver.MetricsHandler.handle(MetricsHandler.java:43)
at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:79)
at sun.net.httpserver.AuthFilter.doFilter(AuthFilter.java:83)
at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:82)
at sun.net.httpserver.ServerImpl$Exchange$LinkHandler.handle(ServerImpl.java:675)
at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:79)
at sun.net.httpserver.ServerImpl$Exchange.run(ServerImpl.java:647)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

This issue does not happening with jmx_prometheus_javaagent-0.19.0.jar
Could you please advise ?

@dhoard
Copy link
Collaborator

dhoard commented Apr 4, 2024

@liransliveperson interesting. The current code is main which uses client_java 1.2.0, which is completely new.

I would check your rules. The exception is stating that you are trying to add a duplicate label - label names have to be unique.

@liransliveperson
Copy link

@dhoard I have check my rules, there are no duplicate rules, I have used the same rules, one with the main version here-> that produces exception, Then with the jmx_prometheus_javaagent-0.19.0.jar that not produces the exception. The rules are exactly the same in both cases. there is some issue in your main code.

@dhoard
Copy link
Collaborator

dhoard commented Apr 5, 2024

@liransliveperson This appears to be an exporter rule configuration issue that the older version of client_java allowed (but was invalid.) An exporter rule must produce a unique metric (name + labels.) This can occur if a rule uses a subset of values as labels.

Can you provide the specific exporter rule and a list of attributes on the MBean?

@liransliveperson
Copy link

@dhoard we have many rules and attributes in the MBean its hard to put it all in here.
Is there an option to add a code in your project where we will able to know which datapoints are the duplicates?
That will help us to find the problem

@dhoard
Copy link
Collaborator

dhoard commented Apr 5, 2024

@liransliveperson we don't have access to the underlying information in the JMX Exporter, but I feel we could add it to the exception when implementing prometheus/client_java#942

@dhoard
Copy link
Collaborator

dhoard commented Apr 5, 2024

Here is a JUnit test that reproduces the issue:

package io.prometheus.jmx;

import io.prometheus.metrics.model.snapshots.Labels;
import io.prometheus.metrics.model.snapshots.MetricSnapshots;
import io.prometheus.metrics.model.snapshots.UnknownSnapshot;
import org.junit.Before;
import org.junit.Test;

import java.util.HashMap;
import java.util.Map;

import static org.junit.Assert.assertEquals;
import static org.junit.Assert.assertThrows;

public class DuplicateLabels {

    Map<String, UnknownSnapshot.Builder> unknownMap;

    @Before
    public void setUp() {
        unknownMap = new HashMap<>();
    }

    @Test
    public void testDuplicateLabels() {
        UnknownSnapshot.Builder unknownBuilder =
                unknownMap.computeIfAbsent(
                        "test",
                        name ->
                                UnknownSnapshot.builder()
                                        .name("test_metric")
                                        .help("test_metric help"));
        unknownBuilder.dataPoint(
                UnknownSnapshot.UnknownDataPointSnapshot.builder()
                        .value(1.12345678)
                        .labels(
                                Labels.of(
                                        "label1", "value1"))
                        .build());


        unknownMap.put("test", unknownBuilder);

        unknownBuilder =
                unknownMap.computeIfAbsent(
                        "test",
                        name ->
                                UnknownSnapshot.builder()
                                        .name("test_metric")
                                        .help("test_metric help"));
        unknownBuilder.dataPoint(
                UnknownSnapshot.UnknownDataPointSnapshot.builder()
                        .value(2.2468)
                        .labels(
                                Labels.of(
                                        "label1", "value1"))
                        .build());

        unknownMap.put("test2", unknownBuilder);

        MetricSnapshots.Builder result = MetricSnapshots.builder();

        Exception exception = assertThrows(IllegalArgumentException.class, () -> {
            for (UnknownSnapshot.Builder unknown : unknownMap.values()) {
                UnknownSnapshot unknownSnapshot = unknown.build();
                result.metricSnapshot(unknownSnapshot);
            }
        });

        assertEquals("Duplicate labels in metric data: {label1=\"value1\"}", exception.getMessage());
    }
}

@liransliveperson
Copy link

@dhoard is the master ready for build with the detailed exception on duplicate labels ?
Im trying to build the project but in class DuplicateLabels it cant find :
import io.prometheus.metrics.model.snapshots.DuplicateLabelsException;

@liransliveperson
Copy link

liransliveperson commented Apr 9, 2024

@dhoard here is our servers metrics :
duplicate_lables.xlsx
I want able to find any duplicate labels, I put that in excel sheet, every row has the row metric label, then went to 'Conditional Formatting' -> 'Highlight cell rules' -> 'Duplicate values' And I cand find any duplicate

@dhoard
Copy link
Collaborator

dhoard commented Apr 9, 2024

@liransliveperson I have disabled the test. DuplicateLabelsException will be in a new version of client_java, which hasn't been published yet.

Until the new version of client_java is published, the exporter code doesn't have the information to provide further information.

If you want to test it some code with the exception displaying more information...

  1. clone https://github.com/prometheus/client_java and build/install it (./mvnw clean install)
  2. clone my private branch https://github.com/dhoard/jmx_exporter/tree/test-branch and build it (./mvnw clean package)

The jar will be in ./jmx_prometheus_javaagent/target/jmx_prometheus_javaagent-1.0.0-BETA.jar

@liransliveperson
Copy link

@dhoard I wasnt able to build https://github.com/prometheus/client_java it has test error in :
testGoodCase(ProcessMetricsTest.java:76)

@dhoard
Copy link
Collaborator

dhoard commented Apr 9, 2024

Hmm... client_java is showing builds passing.

https://app.circleci.com/pipelines/github/prometheus/client_java

My test branch requires the client_java libraries built off of main. You will have to resolve the client_java build issues first.

@liransliveperson
Copy link

Try to run localy ProcessMetricsTest

@dhoard
Copy link
Collaborator

dhoard commented Apr 9, 2024

What is your OS version? JDK version?

@liransliveperson
Copy link

liransliveperson commented Apr 9, 2024

@dhoard I was able to deploy the test version with the detailed exception, The exception says :

An Exception occurred while scraping metrics: io.prometheus.metrics.model.snapshots.DuplicateLabelsException: Duplicate labels for metric "metricsPerBrand_Acd_WaitingConvs_Value": {brandId="10138221"}
at io.prometheus.metrics.model.snapshots.MetricSnapshot.validateLabels(MetricSnapshot.java:46)
at io.prometheus.metrics.model.snapshots.MetricSnapshot.(MetricSnapshot.java:33)

Looking on our metrics we have multiple "metricsPerBrand_Acd_WaitingConvs_Value" for different brands :

metricsPerBrand_Acd_WaitingConvs_Value{brandId="10138221",} 0.0
metricsPerBrand_Acd_WaitingConvs_Value{brandId="87947798",} 0.0
metricsPerBrand_Acd_WaitingConvs_Value{brandId="86122336",} 0.0
metricsPerBrand_Acd_WaitingConvs_Value{brandId="12660041",} 0.0

But only one for brand Id 10138221 :
metricsPerBrand_Acd_WaitingConvs_Value{brandId="10138221",}

This is the Yaml rule:

  • pattern: 'metricsPerBrand<0=(\w+), 1=Routing, 2=Acd, 3=(AgentAvailableCapacity|WaitingConvs|UnassignedConvs|waitingConvsNotIncludesBotsOnly|waitingConvsBotsOnly|waitingConvsFromSkillBasedMapTotal|waitingConvsFromSkillBasedMapNotIncludingBotSkillTotal|unassignedConvsFromSkillBasedMapTotal)><>(\w+)'
    name: metricsPerBrand_Acd_$2_Value
    labels:
    brandId: $1

@dhoard
Copy link
Collaborator

dhoard commented Apr 10, 2024

@liransliveperson Can you provide the full stacktrace?

EDIT:

I can add some test debug in my branch.

Can you enable some debugging logging by either definiing...

  1. an environment variable JMX_PROMETHEUS_EXPORTER_DEVELOPER_DEBUG=true

or

  1. a JVM system property jmx.prometheus.exporter.developer.debug=true

... and should get a debug print for ever metric that's being added.

            // Add to samples.
            LOGGER.log(
                    FINE,
                    "add metric sample: %s %s %s %s",
                    matchedRule.name,
                    matchedRule.labelNames,
                    matchedRule.labelValues,
                    value.doubleValue());

@liransliveperson
Copy link

I was able to print all the labels for a metric.
In the jmx_prometheus_exporter.yaml I have only those rules :

  • pattern: 'metricsPerBrand<0=(\w+), 1=BotAcd, 2=singleConversationDispatch><>Mean'
    name: metricsPerBrand_BotAcd_SingleConv_DispatchTimeMs_Mean
    labels:
    brandId: $1

    • pattern: 'metricsForGraphite<0=BotAcd, 1=singleConversationDispatch><>(99thPercentile|95thPercentile|Count|Mean)'

And looks like the jmx exporter sees labels :

An Exception occurred while scraping metrics: io.prometheus.metrics.model.snapshots.DuplicateLabelsException: Duplicate labels for metric "metricsPerBrand_BotAcd_SingleConv_DispatchTimeMs_Mean": {brandId="29422842"} All Labels : {brandId="29422842"},{brandId="29422842"},.

Although in the JMX we have only once the brand 29422842.

@dhoard
Copy link
Collaborator

dhoard commented Apr 11, 2024

@liransliveperson can you enable the debug? (the output will be to where ever system out is captured) can you provide your complete YAML file?

@liransliveperson
Copy link

Here is the Yaml file
jmx_prometheus_exporter_yaml.txt

@dhoard
Copy link
Collaborator

dhoard commented Apr 12, 2024

@liransliveperson can you provide the requested debug output?

@liransliveperson
Copy link

@dhoard we found the issue in the rules that caused the duplicate values, we will proceed with the main version inorder to solve the open files issue.

@dhoard
Copy link
Collaborator

dhoard commented Apr 19, 2024

@liransliveperson can you provide an update on the CLOSE_WAIT socket issue you were experiencing? Does it still occur using code the code in main?

@dhoard
Copy link
Collaborator

dhoard commented May 31, 2024

@liransliveperson version 1.0.1 has been released which should resolve the CLOSE_WAIT socket issue.

@dhoard dhoard closed this as completed May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants