Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many open files #22

Open
jnehring opened this issue Apr 26, 2016 · 33 comments
Open

Too many open files #22

jnehring opened this issue Apr 26, 2016 · 33 comments
Labels

Comments

@jnehring
Copy link
Member

FREME live becomes unstable after running for a while with this error:

ERROR   2016-04-26 08:42:24,909 [http-nio-8004-Acceptor-0] org.apache.tomcat.util.net.NioEndpoint  - Socket accept failed
java.io.IOException: Zu viele offene Dateien
        at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
        at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
        at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
        at org.apache.tomcat.util.net.NioEndpoint$Acceptor.run(NioEndpoint.java:686)
        at java.lang.Thread.run(Thread.java:745)

It happened two weeks ago and now it happened again. The problem can be solved through restarting freme-live.

I backuped the log files (around 500 mb or error logs) in ~/logs-open-files on freme-live. I guess the problem is an open file handle or stream that is not closed properly. When this happened 20.000 times then freme-live fails. Probably it is caused by the WRIPL requests because no one else does so many requests in two weeks on freme-live.

@jnehring
Copy link
Member Author

jnehring commented May 9, 2016

The bug happend again. We should fix it soon.

jnehring added a commit to freme-project/basic-services that referenced this issue May 11, 2016
@jnehring
Copy link
Member Author

jnehring commented May 11, 2016

I can test the issue by starting FREME on ubuntu and using this command to count the number of open file handles:

lsof -a -p 3919 | wc -l

where 3919 is the process id of freme.

The number of file handles increases with each API call that uses e-Internationalisation. I could find and fix two streams that where not closed properly and reduce the number of newly opened streams by each API call from 3 to 1.

Next step: Find the last stream that is not closed properly.

@ArneBinder
Copy link
Contributor

ArneBinder commented May 26, 2016

This hould be fixed. @jnehring could you check this on freme-dev?
This does not solve freme-project/basic-services#32.

@ArneBinder ArneBinder reopened this May 26, 2016
@jnehring
Copy link
Member Author

I tested it. The problem is fixed. Great!

@jnehring
Copy link
Member Author

jnehring commented Aug 2, 2016

The bug happened again today.

@jnehring
Copy link
Member Author

jnehring commented Aug 3, 2016

The bug happened again and I could solve it again by restarting the server. I could not find the reason for the problem. I cannot even reproduce it locally. Next step to find the bug: I want to run the broker on my local VM and process a larger dataset (e.g. 1000 documents) with it using the same pipeline that WRIPL uses.

@jnehring
Copy link
Member Author

jnehring commented Aug 9, 2016

I analysed the content going to FREME. When putting the following content in /etc/modsecurity/modsecurity.conf then all content send to FREME is logged in /var/log/apache2/modsec_audit.log:

SecRuleEngine On
SecAuditEngine On
SecAuditLog /var/log/apache2/modsec_audit.log
SecRequestBodyAccess on
SecAuditLogParts ABIJDFHZ

Through this logging I found out that some special characters lead to crashes in e-Link. Processing a pipeline with the following content leaves 10 open file handles. It was executed on a FREME test installation on freme-live.

[{"method":"POST","endpoint":"http:\/\/api.freme-project.eu:5000\/e-entity\/freme-ner\/documents","parameters":{"language":"en","dataset":"dbpedia"},"headers":{"content-type":"text\/plain","accept":"text\/turtle"},"body":"close Now Pinned to My Dublin Map.Register Now to save and share your pins. Explore 0 Login \/ Register Search OFF 0 close See & Do Insider Tips My Dublin Map What's On Login \/ Register About Us Contact Us SEARCH See & Do Free Events What's On Casement Evening Lecture Series: In Conversation: Chris Clarke with Alan Phelan Free Events Inspired by the notion behind Rising leader Roger Casements legendary words, Our Kind, a short film created by DCU graduate and Irish artist Alan Phelan, takes creative liberty with the familiar narrative and political history of the Rising, imagining a world in which Casement had instead escaped to exile in Norway. Join curator and art critic Chris Clarke for this fascinating conversation with Phelan, which will seek to uncover how the script was approached and developed, and the historical context from which he drew. Bring an open mind and prepare to discover what lessons we can learn from an alternate take on history. Admission Free Event Dates Pin To Map address Charlemont House, Parnell Square North Phone +353(0)12225550 Web hughlane.ie Email [email protected] LOCATION SEE WHAT'S NEARBY Special Offers Shopping Tours Visitor Attractions Activities Accommodation Eat, Drink, Nightlife What's On Close Filters Apply Filters EVENT DATES Thu, 1 September 2016 18:30 - 19:30 Events Seasonal Bram Stoker Festival 28 Oct 2016 - 31 Oct 2016 Free Events Simon Fujiwara: The Humanizer 20 May 2016 - 28 Aug 2016 Seasonal Historic Dn Laoghaire 14 Jul 2016 - 01 Sep 2016 Free Events Angela Morgan 01 Sep 2016 - 29 Sep 2016 Free Events Breaking Rainbows: Orla Barry 29 Sep 2016 - 05 Nov 2016 Insider Tips READ MORE More Than A Stage: Olympia Theatre READ MORE Hooked on Howth READ MORE Insider Tips from Dublin's Top Concierges READ MORE 20 Reasons to #LoveDublin READ MORE More Than A Stage: Olympia Theatre What's On My Dublin Map Insider Tips See & Do 2016 Visit Dublin About Us Terms & Conditions Privacy & Cookies Dublin Pass Sightseeing Card Dublin Convention Bureau Contact Us Newsletter Site Map"},{"method":"POST","endpoint":"http:\/\/api.freme-project.eu:5000\/e-link\/documents\/","parameters":{"templateid":"1"},"headers":{"content-type":"text\/turtle","accept":"application\/json+ld"}}]

Next steps:

  • Trying to reproduce the error somewhere else then on freme-live server
  • Find the place that leaves the file handle open and fix it. If this works out, do a freme release with the bugfix.
  • Open a bug issue to deal with these problems. I found a range of API requests that fail due to special characters in e-Link.

@ghsnd
Copy link
Contributor

ghsnd commented Aug 10, 2016

Maybe you can see if it are connections (sockets) and what the connections are with netstat or ss ?

@jnehring
Copy link
Member Author

The problem does not occur any more on freme-live. We have an apache webserver on freme-live that proxies the requests to our API. This proxy was not configured properly, the module mod_xml2enc was not installed. Therefore special characters got mangled up. These special characters caused the open streams. I installed mod_xml2end and now the number of open streams is constant.

I leave this issue open because there is still a problem in FREME which we should find. But hopefully the broker does not crash all the time now.

@jnehring
Copy link
Member Author

The bug still occurs.

@x-fran
Copy link
Member

x-fran commented Aug 17, 2016

Under what OS is running your server?
Who is the owner/group of the files of FREME NER application?
Please run for me in the command line as root the following command "ulimit -a" and send me the output please.

@ghsnd
Copy link
Contributor

ghsnd commented Aug 18, 2016

And can you dump the heap of the java process(es), when there a many open files, like so:

jmap -dump:format=b,file=<filename>.bin <pid>

where <pid> is the id of the java process.

Then, opening this file(s) in visualvm or jvisualvm (part of the JDK), might allow to see instances of Streams, Files, Sockets, or whatever causes the problem and where they are initialised.

So can you do this, compress the dump(s) (xz works good here) and send it/them to me or put them temporarily on Google Drive?

@jnehring
Copy link
Member Author

@Xfran I send you the data via email.

@x-fran
Copy link
Member

x-fran commented Aug 18, 2016

@jnehring answered via email. Please let me know if any issues or improvements.

@ghsnd
Copy link
Contributor

ghsnd commented Aug 18, 2016

It seems that there are 98 com.hp.hpl.jena.sparql.engine.http.HttpQuery instances. These contain SPARQL queries posed to a service on port 8890/sparql. Each instance initialises an org.apache.http.impl.client.DecompressingHttpClient. Each such instance initialises an org.apache.http.iml.conn.PoolingClientConnectionManager. Each pool can have a number of connections (10, I think). In any case, this causes more than 800 connections, that are done (EOF) but not closed.

If this causes the problem (not sure!), then my guess is that somewhere in e-Link a Jena model, query or iterator is not closed properly.

@ghsnd
Copy link
Contributor

ghsnd commented Aug 18, 2016

It's mostly about CONSTRUCT queries.

@ghsnd
Copy link
Contributor

ghsnd commented Aug 18, 2016

I pushed some changed to eu.freme.eservices.elink.api.DataEnricher, where some resources were not (always) closed in DataEnricher.
Most important one on line 92

@jnehring You have to trigger a Jenkins build, because I seem to have credentials problems (see the console output)

@ghsnd
Copy link
Contributor

ghsnd commented Aug 18, 2016

Remark: I didn't test!. Hope I didn't close too much or too early :)

@jnehring
Copy link
Member Author

I manually triggered the re-build and then it re-build the project. Tested e-Link, still works. I will do the release and then we can see if it fixes the bug.

@jnehring
Copy link
Member Author

I have installed the new version of e-link that @ghsnd created on freme-live during freme-project/technical-discussion#122

Lets hope that this solves the problem

@jnehring
Copy link
Member Author

The bug is still there but it is not as problematic as before. Since friday FREME was used a lot and now there are 2467 open files. I cannot easily determine the exact amount of requests but last weekend FREME would have produced more open files and would have crashed.

@ghsnd
Copy link
Contributor

ghsnd commented Aug 29, 2016

It could be that there are still other resources not being closed somewhere else in the code. If you can make a heap dump again, I will take a look.
Another explanation is that the bug was not fixed, but that it is not triggered as much as before.

Or... all partners double search some part of the code base for resources that are not closed ;)

@jnehring
Copy link
Member Author

I restarted FREME live so it does not crash so all open files are gone. I will do another heap dump but I need to wait for a couple of days so there are some open files.

jnehring added a commit to freme-project/basic-services that referenced this issue Aug 30, 2016
@jnehring
Copy link
Member Author

I found one more stream that is not closed properly in the SPARQL converters controller and added the finally statement to it. It is a stream that writes a SPARQL SELECT queries result set to CSV, XML or JSON format.

jnehring added a commit to freme-project/basic-services that referenced this issue Aug 30, 2016
@ghsnd
Copy link
Contributor

ghsnd commented Sep 1, 2016

Closed some resources here.

@jnehring
Copy link
Member Author

jnehring commented Sep 1, 2016

Thanks @ghsnd . I wanted to create the heap dump today but the too many open files bug happened again and when it happened I cannot create the heap dump. Will try again...

@jnehring
Copy link
Member Author

jnehring commented Sep 6, 2016

I created a new memory dump, in a state when the system had 3506 open files and was shortly before crashing. However the latest open streams that @ghsnd and me fixed here are still not installed on freme-live. So next week after the Semantics I want to release and store the latest state on freme-live because maybe the bug has already bin fixed.

@jnehring
Copy link
Member Author

i added a script to the crontab of freme live that restarts freme live every midnight as a temporary solution to the problem.

@jnehring
Copy link
Member Author

I installed a new version of the broker and freme ner and switched off the cronjob to restart freme. lets keep our fingers crossed that this solves the issue

@jnehring
Copy link
Member Author

jnehring commented Sep 27, 2016

About 20 hours after the release the number of open files was 2300. So this problem is still not fixed. I switched on script to restart FREME every midnight.

@ghsnd
Copy link
Contributor

ghsnd commented Sep 28, 2016

Today I found this, since it's the number of 'pipes' that increases... Soon to be checked.

@ghsnd
Copy link
Contributor

ghsnd commented Sep 28, 2016

A dump of the threads reveals a lot of this kind of threads (and their number grows over time):

"I/O dispatcher 480" #639 prio=5 os_prio=0 tid=0x00007fa9b40f7800 nid=0x634b runnable [0x00007fa92ced1000]
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
        - locked <0x000000076f614530> (a sun.nio.ch.Util$2)
        - locked <0x000000076f614520> (a java.util.Collections$UnmodifiableSet)
        - locked <0x000000076f6143e8> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
        at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:256)
        at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:105)
        at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:584)
        at java.lang.Thread.run(Thread.java:745)

This doesn't say much, but AbstractMultiworkerIOReactor is used in CloseableHttpAsyncClientBase, which is used in the Spring MVC framework and in the Spring HTTP client. To be investigated further...

@jnehring
Copy link
Member Author

I read this article about JDBC database connections in high concurrency scenarios and therefore added this configuration to freme live. Maybe the open connections are database connections. So now there is a maximum number of open database connections and further no database connection will be alive for more then 10 seconds.

spring.datasource.max-active=200
spring.datasource.remove-abandoned=true
spring.datasource.remove-abandoned-timeout=10
spring.datasource.log-abandoned=true

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants