Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many CLOSE_WAIT connections #2676

Open
cuihtlauac opened this issue Sep 3, 2024 · 13 comments
Open

Many CLOSE_WAIT connections #2676

cuihtlauac opened this issue Sep 3, 2024 · 13 comments
Labels
bug Something isn't working

Comments

@cuihtlauac
Copy link
Collaborator

Recently, the server seems to become slower and slower until it becomes unresponsive. This only affects package documentation pages.

Connecting to the server and docker container shows there is an apparently always increasing number of TCP connections in CLOSE_WAIT status.

/ # netstat -tn | grep CLOSE_WAIT | wc
     3611     21666    292491
/ # netstat -tn | grep CLOSE_WAIT | wc
     3622     21732    293382
/ # netstat -tn | grep CLOSE_WAIT | wc
     3623     21738    293463
/ # netstat -tn | grep CLOSE_WAIT | wc
     3628     21768    293868
/ # netstat -tn | grep CLOSE_WAIT | wc
     3621     21726    293301
/ # netstat -tn | grep CLOSE_WAIT | wc
     3872     23232    313632

As a workaround, rebooting the machine will put the server in a working state. But daily reboots are needed.

@cuihtlauac cuihtlauac added the bug Something isn't working label Sep 3, 2024
@github-project-automation github-project-automation bot moved this to 📋 Backlog in OCaml.org Sep 3, 2024
@mtelvers
Copy link
Contributor

mtelvers commented Sep 3, 2024

Have you tried restarting the container rather than the whole machine?

@cuihtlauac
Copy link
Collaborator Author

Have you tried restarting the container rather than the whole machine?

No, I plan to try that next time.

@cuihtlauac
Copy link
Collaborator Author

@aantron: Any views on this? How could OS changes affect Dream's networking layer and cause this?

@aantron
Copy link
Contributor

aantron commented Sep 3, 2024

@cuihtlauac Which OS changes are you referring to?

Definitely need to look into this regardless.

@cuihtlauac
Copy link
Collaborator Author

cuihtlauac commented Sep 3, 2024

Inside its container, the server runs on Alpine 3.20.2. Last update of this came from PR#2609
The container is running in Ubuntu 22.04.4
The kernel is 5.15.0-119-generic

@aantron
Copy link
Contributor

aantron commented Sep 3, 2024

That looks like a Docker image change. At least in terms of the image names, I see that it was ocaml/opam:alpine-3.20-ocaml-4.14 before and ocaml/opam:alpine-3.20-ocaml-5.2. I don't know what the totality of the changes between the images is, but from the strings, it makes me suspect that the difference is running on OCaml 5.2 vs. 4.14, rather than an OS change. Are you certain that this was not observed before, on the 4.14 images?

@aantron
Copy link
Contributor

aantron commented Sep 3, 2024

If not, a practical solution for now might be to move back to the 4.14 images, since that PR doesn't require using 5.2 for OCaml.org, just allows it, and we eagerly switched to it -- which might be too aggressive, if there is some flaw in either OCaml 5.2, how Dream or its dependencies end up interacting with the 5.2 runtime, or something wrong with the rest of the OCaml 5.2 image.

@aantron
Copy link
Contributor

aantron commented Sep 3, 2024

It's also possible that some of the other changes in that PR are causing this somehow, since opam-repository is pinned to a newer commit (there might be a fresh bug in an upstream library from opam), or the branch of River that I proposed is somehow causing this (which seems highly unlikely, but I've done nothing yet to empirically rule that out). I would start by running the current main of OCaml.org from 4.14 containers, so all the changes in that PR and everything else since, but with 4.14 containers, and if you no longer observe the issue, we can start debugging it. In general, I would gradually bisect this over the course of a few days, if there is any reason to doubt that something in this PR specifically is causing it.

@cuihtlauac
Copy link
Collaborator Author

This morning, some CLOSE_WAIT connections appeared. They are gone now. Looking at the logs, I found this:

04.09.24 07:18:49.420                  INFO REQ 20312 Found documentation page for https://docs-data.ocaml.org/live/p/octez-libs/20.1/doc/Epoxy_tx/Tx_rollup/MerklePV/V/argument-1-L/Anemoi/index.html.json
04.09.24 07:18:49.420                  INFO REQ 20312 GET https://docs-data.ocaml.org/live/p/octez-libs/20.1/package.json
04.09.24 07:18:49.739    dream.logger  WARN REQ 20311 Aborted by: Failure(": count_while1")
04.09.24 07:18:49.739    dream.logger  WARN Raised at Stdlib.failwith in file "stdlib.ml", line 29, characters 17-33
04.09.24 07:18:49.739    dream.logger  WARN Called from Dream_encoding__Accept.encodings in file "lib/accept.ml" (inlined), line 101, characters 16-48
04.09.24 07:18:49.739    dream.logger  WARN Called from Dream_encoding.accepted_encodings_with_weights in file "lib/dream_encoding.ml", line 113, characters 22-47
04.09.24 07:18:49.739    dream.logger  WARN Called from Dream_encoding.accepted_encodings in file "lib/dream_encoding.ml", line 128, characters 8-47
04.09.24 07:18:49.739    dream.logger  WARN Called from Dream_encoding.preferred_content_encoding in file "lib/dream_encoding.ml", line 133, characters 8-34
04.09.24 07:18:49.739    dream.logger  WARN Called from Dream_encoding.compress.(fun) in file "lib/dream_encoding.ml", line 159, characters 28-58
04.09.24 07:18:49.739    dream.logger  WARN Called from Lwt.Sequential_composition.backtrace_bind.create_result_promise_and_callback_if_deferred.callback in file "src/core/lwt.ml", line 1910, characters 16-19
04.09.24 07:18:49.739    dream.logger  WARN Re-raised at Dream_encoding.compress.(fun) in file "lib/dream_encoding.ml", lines 158-166, characters 2-62
04.09.24 07:18:49.739      dream.http ERROR REQ 20311 Failure(": count_while1")
04.09.24 07:18:49.739      dream.http ERROR REQ 20311 Raised at Stdlib__Map.Make.find in file "map.ml", line 141, characters 10-25
04.09.24 07:18:49.739      dream.http ERROR REQ 20311 Called from Logs.Tag.find in file "src/logs.ml", line 154, characters 14-32
04.09.24 07:18:49.739    dream.logger  INFO REQ 20313 GET /u/f7149bb03f3ecc0a5a88b0a0109946e2/async_rpc_kernel/v0.11.0/doc/Async_rpc_kernel/Versioned_rpc/Both_convert/One_way/index.html 10.0.1.3:38920 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)
04.09.24 07:18:49.739                  INFO REQ 20313 GET https://docs-data.ocaml.org/live/u/f7149bb03f3ecc0a5a88b0a0109946e2/async_rpc_kernel/v0.11.0/doc/Async_rpc_kernel/Versioned_rpc/Both_convert/One_way/index.html.json
04.09.24 07:18:49.739    dream.logger  INFO REQ 20314 GET /p/aws-ec2/1.2/doc/Aws_ec2/Types/InstanceBlockDeviceMappingSpecificationList/index.html 10.0.1.3:52982 Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0
04.09.24 07:18:49.740                  INFO REQ 20314 GET https://docs-data.ocaml.org/live/p/aws-ec2/1.2/doc/Aws_ec2/Types/InstanceBlockDeviceMappingSpecificationList/index.html.json
04.09.24 07:18:49.740    dream.logger  INFO REQ 20315 GET /u/9a0a1ffcbe734aeb883defc48bedb76d/core_kernel/v0.11.1/doc/Core_kernel/Date/Map/index.html 10.0.1.3:52996 facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
04.09.24 07:18:49.740                  INFO REQ 20315 GET https://docs-data.ocaml.org/live/u/9a0a1ffcbe734aeb883defc48bedb76d/core_kernel/v0.11.1/doc/Core_kernel/Date/Map/index.html.json

@aantron
Copy link
Contributor

aantron commented Sep 4, 2024

It looks like one of the issues might be with Dream_encoding, cc @tmattio. However, even in the presence of any other issues, Dream shouldn't be leaking connections, so that's a separate question.

Do we know whether any of these errors are related to the fds in the CLOSE_WAIT state? Perhaps, at a minimum, Dream should print the OS fd number to its logs to help with analyzing this kind of situation.

@cuihtlauac
Copy link
Collaborator Author

Do we know whether any of these errors are related to the fds in the CLOSE_WAIT state?

No we don't.

@aantron
Copy link
Contributor

aantron commented Sep 5, 2024

I've added fd number logging to Dream in aantron/dream#345, now merged in. Would you be able to try running with it?

@cuihtlauac
Copy link
Collaborator Author

cuihtlauac commented Sep 6, 2024

The count_while1 exceptions seem to be unrelated. I've created a separate issue: #2683. It is transient, too, but correlated with content. They appear without causing network or overload trouble.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: 📋 Backlog
Development

No branches or pull requests

3 participants