-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TCP Connection Ports Leaks #9529
Comments
Hello! There is an option called When shutting down the Erlang VM you can either make it flush or not flush all open connections, how does |
Hello @garazdawi,
I will check that on our application and come back to you.
|
We are using cowboy |
You can get the value by doing |
Indeed. All connections in our applications are using
It could then explain why the VM is not stopped, because all the connections are still active, totally outside of the scope of the VM (shutdown procedure), but available in background until the connections are explicitly closed by the node (killing the tcp connections with
The another "incarnation" of the connection seems a problem, what does it really mean? When we are closing the socket, it can still be re-activated by the client? Furthermore, does this "incarnation" behavior can also be seen when using a timeout? |
When you don't use |
I'm closing this, if you need any more help, feel free to re-open. |
I don't think the issue here has been addressed. The problem as I understand it is that there are still socket ports existing despite the owner process being long gone:
Even if the socket lingers, I don't think the Erlang port should exist so long after its owner process has exited, to the point that it prevents shutting down the node. If my understanding is correct, how would one debug such an issue? Because clearly it doesn't happen for everyone. |
How long one should wait is always a problem. If the system is not shutting down, how long would you then want to wait? gen_tcp defaults to infinity, while a gen_server:call is 5 seconds. Both have problems. The same for when a node is shutting down gracefully. It seems prudent to wait for a while for tcp/stdout buffers to be flushed, but how long should one wait?
To detect that you have this issue I would use Is that what you had in mind for debugging help? |
If the owner process called
Yes, thank you. @niamtokik Please refer to the comment from Lukas here the next time you see it happen, this should give us more info. |
Hey there, actually, I'm still working on this issue. We tried many different procedure to avoid this kind of behavior without success:
Unfortunately, even with this quite complex and complete procedure, it seems in some particular case, connections are still active and block the VM during shutdown. Those connections are pretty slow (~5kB/s in average) and are trying to fetch data greater than 100MB. When the VM is in this state, we don't have any access to the console, and we need to use system tools like @essen I saw it few times today. I need to create
Yes. We tried different way (see the previous procedure) and in some case, we can't In summary, I will probably have more information to share during the next days, with the list of all actions we did and what we have found. Right now, we are still trying to find an "easy" workaround instead of killing the connections with |
Forgot to add, we also modified |
We are using |
Doing a It is quite easy to trigger this behaviour in a small test: 1> os:cmd("erl -detached -noshell -eval '{ok, L} = gen_tcp:listen(8080, [{reuseaddr,true},{active,false}]), gen_tcp:accept(L), receive ok -> ok end'").
[]
2> spawn(fun() -> {ok, D} = gen_tcp:connect(localhost, 8080, []), (fun F() -> gen_tcp:send(D, <<0:4096>>), F() end)() end).
<0.92.0>
3> exit(v(2), stop).
true
4> erlang:ports().
[#Port<0.0>,#Port<0.1>,#Port<0.3>,#Port<0.4>,#Port<0.5>]
5> erlang:port_info(#Port<0.5>).
[{name,"tcp_inet"},
{links,[]},
{id,40},
{connected,<0.92.0>},
{input,0},
{output,594540},
{os_pid,undefined}]
6> inet:getstat(#Port<0.5>).
{ok,[...{send_pend,8468}]}
7> init:stop(). %% Hangs If you don't want to have the TCP connection remain after it has been closed by you, you need to set the linger option to something. |
Just wanted to note that this behaviour only happens when you are trying to send data but cannot. It does not happen when you are receiving data. |
Does this mean that if I set linger to https://www.erlang.org/doc/apps/kernel/inet.html
If it does completely close the socket after the timeout (same as Thank you for all the clarifications!! |
Here is the problem. Even by setting linger to Anyway, I think isolate those connections is important, and collecting some traces during a shutdown could probably give us more information here. One thought I got few days ago is what if those connections are behind a firewall or another entity with the power to modify or alter the stream?
From Unix Network Programming Volume 1: The Sockets Networking API (page 203):
Again, in most of cases we are seeing these behaviors, but it could be due to the connection quality of the peers. When a peer got poor connections (e.g. retransmission, latency, low speed...), it seems to behave differently. |
It does not, only |
Turns out I was incorrect about this. Doing an explicit close makes The only way to guarantee that the port is closed for both explicit and implicit close is to get linger to 0. |
@garazdawi, a quick summary, it seems we had a bug also on our application. One of the process in charge of the connection was not correctly catching I think adding the behavior in the documentation regarding the @essen adding I think this issue can be closed, at least, we are using a workaround that just work now. |
Cowboy has an application-level linger loop, so it could be improved by setting |
Describe the bug
When shutting down an Erlang node, if an active TCP connection is still established with a remote peer, the connection is not closed, even if the process linked/connected to the port was killed. It can lead to a very slow shutdown if the connection used by the remote peer has a huge latency or a small bandwidth.
To Reproduce
We discovered this bug using
cowboy
but it can also be reproduced withinets:httpd
:cowboy
orinets:httpd
)curl
using--limit-rate ${value}
where${value}
is close to0
.[ erlang:port_info(X) || X <- erlang:ports() ]
sudo ss -ntip
on GNU/Linux ordoas fstat -u ${user}
on OpenBSD.ss -K
on GNU/LinuxA full example as escript is present at the end of this bug report.
Expected behavior
When stopping
cowboy
,inets:httpd
or any service using TCP, the active connection should be properly closed and/or killed.Affected versions
This bug has been reproduced on:
cowboy
/inets
cowboy
/inets
cowboy
/inets
inets
Additional context
It looks like the bug is from
erts/emulator/drivers/common/inet_drv.c
and was introduced in ebbd26e. Not sure though, but during the investigation, it seems this part of the code does not receive the instruction to close the connection.The bug can easily be reprouced with the following code:
Here some example of leaked ports. The port is not linked, and the process is not existing (dead):
The connections can be seen using these commands:
Those ports cannot be closed using
erlang:port_close/1
. Here another debugging session from an erlang shell whereinets
has been stopped:EDIT1: added ports/process info and corrected few typos
EDIT2: added more information regarding leaked ports.
The text was updated successfully, but these errors were encountered: