Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TCP Connection Ports Leaks #9529

Open
niamtokik opened this issue Mar 4, 2025 · 21 comments
Open

TCP Connection Ports Leaks #9529

niamtokik opened this issue Mar 4, 2025 · 21 comments
Assignees
Labels
bug Issue is reported as a bug team:PS Assigned to OTP team PS team:VM Assigned to OTP team VM

Comments

@niamtokik
Copy link

niamtokik commented Mar 4, 2025

Describe the bug

When shutting down an Erlang node, if an active TCP connection is still established with a remote peer, the connection is not closed, even if the process linked/connected to the port was killed. It can lead to a very slow shutdown if the connection used by the remote peer has a huge latency or a small bandwidth.

To Reproduce

We discovered this bug using cowboy but it can also be reproduced with inets:httpd:

  1. start a service using TCP (e.g. cowboy or inets:httpd)
  2. configure the service to share large files (greater than 100MB but can be smaller if one has a way to control the client/server throughput)
  3. fetch a file with a web client like curl using --limit-rate ${value} where ${value} is close to 0.
  4. stop the service (or the BEAM)
  5. if the service is stopped, an unlinked port connected to an unknown process should be present. It can be listed using [ erlang:port_info(X) || X <- erlang:ports() ]
  6. if the BEAM has been stopped, it will wait until the connection is closed. The connection can be checked using sudo ss -ntip on GNU/Linux or doas fstat -u ${user} on OpenBSD.
  7. the only workaround found right now is to kill the connections using ss -K on GNU/Linux

A full example as escript is present at the end of this bug report.

Expected behavior

When stopping cowboy, inets:httpd or any service using TCP, the active connection should be properly closed and/or killed.

Affected versions

This bug has been reproduced on:

  • Ubuntu 22.04 with OTP-24.2.1 (from packages) and cowboy/inets
  • Ubuntu 22.04 with OTP-26.2.5.9 (from sources) and cowboy/inets
  • OpenBSD 7.6 with OTP-25.3.2.13 (from packages) and cowboy/inets
  • OpenBSD 7.6 with OTP-26.2.5.3 (from packages) and inets

Additional context

It looks like the bug is from erts/emulator/drivers/common/inet_drv.c and was introduced in ebbd26e. Not sure though, but during the investigation, it seems this part of the code does not receive the instruction to close the connection.

The bug can easily be reprouced with the following code:

#!/usr/bin/env escript

listen_port() -> 9999.

main(_Args) ->
  % prepare environment
  io:format("prepare environment~n"),
  
  io:format("create /tmp/inets directory~n"),
  file:make_dir("/tmp/inets"),
  
  io:format("create /tmp/inets/configuratio directory~n"),
  file:make_dir("/tmp/inets/configuration"),
  
  io:format("create /tmp/inets/files directory~n"),
  file:make_dir("/tmp/inets/files"),
  
  io:format("create /tmp/inets/files/100m file~n"),
  file:write_file("/tmp/inets/files/100m", crypto:strong_rand_bytes(100*1024*1024)),

  % start inets and configure it
  io:format("start inets~n"),
  application:ensure_all_started(inets),
  {ok, _P} = inets:start(httpd, [
    {port, listen_port()},
    {server_root, "/tmp/inets/configuration"},
    {document_root, "/tmp/inets/files"},
    {bind_address, "localhost"},
    {server_name, "localhost"}
  ]),
  
  % send a first message every 1 second to display ports
  timer:send_interval(1000, self(), tick),
  
  % execute some curls from the BEAM
  timer:send_after(2000, self(), curl),
  timer:send_after(2000, self(), curl),
  timer:send_after(2000, self(), curl),
  timer:send_after(2000, self(), curl),

  % after 5s, send stop message to stop inets
  timer:send_after(3000, self(), stop),
  loop().

loop() ->
  receive
    curl ->
      spawn_monitor(fun() ->
        Command = "curl --limit-rate 1 http://localhost:" ++ integer_to_list(listen_port()) ++ "/100m",
        io:format("start curl connection: ~p~n", [Command]),
        os:cmd(Command)
      end),
      loop();
    stop ->
      io:format("stop inets~n"),
      inets:stop(),
      application:stop(inets),
      loop();
    tick ->
      io:format("port leaks: ~w~n", [port_leak()]),
      loop();
    Msg ->
      io:format("received: ~w~n", [Msg]),
      loop()
  end.

% filter only unlinked ports
port_leak() ->
  Ports = erlang:ports(),
  PortsInfo = [ erlang:port_info(X) || X <- Ports ],
  Leaks = [ 
    X || X <- PortsInfo,
    proplists:get_value(name, X) =:= "tcp_inet",
    proplists:get_value(links, X) =:= []
  ],
  case Leaks of
    [] -> [];
    PS when is_list(PS) ->
      [
        begin
          Process = proplists:get_value(connected, P),
          ProcessInfo = erlang:process_info(Process),
          {P, {Process, ProcessInfo}}
        end
      || P <- PS
      ]
  end.

Here some example of leaked ports. The port is not linked, and the process is not existing (dead):

[{[{name,"tcp_inet"},                                                          
   {links,[]},                                                                 
   {id,80},                                                                    
   {connected,<0.101.0>},                                                      
   {input,0},                                                                  
   {output,131275},                                                            
   {os_pid,undefined}],                                                        
  {<0.101.0>,undefined}},                                                                                                                                                                                                                                                                                                    {[{name,"tcp_inet"},                                                                                                                                                                                                                                                                                                       
   {links,[]},                                                                                                                                                                                                                                                                                                                 {id,88},                                                                                                                                                                                                                                                                                                                 
   {connected,<0.102.0>},                                                                                                                                                                                                                                                                                                      {input,0},                                                                                                                                                                                                                                                                                                               
   {output,131275},                                                                                                                                                                                                                                                                                                            {os_pid,undefined}],                                                                                                                                                                                                                                                                                                     
  {<0.102.0>,undefined}},                                                                                                                                                                                                                                                                                                    {[{name,"tcp_inet"},                                                                                                                                                                                                                                                                                                       
   {links,[]},                                                                                                                                                                                                                                                                                                                 {id,96},                                                                                                                                                                                                                                                                                                                 
   {connected,<0.103.0>},                                                                                                                                                                                                                                                                                                      {input,0},                                                                                                                                                                                                                                                                                                               
   {output,131275},                                                                                                                                                                                                                                                                                                            {os_pid,undefined}],                                                                                                                                                                                                                                                                                                     
  {<0.103.0>,undefined}},                                                                                                                                                                                                                                                                                                    {[{name,"tcp_inet"},                                                                                                                                                                                                                                                                                                       
   {links,[]},                                                                                                                                                                                                                                                                                                                 {id,112},                                                                                                                                                                                                                                                                                                                
   {connected,<0.104.0>},                                                                                                                                                                                                                                                                                                   
   {input,0},                                                                  
   {output,131275},                                                            
   {os_pid,undefined}],                                                        
  {<0.104.0>,undefined}}]

The connections can be seen using these commands:

# on GNU/Linux
ss -nitp | grep beam

# on OpenBSD
fstat | grep beam | grep tcp

Those ports cannot be closed using erlang:port_close/1. Here another debugging session from an erlang shell where inets has been stopped:

% extract the ports
[{F,_}] = [ Z || Z = {X, Y} <- [ {X, erlang:port_info(X)} || X <- erlang:ports()],  proplists:get_value(name, Y) =:= "tcp_inet" ].

% impossible to close the port using port_close:
erlang:port_close(F).
% returns an exception:
% ** exception error: bad argument
%      in function  port_close/1
%         called as port_close(#Port<0.5>)

% impossible to connect to the port as well
erlang:port_connect(F, self()).
% returns an exception:
% ** exception error: bad argument
%      in function  port_connect/2
%         called as port_connect(#Port<0.5>,<0.130.0>)

EDIT1: added ports/process info and corrected few typos

EDIT2: added more information regarding leaked ports.

@niamtokik niamtokik added the bug Issue is reported as a bug label Mar 4, 2025
@garazdawi
Copy link
Contributor

Hello!

There is an option called linger that can be set on a connection. How is that configured in this case?

When shutting down the Erlang VM you can either make it flush or not flush all open connections, how does erlang:halt(..., [{flush,boolean()}]) effect your scenarios?

@niamtokik
Copy link
Author

Hello @garazdawi,

There is an option called linger that can be set on a connection. How is that configured in this case?

I will check that on our application and come back to you.

When shutting down the Erlang VM you can either make it flush or not flush all open connections, how does erlang:halt(..., [{flush,boolean()}]) effect your scenarios?

erlang:halt(0, [{flush, true}]). hangs, but erlang:halt(0, [{flush, false}]). stops the VM. It seems the connections are closed as well.

@niamtokik
Copy link
Author

There is an option called linger that can be set on a connection. How is that configured in this case?

We are using cowboy 2.10.0 with ranch 1.8.0. It seems ranch_tcp set linger value to {false, 0}. I enforced this configuration in our application to be sure but it did not change the behavior, when shutting down the VM or stopping the application (including cowboy and ranch). It seems most of the connections are now correctly closed but there are still some taking a while to be closed. I can't find which value is used by default for this parameter by the BEAM, is it set to {true, 0}? If it's the case, I think cowboy and ranch documentation need to be fixed.

@garazdawi
Copy link
Contributor

You can get the value by doing inet:getopts(Connection, [linger]). On my machine the default is {false,0}.

@niamtokik
Copy link
Author

Indeed. All connections in our applications are using {false, 0} by default. from the documentation:

{false, _} - close/1 or shutdown/2 returns immediately, not waiting for data to be flushed, with closing happening in the background.

It could then explain why the VM is not stopped, because all the connections are still active, totally outside of the scope of the VM (shutdown procedure), but available in background until the connections are explicitly closed by the node (killing the tcp connections with ss for example) or when the client terminates it on its side. Then what about {true, 0}?

{true, 0} - Aborts the connection when it is closed. Discards any data still remaining in the send buffers and sends RST to the peer. This avoids TCP's TIME_WAIT state, but leaves open the possibility that another "incarnation" of this connection being created.

The another "incarnation" of the connection seems a problem, what does it really mean? When we are closing the socket, it can still be re-activated by the client? Furthermore, does this "incarnation" behavior can also be seen when using a timeout?

@garazdawi
Copy link
Contributor

The another "incarnation" of the connection seems a problem, what does it really mean? When we are closing the socket, it can still be re-activated by the client? Furthermore, does this "incarnation" behavior can also be seen when using a timeout?

When you don't use {true,0}, the tcp socket will enter the TIME_WAIT state when it closed. This holds the connection semi-alive for a while and prevents certain types of miss-behaviours. It is the default for a reason. If you do set it to {true,0}, then it will skip the TIME_WAIT state and send RST to the remote end (RST is an error indication). If you search/llm for SO_LINGER 0 and TIME_WAIT you will get better explanations that what I can give.

@IngelaAndin IngelaAndin added team:PS Assigned to OTP team PS team:VM Assigned to OTP team VM labels Mar 4, 2025
@garazdawi
Copy link
Contributor

I'm closing this, if you need any more help, feel free to re-open.

@essen
Copy link
Contributor

essen commented Mar 10, 2025

I don't think the issue here has been addressed. The problem as I understand it is that there are still socket ports existing despite the owner process being long gone:

[{[{name,"tcp_inet"},                                                          
   {links,[]},                                                                 
   {id,80},                                                                    
   {connected,<0.101.0>},                

<0.101.0> in this case is no longer there (it's no longer in links because it terminated).

Even if the socket lingers, I don't think the Erlang port should exist so long after its owner process has exited, to the point that it prevents shutting down the node.

If my understanding is correct, how would one debug such an issue? Because clearly it doesn't happen for everyone.

@garazdawi
Copy link
Contributor

Even if the socket lingers, I don't think the Erlang port should exist so long after its owner process has exited, to the point that it prevents shutting down the node.

How long one should wait is always a problem. If the system is not shutting down, how long would you then want to wait? gen_tcp defaults to infinity, while a gen_server:call is 5 seconds. Both have problems.

The same for when a node is shutting down gracefully. It seems prudent to wait for a while for tcp/stdout buffers to be flushed, but how long should one wait?

If my understanding is correct, how would one debug such an issue? Because clearly it doesn't happen for everyone.

To detect that you have this issue I would use inet:getstat/1 on the stuck connection. It will show send_pend > 0, which means that it has data it needs to send. You can then use inet to inspect that port and figure out what where it is connected and maybe see why that tcp connection is "stuck". You can also use inet:i() to list all open ports and that will help you identify what the port is connected to.

Is that what you had in mind for debugging help?

@essen
Copy link
Contributor

essen commented Mar 11, 2025

It seems prudent to wait for a while for tcp/stdout buffers to be flushed, but how long should one wait?

If the owner process called gen_tcp:close before exiting, would the owner process be stuck as well? Or is there a difference between a close from function and a close from owner exit, in that regard?

Is that what you had in mind for debugging help?

Yes, thank you.

@niamtokik Please refer to the comment from Lukas here the next time you see it happen, this should give us more info.

@niamtokik
Copy link
Author

niamtokik commented Mar 11, 2025

Hey there, actually, I'm still working on this issue. We tried many different procedure to avoid this kind of behavior without success:

  • when using {linger, {true, 0}} (or any other timeout value) we have really strange behavior. Even when we set this value only on sockets controlled by ranch/cowboy. We got more timeouts and lot of instabilities;

  • we tried to create a drain procedure, like the one in the official ranch documentation, to purge all the connection before existing the node. Unfortunately, few connections were still there and the VM was hanging. Here one of the final procedure we developed:

  1. suspend our ranch listener with ranch:suspend_listener(Listener);
  2. shutdown the port linked to each remaining connections using inet:shutdown(Port, write);
  3. after a first delay, we close again the remaining connections using inet:close(Port);
  4. in this situation, most of the connections are over, but sometimes, few of them are still there, we decided to update the socket with inet:setopts/1 and update linger option to {true, 0}, then call again inet:close/1;
  5. today (not deeply tested), we tried to kill directly ranch process with erlang:exit/2. I was not able to reproduce the bug since then, but because we are not controlling the peers - and still have any easy way to recreate the scenario - we need to wait a bit.

Unfortunately, even with this quite complex and complete procedure, it seems in some particular case, connections are still active and block the VM during shutdown. Those connections are pretty slow (~5kB/s in average) and are trying to fetch data greater than 100MB.

When the VM is in this state, we don't have any access to the console, and we need to use system tools like ss, netstat and strace to have an idea to see what's happening.

@essen I saw it few times today. I need to create pcap files to understand correctly what's happening. Not sure if it's a network issue though, it looks like the process in charge of the VM still has access to the write buffer with some data in it.

If the owner process called gen_tcp:close before exiting, would the owner process be stuck as well? Or is there a difference between a close from function and a close from owner exit, in that regard?

Yes. We tried different way (see the previous procedure) and in some case, we can't shutdown/close/kill the connection. I would like to generate the same behavior with gen_tcp directly without using ranch to see if it can happen as well, but all my tests with this scenario failed.

In summary, I will probably have more information to share during the next days, with the list of all actions we did and what we have found. Right now, we are still trying to find an "easy" workaround instead of killing the connections with ss -K every time we need to restart our nodes.

@niamtokik
Copy link
Author

Forgot to add, we also modified -shutdown_time parameter to a really small value, and it did not work. The process in charge of the VM was still there even after the timeout we set. Same behavior was also present when also modifying the supervisor timeouts.

@niamtokik
Copy link
Author

To detect that you have this issue I would use inet:getstat/1 on the stuck connection. It will show send_pend > 0, which means that it has data it needs to send. You can then use inet to inspect that port and figure out what where it is connected and maybe see why that tcp connection is "stuck". You can also use inet:i() to list all open ports and that will help you identify what the port is connected to.

We are using inet:getstat/1 right now to isolate the connections and try to have a better understanding of what's happening. But inet:getstat/1 is exposing practically the same information than netstat or ss (at least, if I correctly understood the implementation). What is annoying though is, when we are shutting down the node with erlang:halt/1 or init:stop/0, we are losing access to the console and we are not able to execute these functions. In fact, we are also losing the logs, so, it's kinda hard to give more information without doing dirty hacks.

@garazdawi garazdawi reopened this Mar 11, 2025
@garazdawi
Copy link
Contributor

If the owner process called gen_tcp:close before exiting, would the owner process be stuck as well? Or is there a difference between a close from function and a close from owner exit, in that regard?

Doing a gen_tcp:close/1 causes exactly the same behaviour as if the controlling process exits. That is the TCP port is told to flush all its queues and then exit. The process that does the close will not block while waiting for the buffers to flush.

It is quite easy to trigger this behaviour in a small test:

1> os:cmd("erl -detached -noshell -eval '{ok, L} = gen_tcp:listen(8080, [{reuseaddr,true},{active,false}]), gen_tcp:accept(L), receive ok -> ok end'").
[]
2> spawn(fun() -> {ok, D} = gen_tcp:connect(localhost, 8080, []), (fun F() -> gen_tcp:send(D, <<0:4096>>), F() end)() end).
<0.92.0>
3> exit(v(2), stop).
true
4> erlang:ports().
[#Port<0.0>,#Port<0.1>,#Port<0.3>,#Port<0.4>,#Port<0.5>]
5> erlang:port_info(#Port<0.5>).
[{name,"tcp_inet"},
 {links,[]},
 {id,40},
 {connected,<0.92.0>},
 {input,0},
 {output,594540},
 {os_pid,undefined}]
6> inet:getstat(#Port<0.5>).
{ok,[...{send_pend,8468}]}
7> init:stop(). %% Hangs

If you don't want to have the TCP connection remain after it has been closed by you, you need to set the linger option to something.

@garazdawi
Copy link
Contributor

Those connections are pretty slow (~5kB/s in average) and are trying to fetch data greater than 100MB.

Just wanted to note that this behaviour only happens when you are trying to send data but cannot. It does not happen when you are receiving data.

@garazdawi garazdawi assigned garazdawi and unassigned jhogberg Mar 11, 2025
@essen
Copy link
Contributor

essen commented Mar 11, 2025

If you don't want to have the TCP connection remain after it has been closed by you, you need to set the linger option to something.

Does this mean that if I set linger to {true, 5} the connection will not remain after the 5s? I do not find this part of the doc entirely clear about what happens when the timeout is reached (other than the close function returning):

https://www.erlang.org/doc/apps/kernel/inet.html

{true, Time} when Time > 0 - close/1 or shutdown/2 will not return
until all queued messages for the socket have been successfully sent
or the linger timeout (Time) has been reached.

If it does completely close the socket after the timeout (same as {true, 0} but delayed), then it sounds like something Cowboy/Ranch should set and do when closing as a result of graceful shutdown, in at least some cases.

Thank you for all the clarifications!!

@niamtokik
Copy link
Author

If you don't want to have the TCP connection remain after it has been closed by you, you need to set the linger option to something.

Here is the problem. Even by setting linger to {true, X} seems to have no effect on some sockets. Why? I don't know. Most of the connections are correctly closed like it is explained in the documentation. One thing we want to investigate is to pimp a bit the Linux TCP stack to see if we can bypass this odd behavior (increasing or decreasing read/write buffers for example).

Anyway, I think isolate those connections is important, and collecting some traces during a shutdown could probably give us more information here. One thought I got few days ago is what if those connections are behind a firewall or another entity with the power to modify or alter the stream?

Does this mean that if I set linger to {true, 5} the connection will not remain after the 5s? I do not find this part of the doc entirely clear about what happens when the timeout is reached (other than the close function returning):

From Unix Network Programming Volume 1: The Sockets Networking API (page 203):

If l_onoff is nonzero and l_linger is zero, TCP aborts the connection when it is closed. That is, TCP discards any data still remaining in the socket send buffer and sends an RST to the peer, not the normal four-packet connection termination sequence. This avoids TCP’s TIME_WAIT state, but in doing so, leaves open the possibility of another incarnation of this connection being created within 2MSL seconds and having old duplicate segments from the just-terminated connection being incorrectly delivered to the new incarnation [...] If l_onoff is nonzero and l_linger is nonzero, then the kernel will linger when the socket is closed. That is, if there is any data still remaining in the socket send buffer, the process is put to sleep until either: (i) all the data is sent and acknowledged by the peer TCP, or (ii) the linger time expires. If the socket has been set to nonblocking, it will not wait for the close to complete, even if the linger time is nonzero. When using this feature of the SO_LINGER option, it is important for the application to check the return value from close, because if the linger time expires before the remaining data is sent and acknowledged, close returns EWOULDBLOCK and any remaining data in the send buffer is discarded.

Again, in most of cases we are seeing these behaviors, but it could be due to the connection quality of the peers. When a peer got poor connections (e.g. retransmission, latency, low speed...), it seems to behave differently.

@garazdawi
Copy link
Contributor

Does this mean that if I set linger to {true, 5} the connection will not remain after the 5s?

It does not, only {true,0} does that. I thought that {true,N>0} also would do that, but it seems like it does not. I'm attempting to find out if this is expected and if so why.

@garazdawi
Copy link
Contributor

Doing a gen_tcp:close/1 causes exactly the same behaviour as if the controlling process exits.

Turns out I was incorrect about this. Doing an explicit close makes {linger,{true,N>1}} work as I thought it would, but when the port is closed due to process termination the option is effectively ignored until the internal buffer is empty.

The only way to guarantee that the port is closed for both explicit and implicit close is to get linger to 0. socket should behave better in this regard, so you might want to try that and see if you get better behaviour.

@niamtokik
Copy link
Author

niamtokik commented Mar 14, 2025

@garazdawi, a quick summary, it seems we had a bug also on our application. One of the process in charge of the connection was not correctly catching shutdown messages and then was killed too early in the shutdown procedures. The 3 steps we are using now seem to work as expected. We decided to use {linger, {true, 0}} after a timeout to be sure most of the connections are being safely closed, and kill only the slow ones.

I think adding the behavior in the documentation regarding the links and connected fields on the socket when the node is shutting down with remaining connections could be a good thing. It is really strange to have a port without links and connected process. Even more weird when the node hang for a long period of time during shutdown.

@essen adding {linger, {true, 0} (or {linger, {true, N}}) in ranch documentation to explain this specific scenario could also be nice to have.

I think this issue can be closed, at least, we are using a workaround that just work now.

@essen
Copy link
Contributor

essen commented Mar 14, 2025

Cowboy has an application-level linger loop, so it could be improved by setting {linger, {true, 0}} before the application-level linger loop, so that when the process terminates the socket gets closed. I will open a ticket in Cowboy to do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is reported as a bug team:PS Assigned to OTP team PS team:VM Assigned to OTP team VM
Projects
None yet
Development

No branches or pull requests

5 participants