Skip to content

Add timeout to connect attempts to fix #19 #20

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Aug 1, 2025

Conversation

thomasjm
Copy link
Contributor

Without connect timeouts, I get intermittent hangs on some tests I run. With this change, I ran a whole bunch of repeats and got no failures.

I don't have an exact reproducer for causing the connect call to hang. In my case, it happens when a Kubernetes Service is in the process of starting. According to some googling, hangs can happen for a variety of reasons, such as if the network simply drops packets for a destination that doesn't exist yet.

Copy link
Owner

@mbg mbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @thomasjm 👋🏻

Thank you so much for reporting the issue you encountered and opening this PR to try and fix it! I hadn't run into this particular problem during any of my testing.

I approved the CI runs for your PR and it seems that most of the workflows fail because you introduced the NumericUnderscores extension which isn't available in older versions of GHC.

I am on holiday this week, so it would be good if you could change that and get the existing CI to pass before I am back next week when I can review this in full.

@mbg
Copy link
Owner

mbg commented Aug 19, 2024

Also, some of the builds fail because recoveringWith now has a different signature, so that will also need to be addressed to get the CI to pass.

@thomasjm
Copy link
Contributor Author

All right, done. I'm pretty sure nobody uses GHC 8.2 anymore though :P

@thomasjm thomasjm requested a review from mbg October 11, 2024 09:15
@thomasjm
Copy link
Contributor Author

Friendly ping on review @mbg

@mbg
Copy link
Owner

mbg commented Oct 25, 2024

Hi @thomasjm,

Sorry for the delay in getting this reviewed. It's on my radar, but I just haven't been able to find the time yet.

@thomasjm
Copy link
Contributor Author

thomasjm commented Feb 7, 2025

Hi @mbg, any chance of getting this finished soon?

@mbg
Copy link
Owner

mbg commented Aug 1, 2025

Really sorry for the delay here. I haven't been doing much Haskell lately, so these libraries haven't been getting the attention they deserve. However, I am trying to get back into maintaining these properly since I am using them personally again, so I am getting back to this now.

I have been trying to put together a test case for this (I have tried various combinations of nc, socat, and the macOS firewall), but also had no luck. The library as it is on main handled everything fine. This made me realise that I never asked:

  • What operating system are you using where you ran into the problem described in Initial connect call can hang forever #19. I imagine different networking stacks may have differences here that mean one could handle it without changes to this library, while others don't.
  • I assume that you are using TCP, but can you confirm that?

@thomasjm
Copy link
Contributor Author

thomasjm commented Aug 1, 2025

#19 would have been on Linux with TCP.

What are you trying to do in your test case? Like I said in the original issue, I'm not entirely sure what circumstances cause a connect call to hang, but one way might be network pathology like dropped packets.

@mbg
Copy link
Owner

mbg commented Aug 1, 2025

What are you trying to do in your test case? Like I said in the original issue, I'm not entirely sure what circumstances cause a connect call to hang, but one way might be network pathology like dropped packets.

Mainly I've been trying to simulate that sort of thing, but with everything I have tested the connection attempt either got aborted immediately or the underlying networking stack's timeout worked.

For example, I added a firewall rule (on macOS) with:

echo "block drop out quick on lo0 proto tcp from any to 127.0.0.1 port 12345" | sudo pfctl -ef -

And then:

> waitTcpVerbose putStrLn retryPolicyDefault "localhost" "12345"
[retry:0] Encountered Network.Socket.connect: <socket: 15>: timeout (Operation timed out). Retrying.
[retry:1] Encountered Network.Socket.connect: <socket: 15>: timeout (Operation timed out). Retrying.
[retry:2] Encountered Network.Socket.connect: <socket: 15>: timeout (Operation timed out). Retrying.
[retry:3] Encountered Network.Socket.connect: <socket: 15>: timeout (Operation timed out). Retrying.
[retry:4] Encountered Network.Socket.connect: <socket: 15>: timeout (Operation timed out). Retrying.
[retry:5] Encountered Network.Socket.connect: <socket: 15>: timeout (Operation timed out). Retrying.
*** Exception: Network.Socket.connect: <socket: 15>: timeout (Operation timed out)

@thomasjm
Copy link
Contributor Author

thomasjm commented Aug 1, 2025

i just reproduced a hang on both Linux and macOS like this:

 module Main (main) where

 import Network.Socket

 main :: IO ()
 main = do
     sock <- socket AF_INET Stream defaultProtocol

     -- Connect to 10.255.255.1 (non-routable private IP)
     -- This will cause SYN packets to be sent but dropped by routing
     let addr = SockAddrInet 80 (tupleToHostAddress (10, 255, 255, 1))

     putStrLn "Attempting connection (this will hang)..."
     connect sock addr  -- This will hang indefinitely
     putStrLn "Connected!" -- Never reached

@mbg
Copy link
Owner

mbg commented Aug 1, 2025

That's great, thank you! I can confirm that this does get stuck on its own, as well as on the main branch with:

ghci> import Network.Socket
ghci> import Control.Retry
ghci> let addr = SockAddrInet 80 (tupleToHostAddress (10, 255, 255, 1))
ghci> waitSocketVerbose putStrLn retryPolicyDefault $ defaultHints{ addrAddress=addr , addrFamily = AF_INET, addrSocketType = Stream }

-- nothing happens

But with the changes in your PR:

ghci> waitSocketVerbose putStrLn retryPolicyDefault $ defaultHints{ addrAddress=addr , addrFamily = AF_INET, addrSocketType = Stream }
[retry:0] Encountered user error (Timeout in connect attempt). Retrying.
[retry:1] Encountered user error (Timeout in connect attempt). Retrying.
[retry:2] Encountered user error (Timeout in connect attempt). Retrying.
[retry:3] Encountered user error (Timeout in connect attempt). Retrying.
[retry:4] Encountered user error (Timeout in connect attempt). Retrying.
[retry:5] Encountered user error (Timeout in connect attempt). Retrying.
*** Exception: user error (Timeout in connect attempt)

@mbg
Copy link
Owner

mbg commented Aug 1, 2025

I have turned this into a test case in 9d56f3d, feel free to cherry-pick that onto this branch

@thomasjm
Copy link
Contributor Author

thomasjm commented Aug 1, 2025

FWIW if you wait long enough with my hanging example, it does time out. For me it took 1 minute 15 seconds on macOS, and 2 minutes 13 seconds on Linux. I'm not sure what determines this.

@thomasjm
Copy link
Contributor Author

thomasjm commented Aug 1, 2025

Cherry-pick done

@mbg
Copy link
Owner

mbg commented Aug 1, 2025

FWIW if you wait long enough with my hanging example, it does time out. For me it took 1 minute 15 seconds on macOS, and 2 minutes 13 seconds on Linux.

Interesting. Is that specific to the example or is that the case with the original scenario where you ran into the problem as well? (If you can reproduce that)

@mbg
Copy link
Owner

mbg commented Aug 1, 2025

Also, I realised that 4d27fd7 is also needed for the test. Otherwise, without the fix, it of course also gets stuck (at least for the amount of time until the lengthy timeout kicks in that you mentioned).

@thomasjm
Copy link
Contributor Author

thomasjm commented Aug 1, 2025

Is that specific to the example or is that the case with the original scenario where you ran into the problem as well? (If you can reproduce that)

I'm not sure where it comes from. The original scenario was intermittent and difficult to reproduce since it involved a whole Kubernetes cluster.

Copy link
Owner

@mbg mbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think whether or not the test case gives us the worst case scenario of connect getting completely stuck, this is an improvement over main where there's no timeout for the connect attempt and we rely on the underlying network stack to do the right thing.

I am happy to merge this as-is.

Thank you very much for reporting this, implementing a fix, and finding a reasonable test case. Once again sorry that it took so long for me to deal with.

@mbg mbg merged commit ae95dcb into mbg:main Aug 1, 2025
12 checks passed
@mbg mbg mentioned this pull request Aug 1, 2025
@mbg
Copy link
Owner

mbg commented Aug 1, 2025

I am releasing this as part of https://github.com/mbg/network-wait/releases/tag/v0.4

@thomasjm
Copy link
Contributor Author

thomasjm commented Aug 1, 2025

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants