-
Notifications
You must be signed in to change notification settings - Fork 24
enm goes into infinite loop with pipeline tcp connection - when tcp service is not running. #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks, will take a look. |
Here the rc from enm_do_send comes back as -1. This is the function sequence that repeats.
I've not done any erl port and new to erlang. But i think somehow erlport needs to be notified that operation failed.
Anyway - you'll know it best. Please let me know if I can help. |
adding driver_failure(d->port, errno); is making it crash as oppose to hang.
|
Prevent a TCP socket that's not actually connected from causing the driver enm_read_output function from trying repeatedly to send data and failing. Close the socket and indicate connection eof to the caller. Add a regression test for this case.
The problem with |
Thanks Steve. I also thought driver_failure was not harsh but was not sure about eof. Will let test it next week and let you know. |
Hi Steve, This fix works for us. We effectively get the closed socket effect and are going to retry with backoff. Thanks a lot for help. When would you be able to push it to master? Regards, |
Thanks for testing it. I'll need to get the change reviewed before it can go to master. I'll try to get that done as soon as possible. |
Fix issue #7: avoid spinning on send to unconnected socket
I wrote a small program to test this fix. It gives me {error, close} on second message even if the tcp server is listening properly. I tested the same code against older build(before the patch), it worked fine. -module(enm_stress).
-export([stress/0]).
stress() ->
enm:start_link(),
Url = "tcp://127.0.0.1:9090",
%Url = "inproc://pipeline",
spawn(fun() -> acceptor(Url) end),
timer:sleep(1000),
spawn(fun() -> sender(Url) end).
acceptor(Url) ->
{ok, Socket} = enm:pull([{bind, Url}]),
io:format("accepting on socket ~p~n", [Socket]),
listen(Socket).
listen(Socket) ->
receive
{nnpull, Socket, <<"quit">>} ->
io:format("quitting server~n"),
enm:close(Socket);
{nnpull, Socket, Message} ->
io:format("~p : ~p~n", [erlang:now(), Message]),
listen(Socket)
after 10000 ->
io:format("timeout. quitting server~n"),
enm:close(Socket)
end.
sender(Url) ->
{ok, Client} = enm:push([{connect, Url}]),
send(Client, 1000).
send(Socket, 0) ->
io:format("Quitting client~n"),
ok=enm:send(Socket, <<"quit">>),
timer:sleep(2000),
enm:close(Socket);
send(Socket, Tick) ->
io:format("sending ~p ~n", [Tick]),
ok=enm:send(Socket, integer_to_binary(Tick)),
timer:sleep(5), % milliseconds
send(Socket, Tick-1). |
@supershal thanks for reporting this. Sometimes when I run your test code, it runs correctly to completion, but other times it hits the same error you describe. I'll need to look into it some more. |
I have a fix that preserves the behavior of the original change in this thread but that also avoids the problem @supershal has seen. The fix involves counting the number of |
The initial fix for issue 7 could result in sockets getting closed prematurely. Unfortunately the nanomsg API limits how we can handle this, so this fix uses a counter to determine if we're spinning in a ready_output loop, closing the socket if it hits EAGAIN 64 times with no intervening I/O.
@supershal can you try this fix and let me know if it works for you? |
Thanks @vinoski for looking into it. I tested the fix using above stand alone module, it works fine. However, when I tried the fix in our application, I still see socket closed error even if the server is running. Our application traffic is coming in bursts over (1000 messages at once with a message size can be 5-8 KB) thats where the socket is closing prematurely. However I was unable to reproduce the same issue with a standalone test (sending more than 1000 message at once, without any delay) where message size is too small. Please allow me a day or two. let me write a reproducible test and play with EAGAIN count. |
Thanks @supershal. I may have a better idea for fixing this, so in the meantime I'll experiment with that. |
I did some further testing. I found out that the issue is with message size not when sending messages in bursts. A message size over 1 KB breaks pipeline after sending few messages.
|
@vinoski, Does the nanomsg closes socket immediately Or the messages are queued up/dropped and EAGAIN is returned when server is unavailable? I was wondering if returning {error, eagain} upon receiving EAGAIN would be preferred way to handle this usecase. What are conventions? Should client crash and re-establish new socket or client can drop/queue messages itself when it receives {error, eagain} and let nanomsg keep retrying to reconnect with the server? |
@supershal: the main problem with returning I'm still experimenting with an alternative fix. |
@supershal and @harishdeshmukh please have a look at 5e19468 and give it a try. It adds a |
Thanks @vinoski for the fix. I was able to run my test over the fix successfully. It does not prematurely closes socket anymore. Please let me know following assumptions are correct.
|
@supershal answering your questions:
|
enm goes into infinite loop with pipeline tcp connection - when tcp service is not running.
Scenario:
a. Connect Url is tcp
b. The tcp service is not running
c. Run the following code. I just did
d. Look at top - the erlang process is taking 100% CPU.
e. CentOS - Linux version 3.10.0-123.20.1.el7.x86_64 ([email protected])
f. gcc version 4.8.2 20140120
Is it a known issue? Please let me know if its a user error.
Thanks.
The text was updated successfully, but these errors were encountered: