Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve connection cleanup to avoid many error messages with ioc exit on some operating systems #200

Merged

Conversation

FreddieAkeroyd
Copy link
Contributor

@FreddieAkeroyd FreddieAkeroyd commented Dec 20, 2023

Try to terminate accept thread during cleanup

@AppVeyorBot
Copy link

Build asyn 1.0.254 failed (commit 6eb168c10f by @FreddieAkeroyd)

@exzombie
Copy link
Contributor

I like the general idea. It won't work on Linux, however. It might on Darwin, or it might not; I don't have a Mac to test on. The reason is that epicsSocketDestroy() on Posix only calls close() on the socket, which is not always enough to interrupt the blocking accept(). One needs to call shutdown() before calling close().1

@jlmuir, if your're testing this, and if it works, you should expect to see one or two errors that occur when the accept() is interrupted. If it doesn't work, add shutdown(tty->fd, SHUT_RDWR); before destroying the socket. Once you've confirmed that this helps on Darwin, we can polish it to remove the remaining errors.

Footnotes

  1. Perhaps it would be appropriate to add shutdown() to epicsSocketDestroy()? Not sure.

@FreddieAkeroyd
Copy link
Contributor Author

Ah thanks, I see why linux did not error then, it just stayed blocked until the process died. Maybe adding shutdown to epicsSocketDestroy is worth doing, might be work a post to core-talk to see if it was considered but decided against. Can shutdown itself ever block though?

Another option is to make the loop non-blocking by adding a select() with timeout, so accept is only called if there is a connection to accept, and fd can then be checked every timeout

@AppVeyorBot
Copy link

Build asyn 1.0.255 failed (commit 3ab9652411 by @FreddieAkeroyd)

@jlmuir
Copy link

jlmuir commented Dec 21, 2023

@exzombie, I confirm that the three commits from @FreddieAkeroyd in this PR help on darwin-aarch64! I now only get one error message on IOC exit as you predicted:

2023/12/21 12:21:18.686 drvAsynIPServerPort:ttyCleanup P5001: shutdown socket 4

Here's the full terminal session:

$ ../../bin/darwin-aarch64/testIPServer st.cmd
< envPaths
epicsEnvSet("IOC","ioctestIPServer")
epicsEnvSet("TOP","/tmp/48/opt/synapps-6.2.1/asyn-R4-44-2")
epicsEnvSet("SUPPORT","/tmp/48/opt/synapps-6.2.1")
epicsEnvSet("SNCSEQ","/tmp/48/opt/synapps-6.2.1/seq-2-2-9")
epicsEnvSet("CALC","/tmp/48/opt/synapps-6.2.1/calc-R3-7-4")
epicsEnvSet("SSCAN","/tmp/48/opt/synapps-6.2.1/sscan-R2-11-4")
epicsEnvSet("EPICS_BASE","/tmp/48/opt/epics-base-7.0.8")
dbLoadDatabase("../../dbd/testIPServer.dbd")
testIPServer_registerRecordDeviceDriver(pdbbase)
#The following command starts a server on port 5001
drvAsynIPServerPortConfigure("P5001","localhost:5001",2,0,0,0)
serverAddr: 0.0.0.0:5001
serverPort: 5001
drvAsynIPServerPortConfigure("P5002","localhost:5002",1,0,0,0)
serverAddr: 0.0.0.0:5002
serverPort: 5002
#asynSetTraceFile("P5001",-1,"")
asynSetTraceIOMask("P5001",-1,0x2)
asynSetTraceMask("P5001",-1,0xff)
dbLoadRecords("../../db/testIPServer.db", "P=testIPServer:")
iocInit()
Starting iocInit
############################################################################
## EPICS R7.0.8
## Rev. 2023-12-18T15:15-0600
## Rev. Date build date/time:
############################################################################
iocRun: All initialization complete
ipEchoServer("P5001")
2023/12/21 12:20:57.598 P5001 -1 registerInterruptUser
seq("ipSNCServer", "P=testIPServer:, PORT=P5002")
sevr=info Sequencer release 2.2.9, compiled Mon Dec 18 22:30:46 2023
sevr=info Spawning sequencer program "ipSNCServer", thread 0x600002614180: "ipSNCServer"
sevr=info ipSNCServer[0]: all channels connected & received 1st monitor
epics> exit
2023/12/21 12:21:18.686 drvAsynIPServerPort:ttyCleanup P5001: shutdown socket 4

@MarkRivers
Copy link
Member

The appveyor Windows builds are failing because it cannot find "distutils". Does anyone know how to fix this?

@jlmuir
Copy link

jlmuir commented Dec 21, 2023

Re the AppVeyor build failures, if the version of Python in the AppVeyor build VM is >= 3.12, maybe because the distutils module was removed from Python >= 3.12? See What’s New In Python 3.12, which has more info on the removal, including a migration guide and workaround.

@MarkRivers
Copy link
Member

I found that the latest version of ci-scripts (v3.4.1) removes the dependency on distutils. I have changed asyn to use the latest version of ci-scripts, and I suspect that will fix the problem with Appveyor.

@FreddieAkeroyd
Copy link
Contributor Author

@jlmuir good to hear that works, though on reflection I think the suggestion of @exzombie to use shutdown() of the socket prior to epicsSocketDestroy would be a better way to go as it avoids arbitrary wakeups of the accept loop and also allows the sleep delay in cleanup to be as small as possible. Something like this in the ttyCleanup and no select() in the accept loop

int fd_save = tty->fd;
tty->fd = INVALID_SOCKET;
shutdown(fd_save, SHUT_RDWR); 
epicsSocketDestroy(fd_save);
epicsThreadSleep(0.1); /* wait for accept thread to terminate */  

@FreddieAkeroyd
Copy link
Contributor Author

FreddieAkeroyd commented Dec 22, 2023

Though we may need a delay between shutdown and epicsSocketDestroy if i have understood https://epics.anl.gov/core-talk/2019/msg00714.php and its follow up message correctly. I was also unaware of epicsSocketSystemCallInterruptMechanismQuery

@AppVeyorBot
Copy link

@FreddieAkeroyd
Copy link
Contributor Author

Have pushed change to use shutdown prior to socket close, I believe this should mean that windows/linux/darwin will now all behave the same and print a single accept error on ioc exit?

@AppVeyorBot
Copy link

@jlmuir
Copy link

jlmuir commented Dec 23, 2023

Tried with commit 7f89753 on darwin-aarch64, and it too eliminates the thousands of error messages (but on exit there are more messages than there were with the previous commit):

$ ../../bin/darwin-aarch64/testIPServer st.cmd
< envPaths
epicsEnvSet("IOC","ioctestIPServer")
epicsEnvSet("TOP","/tmp/48/opt/synapps-6.2.1/asyn-R4-44-2")
epicsEnvSet("SUPPORT","/tmp/48/opt/synapps-6.2.1")
epicsEnvSet("SNCSEQ","/tmp/48/opt/synapps-6.2.1/seq-2-2-9")
epicsEnvSet("CALC","/tmp/48/opt/synapps-6.2.1/calc-R3-7-4")
epicsEnvSet("SSCAN","/tmp/48/opt/synapps-6.2.1/sscan-R2-11-4")
epicsEnvSet("EPICS_BASE","/tmp/48/opt/epics-base-7.0.8")
dbLoadDatabase("../../dbd/testIPServer.dbd")
testIPServer_registerRecordDeviceDriver(pdbbase)
#The following command starts a server on port 5001
drvAsynIPServerPortConfigure("P5001","localhost:5001",2,0,0,0)
serverAddr: 0.0.0.0:5001
serverPort: 5001
drvAsynIPServerPortConfigure("P5002","localhost:5002",1,0,0,0)
serverAddr: 0.0.0.0:5002
serverPort: 5002
#asynSetTraceFile("P5001",-1,"")
asynSetTraceIOMask("P5001",-1,0x2)
asynSetTraceMask("P5001",-1,0xff)
dbLoadRecords("../../db/testIPServer.db", "P=testIPServer:")
iocInit()
Starting iocInit
############################################################################
## EPICS R7.0.8
## Rev. 2023-12-18T15:15-0600
## Rev. Date build date/time:
############################################################################
iocRun: All initialization complete
ipEchoServer("P5001")
2023/12/23 00:15:14.051 P5001 -1 registerInterruptUser
seq("ipSNCServer", "P=testIPServer:, PORT=P5002")
sevr=info Sequencer release 2.2.9, compiled Mon Dec 18 22:30:46 2023
sevr=info Spawning sequencer program "ipSNCServer", thread 0x600003bac0c0: "ipSNCServer"
sevr=info ipSNCServer[0]: all channels connected & received 1st monitor
epics> exit
2023/12/23 00:15:17.449 drvAsynIPServerPort: accept error on localhost:5002: fd=-1, Software caused connection abort
2023/12/23 00:15:17.554 drvAsynIPServerPort:ttyCleanup P5001: shutdown socket 4
2023/12/23 00:15:17.565 drvAsynIPServerPort: new connection, socket=-1 on localhost:5001
2023/12/23 00:15:17.565 drvAsynIPServerPort: accept error on localhost:5001: fd=-1, Undefined error: 0

@FreddieAkeroyd
Copy link
Contributor Author

Yes, it will print a few more messages as the select call avoided an error with accept and i didn't distinguish a select timeout (expected) from a socket close/shutdown. If we wished to avoid the accept error message on ioc exit we could use something like:

    clientFd = epicsSocketAccept(tty->fd, (struct sockaddr *) &clientAddr, &clientLen);
    if (tty->fd == INVALID_SOCKET) {
        break; /* we must be in ioc shutdown */
    }

@FreddieAkeroyd FreddieAkeroyd changed the title Invalid socket, but still prone to race conditions Improve connection cleanup to avoid many error messages with ioc exit on some operating systems Dec 24, 2023
@AppVeyorBot
Copy link

@jlmuir
Copy link

jlmuir commented Dec 29, 2023

Tried with commit 19db7d3 on darwin-aarch64, and it too eliminates the thousands of error messages on exit, and now there are only two messages on exit:

$ ../../bin/darwin-aarch64/testIPServer st.cmd
< envPaths
epicsEnvSet("IOC","ioctestIPServer")
epicsEnvSet("TOP","/tmp/48/opt/synapps-6.2.1/asyn-R4-44-2")
epicsEnvSet("SUPPORT","/tmp/48/opt/synapps-6.2.1")
epicsEnvSet("SNCSEQ","/tmp/48/opt/synapps-6.2.1/seq-2-2-9")
epicsEnvSet("CALC","/tmp/48/opt/synapps-6.2.1/calc-R3-7-4")
epicsEnvSet("SSCAN","/tmp/48/opt/synapps-6.2.1/sscan-R2-11-4")
epicsEnvSet("EPICS_BASE","/tmp/48/opt/epics-base-7.0.8")
dbLoadDatabase("../../dbd/testIPServer.dbd")
testIPServer_registerRecordDeviceDriver(pdbbase)
#The following command starts a server on port 5001
drvAsynIPServerPortConfigure("P5001","localhost:5001",2,0,0,0)
serverAddr: 0.0.0.0:5001
serverPort: 5001
drvAsynIPServerPortConfigure("P5002","localhost:5002",1,0,0,0)
serverAddr: 0.0.0.0:5002
serverPort: 5002
#asynSetTraceFile("P5001",-1,"")
asynSetTraceIOMask("P5001",-1,0x2)
asynSetTraceMask("P5001",-1,0xff)
dbLoadRecords("../../db/testIPServer.db", "P=testIPServer:")
iocInit()
Starting iocInit
############################################################################
## EPICS R7.0.8
## Rev. 2023-12-18T15:15-0600
## Rev. Date build date/time:
############################################################################
iocRun: All initialization complete
ipEchoServer("P5001")
2023/12/28 22:46:47.218 P5001 -1 registerInterruptUser
seq("ipSNCServer", "P=testIPServer:, PORT=P5002")
sevr=info Sequencer release 2.2.9, compiled Mon Dec 18 22:30:46 2023
sevr=info Spawning sequencer program "ipSNCServer", thread 0x60000077c000: "ipSNCServer"
sevr=info ipSNCServer[0]: all channels connected & received 1st monitor
epics> exit
2023/12/28 22:46:53.865 drvAsynIPServerPort:ttyCleanup P5001: shutdown socket 4
2023/12/28 22:46:53.877 drvAsynIPServerPort: terminating connection thread for localhost:5001

@FreddieAkeroyd
Copy link
Contributor Author

@jlmuir thank you for testing and a happy new year to you. So now the only messages printed are from ASYN_TRACE_FLOW, so you will see them for port P5001 where this is enabled but not for P5002 where it is not (previously there was an accept error for P5002 that was printed, but such an error is now ignored during ioc shutdown)

@exzombie
Copy link
Contributor

exzombie commented Jan 3, 2024

Looks ok to me; I'm not a fan of using sleep to synchronize threads, but that's the least intrusive approach. I've had enough "fun" trying to fix broken (in theory) locking in #171 to push for a proper solution here 😅

I was thinking a bit about what it would mean if shutdown() was added to epicsSocketDestroy(). Would it break anything? I see three things to consider:

  • The most visible change would be that threads on Linux would suddenly be unblocked on socket destruction, where they were blocked before. However, as we saw here, they get unblocked on some systems already, so the behavior would only change on Linux. This change would bring consistency, but might be a hard sell because of the prevalence of Linux systems.
  • The code that already calls shutdown() might be affected. The additional call made as part of epicsSocketDestroy() could cause errors. It does not on Linux (the second call returns success), but I don't know about other systems. Perhaps potential errors can simply be ignored, the shutdown being a "best effort" thing?
  • There should be no delay between shutdown() and close(). From what I see, the RTEMS thing is either a bug/misbehavior or a thread priority problem or both. @mdavidsaver has added a workaround to pvAccessCPP, but I wonder whether it's really needed for other code. From the comments, it looks like the situation is specific to the design of pvAccessCPP.

Given the existence of the RTEMS workaround, surely adding shutdown() to epicsSocketDestroy() has been considered before? @mdavidsaver ?

@jlmuir
Copy link

jlmuir commented Jan 8, 2024

@FreddieAkeroyd wrote:

@jlmuir thank you for testing and a happy new year to you. So now the only messages printed are from ASYN_TRACE_FLOW, so you will see them for port P5001 where this is enabled but not for P5002 where it is not (previously there was an accept error for P5002 that was printed, but such an error is now ignored during ioc shutdown)

Yes, I confirm that if I comment out the asynSetTraceMask("P5001",-1,0xff) call in st.cmd, I get no messages on exit:

$ ../../bin/darwin-aarch64/testIPServer st.cmd
< envPaths
epicsEnvSet("IOC","ioctestIPServer")
epicsEnvSet("TOP","/tmp/48/opt/synapps-6.2.1/asyn-R4-44-2")
epicsEnvSet("SUPPORT","/tmp/48/opt/synapps-6.2.1")
epicsEnvSet("SNCSEQ","/tmp/48/opt/synapps-6.2.1/seq-2-2-9")
epicsEnvSet("CALC","/tmp/48/opt/synapps-6.2.1/calc-R3-7-4")
epicsEnvSet("SSCAN","/tmp/48/opt/synapps-6.2.1/sscan-R2-11-4")
epicsEnvSet("EPICS_BASE","/tmp/48/opt/epics-base-7.0.8")
dbLoadDatabase("../../dbd/testIPServer.dbd")
testIPServer_registerRecordDeviceDriver(pdbbase)
#The following command starts a server on port 5001
drvAsynIPServerPortConfigure("P5001","localhost:5001",2,0,0,0)
serverAddr: 0.0.0.0:5001
serverPort: 5001
drvAsynIPServerPortConfigure("P5002","localhost:5002",1,0,0,0)
serverAddr: 0.0.0.0:5002
serverPort: 5002
#asynSetTraceFile("P5001",-1,"")
asynSetTraceIOMask("P5001",-1,0x2)
#asynSetTraceMask("P5001",-1,0xff)
dbLoadRecords("../../db/testIPServer.db", "P=testIPServer:")
iocInit()
Starting iocInit
############################################################################
## EPICS R7.0.8
## Rev. 2023-12-18T15:15-0600
## Rev. Date build date/time:
############################################################################
iocRun: All initialization complete
ipEchoServer("P5001")
seq("ipSNCServer", "P=testIPServer:, PORT=P5002")
sevr=info Sequencer release 2.2.9, compiled Mon Dec 18 22:30:46 2023
sevr=info Spawning sequencer program "ipSNCServer", thread 0x6000039fc000: "ipSNCServer"
sevr=info ipSNCServer[0]: all channels connected & received 1st monitor
epics> exit

@jlmuir
Copy link

jlmuir commented Feb 5, 2024

Ping? Anything else needed here? Could this be merged? Thanks!

@MarkRivers MarkRivers merged commit a94a789 into epics-modules:master Feb 5, 2024
10 checks passed
@mdavidsaver
Copy link
Contributor

Since I am late to this party, these will have to be notes for future work.

Given the existence of the RTEMS workaround, surely adding shutdown() to epicsSocketDestroy() has been considered before? @mdavidsaver ?

Not that I am aware of.

As was noted above, interrupting blocking syscalls is not especially portable to OS or socket type. The awkward epicsSocketSystemCallInterruptMechanismQuery() logic is intended for deciding how to interrupt recv() on a TCP socket, but is I hope the correct condition for accept() as well.

asynPrint(pasynUser, ASYN_TRACE_FLOW,
"drvAsynIPServerPort: terminating connection thread for %s\n",
tty->serverInfo);
break; /* we must be in ioc shutdown and ttyCleanup has been called */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... or maybe the file descriptor limit was reached and accept() has returned with errno==EMFILE.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add a separate test for this condition when we check clientFd at https://github.com/FreddieAkeroyd/asyn/blob/19db7d33e5576393f99feeb196f88b7d321e5a95/asyn/drvAsynSerial/drvAsynIPServerPort.c#L336 or would this situation also affect tty->fd ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants