-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pa_StopStream() should not use a finite timeout to join threads #919
Comments
See PortAudio#919 for rationale. Note that the previous Pa_StopStream() uses a bounded timeout to abort the stream if it doesn't stop within the alloted time, and then another bounded timeout to wait for the streaming thread to exit. This commit preserves the first timeout as it makes sense and appears to be safe, but gets rid of the second one so that we wait indefinitely for the streaming thread to exit.
See PortAudio#919 for rationale.
The counter argument is that if the stream freezes (as has been reported in multiple recent issues and PRs) then an infinite timeout results in your application locking up and your user having no way to save their work. (I recall we merged patches for bounded timeouts in JACK and/or ALSA recently). My position is that bounded timeouts should be used by all host APIs under all conditions. If a stream fails to stop (or close due to a timeout then it should be put into an permanently failed state so that the worst case is that it stops working and (temporarily) leaks resources, but never causes a complete lock-up of the client. |
Yes, but these timeouts are not the right fix for this. Using a timeout when joining a thread is not a proper fix. It's papering over the problem and creating new ones in the process (specifically, the thread could suddenly resume execution at any point, and if it does it is likely to trigger undefined behavior because the state it was using is gone - in fact, the entire PortAudio DLL could be gone). The correct fix is to actually figure out why things are getting stuck and fix the root cause. But it's unlikely anyone will get the opportunity to do that if PortAudio keeps hiding underlying issues like this.
If an Host API has to use a timeout (which, again, I don't think it should unless all other avenues have been exhausted), then the only safe way to do it is to apply the timeout inside the streaming thread on the particular system audio API that's blocking it (e.g. when the audio streaming thread is waiting for some event to be signaled). That I'd be fine with, because that can usually be done safely without introducing potential undefined behavior, contrary to the "well I don't know what's going on with that thread, so YOLO ¯\(ツ)/¯" approach. Ironically, from a cursory look at the code, the DirectSound Host API already seems to do that, because the only blocking call made from the streaming thread is using a timeout: portaudio/src/hostapi/dsound/pa_win_ds.c Lines 2782 to 2784 in 18a606e
Which means that even if DirectSound itself is stuck (i.e. the playback/capture cursors don't advance), the streaming thread will still be able to notice that it's being requested to stop and cleanly exit. Which means the timeout on the thread join truly is pointless - it's just opening potential for undefined behavior for no benefit. Same story for MME it seems: portaudio/src/hostapi/wmme/pa_win_wmme.c Line 2848 in 18a606e
So, for MME and DirectSound, the streaming thread is already designed in such a way that it cannot get stuck. The timeout on the thread join in
I'm not familiar with JACK or ALSA code, but if these timeouts are implemented correctly (e.g. streaming thread code timing out waiting for some audio event and then cleanly cleaning up if it's exceeded), then I'd be fine with it (with the caveat that I don't think we should hide deadlocks until we have at least tried to root cause them). Using a timeout on a thread join and then just moving on if it's exceeded is not a correctly implemented timeout. |
Yes, that would be the ideal. We should track down the root cause.
I agree. Timeouts should not be ignored. When a timeout is detected it should result in an error paTimedOut being returned to the application. That is generally doable for Pa_* functions. But it's hard when the timeout occurs in the callback thread. |
@dechamps You're jumping to a lot of conclusions here. Could you please at least try to assume good faith and see where I'm coming from? Here is the WMME PR that I was talking about: If a stream freezes due to operating system or driver behavior there is no root cause analysis possible. The best that can be done is log an error and implement a best effort error recovery. In my opinion, a timeout is better than a hung application. I agree that if there is a PortAudio bug, it should be fixed. I am not for a moment suggesting that PortAudio should paper over bugs in its own code. (How it handles a user callback that never returns is a separate question, with its own discussion). But I am assuming that PortAudio has no bugs relevant to this issue and the hangs are caused by the OS ceasing to wake PortAudio -- this is a real thing that has been observed.
I agree that timeouts need to be handled sensitively. If you read my response again you will see that I suggested putting the stream into an error state -- that would include never deleting associated storage. |
I agree that timeouts should not be ignored. They should be handled in some "correct" manner. Entering a safe "failed" (or "error" or "detached" or "zombie") state is one option. I don't think it's always as simple as returning an error though. As a rule, any function which drives a state change to quiescent state (equivalent of a destructor in C++) should always succeed to effect the state transition (in C++ this is the "destructors don't throw exceptions" rule). That would include Pa_StopStream and Pa_CloseStream. If you violate this rule you are essentially defining a code path where the stream can never be stopped or closed. Maybe you prefer that to my preferred persistent "failed" state? |
I like that rule. But if stopping the stream involves join()ing a thread and the thread is hung then what can you do? I think PA should clean up as best it can, set the state to STOPPED and then return paTimedOut. Are you suggesting that Pa_StopStream() should return paNoError when a join() times out?
Is that the state where you do not delete the structures that might be in use by the hung threads? |
I have no problem with the code in that PR. It looks correct to me. I think we are talking past each other Ross. To be clear: I am fine with an Host API using timeouts while waiting for audio events to occur (e.g. waiting for buffer space to become available), if it turns out to be necessary (looks like in the case of MME it is). That can be done without triggering undefined behavior and proper cleanup can still be done even if the timeout is hit. I am NOT fine with an Host API using timeouts to join its streaming thread, because that cannot be done without potentially triggering undefined behavior. Ross, the reason why I reacted the way I did is because you answered my original post, which is titled (emphasis mine):
And has the following as its first sentence:
Followed by code examples that show timeouts when waiting for the streaming thread to exit in some Host APIs. You answered that with:
Which I naturally interpreted as "All Host APIs should use bounded timeouts when joining the streaming thread", because that's the interpretation that seemed to make sense in the context of what I wrote. I vehemently disagree with that, and so I wrote a thorough rebuttal. But now, in your last comment you linked to a PR (#889) that is purely concerned with timeouts waiting for events within the streaming thread (which, again, is fine by me) - not the timeout used to join the streaming thread. This leaves me confused and makes me wonder if I misinterpreted your position. So, do we actually disagree on this or not? Just so that things are perfectly clear, my fixes for this issue are in #920. You'll note that I'm only touching the timeout used to join the MME and DirectSound streaming threads - I am NOT touching the timeouts used in the streaming thread to wait for buffers to become available, nor do I have any desire to. Do you object to #920 or not?
I don't think we need to have "detached" or "zombie" states. There are two kinds of timeouts we are discussing here: thread join timeouts, and timeouts inside the streaming thread waiting for some event to occur. My position, which hasn't changed, is that thread join timeouts should not exist because they cannot be implemented safely. I am fine with timeouts waiting for events in the streaming thread, because PortAudio can safely recover from those: indeed, if these occur, then a call to MME and DS currently use both kinds of timeouts. My proposal is to get rid of the former but keep the latter.
That cannot be done safely in the general case. The application may decide to unload the PortAudio DLL right after Again, I do not think PortAudio should try to cover up "hung threads". We should prevent threads from getting hung in the first place. MME and DS already have safe timeouts inside the streaming thread to prevent it from getting hung, so all that's left to do is getting rid of the unsafe timeouts on thread join, which are completely unnecessary. I am honestly still confused as to whether Ross agrees with this position or not. |
Currently, some, but not all, PortAudio Host APIs will use a finite timeout when joining the audio streaming thread in
Pa_StopStream()
. Examples include:MME:
portaudio/src/hostapi/wmme/pa_win_wmme.c
Lines 3453 to 3460 in 18a606e
DirectSound:
portaudio/src/hostapi/dsound/pa_win_ds.c
Line 3092 in 18a606e
In contrast, WASAPI uses an unbounded wait:
portaudio/src/hostapi/wasapi/pa_win_wasapi.c
Lines 4517 to 4520 in 18a606e
And so does WDM-KS:
portaudio/src/hostapi/wdmks/pa_win_wdmks.c
Lines 6289 to 6291 in 18a606e
(I have not looked into non-Windows host APIs.)
I believe it is incorrect to use a finite timeout, and this can lead to race conditions and undefined behavior. MME and DirectSound should be changed to wait forever.
The reason why I'm advocating this is because there is no telling how long the streaming thread may be blocked for. If the streaming thread suddenly resumes execution after
Pa_StopStream()
gave up, then the resources that the streaming thread is using may have been freed already - in fact, the entire PortAudio DLL may have been freed, as everything has been cleaned up as far as the caller code is concerned. This leads to hilarity in the form of undefined behavior.Now, is it likely that the streaming thread will be blocked for that long? No, but "unlikely" is not the same as "impossible", even in the absence of any deadlock/bugs. A non-real-time OS is allowed to suspend the execution of a thread indefinitely. For example, this could happen if the machine is overloaded, if the relevant memory pages have been swapped out and are taking some time to swap back in, if the process is suspended, or even if the computer is going to standby/sleep. In all these cases, the bounded timeout runs the risk of triggering undefined behavior. In my opinion this is not reasonable.
I presume that the reason why this timeout was introduced is to protect against the case where the streaming thread may be deadlocked. I don't think this makes sense because this is not something that is supposed to occur in a valid program. If the streaming thread deadlocks, then things are already going off the rails and there is no point in trying to "cover up" the problem - in fact, this is doing the developer a disservice because it only serves to hide bugs. This is precisely what happened in dechamps/FlexASIO#235, where PortAudio DirectSound ended up "hiding" a deadlock issue in user code, but PortAudio WASAPI made it obvious by deadlocking the entire app (as it should!). The behavior of waiting with a timeout and returning an error would be reasonable if the failure mode was recoverable, but strictly speaking it isn't, because (1) the code cannot know if the streaming thread is truly deadlocked and (2) even if it is, the streaming thread will linger around forever and leak various resources anyway.
For these reasons, I believe the correct behavior is to wait forever.
The text was updated successfully, but these errors were encountered: