Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash on com.apollographql.websocket on _inputStreamCallbackFunc #3390

Open
aaronbarsky opened this issue Jun 10, 2024 · 7 comments
Open

Crash on com.apollographql.websocket on _inputStreamCallbackFunc #3390

aaronbarsky opened this issue Jun 10, 2024 · 7 comments
Labels

Comments

@aaronbarsky
Copy link

Summary

Crashlytics is reporting a very rare but consistent crash on the com.apollograph.websocket queue.

Crashed: com.apollographql.websocket
EXC_BAD_ACCESS KERN_INVALID_ADDRESS 0x0000000f9d3993a0
0 libobjc.A.dylib objc_retain + 16
1 libobjc.A.dylib objc_retain_x0 + 16
2 CoreFoundation _inputStreamCallbackFunc + 60
3 CoreFoundation _signalEventSync + 216
4 CoreFoundation ___signalEventQueue_block_invoke + 28
5 libdispatch.dylib _dispatch_call_block_and_release + 32

The app is a mix of 60% foreground and 40% background.

Version

1.9.2

Steps to reproduce the behavior

Unfortunately I only have crash logs

Logs

No response

Anything else?

My hunch is that it's the FoundationStream teardown sequence:

 func cleanup() {
    if let stream = inputStream {
      stream.delegate = nil
      CFReadStreamSetDispatchQueue(stream, nil)
      stream.close()
    }

stream.delegate is an unowned(unsafe) var.

There are several codepaths where cleanup is not called on com.apollographql.websocket. When this happens, delegate can be set to nil as the CFReadStream is attempting to invoke a callback on the deallocated delegate.

@aaronbarsky aaronbarsky added bug Generally incorrect behavior needs investigation labels Jun 10, 2024
@calvincestari
Copy link
Member

Hi @aaronbarsky - you mentioned this being consistent; do you know the conditions or sequence of events that leads to the crash. Being able to have a reproduction case would be immensely helpful to resolving this.

@aaronbarsky
Copy link
Author

By consistent, I mean that the call stack is exactly the same in each case. It happens about 1/500 sessions.

The challenge is that the start of the crash is

CoreFoundation _inputStreamCallbackFunc

which makes it quite tough to write a unit test or minimally reproducible example.

@AnthonyMDev
Copy link
Contributor

We're going to really struggle to create or verify a fix for this since, as you said, its difficult to reproduce it.

There are several codepaths where cleanup is not called on com.apollographql.websocket. When this happens, delegate can be set to nil as the CFReadStream is attempting to invoke a callback on the deallocated delegate.

What codepaths are you referring to here? If there are some obvious areas where we are creating unsafe memory access, I'd suggest trying to fork the library, fix the cleanup() methods calls and ship a new version. If that prevents the crashes, we'd be happy to take a PR with the fixes you've made?

@aaronbarsky
Copy link
Author

I worked around the problem with an explicit call to closeConnection.

Previously I had

func makeWebSocketClient(connectionHeaders:JSONEncodableDictionary) -> ApolloClientProtocol {
   let configuration = WebSocketTransport.Configuration(
            reconnect: false,
            connectingPayload: connectionHeaders)

    let webSocketClient = WebSocket(
            url: ...,
            protocol: .graphql_transport_ws)

    let wsTransport = WebSocketTransport(
            websocket: webSocketClient,
            config: configuration)

    return ApolloClient(networkTransport: wsTransport, store: ApolloStore(cache: InMemoryNormalizedCache()))
}
    // use client
    apolloClient = nil

Now I keep an explicit reference to the networkTransport so I can close it before deallocating the client

   let configuration = ...
   let webSocketClient = ...
   self.wsTransport = ...
   return  apolloClient = ...
   // use client
   self.wsTransport.closeConnection()
   self.wsTransport = nil
   self.apolloClient = nil

(Note that networkTransport is internal to ApolloClient so I need to keep the extra reference for access to the closeConnection function)

Fixing it in the library would require a lot of changes, that could be complex.

The problem is that FoundationStream marks the delgate as @unowned(unsafe) instead of @weak. This means that when the delegate is deallocated, rather than skipping the callback, it crashes. To avoid this, we need to make sure that the delegate is only released after close has been called on the callback queue.

Current situation

   FoundationStreamQueue         DelegateCallbackQueue          ClientQueue
                                                                apolloClient = nil
    receiveData
    scheduleCallbackOnDelegate                                  deinit
                                                                cleanupStream
                                                                cleanup
                                                                delegate deallocated
                                 delegate.stream(handle:)
                                 Crash()

My theoretical fix would look something like

   FoundationStreamQueue         DelegateCallbackQueue          ClientQueue
                                                                apolloClient = nil
    receiveData
    scheduleCallbackOnDelegate                                  deinit
                                                                cleanupStream
                                                                cleanup
                                                                capture strong reference to delegate in work block
                                                                schedule block on StreamCallbackQueue
                                 delegate.stream(handle:)
                                 complete work block releasing last reference
                                 delegate deallocated

The tricky bit is that you can't make strong references to self in the deinit. So rather than the delegate being an instance of Apollo.FoundationStream, it would have to be a separate object with a weak reference back to FoundationStream.

I'm too busy to make an attempt on this now. Perhaps if this code doesn't change much in Apollo 2.0, I can give it another look in the future.

@calvincestari
Copy link
Member

calvincestari commented Sep 11, 2024

Thanks for the great debugging @aaronbarsky.

I'm too busy to make an attempt on this now. Perhaps if this code doesn't change much in Apollo 2.0, I can give it another look in the future.

The current StarScream-based websocket implementation will be removed from Apollo iOS; it's old, unsupported and hindering progress. Ideally we build a default websocket implementation using iOS native websocket APIs but allow users to swap in whatever websocket library they choose to use. When that happens is still undecided though. We're trying to keep the scope of 2.0 breaking changes as small as possible and it really depends how much work is needed here to achieve that.

@pixelmatrix
Copy link

I'm seeing a handful of crashes like this as well in my app

@pixelmatrix
Copy link

pixelmatrix commented Sep 12, 2024

@aaronbarsky Isn't setting the delegate to nil supposed to set the delegate back to the stream instance itself? I'm not seeing this happening in the debugger, but that's what the documentation says:

By a default, a stream object must be its own delegate; so a delegate message with an argument of nil should restore this delegate. Don’t retain the delegate to prevent retain cycles.

If that's the case, could the crash be that inputStream or outputStream are getting set to nil, thus the delegate is nil when the system attempts to call it?


On my end, the crashes seem to be happening because I call pauseWebSocketConnection when the app goes into the background, which eventually triggers cleanup, but leaves the Apollo.FoundationStream stream instance in place. The subscriptions can be pretty active, so I'm thinking the crash is happening when a pending message is queued up while we pause the connection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants