Crash on com.apollographql.websocket on _inputStreamCallbackFunc #3390

aaronbarsky · 2024-06-10T16:55:46Z

Summary

Crashlytics is reporting a very rare but consistent crash on the com.apollograph.websocket queue.

Crashed: com.apollographql.websocket
EXC_BAD_ACCESS KERN_INVALID_ADDRESS 0x0000000f9d3993a0
0 libobjc.A.dylib objc_retain + 16
1 libobjc.A.dylib objc_retain_x0 + 16
2 CoreFoundation _inputStreamCallbackFunc + 60
3 CoreFoundation _signalEventSync + 216
4 CoreFoundation ___signalEventQueue_block_invoke + 28
5 libdispatch.dylib _dispatch_call_block_and_release + 32

The app is a mix of 60% foreground and 40% background.

Version

1.9.2

Steps to reproduce the behavior

Unfortunately I only have crash logs

Logs

No response

Anything else?

My hunch is that it's the FoundationStream teardown sequence:

 func cleanup() {
    if let stream = inputStream {
      stream.delegate = nil
      CFReadStreamSetDispatchQueue(stream, nil)
      stream.close()
    }

stream.delegate is an unowned(unsafe) var.

There are several codepaths where cleanup is not called on com.apollographql.websocket. When this happens, delegate can be set to nil as the CFReadStream is attempting to invoke a callback on the deallocated delegate.

The text was updated successfully, but these errors were encountered:

calvincestari · 2024-06-10T20:34:53Z

Hi @aaronbarsky - you mentioned this being consistent; do you know the conditions or sequence of events that leads to the crash. Being able to have a reproduction case would be immensely helpful to resolving this.

aaronbarsky · 2024-06-10T21:12:38Z

By consistent, I mean that the call stack is exactly the same in each case. It happens about 1/500 sessions.

The challenge is that the start of the crash is

CoreFoundation _inputStreamCallbackFunc

which makes it quite tough to write a unit test or minimally reproducible example.

AnthonyMDev · 2024-08-09T17:57:24Z

We're going to really struggle to create or verify a fix for this since, as you said, its difficult to reproduce it.

There are several codepaths where cleanup is not called on com.apollographql.websocket. When this happens, delegate can be set to nil as the CFReadStream is attempting to invoke a callback on the deallocated delegate.

What codepaths are you referring to here? If there are some obvious areas where we are creating unsafe memory access, I'd suggest trying to fork the library, fix the cleanup() methods calls and ship a new version. If that prevents the crashes, we'd be happy to take a PR with the fixes you've made?

aaronbarsky · 2024-09-03T17:00:10Z

I worked around the problem with an explicit call to closeConnection.

Previously I had

func makeWebSocketClient(connectionHeaders:JSONEncodableDictionary) -> ApolloClientProtocol {
   let configuration = WebSocketTransport.Configuration(
            reconnect: false,
            connectingPayload: connectionHeaders)

    let webSocketClient = WebSocket(
            url: ...,
            protocol: .graphql_transport_ws)

    let wsTransport = WebSocketTransport(
            websocket: webSocketClient,
            config: configuration)

    return ApolloClient(networkTransport: wsTransport, store: ApolloStore(cache: InMemoryNormalizedCache()))
}

    // use client
    apolloClient = nil

Now I keep an explicit reference to the networkTransport so I can close it before deallocating the client

   let configuration = ...
   let webSocketClient = ...
   self.wsTransport = ...
   return  apolloClient = ...

   // use client
   self.wsTransport.closeConnection()
   self.wsTransport = nil
   self.apolloClient = nil

(Note that networkTransport is internal to ApolloClient so I need to keep the extra reference for access to the closeConnection function)

Fixing it in the library would require a lot of changes, that could be complex.

The problem is that FoundationStream marks the delgate as @unowned(unsafe) instead of @weak. This means that when the delegate is deallocated, rather than skipping the callback, it crashes. To avoid this, we need to make sure that the delegate is only released after close has been called on the callback queue.

Current situation

   FoundationStreamQueue         DelegateCallbackQueue          ClientQueue
                                                                apolloClient = nil
    receiveData
    scheduleCallbackOnDelegate                                  deinit
                                                                cleanupStream
                                                                cleanup
                                                                delegate deallocated
                                 delegate.stream(handle:)
                                 Crash()

My theoretical fix would look something like

   FoundationStreamQueue         DelegateCallbackQueue          ClientQueue
                                                                apolloClient = nil
    receiveData
    scheduleCallbackOnDelegate                                  deinit
                                                                cleanupStream
                                                                cleanup
                                                                capture strong reference to delegate in work block
                                                                schedule block on StreamCallbackQueue
                                 delegate.stream(handle:)
                                 complete work block releasing last reference
                                 delegate deallocated

The tricky bit is that you can't make strong references to self in the deinit. So rather than the delegate being an instance of Apollo.FoundationStream, it would have to be a separate object with a weak reference back to FoundationStream.

I'm too busy to make an attempt on this now. Perhaps if this code doesn't change much in Apollo 2.0, I can give it another look in the future.

calvincestari · 2024-09-11T19:09:32Z

Thanks for the great debugging @aaronbarsky.

I'm too busy to make an attempt on this now. Perhaps if this code doesn't change much in Apollo 2.0, I can give it another look in the future.

The current StarScream-based websocket implementation will be removed from Apollo iOS; it's old, unsupported and hindering progress. Ideally we build a default websocket implementation using iOS native websocket APIs but allow users to swap in whatever websocket library they choose to use. When that happens is still undecided though. We're trying to keep the scope of 2.0 breaking changes as small as possible and it really depends how much work is needed here to achieve that.

pixelmatrix · 2024-09-12T18:15:17Z

I'm seeing a handful of crashes like this as well in my app

pixelmatrix · 2024-09-12T19:15:02Z

@aaronbarsky Isn't setting the delegate to nil supposed to set the delegate back to the stream instance itself? I'm not seeing this happening in the debugger, but that's what the documentation says:

By a default, a stream object must be its own delegate; so a delegate message with an argument of nil should restore this delegate. Don’t retain the delegate to prevent retain cycles.

If that's the case, could the crash be that inputStream or outputStream are getting set to nil, thus the delegate is nil when the system attempts to call it?

On my end, the crashes seem to be happening because I call pauseWebSocketConnection when the app goes into the background, which eventually triggers cleanup, but leaves the Apollo.FoundationStream stream instance in place. The subscriptions can be pretty active, so I'm thinking the crash is happening when a pending message is queued up while we pause the connection.

aaronbarsky added bug Generally incorrect behavior needs investigation labels Jun 10, 2024

AnthonyMDev added the apollo-websockets label Jun 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash on com.apollographql.websocket on _inputStreamCallbackFunc #3390

Crash on com.apollographql.websocket on _inputStreamCallbackFunc #3390

aaronbarsky commented Jun 10, 2024

calvincestari commented Jun 10, 2024

aaronbarsky commented Jun 10, 2024

AnthonyMDev commented Aug 9, 2024

aaronbarsky commented Sep 3, 2024

calvincestari commented Sep 11, 2024 •

edited

Loading

pixelmatrix commented Sep 12, 2024

pixelmatrix commented Sep 12, 2024 •

edited

Loading

Crash on com.apollographql.websocket on _inputStreamCallbackFunc #3390

Crash on com.apollographql.websocket on _inputStreamCallbackFunc #3390

Comments

aaronbarsky commented Jun 10, 2024

Summary

Version

Steps to reproduce the behavior

Logs

Anything else?

calvincestari commented Jun 10, 2024

aaronbarsky commented Jun 10, 2024

AnthonyMDev commented Aug 9, 2024

aaronbarsky commented Sep 3, 2024

calvincestari commented Sep 11, 2024 • edited Loading

pixelmatrix commented Sep 12, 2024

pixelmatrix commented Sep 12, 2024 • edited Loading

calvincestari commented Sep 11, 2024 •

edited

Loading

pixelmatrix commented Sep 12, 2024 •

edited

Loading