Storage calls seem to result in main thread deadlocks

Question

Storage calls seem to result in main thread deadlocks

tristan-warner-smith opened this issue 2 years ago · comments

Tristan Warner-Smith commented 2 years ago

Describe the bug

We've updated from the previous SDK Analytics iOS to Analytics Swift and we're experiencing regular app hangs
During a hang, pausing execution shows three threads accessing Storage functions that use the blocking syncQueue.sync.
We see:
- Storage.write as a result of a track call where we're making a call to Analytics.track(name:properties:) from the main thread
- Storage.remove from HttpClient.startBatchUpload > SegmentDestination.flush > Storage.remove
- Storage.write as a result of another track call, this one from the sync.segment.com serial queue

See attached images.

To Reproduce

The behaviour manifests often when switching to other apps and back but often immediately hangs the app on start, leaving it unusable

Our initialisation is:

let analytics = Analytics(
    configuration: Configuration(
        writeKey: Config.shared.segmentSourceKey
    )
    .trackApplicationLifecycleEvents(true)
    .flushAt(3)
    .flushInterval(10)
)
analytics.add(plugin: FirebaseDestination())

let engage = TwilioEngage { _, _ in }
analytics.add(plugin: engage)
analytics.add(plugin: CustomizeSegmentTrackCalls())

Initialisation is lazy on the first track event we fire.

Expected behavior

The app should never hang.

Screenshots

Platform (please complete the following information):

Library Version in use: 1.4.8
Platform being tested: iOS 17
Integrations in use: SegmentFirebase, Twilio Engage
main commit c4bb71dea0b38c179b2d87b56ee096a06ce2ea86

Additional context

The workaround we've had to adopt in the very short-term before you address this bug is to fork the repo and make Storage.write call syncQueue.async rather than syncQueue.sync. This is obviously not ideal, we aren't domain experts in your codebase, but in our testing this change removes all the main-thread blocking issues we encountered and still seems to send Segment events as expected.
Both track calls we make are early on in the app's initialisation.
We can share a video walking through the stack trace at the point in time of the images above, we'll want to share that privately though.
Sometimes it may just be two calls, not three but the end result is the same.

Brandon Sneed · Answer 1 · Fri Nov 24 2023 03:39:09 GMT+0800 (China Standard Time)

Thanks @tristan-warner-smith, I'll have a look at this and see if I can reproduce. If you could give me an example of the track calls your sending, that would be helpful. There were some instances where large IO on the main thread sent to the storage class could result in something that takes long enough for the OS to kill the app. You might pull main and see if that improves things for you in the meantime.

However, your current work around will almost certainly result in corrupt data given time. Those sync's are there on purpose to synchronize writes to a single file from potentially multiple threads.

Tristan Warner-Smith · Answer 2 · Fri Nov 24 2023 07:03:47 GMT+0800 (China Standard Time)

However, your current work around will almost certainly result in corrupt data given time. Those sync's are there on purpose to synchronize writes to a single file from potentially multiple threads.

I assumed as much, the issue is we now have a release in production that we get no telemetry for, no Crashlytics crash reports and we know when this issue occurs the app is completely unusable.

Our users depend on us for immediate relief from pain, stress etc so we can't wait for perfect in this case.

If you can think of any better workarounds @bsneed that don't jeopardise our app's usability, we'd really appreciate it.

These are similar events emitted roughly on app start:

Name: abcd_abcdefghijk_abcdefj
Parameters: [
	"userId": "abcdefghijklmnopqrstuvwxyz01",
	"is_first_session": true,
	"platform": "iOS",
	"trialStart": false,
	"clicked_branch_link": false,
	"appVersion": "3.20.0"
]

Name: "abcdefghij_abcde"
Parameters: [
	"trialStart": false,
	"platform": "iOS",
	"appVersion": "3.20.0",
	"userId": "abcdefghijklmnopqrstuvwxyz01"
]

Name: "abcdefghijklmno_abcde"
Parameters: [
	"userId": "abcdefghijklmnopqrstuvwxyz01",
	"variant_id": "baseline",
	"appVersion": "3.20.0",
	"event_timestamp": "2023-11-23T22:49:46.669Z",
	"trialStart": false,
	"experiment_id": "ab_abcdef_abcdefgh_abc",
	"platform": "iOS"
]

Name: "abcdefghijklmno_abcde"
Parameters: [
	"userId": "abcdefghijklmnopqrstuvwxyz01",
	"experiment_id": "ab_abcde_ab_abcd",
	"appVersion": "3.20.0",
	"event_timestamp": "2023-11-23T22:49:46.669Z",
	"trialStart": false,
	"variant_id": "baseline",
	"platform": "iOS"
]

Name: "abcdefghijklmno_abcde"
Parameters: [
	"userId": "abcdefghijklmnopqrstuvwxyz01",
	"experiment_id": "ab_abcd_abcdefghijkl_abc_ab",
	"platform": "iOS",
	"trialStart": false,
	"variant_id": "baseline",
	"event_timestamp": "2023-11-23T22:49:46.669Z",
	"appVersion": "3.20.0"
]

Brandon Sneed · Answer 3 · Fri Nov 24 2023 07:43:21 GMT+0800 (China Standard Time)

Thanks @tristan-warner-smith! To clarify, I was just sharing information with you. You should always do (as you are already) what's best for you and your customers and I would never suggest otherwise. We'll dive into this on Monday, but since you have a repro scenario it'd be useful to check out main.

John Cragg · Answer 4 · Fri Nov 24 2023 18:18:12 GMT+0800 (China Standard Time)

Could you outline a potential scenario that represents "However, your current work around will almost certainly result in corrupt data given time." Ie: is it about ordering of events/losing events

Brandon Sneed · Answer 5 · Sat Nov 25 2023 02:27:35 GMT+0800 (China Standard Time)

@johncDepop it's about 2 threads trying to write to a json file at the same time potentially. ie: thread1 is mid-way through writing an event, thread2 writes an event in the middle, you might end up with {myEvent="someThin{myEvent="someOtherThing"},g"}, instead of {myEvent="someThing"},{myEvent="someOtherThing"},

Mo Ahmad · Answer 6 · Mon Nov 27 2023 18:44:26 GMT+0800 (China Standard Time)

Hi @bsneed, just to add to the context @tristan-warner-smith posted earlier, I thought it might be useful to add the following screenshots of the app hanging when we background and foreground the app. We've updated our fork to be in sync with main and have addressed an instance of potential large IO on the main thread being sent to the storage class on app launch, but we still see this issue. Can you please advise?

Brandon Sneed · Answer 7 · Thu Nov 30 2023 02:54:24 GMT+0800 (China Standard Time)

It appears I was incorrect regarding the sync->async thing causing issues. I've built some tests and it's working as it should despite the move to async. Change coming soon. I do think the deadlock there is a symptom of something else, such as a large amount of data being written, etc.

Brandon Sneed · Answer 8 · Thu Nov 30 2023 03:10:54 GMT+0800 (China Standard Time)

Ok, I stand corrected. In theory it'd work on a serial queue when mixing sync/async. Once I make the change, the thread sanitizer identifies a half dozen deadlocks. I won't be able to do anything about this until I have a reproduction scenario unfortunately. Given that it's all around fileIO, you might look into the content of your events.

Tristan Warner-Smith · Answer 9 · Fri Dec 01 2023 00:46:09 GMT+0800 (China Standard Time)

It appears I was incorrect regarding the sync->async thing causing issues. I've built some tests and it's working as it should despite the move to async. Change coming soon. I do think the deadlock there is a symptom of something else, such as a large amount of data being written, etc.

I shared the exact amount of data we were passing through Segment above (subjectively it doesn't seem like a lot of event data), we've subsequently reduced it down to just 2 events tracked initially and had the same result. Note that the only related change we made to our codebase was upgrading from the previous iOS SDK to this one.

Beyond that we're using a SwiftUI-based project and initialising a Segment singleton lazily with the first track call.
We use Firebase + Firestore SDK 10.17, it likely does some writes on init but we obviously don't control that.

We'll attempt to create a repro scenario, when we can prioritise the time for it, but the sample project you linked seemed to only have pre-app start synchronous event dispatch (in the app delegate), no post app-start events and no async scenarios from what I saw so it doesn't seem like it would be comprehensive enough to represent the real-world. We mostly interact with the Segment SDK via unstructured Tasks rather than DispatchQueue so if there are specific thread or QoS requirements to use the SDK please let us know as in some cases we'll be calling from MainActor.

If you can try to replicate the environment described above you may see the issue. Let us know if you make any progress, we appreciate all the help you've given us so far to help us onboard the new product. It's really key for the stability of our users.

Tristan Warner-Smith · Answer 10 · Fri Mar 01 2024 18:27:49 GMT+0800 (China Standard Time)

@bsneed I'd appreciate you revisiting this bug report, we tried updating to the latest 1.55 release, given the mention of concurrency fixes to Sovran-Swift but the issue we mentioned above still manifests.

That of the main thread deadlocking due here where the Storage.syncQueue seems to run at the same dos as DispatchQueue.main, where as you know, calling DispatchQueue.main.sync results in the deadlock we're seeing.

On 1.55 we're now exclusively seeing this issue after the app has been backgrounded for ~2 minutes and brought back to the foreground.

Brandon Sneed · Answer 11 · Sat Mar 02 2024 01:40:23 GMT+0800 (China Standard Time)

I've got a PR coming in the next week or so that removes this whole storage mechanism and replaces it with something quite a bit simpler and less error prone. I'll reopen this until that lands and we can revisit.

Brandon Sneed · Answer 12 · Wed Mar 20 2024 05:28:42 GMT+0800 (China Standard Time)

Closing this unless something else comes up. The connected PR went out with 1.5.6.

Tristan Warner-Smith · Answer 13 · Tue Mar 26 2024 23:41:10 GMT+0800 (China Standard Time)

We've been able to migrate back to main with these changes, we've only been live two days but not noticed any hangs on app start so far. We're experiencing the network failures mentioned in a separate issue but our Segment analytics seem to be unimpacted so far 🤞

Brandon Sneed · Answer 14 · Wed Mar 27 2024 00:43:32 GMT+0800 (China Standard Time)

That's great to hear! Looking into other one in tandem.