segmentio / analytics-swift

The hassle-free way to add Segment analytics to your Swift app (iOS/tvOS/watchOS/macOS/Linux).

Repository from Github https://github.comsegmentio/analytics-swiftRepository from Github https://github.comsegmentio/analytics-swift

Storage calls seem to result in main thread deadlocks

tristan-warner-smith opened this issue · comments

Describe the bug

  • We've updated from the previous SDK Analytics iOS to Analytics Swift and we're experiencing regular app hangs
  • During a hang, pausing execution shows three threads accessing Storage functions that use the blocking syncQueue.sync.
  • We see:
    • Storage.write as a result of a track call where we're making a call to Analytics.track(name:properties:) from the main thread
    • Storage.remove from HttpClient.startBatchUpload > SegmentDestination.flush > Storage.remove
    • Storage.write as a result of another track call, this one from the sync.segment.com serial queue

See attached images.

To Reproduce

  • The behaviour manifests often when switching to other apps and back but often immediately hangs the app on start, leaving it unusable

Our initialisation is:

let analytics = Analytics(
    configuration: Configuration(
        writeKey: Config.shared.segmentSourceKey
    )
    .trackApplicationLifecycleEvents(true)
    .flushAt(3)
    .flushInterval(10)
)
analytics.add(plugin: FirebaseDestination())

let engage = TwilioEngage { _, _ in }
analytics.add(plugin: engage)
analytics.add(plugin: CustomizeSegmentTrackCalls())

Initialisation is lazy on the first track event we fire.

Expected behavior

  • The app should never hang.

Screenshots
Main thread Storage.write

Other Storage calls

Platform (please complete the following information):

  • Library Version in use: 1.4.8
  • Platform being tested: iOS 17
  • Integrations in use: SegmentFirebase, Twilio Engage
  • main commit c4bb71dea0b38c179b2d87b56ee096a06ce2ea86

Additional context

  • The workaround we've had to adopt in the very short-term before you address this bug is to fork the repo and make Storage.write call syncQueue.async rather than syncQueue.sync. This is obviously not ideal, we aren't domain experts in your codebase, but in our testing this change removes all the main-thread blocking issues we encountered and still seems to send Segment events as expected.
  • Both track calls we make are early on in the app's initialisation.
  • We can share a video walking through the stack trace at the point in time of the images above, we'll want to share that privately though.
  • Sometimes it may just be two calls, not three but the end result is the same.

Thanks @tristan-warner-smith, I'll have a look at this and see if I can reproduce. If you could give me an example of the track calls your sending, that would be helpful. There were some instances where large IO on the main thread sent to the storage class could result in something that takes long enough for the OS to kill the app. You might pull main and see if that improves things for you in the meantime.

However, your current work around will almost certainly result in corrupt data given time. Those sync's are there on purpose to synchronize writes to a single file from potentially multiple threads.

However, your current work around will almost certainly result in corrupt data given time. Those sync's are there on purpose to synchronize writes to a single file from potentially multiple threads.

  • I assumed as much, the issue is we now have a release in production that we get no telemetry for, no Crashlytics crash reports and we know when this issue occurs the app is completely unusable.

Our users depend on us for immediate relief from pain, stress etc so we can't wait for perfect in this case.

If you can think of any better workarounds @bsneed that don't jeopardise our app's usability, we'd really appreciate it.

These are similar events emitted roughly on app start:

Name: abcd_abcdefghijk_abcdefj
Parameters: [
	"userId": "abcdefghijklmnopqrstuvwxyz01",
	"is_first_session": true,
	"platform": "iOS",
	"trialStart": false,
	"clicked_branch_link": false,
	"appVersion": "3.20.0"
]

Name: "abcdefghij_abcde"
Parameters: [
	"trialStart": false,
	"platform": "iOS",
	"appVersion": "3.20.0",
	"userId": "abcdefghijklmnopqrstuvwxyz01"
]

Name: "abcdefghijklmno_abcde"
Parameters: [
	"userId": "abcdefghijklmnopqrstuvwxyz01",
	"variant_id": "baseline",
	"appVersion": "3.20.0",
	"event_timestamp": "2023-11-23T22:49:46.669Z",
	"trialStart": false,
	"experiment_id": "ab_abcdef_abcdefgh_abc",
	"platform": "iOS"
]

Name: "abcdefghijklmno_abcde"
Parameters: [
	"userId": "abcdefghijklmnopqrstuvwxyz01",
	"experiment_id": "ab_abcde_ab_abcd",
	"appVersion": "3.20.0",
	"event_timestamp": "2023-11-23T22:49:46.669Z",
	"trialStart": false,
	"variant_id": "baseline",
	"platform": "iOS"
]

Name: "abcdefghijklmno_abcde"
Parameters: [
	"userId": "abcdefghijklmnopqrstuvwxyz01",
	"experiment_id": "ab_abcd_abcdefghijkl_abc_ab",
	"platform": "iOS",
	"trialStart": false,
	"variant_id": "baseline",
	"event_timestamp": "2023-11-23T22:49:46.669Z",
	"appVersion": "3.20.0"
]

Thanks @tristan-warner-smith! To clarify, I was just sharing information with you. You should always do (as you are already) what's best for you and your customers and I would never suggest otherwise. We'll dive into this on Monday, but since you have a repro scenario it'd be useful to check out main.

Could you outline a potential scenario that represents "However, your current work around will almost certainly result in corrupt data given time." Ie: is it about ordering of events/losing events

@johncDepop it's about 2 threads trying to write to a json file at the same time potentially. ie: thread1 is mid-way through writing an event, thread2 writes an event in the middle, you might end up with {myEvent="someThin{myEvent="someOtherThing"},g"}, instead of {myEvent="someThing"},{myEvent="someOtherThing"},

Hi @bsneed, just to add to the context @tristan-warner-smith posted earlier, I thought it might be useful to add the following screenshots of the app hanging when we background and foreground the app. We've updated our fork to be in sync with main and have addressed an instance of potential large IO on the main thread being sent to the storage class on app launch, but we still see this issue. Can you please advise?

Screenshot 2023-11-27 at 09 08 52
Screenshot 2023-11-27 at 09 06 36
Screenshot 2023-11-27 at 09 08 32

It appears I was incorrect regarding the sync->async thing causing issues. I've built some tests and it's working as it should despite the move to async. Change coming soon. I do think the deadlock there is a symptom of something else, such as a large amount of data being written, etc.

Ok, I stand corrected. In theory it'd work on a serial queue when mixing sync/async. Once I make the change, the thread sanitizer identifies a half dozen deadlocks. I won't be able to do anything about this until I have a reproduction scenario unfortunately. Given that it's all around fileIO, you might look into the content of your events.

It appears I was incorrect regarding the sync->async thing causing issues. I've built some tests and it's working as it should despite the move to async. Change coming soon. I do think the deadlock there is a symptom of something else, such as a large amount of data being written, etc.

I shared the exact amount of data we were passing through Segment above (subjectively it doesn't seem like a lot of event data), we've subsequently reduced it down to just 2 events tracked initially and had the same result. Note that the only related change we made to our codebase was upgrading from the previous iOS SDK to this one.

Beyond that we're using a SwiftUI-based project and initialising a Segment singleton lazily with the first track call.
We use Firebase + Firestore SDK 10.17, it likely does some writes on init but we obviously don't control that.

We'll attempt to create a repro scenario, when we can prioritise the time for it, but the sample project you linked seemed to only have pre-app start synchronous event dispatch (in the app delegate), no post app-start events and no async scenarios from what I saw so it doesn't seem like it would be comprehensive enough to represent the real-world. We mostly interact with the Segment SDK via unstructured Tasks rather than DispatchQueue so if there are specific thread or QoS requirements to use the SDK please let us know as in some cases we'll be calling from MainActor.

If you can try to replicate the environment described above you may see the issue. Let us know if you make any progress, we appreciate all the help you've given us so far to help us onboard the new product. It's really key for the stability of our users.

@bsneed I'd appreciate you revisiting this bug report, we tried updating to the latest 1.55 release, given the mention of concurrency fixes to Sovran-Swift but the issue we mentioned above still manifests.

That of the main thread deadlocking due here where the Storage.syncQueue seems to run at the same dos as DispatchQueue.main, where as you know, calling DispatchQueue.main.sync results in the deadlock we're seeing.

On 1.55 we're now exclusively seeing this issue after the app has been backgrounded for ~2 minutes and brought back to the foreground.

image

I've got a PR coming in the next week or so that removes this whole storage mechanism and replaces it with something quite a bit simpler and less error prone. I'll reopen this until that lands and we can revisit.

Closing this unless something else comes up. The connected PR went out with 1.5.6.

We've been able to migrate back to main with these changes, we've only been live two days but not noticed any hangs on app start so far. We're experiencing the network failures mentioned in a separate issue but our Segment analytics seem to be unimpacted so far 🤞

That's great to hear! Looking into other one in tandem.