Increased lambda duration during march 8th incident

Question

Increased lambda duration during march 8th incident

santiagoaguiar opened this issue a year ago · comments

During the 2023 march 8th incident, our lambdas duration average execution increased very significantly (about x2-x3), causing an increase in concurrency and additional load across the board. These lambdas are executed thousands of times per minute, and can take ~90ms on average to complete. We ended up disabling the lambda extension to restore to our normal duration.

Looking at https://github.com/DataDog/datadog-lambda-extension#overhead it seems we shouldn't have seen this. My interpretation is that it would be expected to have 1 invocation every minute to have a larger than normal duration as it flushes the buffered metrics/spans, but most lambda invocations should have kept working at same speed.

Wanted to check on that interpretation and see if there was any other reasons that could have caused such an increase in average duration, and if there was anything that could be done to prevent those in the future in the light of today's incident.

This is our current configuration for the extension:

  enableDDTracing: false
  # logs are forwarded from CW to DD
  enableDDLogs: false
  subscribeToAccessLogs: false
  # as tracing is disable, do not add DD context to logs
  injectLogContext: false

Thank you, and #hugops as I bet this one was a hard one!

Tian Chu · Answer 1 · Wed Mar 15 2023 03:32:20 GMT+0800 (China Standard Time)

@santiagoaguiar Thanks for reporting what you experienced during the incident! That's is extremely valuable for us, as we are still actively investigating the exact impact to our serverless customers during the incident. Do you mind following up in another week or so, I believe we will have some concrete to share by then.

Tian Chu · Answer 2 · Fri Mar 31 2023 04:30:06 GMT+0800 (China Standard Time)

@santiagoaguiar We were able to identify a few places in the Datadog Agent where the existing retry and buffer logic were not optimized for serverless. We are looking into potential improvements in Q2.