Cold start is unacceptably slow

Question

Cold start is unacceptably slow

shyouhei opened this issue a year ago · comments

卜部昌平 commented a year ago

Expected Behavior

This is a trace I get (via using AWS X-Ray) without having the Datadog-Node18-x layer:

Actual Behavior

This is the trace I actually get (note the flame graph is in seconds, not in milliseconds):

Steps to Reproduce the Problem

I have uploaded a repo to reproduce:
https://github.com/shyouhei/datadog-agent-cold-start-issue

Specifications

Datadog Lambda Layer version: 7.90.0
Node version: INIT_START Runtime Version: nodejs:18.v5 Runtime Version ARN: arn:aws:lambda:ap-northeast-1::runtime:c869d752e4ae21a3945cfcb3c1ff2beb1f160d7bcec3b0a8ef7caceae73c055f

Stacktrace

Paste here

AJ Stuyvenberg · Answer 1 · Thu Apr 20 2023 20:12:12 GMT+0800 (China Standard Time)

Hi @shyouhei - thanks for reaching out! I appreciate you including a repo! I have several questions which I hope will help narrow in on the possible contributing factors.

The cold start of your function went from 215ms to 910ms, but the actual function duration in the latter case was 3 full seconds. Given that the handler code you provided is relatively empty, I presume you're raising an issue about the significantly increased function duration, is that correct? Not specifically the initialization duration?

Your reproduction case doesn't include configured memory, and I can't tell from your screenshot how much memory was consumed. If your function is using 128mb and also utilizing the datadog lambda extension, I'd suggest increasing the configured memory to at least 256mb, this may solve the issue.

Additionally, to rule out possible interactions with x-ray - can you disable x-ray and reproduce this issue using Datadog tracing?

As a further troubleshooting step, it would be helpful to separate this library from the agent.
The terraform template you've provided applies both the datadog lambda extension (which is the datadog agent) as well as this library, datadog-lambda-js. As a test, we could remove the datadog lambda extension, and use the datadog lambda forwarder. This would help eliminate any post-runtime duration which could be incurred by transmitting telemetry data from ap-northeast-1 back to us-east-1. Could you try that, and see if this resolves the issue?

If this does reduce latency, I'd suggest testing this using our new datacenter in Japan. That should help reduce geographically-induced latency.

Finally, it would be good to understand a few other factors, which I can't see from your screenshot. How much memory is this function configured with?

Thanks!

AJ Stuyvenberg · Answer 2 · Thu Apr 20 2023 23:51:09 GMT+0800 (China Standard Time)

Hi @shyouhei!

I've attempted to reproduce this with Datadog tracing instead of X-Ray, and was not able to reproduce this issue. This function is using the handler code you provided, using node16.x, and runs in ap-northeast-1, while sending telemetry data back to Datadog's US1 datacenter.

The duration of the function was 2.83ms, and it seems it required roughly 100ms after the lambda function execution finished, in order to flush telemetry data back to datadog:

The cold start was around 800ms:

I think the next course of action would be to try what I've done here (using datadog tracing instead of x-ray), and seeing if this resolves the issue.

I did this using serverless framework, as I don't have TF set up, but the outcome should be identical. You can reproduce with this template, deploying using DD_API_KEY=<yourkey> serverless deploy:

service: ap-northeast-1
frameworkVersion: '3'

provider:
  name: aws
  runtime: nodejs16.x
  region: ap-northeast-1

custom:
  datadog:
    apiKey: ${env:DD_API_KEY}

functions:
  hello:
    handler: handler.hello
    events:
      - httpApi:
          method: get
          path: /hello

plugins:
  - serverless-plugin-datadog

Thank you!

AJ Stuyvenberg · Answer 3 · Tue Apr 25 2023 23:50:56 GMT+0800 (China Standard Time)

Hi @shyouhei!

Just wanted to check and see if you've had a chance to test the changes I suggested earlier. Any luck?

Thanks again!

卜部昌平 · Answer 4 · Wed Apr 26 2023 09:11:29 GMT+0800 (China Standard Time)

Hello @astuyve. Much appreciated for your super quick response, and sorry for my being slow.

I have since contacted to AWS support about this. No concrete answer yet but it could be on their side. Would also try your suggestion. Let me tell you any updates when available. Thank you very much!

AJ Stuyvenberg · Answer 5 · Wed May 10 2023 03:27:17 GMT+0800 (China Standard Time)

Hi @shyouhei!

Just wondering if you've been able to test the changes I suggested earlier - have you had any success?

Thanks!

卜部昌平 · Answer 6 · Thu May 11 2023 07:51:55 GMT+0800 (China Standard Time)

Sorry, I was talking with AWS. No progress on my side.

Let me tentatively close this issue. Will reopen when I have more info. Sorry!

AJ Stuyvenberg · Answer 7 · Thu May 11 2023 07:55:11 GMT+0800 (China Standard Time)

No need to apologize at all @shyouhei

If you ever have any questions or concerns, please do not hesitate to reach out!

Thanks!