DataDog / datadog-lambda-extension

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Are +1 second cold starts on NodeJS to be expected?

eliasbrange opened this issue · comments

The documentation states that a short increase in overhead is expected, but there are no pointers as to the expected size of this increase.

We've been seeing at least 1-second increases in cold starts across the board using both NodeJS 18 and 20, which feels a bit much IMO.

I've used the https://github.com/astuyve/cold-start-benchmarker tool to measure the total round trip time for both instrumented and a non-instrumented Lambda function.

For testing, I've used the following minimal samples:

Handler function:

module.exports.handler = async (event) => {
  console.log(event);
  return {
    statusCode: 200,
    body: JSON.stringify(
      {
        message: "Hello there!",
      },
      null,
      2
    ),
  };
};

Serverless.yml (non-instrumented):

service: coldstart-without-dd
frameworkVersion: "3"

provider:
  name: aws
  region: eu-north-1
  runtime: nodejs20.x

functions:
  function1:
    handler: index.handler

Serverless.yml (instrumented):

service: coldstart-with-dd
frameworkVersion: "3"

plugins:
  - serverless-plugin-datadog

provider:
  name: aws
  region: eu-north-1
  runtime: nodejs20.x

custom:
  datadog:
    # api key set using DATADOG_API_KEY environment variable
    site: datadoghq.eu
    env: dev
    service: coldstart-with-dd
    version: 1.0.0

functions:
  function1:
    handler: index.handler

I'm using the latest serverless-plugin-datadog as of this writing, which uses the following layers:

arn:aws:lambda:eu-north-1:464622532012:layer:Datadog-Node20-x:104
arn:aws:lambda:eu-north-1:464622532012:layer:Datadog-Extension:52

Results

Just adding the extension through the serverless plugins increases our cold start RTT from 250-300ms to 1200-1400ms. See the following graph on average RTT in ms. This data is gathered using https://github.com/astuyve/cold-start-benchmarker

image

Logs

Debug logs
2023-12-21T10:23:04.396+01:00	INIT_START Runtime Version: nodejs:20.v13 Runtime Version ARN: arn:aws:lambda:eu-north-1::runtime:0593533e3a90a967b5685780855b7a89f7ba80b7d910603abf59369c6c4c7331
2023-12-21T10:23:04.712+01:00	2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | Datadog extension version : 52|Datadog environment variables: DD_API_KEY=***|DD_CAPTURE_LAMBDA_PAYLOAD=false|DD_ENV=dev|DD_LAMBDA_HANDLER=index.handler|DD_LOGS_INJECTION=false|DD_LOG_LEVEL=debug|DD_MERGE_XRAY_TRACES=false|DD_REMOTE_CONFIGURATION_ENABLED=false|DD_SERVERLESS_LOGS_ENABLED=true|DD_SERVICE=coldstart-with-dd|DD_SITE=datadoghq.eu|DD_TRACE_ENABLED=true|DD_VERSION=1.0.0|
2023-12-21T10:23:04.716+01:00	2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | Starting daemon to receive messages from runtime...
2023-12-21T10:23:04.725+01:00	2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | Unable to restore the state from file
2023-12-21T10:23:04.733+01:00	2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | No config file detected, using environment variable based configuration only
2023-12-21T10:23:04.733+01:00	2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | 'use_proxy_for_cloud_metadata' is enabled: adding cloud provider URL to the no_proxy list
2023-12-21T10:23:04.733+01:00	2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | FIPS mode is disabled
2023-12-21T10:23:04.733+01:00	2023-12-21 09:23:04 UTC | DD_EXTENSION | INFO | 0 Features detected from environment:
2023-12-21T10:23:04.734+01:00	2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | Using a SyncForwarder with a 5s timeout
2023-12-21T10:23:04.739+01:00	2023-12-21 09:23:04 UTC | DD_EXTENSION | INFO | Retry queue storage on disk is disabled
2023-12-21T10:23:04.739+01:00	2023-12-21 09:23:04 UTC | DD_EXTENSION | INFO | Creating TimeSampler #0
2023-12-21T10:23:04.739+01:00	2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | 'telemetry.dogstatsd.aggregator_channel_latency_buckets' is empty, falling back to default values
2023-12-21T10:23:04.739+01:00	2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | Forwarder started
2023-12-21T10:23:04.739+01:00	2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | Demultiplexer started
2023-12-21T10:23:04.739+01:00	2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | 'telemetry.dogstatsd.listeners_latency_buckets' is empty, falling back to default values
2023-12-21T10:23:04.739+01:00	2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | 'telemetry.dogstatsd.listeners_channel_latency_buckets' is empty, falling back to default values
2023-12-21T10:23:04.742+01:00	2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | dogstatsd-udp: 127.0.0.1:8125 successfully initialized
2023-12-21T10:23:04.742+01:00	2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | DogStatsD will run 2 workers
2023-12-21T10:23:04.744+01:00	2023-12-21 09:23:04 UTC | DD_EXTENSION | INFO | dogstatsd-udp: starting to listen on 127.0.0.1:8125
2023-12-21T10:23:04.750+01:00	2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | appsec: security monitoring is not enabled: DD_SERVERLESS_APPSEC_ENABLED is not set to true
2023-12-21T10:23:04.750+01:00	2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | Enabling telemetry collection HTTP route
2023-12-21T10:23:04.750+01:00	2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | Subscribing to Telemetry for types: [platform function extension]
2023-12-21T10:23:04.750+01:00	2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | No config file detected, using environment variable based configuration only
2023-12-21T10:23:04.750+01:00	2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | 'use_proxy_for_cloud_metadata' is enabled: adding cloud provider URL to the no_proxy list
2023-12-21T10:23:04.751+01:00	2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | otlp endpoint disabled
2023-12-21T10:23:04.751+01:00	2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | FIPS mode is disabled
2023-12-21T10:23:04.753+01:00	2023-12-21 09:23:04 UTC | DD_EXTENSION | INFO | 0 Features detected from environment:
2023-12-21T10:23:04.754+01:00	TELEMETRY Name: datadog-agent State: Subscribed Types: [Platform, Function, Extension]
2023-12-21T10:23:04.756+01:00	2023-12-21 09:23:04 UTC | DD_EXTENSION | INFO | Loaded configuration: /var/task/datadog.yaml
2023-12-21T10:23:04.756+01:00	2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | Setting DefaultEnv to "dev" (from 'env' config option)
2023-12-21T10:23:04.758+01:00	2023-12-21 09:23:04 UTC | DD_EXTENSION | INFO | Starting logs-agent...
2023-12-21T10:23:04.758+01:00	2023-12-21 09:23:04 UTC | DD_EXTENSION | INFO | logs-agent started
2023-12-21T10:23:04.758+01:00	2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | Adding AWS Logs Log Source
2023-12-21T10:23:04.760+01:00	2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | Starting ColdStartSpanCreator
2023-12-21T10:23:04.760+01:00	2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | Setting ColdStartSpanCreator on Daemon
2023-12-21T10:23:04.760+01:00	2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | finding the lambda runtime took 606.381µs. found runtime: nodejs20.x
2023-12-21T10:23:04.771+01:00	2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | serverless agent ready in 54.868115ms
2023-12-21T10:23:05.293+01:00	EXTENSION Name: datadog-agent State: Ready Events: [INVOKE, SHUTDOWN]
2023-12-21T10:23:05.294+01:00	START RequestId: 47d7f666-dda2-42eb-890f-8e43c71800e5 Version: $LATEST
2023-12-21T10:23:05.295+01:00	2023-12-21 09:23:05 UTC | DD_EXTENSION | DEBUG | Received invocation event...

Something that stands out is the time between the serverless agent ready log and EXTENSION Name: datadog-agent... log. What is happening during those 522 milliseconds?

Hi @eliasbrange,

Thanks so much for reaching out!

The short answer is that it depends 😄

What you can do today

Your current configuration is enabling automatic APM instrumentation, logs, and metrics – all with the Datadog extension and a zip-based lambda function.

That's going to be on the higher end of the cold start spectrum because effectively you're replacing all of Cloudwatch with a combination of the tracer and the extension. Users often choose to go this route because they can disable Cloudwatch logs and save substantial amounts of money.

You can make many performance tradeoffs including manually instrumenting your code, using the Datadog Lambda Forwarder (which replaces the extension entirely), or even using x-ray (which we can merge).

So what's happening during those 522 milliseconds?

Something that stands out is the time between the serverless agent ready log and EXTENSION Name: datadog-agent... log. What is happening during those 522 milliseconds?

The Lambda Extensions API is designed to allow many things (which we rely on), but also allows folks to modify the runtime environment of the function itself (which we do not rely on).

This feature means that all extensions present must fully initialize before the function runtime and code even begins being copied into the Lambda sandbox.
image

Answer:

My understanding is that 522ms is Lambda copying the node runtime + your code into the sandbox, and performing static initialization.

To further illustrate this point, I added some debug points to the extension and created a demonstration application.
I wrapped the code which calls /next in two log statements:

	log.Debugf("[extension] - calling /next")
	if response, err = client.Do(request); err != nil {
		return fmt.Errorf("WaitForNextInvocation: while GET next route: %v", err)
	}
	log.Debugf("[extension] - got response from /next")

And then outside of my handler function, I logged my function init:

'use strict';
console.log('[handler] initializing function code')

module.exports.hello = async (event) => {
  return {
    statusCode: 200,
    body: JSON.stringify(
      {
        message: 'Go Serverless v1.0! Your function executed successfully!',
        input: event,
      },
      null,
      2
    ),
  };
};

When I run the code, we see that the longest gap exists between calling /next and the initialization of the handler code:
image and finally afterwards we receive a response, which begins the invocation.

We're constantly working on improving the performance of our tracing libraries and extension code, and that includes working with our partners at AWS as well.

I hope I've helped!
AJ

Thanks for the response.

To gather more data, I added two more functions. One that has the AppConfig extension and one that has the Lambda Insights extension.

I haven't checked the code of those two (are they public?), but I assume they similarly use the same registration API.

Since my four functions use the same runtime and handler code, I'd expect all three variants with extensions to behave similarly and have the ~500 ms delay. But I am seeing the following instead:

image

Those other extensions add around 100ms to the round-trip time, much less than the Datadog extension.

Could there be something more to it than copying the node runtime + your code into the sandbox, and performing static initialization.?

Hey folks! We narrowed this down both in this issue and on slack with a bit of code sharing. The extra duration includes time to load and instrument the hander function itself, which will vary compared with other lambda extensions.

Feel free to reach out with any questions.

Thanks!