Use of layer + timeout causes high initialization times of next Lambda call

Question

Use of layer + timeout causes high initialization times of next Lambda call

keithpeck opened this issue 9 months ago · comments

Describe the bug

When aws-otel-python layer is used, Lambda timeouts will cause high initialization times of the next Lambda call.

When our database is under high load, it can cause our Lambdas to timeout. In these high load situations we are also seeing very high Lambda initialization times of 19 seconds! Importantly, this initialization time is charged for and is removed from our remaining runtime - with a 30 second Lambda timeout, context.get_remaining_time_in_millis() is reporting only 11 seconds when handler is called.

I have tested that these high initialization times are only present when the aws-otel-python layer is deployed and configured. I don't have experience with other extensions so am not able to confirm if this is specifically an open tel layer issue or something inherent to extensions

Steps to reproduce

This can be reproduced with a time.sleep call in the Lambda handler, and invoking the Lambda with some load so that it is reused.

Normal initialization times of this example function are 1.2 seconds, with the outliers being 9 seconds.

And CW showing that function with 15 second timeout only had 6 seconds to run

CloudFormation and Python code is shared below which can be used to reproduce. Warnings:

Code deploys a public URL
The python script will generate 100 requests to the URL with a 256 MB Lambda with 15 second timeout.

You should expect to see printed results like this:

count inits: 11
count number of inits over 6 seconds: 7
min init duration: 1.2568621635437012
max init duration: 9.623291969299316
mean init duration: 6.339779875495217
mean duration of inits under 6 seconds: 1.2770129442214966
inits trace ids over 6 second ['1-6546c293-xxxxxxxxxx', '1-6546c273-xxxxxx', '1-6546c293-xxxxxxxxx', '1-6546c293-xxxxxx', '1-6546c274-xxxxxx', '1-6546c273-xxxxx', '1-6546c273-xxxxxx']

AWSTemplateFormatVersion: 2010-09-09

Parameters:
  BucketName:
    Type: String
    Default: your_bucket_name

  FunctionName:
    Type: String
    Default: your_function_name

  Region:
    Type: String
    Default: us-east-1

Resources:

  Bucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketName:
        Ref: BucketName
      PublicAccessBlockConfiguration:
        BlockPublicAcls: True
        BlockPublicPolicy: True
        IgnorePublicAcls: True
        RestrictPublicBuckets: True
  
  URL:
    Type: AWS::Lambda::Url
    Properties:
      AuthType: NONE
      TargetFunctionArn:
        Fn::GetAtt: [Function, Arn]

  Function:
    Type: AWS::Lambda::Function
    Properties:
      FunctionName:
        Ref: FunctionName
      Handler: index.lambda_handler
      Runtime: python3.11
      MemorySize: 256
      Timeout: 15
      Role:
        Fn::GetAtt: [TaskRole, Arn]
      Code:
        ZipFile: |
          import time
          def lambda_handler(event, context):
            print(f"time remaining {context.get_remaining_time_in_millis()}")
            time.sleep(20)
      Environment:
        Variables:
          # You will need to have an open telemetry config setup 
          OPENTELEMETRY_COLLECTOR_CONFIG_FILE:
            Fn::Sub: s3://${BucketName}.s3.${Region}.amazonaws.com/otel-config.yaml
          OTEL_METRICS_EXPORTde: otlp
          OTEL_TRACES_EXPORTER: none
          OTEL_EXPORTER_OTLP_TIMEOUT: 500
          OTEL_EXPORTER_OTLP_PROTOCOL: http/protobuf
          OPENTELEMETRY_EXTENSION_LOG_LEVEL: error
          OTEL_PYTHON_DISABLED_INSTRUMENTATIONS: fastapi
          AWS_LAMBDA_EXEC_WRAPPER: /opt/otel-instrument
      TracingConfig:
        Mode: Active
      Layers:
          - arn:aws:lambda:us-east-1:901920570463:layer:aws-otel-python-amd64-ver-1-20-0:2

  LogGroup:
    DependsOn: Function
    Type: AWS::Logs::LogGroup
    Properties:
      LogGroupName:
        Fn::Sub: "/aws/lambda/${FunctionName}"
      RetentionInDays: 1

  PublicURLPermission:
     Type: AWS::Lambda::Permission
     Properties:
       FunctionName:
          Fn::GetAtt: [Function, Arn]
       FunctionUrlAuthType: 'NONE'
       Action: lambda:InvokeFunctionUrl
       Principal: '*'

  TaskRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: 
        Fn::Sub: "${FunctionName}Role"
      AssumeRolePolicyDocument:
        Statement:
          - Effect: Allow
            Principal:
              Service:
                - lambda.amazonaws.com
            Action:
              - sts:AssumeRole
      Policies:
        - PolicyName: root
          PolicyDocument:
            Version: 2012-10-17
            Statement:
            - Effect: Allow
              Action:
                - s3:GetObject
                - s3:ListBucket
              Resource:
                - Fn::Sub: arn:aws:s3:::${BucketName}
                - Fn::Sub: "arn:aws:s3:::${BucketName}/*"
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/CloudWatchLambdaInsightsExecutionRolePolicy
        - arn:aws:iam::aws:policy/AWSXrayWriteOnlyAccess
        - arn:aws:iam::aws:policy/CloudWatchLogsFullAccess

from datetime import datetime, timedelta
import boto3
import json
from statistics import mean
import time
import concurrent.futures
import urllib.request

function_arn = "your function arn"
url = "your function url"
threshold = 6 # initialization seconds consider bad

print("Sending web requests for 1 minute")

worker_count = 20
requests_per_worker = 4

client = boto3.client("xray")

def load_url():
    for _ in range(0, requests_per_worker):
        try: 
            httprequest = urllib.request.Request(url)
            with urllib.request.urlopen(httprequest) as response:
                pass
        except Exception as e:
            print(e) # request is expected to timeout

with concurrent.futures.ThreadPoolExecutor(max_workers = worker_count) as executor:
    jobs = []
    for worker in range(0, worker_count):
        job = executor.submit(load_url)
        jobs.append(job)

    for future in concurrent.futures.as_completed(jobs):
        print("completed")
        future.result()

print("Waiting for traces..")
time.sleep(20)

enddate = datetime.utcnow() - timedelta(minutes=3)
now = datetime.utcnow()
token = ''
ids = []

while True:
    response = client.get_trace_summaries(
        StartTime=enddate, 
        EndTime=now,
        Sampling=True,
        NextToken=token,
        FilterExpression=f'resource.ARN = "{function_arn}"',
    )

    for trace in response["TraceSummaries"]:
        ids.append(trace["Id"])
        print(trace["Id"])
    if response.get("NextToken"):
        token = response["NextToken"]
    else:
        break

print(f"found {len(ids)}")
traces = []

while True:
    batch = []
    for i in range(0,5):
        if ids:
            batch.append(ids.pop())
    if batch:
        response = client.batch_get_traces(
            TraceIds=batch,
        )
        traces.extend(response["Traces"])
    else:
        break

init_durations = []
long_init_durations = []
standard_init_durations = []
long_init_durations_ids = []

def recursive(node, trace):
    if node.get("name") == "Initialization":
        duration = node["end_time"] - node["start_time"]
        init_durations.append(duration)
        print(f"duration: {duration} s" )
        if duration > threshold:
            long_init_durations.append(duration)
            long_init_durations_ids.append(trace["Id"])
        if duration < threshold:
            standard_init_durations.append(duration)
    elif node.get("subsegments"):
        for subsegment in node.get("subsegments"):
            recursive(subsegment, trace)

for trace in traces:
    for seg in trace["Segments"]:
        recursive(json.loads(seg["Document"]), trace)

print(f"count inits: {len(init_durations)}")
print(f"count number of inits over {threshold} seconds: {len(long_init_durations)}")
print(f"min init duration: {min(init_durations)}")
print(f"max init duration: {max(init_durations)}")
print(f"mean init duration: {mean(init_durations)}")
print(f"mean duration of inits under {threshold} seconds: {mean(standard_init_durations)}")
print(f"inits trace ids over {threshold} second {long_init_durations_ids}")

What version of collector/language SDK version did you use?
arn:aws:lambda:us-east-1:901920570463:layer:aws-otel-python-amd64-ver-1-20-0:2

What language layer did you use?
Python 11

Additional context
We attempted to mitigate this issue by setting lower statement timeouts on our DB calls, but under high DB load this timeout does not fire accurately.

Tyler Benson · Answer 1 · Fri Jan 19 2024 07:18:15 GMT+0800 (China Standard Time)

I'm not sure where the data is being reported to, but the delay might be caused by the synchronous flush.

Please try the new decouple processor included in the latest release of the community collector to see if that helps.

For more details see the links below:
#959
#1046
https://github.com/open-telemetry/opentelemetry-lambda/blob/main/collector/README.md#improving-lambda-responses-times