Use of layer + timeout causes high initialization times of next Lambda call
keithpeck opened this issue · comments
Describe the bug
When aws-otel-python layer is used, Lambda timeouts will cause high initialization times of the next Lambda call.
When our database is under high load, it can cause our Lambdas to timeout. In these high load situations we are also seeing very high Lambda initialization times of 19 seconds! Importantly, this initialization time is charged for and is removed from our remaining runtime - with a 30 second Lambda timeout, context.get_remaining_time_in_millis()
is reporting only 11 seconds when handler is called.
I have tested that these high initialization times are only present when the aws-otel-python layer is deployed and configured. I don't have experience with other extensions so am not able to confirm if this is specifically an open tel layer issue or something inherent to extensions
Steps to reproduce
This can be reproduced with a time.sleep
call in the Lambda handler, and invoking the Lambda with some load so that it is reused.
Normal initialization times of this example function are 1.2 seconds, with the outliers being 9 seconds.
And CW showing that function with 15 second timeout only had 6 seconds to run
CloudFormation and Python code is shared below which can be used to reproduce. Warnings:
- Code deploys a public URL
- The python script will generate 100 requests to the URL with a 256 MB Lambda with 15 second timeout.
You should expect to see printed results like this:
count inits: 11
count number of inits over 6 seconds: 7
min init duration: 1.2568621635437012
max init duration: 9.623291969299316
mean init duration: 6.339779875495217
mean duration of inits under 6 seconds: 1.2770129442214966
inits trace ids over 6 second ['1-6546c293-xxxxxxxxxx', '1-6546c273-xxxxxx', '1-6546c293-xxxxxxxxx', '1-6546c293-xxxxxx', '1-6546c274-xxxxxx', '1-6546c273-xxxxx', '1-6546c273-xxxxxx']
AWSTemplateFormatVersion: 2010-09-09
Parameters:
BucketName:
Type: String
Default: your_bucket_name
FunctionName:
Type: String
Default: your_function_name
Region:
Type: String
Default: us-east-1
Resources:
Bucket:
Type: AWS::S3::Bucket
Properties:
BucketName:
Ref: BucketName
PublicAccessBlockConfiguration:
BlockPublicAcls: True
BlockPublicPolicy: True
IgnorePublicAcls: True
RestrictPublicBuckets: True
URL:
Type: AWS::Lambda::Url
Properties:
AuthType: NONE
TargetFunctionArn:
Fn::GetAtt: [Function, Arn]
Function:
Type: AWS::Lambda::Function
Properties:
FunctionName:
Ref: FunctionName
Handler: index.lambda_handler
Runtime: python3.11
MemorySize: 256
Timeout: 15
Role:
Fn::GetAtt: [TaskRole, Arn]
Code:
ZipFile: |
import time
def lambda_handler(event, context):
print(f"time remaining {context.get_remaining_time_in_millis()}")
time.sleep(20)
Environment:
Variables:
# You will need to have an open telemetry config setup
OPENTELEMETRY_COLLECTOR_CONFIG_FILE:
Fn::Sub: s3://${BucketName}.s3.${Region}.amazonaws.com/otel-config.yaml
OTEL_METRICS_EXPORTde: otlp
OTEL_TRACES_EXPORTER: none
OTEL_EXPORTER_OTLP_TIMEOUT: 500
OTEL_EXPORTER_OTLP_PROTOCOL: http/protobuf
OPENTELEMETRY_EXTENSION_LOG_LEVEL: error
OTEL_PYTHON_DISABLED_INSTRUMENTATIONS: fastapi
AWS_LAMBDA_EXEC_WRAPPER: /opt/otel-instrument
TracingConfig:
Mode: Active
Layers:
- arn:aws:lambda:us-east-1:901920570463:layer:aws-otel-python-amd64-ver-1-20-0:2
LogGroup:
DependsOn: Function
Type: AWS::Logs::LogGroup
Properties:
LogGroupName:
Fn::Sub: "/aws/lambda/${FunctionName}"
RetentionInDays: 1
PublicURLPermission:
Type: AWS::Lambda::Permission
Properties:
FunctionName:
Fn::GetAtt: [Function, Arn]
FunctionUrlAuthType: 'NONE'
Action: lambda:InvokeFunctionUrl
Principal: '*'
TaskRole:
Type: AWS::IAM::Role
Properties:
RoleName:
Fn::Sub: "${FunctionName}Role"
AssumeRolePolicyDocument:
Statement:
- Effect: Allow
Principal:
Service:
- lambda.amazonaws.com
Action:
- sts:AssumeRole
Policies:
- PolicyName: root
PolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Action:
- s3:GetObject
- s3:ListBucket
Resource:
- Fn::Sub: arn:aws:s3:::${BucketName}
- Fn::Sub: "arn:aws:s3:::${BucketName}/*"
ManagedPolicyArns:
- arn:aws:iam::aws:policy/CloudWatchLambdaInsightsExecutionRolePolicy
- arn:aws:iam::aws:policy/AWSXrayWriteOnlyAccess
- arn:aws:iam::aws:policy/CloudWatchLogsFullAccess
from datetime import datetime, timedelta
import boto3
import json
from statistics import mean
import time
import concurrent.futures
import urllib.request
function_arn = "your function arn"
url = "your function url"
threshold = 6 # initialization seconds consider bad
print("Sending web requests for 1 minute")
worker_count = 20
requests_per_worker = 4
client = boto3.client("xray")
def load_url():
for _ in range(0, requests_per_worker):
try:
httprequest = urllib.request.Request(url)
with urllib.request.urlopen(httprequest) as response:
pass
except Exception as e:
print(e) # request is expected to timeout
with concurrent.futures.ThreadPoolExecutor(max_workers = worker_count) as executor:
jobs = []
for worker in range(0, worker_count):
job = executor.submit(load_url)
jobs.append(job)
for future in concurrent.futures.as_completed(jobs):
print("completed")
future.result()
print("Waiting for traces..")
time.sleep(20)
enddate = datetime.utcnow() - timedelta(minutes=3)
now = datetime.utcnow()
token = ''
ids = []
while True:
response = client.get_trace_summaries(
StartTime=enddate,
EndTime=now,
Sampling=True,
NextToken=token,
FilterExpression=f'resource.ARN = "{function_arn}"',
)
for trace in response["TraceSummaries"]:
ids.append(trace["Id"])
print(trace["Id"])
if response.get("NextToken"):
token = response["NextToken"]
else:
break
print(f"found {len(ids)}")
traces = []
while True:
batch = []
for i in range(0,5):
if ids:
batch.append(ids.pop())
if batch:
response = client.batch_get_traces(
TraceIds=batch,
)
traces.extend(response["Traces"])
else:
break
init_durations = []
long_init_durations = []
standard_init_durations = []
long_init_durations_ids = []
def recursive(node, trace):
if node.get("name") == "Initialization":
duration = node["end_time"] - node["start_time"]
init_durations.append(duration)
print(f"duration: {duration} s" )
if duration > threshold:
long_init_durations.append(duration)
long_init_durations_ids.append(trace["Id"])
if duration < threshold:
standard_init_durations.append(duration)
elif node.get("subsegments"):
for subsegment in node.get("subsegments"):
recursive(subsegment, trace)
for trace in traces:
for seg in trace["Segments"]:
recursive(json.loads(seg["Document"]), trace)
print(f"count inits: {len(init_durations)}")
print(f"count number of inits over {threshold} seconds: {len(long_init_durations)}")
print(f"min init duration: {min(init_durations)}")
print(f"max init duration: {max(init_durations)}")
print(f"mean init duration: {mean(init_durations)}")
print(f"mean duration of inits under {threshold} seconds: {mean(standard_init_durations)}")
print(f"inits trace ids over {threshold} second {long_init_durations_ids}")
What version of collector/language SDK version did you use?
arn:aws:lambda:us-east-1:901920570463:layer:aws-otel-python-amd64-ver-1-20-0:2
What language layer did you use?
Python 11
Additional context
We attempted to mitigate this issue by setting lower statement timeouts on our DB calls, but under high DB load this timeout does not fire accurately.
I'm not sure where the data is being reported to, but the delay might be caused by the synchronous flush.
Please try the new decouple
processor included in the latest release of the community collector to see if that helps.
For more details see the links below:
#959
#1046
https://github.com/open-telemetry/opentelemetry-lambda/blob/main/collector/README.md#improving-lambda-responses-times