bluesentry / bucket-antivirus-function

Serverless antivirus for cloud storage.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Clamav lambda timeouts

code-memento opened this issue · comments

Hi,

We have some weird behavior, for a year or more the lambda functions worked without any issues.

Up until recently, the scan takes like forever and is stopped by the lambda timeout.

Any ideas ?

Thanks and regards

2020-07-06T16:55:28.822+01:00 Attempting to create directory /tmp/clamav_defs.
2020-07-06T16:55:28.963+01:00 Not downloading older file in series: daily.cvd
2020-07-06T16:55:29.006+01:00 Downloading definition file /tmp/clamav_defs/main.cvd from s3://clamav_defs/main.cvd
2020-07-06T16:55:30.817+01:00 Downloading definition file /tmp/clamav_defs/main.cvd complete!
2020-07-06T16:55:30.817+01:00 Downloading definition file /tmp/clamav_defs/daily.cld from s3://clamav_defs/daily.cld
2020-07-06T16:55:31.913+01:00 Downloading definition file /tmp/clamav_defs/daily.cld complete!
2020-07-06T16:55:31.913+01:00 Downloading definition file /tmp/clamav_defs/bytecode.cvd from s3://clamav_defs/bytecode.cvd
2020-07-06T16:55:31.979+01:00 Downloading definition file /tmp/clamav_defs/bytecode.cvd complete!
2020-07-06T16:55:31.979+01:00 Starting clamscan of /tmp/bucket/documents/file.png.
2020-07-06T17:00:28.513+01:00 END RequestId: cd1e0053-5477-459d-a3c2-7dee7e125378
2020-07-06T17:00:28.513+01:00 REPORT RequestId: cd1e0053-5477-459d-a3c2-7dee7e125378 Duration: 300085.43 ms Billed Duration: 300000 ms Memory Size: 1024 MB Max Memory Used: 1025 MB Init Duration: 500.47 ms
2020-07-06T17:00:28.513+01:00 2020-07-06T16:00:28.512Z cd1e0053-5477-459d-a3c2-7dee7e125378 Task timed out after 300.09 seconds

I've seen the same behaviour since last Friday (3rd July).

I don't know if it's related or not but on that same day I started having trouble when building the Docker container, with this error:

Trying other mirror.
 
One of the configured repositories failed (Extra Packages for Enterprise Linux 7 - x86_64),
 
and yum doesn't have enough cached data to continue. At this point the only
 
safe thing yum can do is fail. There are a few ways to work "fix" this:
 
    1. Contact the upstream for the repository and get them to fix the problem.
 
    2. Reconfigure the baseurl/etc. for the repository, to point to a working
 
       upstream. This is most often useful if you are using a newer
 
       distribution release than is supported by the repository (and the
 
       packages for the previous distribution release still work).
 
    3. Run the command with the repository temporarily disabled
 
           yum --disablerepo=epel ...
 
    4. Disable the repository permanently, so yum won't use it by default. Yum
 
       will then just ignore the repository until you permanently enable it
 
       again or use --enablerepo for temporary usage:
 
           yum-config-manager --disable epel
 
       or
 
           subscription-manager repos --disable=epel
 
    5. Configure the failing repository to be skipped, if it is unavailable.
 
       Note that yum will try to contact the repo. when it runs most commands,
 
       so will have to try and fail each time (and thus. yum will be be much
 
       slower). If it is a very temporary problem though, this is often a nice
 
       compromise:
 
           yum-config-manager --save --setopt=epel.skip_if_unavailable=true
 
failure: repodata/repomd.xml from epel: [Errno 256] No more mirrors to try.

Something has changed last week and I can't work out what, why or how to fix it so that everything starts working again.

Hi @mogusbi

Indeed, something has changed, in my case in the execution of the Lambda.

Your issue is related to the build phase, like it cannot pull epel.

Maybe you should use another repository.

@code-memento Sorry, I should have made myself clearer - the issue pulling down epel is intermittent so it will eventually work after a few retries. I only mentioned it as I started seeing it on the same day I then subsequently started seeing problems with my Lambda

When it does eventually build and deploy, I then see the same issue as you with the lambda function timing out when I try to scan a file

Hi @mogusbi

Okey, so we're on the same boat 😆 .

In my case, we did not change the lambda zip. It used to work like charm up until a week or so. When I cleaned the clamav_defs bucket it seemed to work for a moment but started to timeout again. Even if the timeout is 15min, it hangs till the end.

If the Lambda did not change, and the defs are not the cause, is it related to the AWS runtime 😅 ?

It could well be, the Amazon Linux OS was updated 8 days ago: https://hub.docker.com/_/amazonlinux?tab=tags

Although the release notes say it was updated last month: https://aws.amazon.com/amazon-linux-2/release-notes/

@mogusbi Do you think building with the latest amazonlinux image could solve this issue ?
I think it's more related to the runtime. Moreover, I think that the lambda behaves differently when the lambda container is reused (in this case the defs are not downoaded) did you notice anything about this ?

It hasn't fixed the problem for me

@code-memento yes, it looks like the issue only appears on a cold start. Subsequent requests to scan work fine once the Lambda is warm

@code-memento I've upped the memory of my functions from 1024 to 2048 and that appears to have fixed the issue (for now)

It seems to work :
image
The lambda needs approx 1290Mb, I'll do more tests to make sure that all the cases are covered.
Thanks @mogusbi

I did many tests, it seems to do the trick, no problems so far.
Thanks @mogusbi

That's good to hear!

I'm still slightly concerned as to why it all of a sudden needs more memory, it would be good to get to the bottom of that as throwing more memory at it is treating the symptom but not the disease

You can say that again ! The only explanation that I found is that the clamav_defs has been updated. Thus, the lambda needs more resources for the scan.

I found this error in the update lambda, maybe it's related :

b"ClamAV update process started at Fri Jul 10 08:32:26 2020\ndaily database available for update (local version: 25863, remote version: 25868)\nERROR: buildcld: Can't add daily.hsb to new daily.cld - please check if there is enough disk space available\nERROR: buildcld: gzclose() failed for /tmp/clamav_defs/tmp.6bd07/clamav-e2595ffff6f8a72f6094fc40802f8921.tmp\nERROR: updatedb: Incremental update failed. Failed to build CLD.\nERROR: Unexpected error when attempting to update database: daily\nWARNING: fc_update_databases: fc_update_database failed: Failed to update database (14)\nERROR: Database update process failed: Failed to update database (14)\nERROR: Update failed.\n"

@code-memento - you're right. the definition-update lambda is failing with this error and it is impacting the scan. did you find a fix for this:

'b"ClamAV update process started at Fri Jul 10 08:32:26 2020\ndaily database available for update (local version: 25863, remote version: 25868)\nERROR: buildcld: Can't add daily.hsb to new daily.cld - please check if there is enough disk space available\nERROR: buildcld: gzclose() failed for /tmp/clamav_defs/tmp.6bd07/clamav-e2595ffff6f8a72f6094fc40802f8921.tmp\nERROR: updatedb: Incremental update failed. Failed to build CLD.\nERROR: Unexpected error when attempting to update database: daily\nWARNING: fc_update_databases: fc_update_database failed: Failed to update database (14)\nERROR: Database update process failed: Failed to update database (14)\nERROR: Update failed.\n"

@Muthuveerappanv The error disappears if you delete clamav_defs.
AFAIK the /tmp folder is limited to 512MB.

@Muthuveerappanv The error disappears if you delete clamav_defs.
AFAIK the /tmp folder is limited to 512MB.

u mean delete the clamav_defs on the definition s3 bucket?

@Muthuveerappanv yes the definition bucket

Just wondering if there are alternative solutions other than deleting clamav_defs in the s3 bucket?

I would also like to know this, I dived deep into trying to figure this out last weekend.

I read somewhere, that somebody mentioned putting the definitions into memory directly after download to free up /tmp but I have no idea how to do this.

@culshaw I don't see how it can be done, as the Clamav scan is by the end a command line execution with different parameters.
@DimitrijeManic It's just a speculation, the true issue is that the scan needs more memory (> 1024MB) as the code didn't change for all of us, I suspect that it might be caused by the defs

@code-memento I have set my lambda to 2048 but I believe the issue comes from the hard limit in the /tmp dir.

Possible solutions?

  1. Setup EFS as a storage solution
  2. Treat every update as if there isn't anything in the s3 clamav_defs bucket (Comment out the fetch part)

Thoughts?

This can be reproduced by adding a volume with a size limit in scripts/run-update-lambda

#! /usr/bin/env bash

set -eu -o pipefail

#
# Run the update.lambda_handler locally in a docker container
#

rm -rf tmp/
unzip -qq -d ./tmp build/lambda.zip

NAME="antivirus-update"

# Simulate /tmp/ dir with a 512m size restriction
docker volume create --driver local --opt type=tmpfs --opt device=tmpfs --opt o=size=512m,uid=496 clamav_defs

docker run --rm \
  -v "$(pwd)/tmp/:/var/task" \
  -v clamav_defs:/tmp \
  -e AV_DEFINITION_PATH \
  -e AV_DEFINITION_S3_BUCKET \
  -e AV_DEFINITION_S3_PREFIX \
  -e AWS_ACCESS_KEY_ID \
  -e AWS_DEFAULT_REGION \
  -e AWS_REGION \
  -e AWS_SECRET_ACCESS_KEY \
  -e AWS_SESSION_TOKEN \
  -e CLAMAVLIB_PATH \
  --memory="${MEM}" \
  --memory-swap="${MEM}" \
  --cpus="${CPUS}" \
  --name="${NAME}" \
  lambci/lambda:python3.7 update.lambda_handler

hack workaround to not download existing clamav defs in update.py

# -*- coding: utf-8 -*-
# Upside Travel, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import os

import boto3

import clamav
from common import AV_DEFINITION_PATH
from common import AV_DEFINITION_S3_BUCKET
from common import AV_DEFINITION_S3_PREFIX
from common import CLAMAVLIB_PATH
from common import get_timestamp
import shutil


def lambda_handler(event, context):
    s3 = boto3.resource("s3")
    s3_client = boto3.client("s3")

    print("Script starting at %s\n" % (get_timestamp()))

    for root, dirs, files in os.walk(AV_DEFINITION_PATH):
        for f in files:
            os.unlink(os.path.join(root, f))
        for d in dirs:
            shutil.rmtree(os.path.join(root, d))
    
    to_download = clamav.update_defs_from_s3(
        s3_client, AV_DEFINITION_S3_BUCKET, AV_DEFINITION_S3_PREFIX
    )

    print("Skipping clamav definition download %s\n" % (get_timestamp()))
    # for download in to_download.values():
    #     s3_path = download["s3_path"]
    #     local_path = download["local_path"]
    #     print("Downloading definition file %s from s3://%s" % (local_path, s3_path))
    #     s3.Bucket(AV_DEFINITION_S3_BUCKET).download_file(s3_path, local_path)
    #     print("Downloading definition file %s complete!" % (local_path))

    clamav.update_defs_from_freshclam(AV_DEFINITION_PATH, CLAMAVLIB_PATH)
    # If main.cvd gets updated (very rare), we will need to force freshclam
    # to download the compressed version to keep file sizes down.
    # The existence of main.cud is the trigger to know this has happened.
    if os.path.exists(os.path.join(AV_DEFINITION_PATH, "main.cud")):
        os.remove(os.path.join(AV_DEFINITION_PATH, "main.cud"))
        if os.path.exists(os.path.join(AV_DEFINITION_PATH, "main.cvd")):
            os.remove(os.path.join(AV_DEFINITION_PATH, "main.cvd"))
        clamav.update_defs_from_freshclam(AV_DEFINITION_PATH, CLAMAVLIB_PATH)
    clamav.upload_defs_to_s3(
        s3_client, AV_DEFINITION_S3_BUCKET, AV_DEFINITION_S3_PREFIX, AV_DEFINITION_PATH
    )
    print("Script finished at %s\n" % get_timestamp())

@DimitrijeManic does this solution fix the lambda timeout issue ?

Have seen similar issue recently. Try this:
File size: Less than 1MB
Case 1: Lambda MEM: 1024 MB, Timeout: 10 minutes, Result: Timeout after 10 minutes.
Case 2: Lambda MEM: 2048MB, Timeout: 3 minutes, Result: Succeed after 21 seconds with 1299MB MEM used.
So suggest use 2048MB instead, can reduce lambda timeout significantly

@wangcarlton after some digging, it seems that clamscan is well known memory beast.
The recent issues are without doubt caused by the increase of the number of virus definition.

Increasing lambda MEM to 2048 has resolved the timeout issue however the next next problem is regarding disk space in /tmp.

The lambda will complete successfully but this error message will be in the logs

ClamAV update process started at Fri Jul 10 08:32:26 2020
daily database available for update (local version: 25863, remote version: 25868)
ERROR: buildcld: Can't add daily.hsb to new daily.cld - please check if there is enough disk space available
ERROR: buildcld: gzclose() failed for /tmp/clamav_defs/tmp.6bd07/clamav-e2595ffff6f8a72f6094fc40802f8921.tmp
ERROR: updatedb: Incremental update failed. Failed to build CLD.
ERROR: Unexpected error when attempting to update database: daily
WARNING: fc_update_databases: fc_update_database failed: Failed to update database (14)
ERROR: Database update process failed: Failed to update database (14)
ERROR: Update failed.

So maybe this issue is resolved and we can continue the discussion in #128 ?

I am using this: https://github.com/upsidetravel/bucket-antivirus-function
I guess this is new issue comes after upgrade from 0.102.2 to 0.102.3.
I was trying to solve it today, but seems use other directory(such as /var/task) is prohibited by AWS.
Lambda has a fixed 500MB storage which can't be changed
https://aws.amazon.com/lambda/faqs/

Q: What if I need scratch space on disk for my AWS Lambda function?
Each Lambda function receives 500MB of non-persistent disk space in its own /tmp directory.

It also took me this whole afternoon to figure out that some libs(such as libprelude etc.) need to be installed, env path needs to be updated when run freshclam after the upgrade from 0.102.2 to 0.102.3.
I am going to migrate the lambda to an EC2(more stable and under control) to update definition file.

thanks @DimitrijeManic , your snippet fixed my update.py issues (running out of space)

Increasing the memory worked for me

Same here, increasing the memory did the trick.

Increasing lambda MEM to 2048 has resolved the timeout issue however the next next problem is regarding disk space in /tmp.

Lambda has a fixed 500MB storage which can't be changed

Guys, I know, it's a long time ago you wrote this. I just want to mention, that there is the possibility to append an EFS (Elastic File System) to a Lambda, an then you have nearly unlimited storage available.

https://aws.amazon.com/blogs/compute/using-amazon-efs-for-aws-lambda-in-your-serverless-applications/

just make sure to avoid this error:
aws/serverless-application-model#1631 (comment)

and note, you have to delete the files by yourself after scanning

The best solution for this problem is to increase the memory to 2048MB. Thanks folks.