Garbage Collection for Assets

Question

Garbage Collection for Assets

eladb opened this issue 5 years ago · comments

Description

Assets which are uploaded to the CDK's S3 bucket and ECR repository are never deleted. This will incur costs for users in the long term. We should come up with a story on how those should be garbage collected safely.

Initially we should offer cdk gc which will track down unused assets (e.g. by tracing them back from deployed stacks) and offering users to delete them. We can offer an option to automatically run this after every deployment (either in CLI or through CI/CD). Later we can even offer a construct that you deploy to your environment and it can do that for you.

Proposed usage:

cdk gc [ENVIRONMENT...] [--list] [--type=s3|ecr]

Examples:

This command will find all orphaned S3 and ECR assets in a specific AWS environment and will delete them:

cdk gc aws://ACCOUNT/REGION

This command will garbage collect all assets in all environments that belong to the current CDK app (if cdk.json exists):

cdk gc

Just list orphaned assets:

cdk gc --list

Roles

Role	User
Proposed by	@eladb
Author(s)	@kaizen3031593
API Bar Raiser	@njlynch
Stakeholders	@rix0rrr @nija-at

See RFC Process for details

Workflow

Author is responsible to progress the RFC according to this checklist, and
apply the relevant labels to this issue so that the RFC table in README gets
updated.

sam · Answer 1 · Wed Dec 12 2018 16:36:07 GMT+0800 (China Standard Time)

How do we generate keys when uploading? Just random?

If it were deterministic, we could use lifecycle rules of object versions: https://aws.amazon.com/about-aws/whats-new/2014/05/20/amazon-s3-now-supports-lifecycle-rules-for-versioning/

That way objects are only considered for deletion if they are not 'current'.

sam · Answer 2 · Wed Dec 12 2018 16:39:34 GMT+0800 (China Standard Time)

Now that I think about it, versioning would only help if we were regularly re-uploading files. That's probably not the case?

Elad Ben-Israel · Answer 3 · Wed Dec 12 2018 16:41:36 GMT+0800 (China Standard Time)

Life cycle rules sounds like a good option for sure. The object keys are based on the hash of the contents of the asset, so to avoid uploading in case the content hasn't changed (code)

Rico Hermans · Answer 4 · Wed Dec 12 2018 17:05:53 GMT+0800 (China Standard Time)

I'm not sure this works if there are two stacks, and only one is deployed with a new version of the asset.

In my mind, the other stack should still point to the old version of the asset (should not be automatically updated), but now the asset will be aged out after a while and the undeployed stack will magically break after a certain time.

Alternative idea (but much more involved and potentially not Free Tier): a Lambda that runs ever N hours which enumerates all stacks, detects the assets that are in use (from stack parameters or in some other way) and clears out the rest?

Rico Hermans · Answer 5 · Wed Dec 12 2018 17:06:49 GMT+0800 (China Standard Time)

Now that I think about it, versioning would only help if we were regularly re-uploading files. That's probably not the case?
I thought you meant expiration for non-current versions. The latest version still stays alive indefinitely.

But I think this runs afoul of reuse across stacks.

Elad Ben-Israel · Answer 6 · Wed Dec 12 2018 17:08:23 GMT+0800 (China Standard Time)

👍 on garbage collecting lambda that runs every week or so, with ability to opt out and some cost warnings on docs and online

sam · Answer 7 · Wed Dec 12 2018 17:11:10 GMT+0800 (China Standard Time)

Seems risky. What if a deployment happen while crawling?

Elad Ben-Israel · Answer 8 · Wed Dec 12 2018 17:15:25 GMT+0800 (China Standard Time)

Yeah perhaps only collect old assets (month old) and we can salt the object key such that if a whole month had passed, it will be a new object?

Thinking out loud.... requires a design

Kevin Brown · Answer 9 · Mon Jun 22 2020 09:23:43 GMT+0800 (China Standard Time)

I've got quite a few assets in my bucket now after a month or so of deploying from CD.

How do I determine which ones are in use, even manually? I can't seem to figure out the correlation between the names of the files in S3 and anything else I could use to determine what's being used. The lambdas don't point back at them in any way I can see.

I want to eventually write a script to do this safely for my use case, but absent a way of telling what's being used I'm stuck.

Nicklas Ansman · Answer 10 · Fri Feb 12 2021 03:09:06 GMT+0800 (China Standard Time)

S3 now has lifecycle rules that can automatically delete objects a number of days after creation which might be a solution too.

Mathias Lafeldt · Answer 11 · Fri Feb 12 2021 03:21:00 GMT+0800 (China Standard Time)

How do I determine which ones are in use, even manually?

The ones in use are those referenced in active CloudFormation stacks deployed via CDK.

Those stack templates will include something like this:

    "GenFeedFunc959C5085": {
      "Type": "AWS::Lambda::Function",
      "Properties": {
        "Code": {
          "S3Bucket": {
            "Fn::Sub": "cdk-xxx-assets-${AWS::AccountId}-${AWS::Region}"
          },
          "S3Key": "5946c35f6797cf45370f262c1e5992bc54d8e7dd824e3d5aa32152a2d1e85e5d.zip"
        },

S3 now has lifecycle rules that can automatically delete objects a number of days after creation which might be a solution too.

Unfortunately, that won't help since old objects might still be in use, e.g. when a Lambda wasn't deployed in a while.

(It doesn't help that all assets are stored in the same "folder" either.)

Nicklas Ansman · Answer 12 · Fri Feb 12 2021 03:22:21 GMT+0800 (China Standard Time)

Doesn't lambda copy the resources during deployment?

Mathias Lafeldt · Answer 13 · Fri Feb 12 2021 03:23:44 GMT+0800 (China Standard Time)

Doesn't lambda copy the resources during deployment?

The Lambda service will cache functions, but AFAIK there's no guarantee that they will be cached forever.

Nicklas Ansman · Answer 14 · Fri Feb 12 2021 03:31:33 GMT+0800 (China Standard Time)

I'm fairly certain that Lambda only reads the assets during deployment and that they aren't needed afterwards. You can for example deploy to lambda without using S3 for smaller assets and those aren't stored in there.

Mathias Lafeldt · Answer 15 · Fri Feb 12 2021 03:41:21 GMT+0800 (China Standard Time)

Would love to read about the behavior in the docs somewhere. That 50 MB upload limit must exist for a reason. Haven't found anything so far though.

Nicklas Ansman · Answer 16 · Fri Feb 12 2021 03:59:22 GMT+0800 (China Standard Time)

I can't find any concrete resources on this but I haven't found any docs mentioning that you cannot delete the object after deletion. Also, my lambdas doesn't have any permission to read my staging bucket nor does it mention that the object is from S3 so I doubt it's required to keep the object around.

kjpgit · Answer 17 · Thu Apr 08 2021 05:15:35 GMT+0800 (China Standard Time)

I would suggest writing s3 objects as transient/YYYY/MM/DD/HASH.zip and having a lifecycle policy to remove transient/* files after 3 days. You'd get caching for builds done in the same day. This is similar to the salted hash suggested, but a lot more explicit, observable, and not subject to collisions. Also, the hash could stay the same day to day, so as to not trigger a spurious Lambda redeploy.

The main issue here is you need to pick your date / prefix only once, and stick with that, for the whole build process. You don't want to upload files at Tue 23:59 and a later process/function looks in Wednesday for the object. Perhaps just having an --transient-asset-prefix argument to cdk deploy would be enough?

Option 2 for S3 is to check the last-modifed date on objects, and just force a re-upload every N days. Then you could have a lifecycle policy to delete old objects (> N+1 days), and that shouldn't race with deploys. You'd need to be 100% sure the re-upload doesn't race with S3's lifecycle process, however. That's why I prefer the immutable objects in option 1, there is no chance of a race.

ECR is different because it's not transient, it is the long term backing store for Lambda. Just spitballing here, but if there was a "transient ECR repo" with a 7 day deletion policy, you could push new builds to that, and then during cloudformation deploy, those images would then be "copied/hardlinked" to a "runtime ECR repo" with lifetime managed by cloudformation, e.g. removed upon stack update/delete. Maybe the same thing could be accomplished if cloudformation could set ECR Tags that "pin" images in the transient repo that are in use, and tagged images are excluded from lifecycle rules. However, to avoid races, builds have to push something (e.g. at least some tiny change) to the transient ECR repo to refresh the image age (at least if the image is older than a day), so it won't be deleted right before cloudformation starts tracking / pinning it.

Mark Nielsen · Answer 18 · Fri May 07 2021 02:27:28 GMT+0800 (China Standard Time)

Word of caution: doing just a time based deletion on assets is a little risky. We have had this scenario play out:

A stack is deployed and is healthy.
The stack's assets are removed from S3.
The stack is then updated at a later date (new assets in S3), but the update fails and triggers a rollback.
The rollback fails because CFN looks for the old assets in the prior template.
This puts your stack into the UPDATE_ROLLBACK_FAILED state. For us, we were able to get out of it by carefully skipping some rollback on resources. It's a rather scary state to have a stack in TBH.

So, unsure how you would even accomplish this, but ideally don't delete any S3 assets that are referenced in existing CFN templates.

Mark Nielsen · Answer 19 · Fri May 07 2021 02:33:45 GMT+0800 (China Standard Time)

So, unsure how you would even accomplish this, but ideally don't delete any S3 assets that are referenced in existing CFN templates.

Sorry, was reading quickly - sound like the RFC is going to try to do this, so yah!

Rehan van der Merwe · Answer 20 · Fri May 07 2021 15:44:33 GMT+0800 (China Standard Time)

If the only reason is to prevent CFN from "freezing" if role back is required, then just having a deletion policy of say 1 month should be okay given that there will be no scenario where our deployment frequency is less than a month.

So with the assumption that our deployment frequency is much quicker than a month, we can rest assured that no old assets are referenced within the current CFN so that if role back does occur, the assets will still be present within the bucket for it to complete.

Rehan van der Merwe · Answer 21 · Fri May 07 2021 15:58:58 GMT+0800 (China Standard Time)

We are having a discussion about this on the CDK slack here (link might not work in the near future): https://cdk-dev.slack.com/archives/C018XT6REKT/p1620291773488400

One solution a member (Julien Peron) is using, even though not ideal he said:

for this, i tried using tags on resources in cdk bucket in combination with life cycle policy. is not fully flawless, but may help you.
It’s quite simple;
I use one tag with 3 possible values:
-current
-previous
-outdated
At each deploy, I apply “current” value. If tag is already current, it becomes “previous”. If previous, it becomes “outdated”.
Then the lifecycle policy deletes all “outdated” tagged objects

For the resource tagging, I tag all resources in the bucket at every deployment. I don’t have much resources and as the life cycle policy deletes the old ones, it’s quite fast.
This method is just a one shot and is totally not bulletproof, but I believe there is something smart to do.
In my case, I didn’t even add any checks before tagging. Which means, even if deploy fails, resources are tagged, which is not good.
But yeah, just throwing the idea here to help.
Here is a gist with the kind of script I’m using to tag: https://gist.github.com/julienperon/a048603f50ffe092a952d39672357618

Maybe his method if refined could work?

Rehan van der Merwe · Answer 22 · Thu Aug 12 2021 19:07:33 GMT+0800 (China Standard Time)

We are approaching 0.5TB of assets in the staging bucket. I can only imagine how much large companies have :(

Jonathan Goldwasser · Answer 23 · Wed Feb 02 2022 00:22:36 GMT+0800 (China Standard Time)

See https://github.com/jogold/cloudstructs/blob/master/src/toolkit-cleaner/README.md for a working construct that does asset garbage collection.

Adrian Hesketh · Answer 24 · Wed Feb 02 2022 17:27:29 GMT+0800 (China Standard Time)

In my projects, we use separate AWS accounts for each environment (testing, staging, production).

The use of assets somewhat leads you down the path of one branch per environment in your CI/CD pipeline.

A typical process might be:

Commit application and infrastructure code to the main branch.
This triggers a CI workflow in Github Actions or similar.
The CI workflow builds the application code that's in the branch, and runs tests.
The workflow then assumes a deployment role using the Github OIDC connector, or already has an IAM role in the case of using AWS CodeBuild etc.
CDK builds and pushes Docker containers or Lambda entrypoints (as zip files) using CDK.
The team has 3 AWS accounts, one for each branch:
- main: testing AWS account
- staging: staging AWS account
- production: production AWS account
To release code to the staging environment, the team merges or rebases from the previous branch:
- main [merged to] → staging [merged into] → production
This merge triggers the deployment to the next environment.

Problems with this approach

At each deployment stage (main, staging, production), a full cycle of build and test is executed even though it was already built at the previous stage, wasting build minutes (and dev time since they're sometimes waiting for it).
Duplicate copies of the Docker containers and Lambda zip files are built and deployed to each AWS environment as assets, wasting storage, and more build minutes / time.
The Docker container for each version is not necessarily the same, since apt-get and other commands might produce different versions of dependencies.
Discrepancies between application code package versions could creep in, e.g. Node.js applications that use a "greater than" syntax for versions ^1.0.0. We saw this with recent incidents like left-pad and, more recently, faker.js.
All of the images are in the ecr-assets repo for each account, with human-unfriendly tags that make it hard to work out which image belongs to which application. I had to write a script to find out which image matches with which product, and clean out old ones.
- https://github.com/a-h/cdk-ecr-asset-cleaner
If the team has merge commits, the commit hash is different, so version numbers may also be different.
It's hard to have a rollback mechanism, to move back to a previous version, e.g. you can't identify the previous version of the container.

Suggestions

The best thing about DockerImageAssets is the simplicity of it, and the low effort to get started. It's great to just have one command to run to build and deploy everything, but I don't see how to use it for multi-environment deploys without accepting that the built software might be different to the version you tested in another environment.

Traditional workflows involve building the application code separately from the infrastructure, which we want to avoid, because the infrastructure and software are often tightly coupled, e.g. a new feature needs to push to a newly created SQS queue, or use a new DynamoDB table.

I propose optimising for a workflow that's similar to this:

The main branch builds the application code.
The main branch then builds Dockerfiles and Lambda entrypoints.
- For Docker assets, images are built.
- For Lambda assets, zip files are built.
The built assets are then pushed to a central repo:
- For Docker assets, this might be Github Packages or similar (so that the build process doesn't need to have any access to AWS, or a separate build AWS account).
- For Lambda assets, this might be an S3 bucket, or Github Packages containing the zips.
Docker images are tagged with a version number, e.g.

export APP_VERSION=v0.0."`git rev-list --count HEAD`"-"`git rev-parse --short HEAD`

Lambda zips would be placed in a directory of the stack and version number in S3.
The CDK code is configured to use the specific Docker image tags and Lambda zip locations of the assets that were just pushed.
The main branch CI pipeline packages the CDK code into a Docker container and tags that too.
At this point, we have a CDK Docker image, and various assets:
- project:v1.0.0-cdk
- project:v1.0.0-asset-1-code
- project:v1.0.0-asset-2-code
- s3://name/stack/v1.0.0/function-name.zip
The CDK Docker container would be configured to only deploy the matching application code containers and zips so that the application and infrastructure code don't diverge.
Deployment can then be taken care of by a separate process which just runs the project:v1.0.0-cdk Docker image.
- Only this CI process needs access to an AWS role.
- This CI process would tag Docker images in the central repository with the account ID and account alias (name), so that you could see that the image was being used in the testing / staging environment etc.
- To deploy the newly built project to another environment, e.g. staging, you'd point a deployment action in the staging environment to deploy the project:v1.0.0-cdk Docker image.

This workflow:

Reduces the amount of asset builds, we only build the images and Lambda function zips once.
Reduces the amount of asset storage, we only store it once.
Guarantees that exactly the same assets are used in each environment.
Enables the use of ECR and similar lifecycle rules, since any images tagged with account names (testing, staging etc.) can be protected.
Makes rollback simpler, since you can redeploy past versions by deploying a different version of the CDK container.
Makes it possible to identify which images are in use, and where by viewing the Docker image list, or listing the S3 buckets for Lambda functions.
Gives friendly names to assets.

Christopher Piggott · Answer 25 · Mon Jan 09 2023 23:06:27 GMT+0800 (China Standard Time)

I think the issue with manually deleting artifacts out of the staging bucket is that there isn't an easy way to tell what's still in use. Part of the reason for this is that any single stack description .json may refer to lots of artifact .zip files including very old ones that haven't changed but are still part of the stack. This is an even worse problem if you have multiple stacks sharing a single staging directory.

I have three stacks sharing the same bootstrap/artifact directory. In retrospect this was a mistake but I didn't think to separate them when I started. Trying to work out a manual way to do this, the thought I have is to run get-template for each of your active stacks and note what resource files (.zip) are called out, for example:

  "stack1234lambdas3triggerC83C4999": {
   "Type": "AWS::Lambda::Function",
   "Properties": {
    "Code": {
     "S3Bucket": {
      "Fn::Sub": "cdk-hnb659fds-assets-${AWS::AccountId}-us-east-1"
     },
     "S3Key": "b202ff26ae16f03f3f28d7e48d3aeb9d47201b7084e2361973f7ccdb1d3b78ed.zip"
    },

I think if you collect all those S3Keys you'll have the list of what's still being called out, and you can remove the other artifact bundles without breaking CDK. There's still the problem of the old templates (I have about 400 .json template files currently), but I'm not sure whether or not you need to save those at all. It seems like a new one gets uploaded every time. My thought here is that those files are safe to delete just by date.

Jonathan Weaver · Answer 26 · Sat Feb 11 2023 13:24:30 GMT+0800 (China Standard Time)

Hi - I was pointed to this issue / feature request by AWS Support. I see comments which pertain to cost concerns for the various assets being retained indefinitely, but wanted to raise another concern.

Our development team uses CDK for various services, and the indefinite retention of assets had gone sight unseen until discovered by me this morning. I operate in a security capacity, and recently implemented AWS Inspector continuous scanning on container images. Findings are configured to bubble up to Security Hub.

To my surprise, my Security Hub was flooded with findings for CVEs. I have thousands upon (tens of?) thousands of findings, specifically due to the lack of garbage collection of CDK ECR images. This is going to cause a constant game of whack-a-mole until:

I whitelist these images from the scanning (which I think is possible, and which maybe I should just go ahead and do?)
Garbage collection is implemented

Regardless, I wanted to bring another story for why garbage collection would be appreciated.

Adrian Hesketh · Answer 27 · Sun Feb 12 2023 18:33:56 GMT+0800 (China Standard Time)

@createchange - that's how I initially noticed the problem too.

ECR's cleanup is bit silly, in that it can be configured to delete "old" containers, but doesn't have any understanding if they're in use or not, so I wrote a tool that only deletes containers that are not in use by Lambda and ECS Task definitions.

https://github.com/a-h/cdk-ecr-asset-cleaner

It doesn't tackle the buildup of S3 related assets, but it might be useful for you.

eddie-atkinson · Answer 28 · Wed Apr 26 2023 13:22:34 GMT+0800 (China Standard Time)

Has there been any movement on this? My Dev stack accumulated 144 undeleted images within a few days of people pushing test images. This involves quite significant costs which the DockerImageAsset construct does not make easy to resolve.

Sam Stephens · Answer 29 · Mon May 15 2023 09:48:42 GMT+0800 (China Standard Time)

Thanks @createchange for pointing that out. I just encountered exactly the same issue.

I'd argue that this brings some security urgency to addressing this issue. The flood of spurious security alerts for images that potentially have not been deployed for years makes AWS Inspector basically unusable to protect our CDK assets in ECR.

Note there's another problem for the security inspector, and that is that it is very difficult to attribute what CDK component actually generated a given CDK assets image. This means that if a vulnerability is reported, we have to engage in detective work to find where it should be remediated.

Vincent · Answer 30 · Thu Jun 08 2023 05:32:36 GMT+0800 (China Standard Time)

We also experience this problem: hundreds of images in ECR of which only a handful are currently being used. Please provide some way to prevent this from spinning out of control!

Niv Stolarski · Answer 31 · Sat Jul 22 2023 05:08:13 GMT+0800 (China Standard Time)

We also experience the same problem - thousands of images in ECR without any easy way to filter non-used images.

tobiasfeil · Answer 32 · Mon Aug 07 2023 19:38:02 GMT+0800 (China Standard Time)

Is this seriously not implemented? I'd rather use plain CFN then, except I've already migrated our whole stack to CDK and implemented new features in that context - only to realize that this vital piece of functionality, which I would have never expected to be missing, isn't there.

Torben · Answer 33 · Mon Aug 07 2023 19:56:00 GMT+0800 (China Standard Time)

https://docs.aws.amazon.com/cdk/api/v2/docs/app-staging-synthesizer-alpha-readme.html

Christopher Piggott · Answer 34 · Thu Sep 28 2023 21:47:12 GMT+0800 (China Standard Time)

I come back to this issue every few months to see if it has moved. I think the thing that goes wrong with time based deletion (as explained pretty well by @polothy) is that during a rollback the asset you're rolling back to might not be there any more, so the rollback will fail. So I think the thing we need to be able to do is figure out which assets are actually being used, not when we're deploying something new but right now.

If you go to CloudFormation in the console you can look at the active template. For example, one of my running stacks has a bunch of Lambda functions in it, and each one refers to a specific asset:

    "Code": {
     "S3Bucket": "cdk-hnb659fds-assets-334383426254369-us-east-1",
     "S3Key": "0a22be02f1325321515f5e129533e7f874b5128ea692082d328bf493c1e63.zip"
    },

If you have multiple stacks, what you would have to do to clean things up would be to look at ALL of your deployed stacks, and build a list of S3Keys that are referenced in any of them. Then, you could delete anything else.

My asset zip files are all lambdas. I'm not sure how other types of assets show up because I don't have any. But if I'm right about this, can we make an external cdk-gc program that does what I'm describing?

Pull all the active stacks in the account and look for S3Bucket that points to the bucket you want to clean up, and build a list of S3Keys referenced by all stacks
Delete every artifact that's not in that unified, merged list

Is that 100% safe? Somebody tell me why it's not.

Hugo Lewenhaupt · Answer 35 · Mon Oct 02 2023 22:02:50 GMT+0800 (China Standard Time)

I come back to this issue every few months to see if it has moved. I think the thing that goes wrong with time based deletion (as explained pretty well by @polothy) is that during a rollback the asset you're rolling back to might not be there any more, so the rollback will fail. So I think the thing we need to be able to do is figure out which assets are actually being used, not when we're deploying something new but right now.

If you go to CloudFormation in the console you can look at the active template. For example, one of my running stacks has a bunch of Lambda functions in it, and each one refers to a specific asset:
    "Code": {
     "S3Bucket": "cdk-hnb659fds-assets-334383426254369-us-east-1",
     "S3Key": "0a22be02f1325321515f5e129533e7f874b5128ea692082d328bf493c1e63.zip"
    },
If you have multiple stacks, what you would have to do to clean things up would be to look at ALL of your deployed stacks, and build a list of S3Keys that are referenced in any of them. Then, you could delete anything else.

My asset zip files are all lambdas. I'm not sure how other types of assets show up because I don't have any. But if I'm right about this, can we make an external cdk-gc program that does what I'm describing?

Pull all the active stacks in the account and look for S3Bucket that points to the bucket you want to clean up, and build a list of S3Keys referenced by all stacks

Delete every artifact that's not in that unified, merged list

Is that 100% safe? Somebody tell me why it's not.

Seems reasonable and should take care of the rollback issues described earlier?

Marcelo Luiz Onhate · Answer 36 · Sat Oct 14 2023 03:53:16 GMT+0800 (China Standard Time)

I've made this little cli (via npx) that implements this logic above, I have ran it on a dev account we have and it is looking ok so far.

https://github.com/onhate/cdk-gc

Yuki Ito · Answer 37 · Sun Oct 15 2023 08:16:03 GMT+0800 (China Standard Time)

Why should I have assets locally as well that I can download from S3? If the upload to S3 is successful, the local stuff can be deleted, and there should be a mechanism to download from S3 when needed, maybe?

awsmjs · Answer 38 · Fri Dec 15 2023 07:05:54 GMT+0800 (China Standard Time)

Closing this ticket as it does not align with current priorities. We don't have the bandwidth to collaborate on design or implementation.

Christopher Piggott · Answer 39 · Fri Dec 15 2023 07:12:23 GMT+0800 (China Standard Time)

I completely understand.

…

On Thu, Dec 14, 2023, 6:06 PM awsmjs ***@***.***> wrote: Closed #64 <#64> as completed. — Reply to this email directly, view it on GitHub <#64 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALTMOK4KIVAEBJI5VI6WW3YJOA6HAVCNFSM4KKCTMP2U5DIOJSWCZC7NNSXTWQAEJEXG43VMVCXMZLOORHG65DJMZUWGYLUNFXW4OZRGEZDKMZXHA3TMMJZ> . You are receiving this because you commented.Message ID: ***@***.***>

Sam Stephens · Answer 40 · Fri Dec 15 2023 07:24:24 GMT+0800 (China Standard Time)

@awsmjs sorry, but this is pathetic. AWS has given us a solution that consumes unbounded resources (which should have never passed a design review), and then consider fixing that to not be a priority? As I've said earlier on this issue, this has security implications:

The flood of spurious security alerts for images that potentially have not been deployed for years makes AWS Inspector basically unusable to protect our CDK assets in ECR.

Christopher Piggott · Answer 41 · Fri Dec 15 2023 07:33:17 GMT+0800 (China Standard Time)

I have been thinking this might be better off as a standalone tool anyway, and the https://github.com/onhate/cdk-gc tool mentioned above looks like a good start to me. What might be nice would be to work out the kinks and then maybe merging it into the cdk as core functionality would be more palatable.

Hurtov Oleksii · Answer 42 · Fri Dec 15 2023 20:25:24 GMT+0800 (China Standard Time)

@awsmjs Seriously? We're sitting on over a terabyte of useless CDK assets, accumulated over the past year. It's costing us more than 23$ a month for absolutely no reason. We're not just burning cash, but also hogging storage we don't need. In addition, there are the security aspects mentioned earlier. Considering these factors, don't you think you should prioritize addressing this issue?

Sam Stephens · Answer 43 · Sat Dec 16 2023 05:07:18 GMT+0800 (China Standard Time)

@awsmjs it's also interesting to speculate on the environmental impact of all the wasted storage caused by this decision, considering people's concern with the environmental impact of cloud computing.

Andrew · Answer 44 · Mon Dec 18 2023 16:29:52 GMT+0800 (China Standard Time)

@awsmjs , since when does the cdk teams priorities not align with Amazon Leadership principals?

This is deliberately causing AWS customers to spend money unnessarily.

Sholto Maud · Answer 45 · Mon Dec 18 2023 16:47:27 GMT+0800 (China Standard Time)

hey @awsmjs here is an idea ... umm ... maybe hire some more staff in either/both the cdk team and/or the AI team and fix this shit. $8k/yr for a CDK Landingzone is just dumb, now we don't even get garbage collection because what? Don't cry poor. Fix it and don't piss people off. You have responsibilities now so bad luck.

Vincent · Answer 46 · Mon Dec 18 2023 17:38:06 GMT+0800 (China Standard Time)

No need for the aggressive tone; we don't know if @awsmjs has any authority over hiring.

But I do agree that, since CDK is a tool of the for-profit AWS product, maintainers should aim higher than the average volunteer-run open-source project.

I understand that staffing is a constraint and priorities have to be made, but CDK team should then inform whoever is responsible for staffing, that the CDK team size is inadequate vis-à-vis its responsibilities. And also, put such tickets on some backlog then, instead of closing them as if they do not represent a real need from the customers who are -in the end- the ones paying for this.

Adrian Hesketh · Answer 47 · Mon Dec 18 2023 17:48:02 GMT+0800 (China Standard Time)

For S3 assets, I now have 1.8TB of junk in my test account. The staging account probably has 1/3 of that amount, and the production account 1/3 of the amount of the staging account.

So, in total, I'm probably wasting around 2.5TB of S3 storage on this project.

$0.023 per GB x 2.5TB * 1024GB = 2560GB * $0.023 = $58.88, or around $700 per year.

It's not a lot of money, especially compared to CloudWatch or AWS Config, but it is a waste of some money and I don't like the idea that this will continue increasing, forever, with no prospect of being reduced.

CDK's NPM stats show 1,209,474 downloads per week. Out of those, I assume a lot are waste, but I can make a guess that 1 in 20 downloads give me a usage count of around 60k teams using CDK every week. Out of those teams, I'm probably in the top 3% of use, so that's 1800 teams with the same or greater level of use than me.

If they're in the same position, that's $700 per year * 1800 teams = around $1,260,000 of waste. I don't know what the value of other roadmap items look like, but it looks like it's worth putting 2 people on it for a year to me - although it would be a net loss for AWS revenue. 😁

For ECR, I'm using my custom cleaner https://github.com/a-h/cdk-ecr-asset-cleaner which means I'm still at 28 Docker containers. I would like for AWS to take that implementation and use it for inspiration to build an official CDK version.

Andrew · Answer 48 · Mon Dec 18 2023 17:49:33 GMT+0800 (China Standard Time)

hey @awsmjs here is an idea ... umm ... maybe hire some more staff in either/both the cdk team and/or the AI team and fix this shit. $8k/yr for a CDK Landingzone is just dumb, now we don't even get garbage collection because what? Don't cry poor. Fix it and don't piss people off. You have responsibilities now so bad luck.

Yes. all of that +1

Roman · Answer 49 · Mon Dec 18 2023 17:49:46 GMT+0800 (China Standard Time)

I wholly agree with the majority here - this is a real pain and should be addressed ASAP.

However, the incentives are not there, as AWS has a direct benefit from wasted storage. Thus systems thinking will suggest that there is no incentive to prioritize fixing this. The only lever we have as a community is to find a counter-incentive that will apply pressure. I think posting comments will not cut it. I think it's a good start to voice your opinion, but I am just saying it won't be enough to change the status quo.

Meanwhile, for those who are hurting, and might be interested in alternative solutions today, I recommend looking at these two solutions:

ToolkitCleaner - from @jogold, a big contributor to this project. We've been using this solution for a year now, and thus far have not encountered any issues.
App Staging Synthesizer - this is an alternative synthesizer, which may not be compatible with every workflow yet, notably pipelines, but worth a shot if you can use it. I think it is fine to use at least for sandbox and throwaway accounts. It has the "Auto Delete Staging Assets on Deletion" and "Lifecycle Rules on ECR Repositories" features, which help with old assets.

Andrew · Answer 50 · Mon Dec 18 2023 17:50:39 GMT+0800 (China Standard Time)

No need for the aggressive tone; we don't know if @awsmjs has any authority over hiring.

But I do agree that, since CDK is a tool of the for-profit AWS product, maintainers should aim higher than the average volunteer-run open-source project.

I understand that staffing is a constraint and priorities have to be made, but CDK team should then inform whoever is responsible for staffing, that the CDK team size is inadequate vis-à-vis its responsibilities. And also, put such tickets on some backlog then, instead of closing them as if they do not represent a real need from the customers who are -in the end- the ones paying for this.

People have been polite for long enough. We spend millions of dollars a year with AWS. Time to put customers interests first.

Roman · Answer 51 · Mon Dec 18 2023 18:04:40 GMT+0800 (China Standard Time)

💡 Idea for a counter-incentive

As a community, we'll create a CDK construct and a shared CF template, which will provide minimal access to get the total bucket size of the asset bucket. I think this can be done through just a CloudWatch metric (BucketSizeBytes), and filter by a tag.

Then we will create a central account where we will keep a tally of the sizes of all of the buckets out there of all of the participants of this experiment. I understand that many will not be able to participate due to corporate and security limitations, so it's not the full picture, of course.

We will have a public website, where we will host the BIG NUMBER, and explain the situation.

This will allow us, as a community, to measure the impact of this misfeature. If the impact is significant, then I think it would be easy to get the attention of the shareholders through the grapevine (Twitter/X, blog posts, conference talks, press) and thus apply pressure on AWS to allocate more resources to fix this.

eddie-atkinson · Answer 52 · Mon Dec 18 2023 19:16:28 GMT+0800 (China Standard Time)

Look folks I appreciate that this is frustrating, but can we please not flame someone who's trying to do their job.

After a few months of reflection the solution for my use case of this construct was to separate the build and deploy stages of my service.

Essentially I had two CDK stacks, one of which provisioned an ECR repo to which I could apply a life cycle policy on images, and the other which deployed my service.

Then when it came to deployment I simply deployed my first stack, parsed out the ECR repo ARN and piped it into the call to cdk deploy of the other stack using CDK config. That way no CloudFormation dependency was created between the two stacks, but I could also build an deploy an image to a service in one fell swoop. This also has the neat side effect that if you need to redeploy your service due to a config change you don't need to rebuild your image which was the most time consuming part of the process for me.

This approach is more in line with the build, release, run philosophy of the 12 Factor App, which I broadly agree with.

Tomasz Trębski · Answer 53 · Wed Jan 24 2024 06:49:18 GMT+0800 (China Standard Time)

Personally I had refined default bootstrap solution that prepares accounts with cdk toolkit. That includes reasonable retention policies and I find it to a very good job.

Voyta Krizek · Answer 54 · Thu Mar 07 2024 19:01:09 GMT+0800 (China Standard Time)

It seems several solutions clean up CDK staging bucket from the perspective of listing deployed CloudFormation templates. This might not be ideal in case multiple CDK projects are being used in a single AWS account. and we want a different set up per project.

I have tried a different approach that modifies Default Stack Synthesizer to use S3 object key prefix and leverage cloud assembly output from CDK app synth method to determine, which assets should be kept after successful deployment.

The full description and example Java code is at https://github.com/NewTownData/events-monolith/blob/main/infrastructure/docs/clean-up.md

Mischa Spiegelmock · Answer 55 · Thu Mar 14 2024 09:46:26 GMT+0800 (China Standard Time)

Just speaking for myself here, I have a small team and just noticed we're wasting 500GB on CDK assets in various buckets.
I emptied the one for dev envs and everything seems to still work fine.
I think it can be solved with a lifecycle policy on the bucket. CDK itself creates the bucket so maybe it could just apply some policy. Maybe just delete something if it's over 6mo old to be conservative?