cloud-custodian / cloud-custodian

Rules engine for cloud security, cost optimization, and governance, DSL in yaml for policies to query, filter, and take actions on resources

Home Page:https://cloudcustodian.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How does garbage collection get enforced?

davidclin opened this issue Β· comments

I recently re-purposed a use case policy described in EBS - Garbage Collect Unattached Volumes using the security-group resource and have questions on its usage.

Question 1 : marked-for-op Delete Enforcement
When I run my policy, I can see my unused security groups get tagged with key:value pair of "maid-status:Resource does not meet policy: delete@2018/05/12"

How does the deletion get enforced? How does one verify?

I have set my policy to delete after 1 day, but my marked security-groups are still present.

Question 2: marked-for-op Delete Action Granularity
Is there an option to set deletion in minutes/hours?

For example, is this supported:

 actions:
      - type: mark-for-op
        op: delete
        minutes: 1

Note: I would only use this type of granularity for testing or when I was highly confident of the consequences.

Question 3: marked-for-op Delete Action Window
If my policy is run multiple times say by (1) a cron job running on an EC2 instance or (2) as a lambda with a CloudWatch scheduled event source (eg: fixed rate of min/hours/days or cron), or (3) manually from the terminal and my delete action window was set to 1 day, does this mean my 1 day window gets reset every time the policy is run?

Question 4: marked-for-op Delete Action General Usage
How is the marked-for-op delete action used in practice?
Do users run their policies manually, periodically, and/or scheduled?
When run periodically or scheduled, how does one prevent the action window from getting reset after being tagged for deletion? Conversely, how does one override the action window to force it to reset?

Sorry for the noob questions. I'm starting to get a hang of this but still need some help.

This is my present policy as reference:

unused-security-group-cleanup.yml
policies:
  - name: mark-unused-security-groups-for-deletion
    resource: security-group
    description: |
      Mark unused security groups for deletion in X days.
      A mark is a tag that gets created for each
      unused security group.
      The key/value pair takes on the following attributes:
      key = maid_status
      value = 'Resource does not meet policy: delete@year/month/day'
    filters:
      - unused
      - type: value
        key: GroupName
        op: regex
        value: .*launch-wizard.*
    actions:
      - type: mark-for-op
        op: delete
        days: 1
  - name: delete-marked-security-groups
    resource: security-group
    description: |
      Delete security groups marked for deletion
    filters:
      - type: marked-for-op
        op: delete
    actions:
      - delete
      - type: notify
        template: sgroup-notify.html
        template_format: 'html'
        priority_header: '5'
        subject: 'CloudCustodian: Unused Security Groups'
        to:
          - email@address.com
        owner_absent_contact:
          - emaill@address.com
        transport:
          type: sqs
          queue: https://sqs.us-east-1.amazonaws.com/1234567890/sandbox
commented
  1. using mark-for-op policies requires you to run an additional policy to filter in resources that have been marked for op, use custodian schema ec2.filters.marked-for-opfor an example of how to run it. you'll need to run the policy on the day you want to execute the operation you've marked the resource with.
  2. mark-for-op supports hours right now
  3. if a resource is marked, it shouldn't get marked again
  4. see the marked-for-op in the schema command, typically you can run on a cron as the marked-for-op filter will only pull in resources that are marked for operation on that day/hour. To override the action window, simply use the remove-tag action and remove the mark-for-op tag that was applied

Thanks for the explanation.

When was the mark-for-op support for hours introduced?

This is the version of Cloud Custodian I'm running:

(custodian)$ custodian version
0.8.28.2

I'm getting an error when I validate my policy specifying hours for the mark-for-op action.
The schema is displaying days only.

custodian schema security-group.actions.mark-for-op
(custodian)$ custodian schema security-group.actions.mark-for-op
Schema
------
{
    "additionalProperties": false,
    "required": [
        "type"
    ],
    "type": "object",
    "properties": {
        "msg": {
            "type": "string"
        },
        "tag": {
            "type": "string"
        },
        "type": {
            "enum": [
                "mark-for-op"
            ]
        },
        "days": {
            "exclusiveMinimum": false,
            "minimum": 0,
            "type": "integer"
        },
        "op": {
            "type": "string"
        }
    }
}
custodian validate foo.yml
(custodian)$ custodian validate unused-security-group-cleanup.yml
2018-05-14 17:28:44,650: custodian.commands:ERROR Configuration invalid: unused-security-group-cleanup.yml
2018-05-14 17:28:44,651: custodian.commands:ERROR Additional properties are not allowed ('hours' was unexpected)

Failed validating u'additionalProperties' in schema[11]:
{u'additionalProperties': False,
u'properties': {'days': {u'exclusiveMinimum': False,
u'minimum': 0,
u'type': u'integer'},
'msg': {u'type': u'string'},
'op': {u'type': u'string'},
'tag': {u'type': u'string'},
u'type': {u'enum': [u'mark-for-op']}},
u'required': [u'type'],
u'type': u'object'}

On instance:
{'hours': 1, 'op': 'delete', 'type': 'mark-for-op'}
2018-05-14 17:28:44,651: custodian.commands:ERROR mark-unused-security-groups-for-deletion

commented

it's in master right now, it was implemented 19 days ago, didn't realize it was that new! it'll be in the next release

Thanks for the confirmation!

Does anyone know when the next release is and how it gets communicated to the community?

I'm also curious if anyone knows what the numbers in the version represent.

For example, what does 0.8.28.2 mean?

Coming back to this original thread, please consider adding "minutes" support for the mark-for-ops action.

Customers, 3rd party integrators, and partners (such as myself) who are in early development, testing, or doing Proof-of-Concepts for business/technical decision makers can't afford to wait for a day much less an hour to see an outcome (or, in my case, the absence of an expected result). The ability to get quick "feedback" (eg: in minutes versus hours/days) allows me to diagnose mark-for-ops policies with higher confidence, catch typos/human errors more quickly, and demonstrate a minimally viable product to business leaders more quickly.

At the time of this writing, I still don't know why my mark-for-ops policy isn't working or the mechanism by which the timer gets enforced. I thought there might be a CloudWatch rule that would get instantiated, but I don't see any evidence of that. Then I figured, maybe a cron job gets spawned to do the garbage collection. I really have no idea.

Having to wait a day (or even an hour) to see if a change I make makes any difference just doesn't feel right.

I'm open to suggestions if using "minutes" is the wrong approach to testing marked-for-ops related policies.

commented

Thanks for the feedback, I think there might some clarification that can be done regarding docs as it seems somewhat confusing.

The next packaged release should be coming soon, the best way to keep in the loop is to either watch this repo/pypi or to subscribe to the rss feed for it: https://github.com/capitalone/cloud-custodian/releases.atom

versioning numbering is semvar

minute precision for mark-for-op was added in #2368, which will also be in the next release. To use newer features earlier you can clone the repo and install the package via developer mode:

$ make install
$ source bin/activate
(cloud-custodian) $ custodian run ...

I think there's just some general confusion on how mark-for-op operates. mark-for-op will tag a resource with a op@date tag. In order to execute the op specified in the tag, a corresponding policy will need to be run on that date (or hour/minute after the next release):

  - name: delete-marked-security-groups
    resource: security-group
    description: |
      Delete security groups marked for deletion
    filters:
      - type: marked-for-op
        op: delete
    actions:
      - delete

The implementation of how you run this policy on the specified date is up to the user. If you mark a policy for deletion tomorrow, you can run the policy from your local machine, set up a cron job on an instance that checks for marked resources every day, use mode: periodic to deploy a periodic lambda in the account, etc.

Thanks for answers and the great explanation regarding how the mark-for-op and marked-for-op are implemented.

Now I get it! πŸ‘

As a suggestion, I recommend updating the use case example so the policies visually appear as separate files.

As a new user, I blindly copied/pasted the example and treated everything as a single policy and expected Cloud Custodian to magically take care of my intent -- hence the confusion that ensued. Now I'm better informed and won't cross that bridge again.

Now...back to my use case.

Based on my new understanding of the mark-for-op and marked-for op, I was able to make progress by separating my policy and running them separately.

It appears I can set the days to "0" then run my marked-for-op delete policy separately without any issues. It works great and as expected! πŸ‘

I also observed that it deleted sgroups that were marked with dates prior to today.

If I didn't want that behavior, is there anyway to scope the blast radius so actions are only taken on the date/time specified?

Thanks again for the great support. Loving this product.

commented

yes, resources that are marked for deletion on or before the current date will be deleted once the marked-for-op policy runs.

Right now there isn't a way to limit it to just delete the day of, typically the pattern that's followed is to mark a resource for action n days in the future and send a corresponding notification with the mark action. Then, in the days leading up to the day of deletion run a policy to unmark resources if they have become compliant in the mean time. Also, it would be a good practice to run an unmark policy before a deletion policy just in case the resource is remediated the day of.

πŸ‘

Hi,

Is there any schema written for deleting all the resources in the AWS account?

Looking forward your responses.