aws-solutions / instance-scheduler-on-aws

A cross-account and cross-region solution that allows customers to automatically start and stop EC2 and RDS Instances

Home Page:https://aws.amazon.com/solutions/implementations/instance-scheduler-on-aws/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Report Negative Scheduling Patterns

neilromatowski0373 opened this issue · comments

Hi,

I have come across an issue with a few instances which had shut down but failed to restart. With the help of support we used CloudTrail to isolate the cause of the failure to restart. This was due to a permissions issue and the service role not having access to KMS key assigned to the storage.

There were indications in the logs which indicated an unexpected state, see below. The state at the time was expected to be started but was and remained in a stopped state. This DEBUG message continued to write to the log on the hour of each attempt (check is executed hourly).

DEBUG : Desired state for instance from schedule "office-hours-uk" is running, last desired state was running, actual state is stopped

I have attached a copy of the Event Log.

log-events-viewer-result.csv

I am looking for a way to report out this type of pattern, maybe to SNS, so that we can proactively react to any potential issues within the linked account.

image

Hi @neilromatowski0373

Instances not being able to start due to KMS permissions is a common issue that is described in our IG here:
https://docs.aws.amazon.com/solutions/latest/instance-scheduler-on-aws/troubleshooting.html#encrypted-ec2-instances-not-starting

As for the debug log you are referring to, this is actually a perfectly normal state of the solution as this log refers to 3 distinct values used in a scheduling decision: The current state of the schedule, the state of the schedule during the last execution, and the current state of the instance.

Seeing that the current and previous states of the schedule are both "running" and the instance's actual state is "stopped" indicates to the scheduler that the instance was stopped for some reason outside of normal scheduling. This can be either due to a start failure as is the case in your situation, or more commonly, due to the instance being stopped by manual user intervention outside of the normal operation of the schedule. Normal behavior in this scenario is to take no action (we attempt to avoid overwriting manual user action), but you can override this behavior by setting the "enforced" flag on the schedule to true. This will cause the scheduler to always take action when the current schedule state differs from the actual instance state regardless of whether we detect the instance as having been started/stopped by manual action.

we are currently planning to improve the clarity of these debug messages in an upcoming release

Hi @CrypticCabub

Thank you for your prompt and considered response. I appreciate that. I think we need to build in a check to ensure that the role has permissions to KMS as part of our initial set-up. Stop the problem from occurring in the first place :). As for the logic it makes sense not to 'intervene'. Not keen on enforcing this as there could be valid reasons for this.

I was hoping that we could leverage a centralized log, view, to report on scheduler state. Maybe an extension to the App Insights dashboard.

image

We have a large number of linked accounts. Admittedly still at very early stages for instance scheduling implementation but we have been on the back foot, reacting to users saying that their instances (the KMS scenario ones) have gone down but not coming back on schedule.

Thank you for taking to time to listen and get back to me.

No problem! It looks like you are already using the per-schedule metrics which should be able to provide some of the info you are looking for. As I indirectly mentioned in my previous response, we are currently evaluating ways we can improve observability for the solution and would love any additional feedback on how our customers want to be able to observe/monitor the solution as well what they are doing currently for this purpose

closing due to inactivity. Please reopen if you have any further questions/feedback