aws / amazon-ssm-agent

An agent to enable remote management of your EC2 instances, on-premises servers, or virtual machines (VMs).

Home Page:https://aws.amazon.com/systems-manager/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Windows/EC2: ssm-document-worker process remaining after Service-Stop and resulting in cannot access IPC file + DeliveryTimedOut

rgoltz opened this issue · comments

Describe the bug

  • At the moment we see a recurring, but intermittently issue for our SSM Agents running on Windows OS (normal EC2 instance).
  • We are using SSM Agent to execute Systems Manager Documents via scheduled Associations (= Run Command).
  • In case we hit the issue, we see for this Target the Detailed Status for this Association in state "DeliveryTimedOut" in AWS Console for this Association Execution in State Manager.

Current Behavior

  • As stated before, we apply a Document to a bunch of (Windows-)targets. A small part results in Detailed Status: DeliveryTimedOut
    00_AwsConsoleTimeout

  • Once we checking the local SSM Agent on Windows, we found following pattern across the affected EC2 instances:

a) 1st, we checked the amazon-ssm-agent.log and found following information/error:

2024-05-20 04:01:15 INFO [CredentialRefresher] Credentials ready
2024-05-20 04:01:15 INFO [CredentialRefresher] Next credential rotation will be in 29.999736375 minutes
2024-05-20 04:10:20 ERROR [amazon-ssm-agent] Error occurred while removing the IPC file: remove C:\ProgramData\Amazon\SSM\InstanceData\i-123456789awsIdEc2\channels\health\surveyor-20240519200113-476: The process cannot access the file because it is being used by another process.
2024-05-20 04:10:44 ERROR [amazon-ssm-agent] message C:\ProgramData\Amazon\SSM\InstanceData\i-123456789awsIdEc2\channels\health\respondent-20240519200116-473 failed to read: open C:\ProgramData\Amazon\SSM\InstanceData\i-123456789awsIdEc2\channels\health\respondent-20240519200116-473: The process cannot access the file because it is being used by another process. 
2024-05-20 04:11:21 ERROR [amazon-ssm-agent] Error occurred while removing the IPC file: remove C:\ProgramData\Amazon\SSM\InstanceData\i-123456789awsIdEc2\channels\health\surveyor-20240519200113-477: The process cannot access the file because it is being used by another process.
2024-05-20 04:12:23 ERROR [amazon-ssm-agent] Error occurred while removing the IPC file: remove C:\ProgramData\Amazon\SSM\InstanceData\i-123456789awsIdEc2\channels\health\surveyor-20240519200113-478: The process cannot access the file because it is being used by another process.
2024-05-20 04:31:15 INFO EC2RoleProvider Successfully connected with instance profile role credentials
2024-05-20 04:31:16 INFO [CredentialRefresher] Credentials ready

b) If we afterwards stop the Windows-Service "Amazon SSM Agent", the process "ssm-document-worker" remaining in the Task-Manager "Process-List"!
01_StoppedWithRunningWorkerSSM

c) If we now start the Windows-Service "Amazon SSM Agent" again, the [headless] "ssm-document-worker" process remaining permanent - The 2nd "ssm-document-worker" is only shown one a Document is executed - Hence as an result there are sometimes two "ssm-document-worker" processes - The issue with DeliveryTimedOut remains:
02_AfterStopStartSSM

It seems this remaining (zombie) "ssm-document-worker" process locking the access to some internal files and blocking further execution of documents/run commands to this target!
Now -again- all Associations/Run Commands to an SSM/Instance is this status, will result in a long time "Pending" and afterwards in Failed with "DeliveryTimedOut".

Workaround:

d) We need to stop the Windows-Service "Amazon SSM Agent" and kill the "ssm-document-worker" process via Task Manager using "End task". Afterwards we start the Windows-Service "Amazon SSM Agent" again and apply the asssociation again. It's working right away (since the permanent, headless "ssm-document-worker" process is gone). Just stop and kill remaining ssm-document-workers.

Expected Behavior:

The instances with those Documents/Associations running this many, many months without this bug-pattern. We assume it could be started with the update from 3.3.380.0 to 3.3.418.0 (in our case 15th May 2024) - But we are not sure about this - At least we see a growing number of issues. Having this said, we do not expect those Delivery TimedOuts at all, in case the Windows Services is in Status Running.

OS Version / Host

OS: Microsoft Windows Server 2019 Datacenter (Platform-Version: 10.0.17763)
Host: EC2 Instance with IMDSv1 (Managed-Instance)

SSM Agent Version

Amazon SSM Agent Version: 3.3.418.0

Other information

I've opened AWS-Case 171620678600565 with SSM-Team. We are share full logs and more details (region, instance-id, etc) with this case. Feel free to request more details here as well - I'll do my best to upload them in an anonymized way.