Windows/EC2: ssm-document-worker process remaining after Service-Stop and resulting in cannot access IPC file + DeliveryTimedOut
rgoltz opened this issue · comments
Describe the bug
- At the moment we see a recurring, but intermittently issue for our SSM Agents running on Windows OS (normal EC2 instance).
- We are using SSM Agent to execute Systems Manager Documents via scheduled Associations (= Run Command).
- In case we hit the issue, we see for this Target the Detailed Status for this Association in state "DeliveryTimedOut" in AWS Console for this Association Execution in State Manager.
Current Behavior
-
As stated before, we apply a Document to a bunch of (Windows-)targets. A small part results in Detailed Status:
DeliveryTimedOut
-
Once we checking the local SSM Agent on Windows, we found following pattern across the affected EC2 instances:
a) 1st, we checked the amazon-ssm-agent.log
and found following information/error:
2024-05-20 04:01:15 INFO [CredentialRefresher] Credentials ready
2024-05-20 04:01:15 INFO [CredentialRefresher] Next credential rotation will be in 29.999736375 minutes
2024-05-20 04:10:20 ERROR [amazon-ssm-agent] Error occurred while removing the IPC file: remove C:\ProgramData\Amazon\SSM\InstanceData\i-123456789awsIdEc2\channels\health\surveyor-20240519200113-476: The process cannot access the file because it is being used by another process.
2024-05-20 04:10:44 ERROR [amazon-ssm-agent] message C:\ProgramData\Amazon\SSM\InstanceData\i-123456789awsIdEc2\channels\health\respondent-20240519200116-473 failed to read: open C:\ProgramData\Amazon\SSM\InstanceData\i-123456789awsIdEc2\channels\health\respondent-20240519200116-473: The process cannot access the file because it is being used by another process.
2024-05-20 04:11:21 ERROR [amazon-ssm-agent] Error occurred while removing the IPC file: remove C:\ProgramData\Amazon\SSM\InstanceData\i-123456789awsIdEc2\channels\health\surveyor-20240519200113-477: The process cannot access the file because it is being used by another process.
2024-05-20 04:12:23 ERROR [amazon-ssm-agent] Error occurred while removing the IPC file: remove C:\ProgramData\Amazon\SSM\InstanceData\i-123456789awsIdEc2\channels\health\surveyor-20240519200113-478: The process cannot access the file because it is being used by another process.
2024-05-20 04:31:15 INFO EC2RoleProvider Successfully connected with instance profile role credentials
2024-05-20 04:31:16 INFO [CredentialRefresher] Credentials ready
b) If we afterwards stop the Windows-Service "Amazon SSM Agent", the process "ssm-document-worker
" remaining in the Task-Manager "Process-List"!
c) If we now start the Windows-Service "Amazon SSM Agent" again, the [headless] "ssm-document-worker
" process remaining permanent - The 2nd "ssm-document-worker
" is only shown one a Document is executed - Hence as an result there are sometimes two "ssm-document-worker
" processes - The issue with DeliveryTimedOut remains:
It seems this remaining (zombie) "ssm-document-worker
" process locking the access to some internal files and blocking further execution of documents/run commands to this target!
Now -again- all Associations/Run Commands to an SSM/Instance is this status, will result in a long time "Pending" and afterwards in Failed with "DeliveryTimedOut".
Workaround:
d) We need to stop the Windows-Service "Amazon SSM Agent" and kill the "ssm-document-worker
" process via Task Manager using "End task". Afterwards we start the Windows-Service "Amazon SSM Agent" again and apply the asssociation again. It's working right away (since the permanent, headless "ssm-document-worker" process is gone). Just stop and kill remaining ssm-document-workers.
Expected Behavior:
The instances with those Documents/Associations running this many, many months without this bug-pattern. We assume it could be started with the update from 3.3.380.0 to 3.3.418.0 (in our case 15th May 2024) - But we are not sure about this - At least we see a growing number of issues. Having this said, we do not expect those Delivery TimedOuts at all, in case the Windows Services is in Status Running.
OS Version / Host
OS: Microsoft Windows Server 2019 Datacenter (Platform-Version: 10.0.17763)
Host: EC2 Instance with IMDSv1 (Managed-Instance)
SSM Agent Version
Amazon SSM Agent Version: 3.3.418.0
Other information
I've opened AWS-Case 171620678600565 with SSM-Team. We are share full logs and more details (region, instance-id, etc) with this case. Feel free to request more details here as well - I'll do my best to upload them in an anonymized way.