aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).

Home Page:https://aws.amazon.com/about-aws/whats-new/containers/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[EKS] [SSM] Install SSM agent on EKS AMI

pierresteiner opened this issue · comments

Community Note

The new managed EKS workes (https://aws.amazon.com/fr/about-aws/whats-new/2019/11/amazon-eks-adds-support-for-provisioning-and-managing-kubernetes-worker-nodes/) can only be managed using SSH key.

It would be much more flexible if we could use SSM to connect to those upon need. Until now, we were able to install it using UserData script, but this is not an option anymore for managed worker

Tell us about your request
Managed EKS worker through SSM agent

Which service(s) is this request for?
EKS (managed workers)

Are you currently working around this issue?
No workaround, beside not using the service

For what it is worth, I currently workaround this by having a UserData section in the worker nodegroup creation that looks like this:

      UserData:
        Fn::Base64:
          !Sub |
            #!/bin/bash -xe
            yum install -y https://s3.amazonaws.com/ec2-downloads-windows/SSMAgent/latest/linux_amd64/amazon-ssm-agent.rpm

You can also run the ssm agent as a daemonset as per: https://github.com/mumoshu/kube-ssm-agent But I am not that fond of running containers with hostNetwork: true and privileged: true. Installing ssm with UserData also allows you to debug startup issues in case the node never joins the cluster.

But it would be definitely convenient to have the SSM agent pre-installed in the AMI and have a flag available to start it (or not) via the bootstrap script.

@dlaidlaw thanks, but that won't work for managed node_groups AFAIK: #596

@davidkarlsen Understood. This is one of the reasons we do not use managed workers. Another important one being the requirement to harden the instance as per CIS Hardening Blueprints. Some people also like to have vulnerability scanning agents and anti-virus software installed.

We recently introduced EKS with Managed Node Groups in our Company but now we stuck without SSM. We dont open SSH in our organization and only way to manage is via SSM with Federated users.
Atleast please provide prepackaged image with SSM Agent, if we can not use user data.

@Viren you could work around that by deploying the SSM agent as DaemonSet until managed node groups allow customizations

With the release of the official EKS Best Practices Guide for Security, I hope this issue will get more attention: https://aws.github.io/aws-eks-best-practices/hosts/#minimize-access-to-worker-nodes

As mentioned in all the other issues the same as this one, rather than forcing everyone to run SSM agent by putting it into the AMI, just install it using daemonset because:

  1. It's a more "k8s" way to do it
  2. CPU/memory resources are accounted for properly in the cluster
  3. You can manage updates to SSM agent nicely

https://github.com/mumoshu/kube-ssm-agent

Also, would be cool to create a chart for this tool and put it in eks-charts

As mentioned in all the other issues the same as this one, rather than forcing everyone to run SSM agent by putting it into the AMI, just install it using daemonset because:

1. It's a more "k8s" way to do it

2. CPU/memory resources are accounted for properly in the cluster

3. You can manage updates to SSM agent nicely

https://github.com/mumoshu/kube-ssm-agent

Also, would be cool to create a chart for this tool and put it in eks-charts

I don't think folks are forcing everyone to run SSM. We're looking for an option to enable it.

@max-rocket-internet While I like the daemonset idea, it can't offer the full utility of the SSM Agent, specifically the ability to use State Manager to configure aspects of the host. For that to work, the container would need access to the host's root filesystem. Of course, you could create a host mount from / to /mnt (or similar) in the container, but State Manager can't currently be configured to chroot into an an alternate root filesystem.

I don't think folks are forcing everyone to run SSM

It was mentioned putting it in the AMI, that part I'm not keen on 🙂

We're looking for an option to enable it.

Fair enough! Makes sense. I think this is covered in #596

For that to work, the container would need access to the host's root filesystem

I'm not super familiar with State Manager but many daemonsets mount host directories. It's very common for system management tools. e.g log collectors mount /var/log/pods, /var/lib/docker/containers, Sysdig mounts /proc, /dev, /boot, /lib/modules etc

State Manager can't currently be configured to chroot

I don't think chroot is involved when running a container with host directories mounted. They are just mounted into the container like --volume in docker. I could be wrong though 🙂

@max-rocket-internet Ordinarily, the SSM Agent expects that it is running on the host, not in a container. And so when synchronizing state according to SSM Documents per State Manager, it expects all paths on which it is operating to be host paths: /usr is the host's /usr, /etc is the host's /etc, and so on.

As you mention, you can mount host volumes on a case by case basis into a container. But you can't mount the entire host filesystem as is at the root of the container (i.e.,/ on the host is / inside the container). You could mount the host's root volume into a container as, say, /host, then chroot /host and it would look like you were then in the host's root volume - but SSM Agent doesn't support such behavior right now.

Running as a DaemonSet won't help if you're trying to debug an issue of the node not joining the cluster. The node will not have received the DaemonSet spec from the kube-apiserver.

Until the current issue (#593), #596, and #585 have been addressed, managed node groups are not an option for clusters that have both a no-ssh security requirement and a requirement for remote terminal access to the nodes via SSM.

If would be helpful to add a warning about this to the Managed Node Group documentation.

Hey all,

Our aim is to keep the EKS AMI as minimal as possible. Given managed node groups now supports launch templates #585 and custom user data #596, it's straightforward to install the SSM agent at node boot time. In fact, it's the exact example we used in the launch blog.

Is there still an ask for the agent to be baked into the AMI, or is user data support sufficient to meet the feature request as outlined in this issue?

My solution is a daemonset that installs a systemd unit on the host which installs the SSM agent (and a couple other configurations we need):

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: eks-host-config
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: eks-host-config
  template:
    metadata:
      name: eks-host-config
      labels:
        app: eks-host-config
    spec:
      initContainers:
        - name: ssm-install-unit
          image: debian:buster
          command:
            - sh
            - -c
            - |
              set -x

              # Add unit file to install the SSM agent
              cat >/etc/systemd/system/install-ssm.service <<EOF
              [Unit]
              Description=Install the SSM agent

              [Service]
              Type=oneshot
              ExecStart=/bin/sh -c "yum install -y amazon-ssm-agent; systemctl enable amazon-ssm-agent; systemctl start amazon-ssm-agent"

              [Install]
              WantedBy=multi-user.target
              EOF

              systemctl daemon-reload
              systemctl enable install-ssm.service
              systemctl start install-ssm.service



              # Add unit file to increase inotify watches.  Some CI jobs which use inotify fail without this.
              cat >/etc/systemd/system/increase-inotify-watches.service <<EOF
              [Unit]
              Description=Increase inotify watches

              [Service]
              Type=oneshot
              ExecStart=/bin/sh -c "echo 'fs.inotify.max_user_watches = 524288' >/etc/sysctl.d/50-increase-inotify-watches.conf; sysctl --system"

              [Install]
              WantedBy=multi-user.target
              EOF

              systemctl daemon-reload
              systemctl enable increase-inotify-watches.service
              systemctl start increase-inotify-watches.service

              # Enable the docker bridge so that docker-in-docker
              # (dind) works for CI operations.  This is equivalent to
              # setting --enable-docker-bridge in the EKS userdata
              # script.  See https://github.com/awslabs/amazon-eks-ami/commit/0db49b4ed7e1d0198f9c1d9ccaab3ed2ecca8cd0
              cat >/etc/systemd/system/enable-docker-bridge.service <<EOF
              [Unit]
              Description=Enable the docker bridge

              [Service]
              Type=oneshot
              ExecStart=/bin/sh -c "if ! grep docker0 /etc/docker/daemon.json; then cp /etc/docker/daemon.json /tmp/; jq '.bridge=\"docker0\" | .\"live-restore\"=false' </tmp/daemon.json >/etc/docker/daemon.json; fi; if ! ip link show docker0; then systemctl restart docker; fi"

              [Install]
              WantedBy=multi-user.target
              EOF

              systemctl daemon-reload
              systemctl enable enable-docker-bridge.service
              systemctl start enable-docker-bridge.service

          volumeMounts:
            - name: etc-docker
              mountPath: /etc/docker
            - name: run-systemd
              mountPath: /run/systemd
            - name: etc-systemd
              mountPath: /etc/systemd/system
            - name: libgcrypt
              mountPath: /usr/lib/x86_64-linux-gnu/libgcrypt.so.11
            - name: bin-systemctl
              mountPath: /bin/systemctl
          resources:
            limits:
              cpu: 100m
              memory: 256Mi
            requests:
              cpu: 100m
              memory: 256Mi
      containers:
        - name: pause
          image: gcr.io/google_containers/pause
      volumes:

        # docker config for enabling the bridge
        - name: etc-docker
          hostPath:
            path: /etc/docker
            type: Directory

        # systemd's runtime socket
        - name: run-systemd
          hostPath:
            path: /run/systemd
            type: Directory

        # location for custom unit files
        - name: etc-systemd
          hostPath:
            path: /etc/systemd/system
            type: Directory

        # systemctl command + required lib
        - name: libgcrypt
          hostPath:
            path: /lib64/libgcrypt.so.11
            type: File
        - name: bin-systemctl
          hostPath:
            path: /bin/systemctl
            type: File

It's not super elegant but it seems to work so far and it keeps EKS satisfied that my node groups are still upgradeable.

@mikestef9 As soon as we create a new version of the launch template with customized userdata, the EKS console denies us the AMI upgrades since our configuration has diverged.

From our docs, "Existing node groups that do not use launch templates cannot be updated directly. Rather, you must create a new node group with a launch template to do so."

@mikestef9 as I mentioned on another issue about SSM and EKS AMIs, not having the SSM agent is inconsistent both with the default AL2 AMI and with the stated intention to "keep the EKS AMI as minimal as possible".

EKS AMI still includes SSH server. I could buy the "minimal" argument if the idea is also to remove the SSH server and include no remote access tooling at all unless the user installs it.

If it feels wrong to remove SSH (and my guess is it will) then we have to ask why. Is it just about remote access? then the default option provided should be the most auditable and most secure access route which seems at this point to be SSM. unless it's not?

@mikestef9 great news, I've been waiting for a while for launch configurations on managed node groups, will give it a try with the ssm agent.

I want to add that DaemonSets arent the solution.

We want SSM for SSH replacement, inventory, hardening, patch sets

The problem with daemonsets is that it runs a container. Hardening? Hardens the container. Inventory? Inventory of the container.

Using an alternative daemonset to run host level stuff requires permissions that are not feasible and is subject to race conditions

The problem with daemonsets is that it runs a container. Hardening? Hardens the container. Inventory? Inventory of the container.

This is not necessarily true. Privileged containers can escape to the host by calling nsenter -t 1. And, in fact, this is exactly how tools like https://github.com/kvaps/kubectl-node-shell work.

As long as the entrypoint for the agent does the right thing, it should work just fine.

None of this is a good replacement for simply adding SSM to the AMI considering the base, ecs, beanstalk and others all have it ready to go.

The daemonset aside from that is hacky anyway by mounting crontab and injecting a minute cron to install the RPM meaning that its constantly attempting to even when it doesnt need to:

        command: ["/bin/bash","-c","echo '* * * * * root rpm -e amazon-ssm-agent-2.3.1550.0-1.x86_64 && yum install -y https://s3.amazonaws.com/ec2-downloads-windows/SSMAgent/latest/linux_amd64/amazon-ssm-agent.rpm && systemctl restart amazon-ssm-agent && rm -rf /etc/cron.d/ssmstart' > /etc/cron.d/ssmstart && echo 'Successfully installed SSM agent'"]

from here

We will be adding the SSM agent to EKS AL2 AMI in a future release, moved to "We're working on it"

I've run into a workload on our cluster that messes with the node enough that kubelet dies and the node enters NodeNotReady state. Not sure why yet, but it's not super relevant for this comment. I'm noting it here because when using a managed node group with no ssh key there's literally no way to access the node that I'm aware of to debug the issues and having SSM installed as an escape hatch would be great. That's just to say that I'm looking forward to the agent being installed by default!

Hey all,

The SSM agent is now installed and enabled by default in the latest release of the EKS Optimized Amazon Linux AMI

https://github.com/awslabs/amazon-eks-ami/releases/tag/v20210621