aws-observability / aws-otel-collector

AWS Distro for OpenTelemetry Collector (see ADOT Roadmap at https://github.com/orgs/aws-observability/projects/4)

Home Page:https://aws-otel.github.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ADOT EKS add-on documentation is missing important parts

tgraupne opened this issue · comments

Describe the bug
The EKS add-on documentation on the official AWS page is linking to this Getting Started Guide:
https://aws-otel.github.io/docs/getting-started/adot-eks-add-on

When following this guide, no metrics are send to CloudWatch and the adot-collector is showing warnings.

Steps to reproduce
I followed the aforementioned guide.

  1. Create EKS add-on with aws eks create-addon
  2. I deployed the OpenTelemetryCollector custom resource.

What did you expect to see?
I expected that the official EKS add-on configures all necessary components to send metrics and logs to CloudWatch.

What did you see instead?
No metrics were sent to CloudWatch and the adot-collector showed warning.

Additional context
After some hours of online research, I analysed the kubernetes resources created by the adot-operator and discovered differences to the maintained helm charts.

I noticed, that the following resources were missing:

  1. Service Accounts
  2. Cluster Role
  3. Cluster Role Binding
  4. environment values
  5. volumes

Moreover, I found out I needed to use eksctl to create a Service Account / IAM Role combination. I attached the following policy: arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy.

Eventually, I used the following manifest file:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: adot-collector-cluster-role
rules:
  - apiGroups: [""]
    resources: ["pods", "nodes", "endpoints"]
    verbs: ["list", "watch", "get"]
  - apiGroups: ["apps"]
    resources: ["replicasets"]
    verbs: ["list", "watch", "get"]
  - apiGroups: ["batch"]
    resources: ["jobs"]
    verbs: ["list", "watch"]
  - apiGroups: [""]
    resources: ["nodes/proxy"]
    verbs: ["get"]
  - apiGroups: [""]
    resources: ["nodes/stats", "configmaps", "events"]
    verbs: ["create", "get"]
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["update"]
  - apiGroups: [""]
    resources: ["configmaps"]
    resourceNames: ["otel-container-insight-clusterleader"]
    verbs: ["get","update", "create"]
  - apiGroups: ["coordination.k8s.io"]
    resources: ["leases"]
    verbs: ["create","get", "update"]
  - apiGroups: ["coordination.k8s.io"]
    resources: ["leases"]
    resourceNames: ["otel-container-insight-clusterleader"]
    verbs: ["get","update", "create"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: adot-collector-cluster-role-binding
subjects:
  - kind: ServiceAccount
    name: adot-collector
    namespace: opentelemetry-operator-system
roleRef:
  kind: ClusterRole
  name: adot-collector-cluster-role
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: adot-collector
  namespace: opentelemetry-operator-system
spec:
  mode: daemonset
  serviceAccount: adot-collector
  securityContext:
    runAsUser: 0
    runAsGroup: 0
  env:
    - name: K8S_NODE_NAME
      valueFrom:
        fieldRef:
          fieldPath: spec.nodeName
    - name: HOST_IP
      valueFrom:
        fieldRef:
          fieldPath: status.hostIP
    - name: HOST_NAME
      valueFrom:
        fieldRef:
          fieldPath: spec.nodeName
    - name: K8S_NAMESPACE
      valueFrom:
          fieldRef:
            fieldPath: metadata.namespace
  volumes:
    - name: rootfs
      hostPath:
        path: /
    - name: dockersock
      hostPath:
        path: /var/run/docker.sock
    - name: varlibdocker
      hostPath:
        path: /var/lib/docker
    - name: containerdsock
      hostPath:
        path: /run/containerd/containerd.sock
    - name: sys
      hostPath:
        path: /sys
    - name: devdisk
      hostPath:
        path: /dev/disk/
  volumeMounts:
    - name: rootfs
      mountPath: /rootfs
      readOnly: true
    - name: dockersock
      mountPath: /var/run/docker.sock
      readOnly: true
    - name: containerdsock
      mountPath: /run/containerd/containerd.sock
    - name: varlibdocker
      mountPath: /var/lib/docker
      readOnly: true
    - name: sys
      mountPath: /sys
      readOnly: true
    - name: devdisk
      mountPath: /dev/disk
      readOnly: true
    
  config: |
    extensions:
      health_check:

    receivers:
      awscontainerinsightreceiver:

    processors:
      batch/metrics:
        timeout: 60s

    exporters:
      awsemf:
        namespace: ContainerInsights
        log_group_name: '/aws/containerinsights/{ClusterName}/performance'
        log_stream_name: '{NodeName}'
        log_retention: 30
        resource_to_telemetry_conversion:
          enabled: true
        dimension_rollup_option: NoDimensionRollup
        parse_json_encoded_attr_values: [Sources, kubernetes]
        metric_declarations:

          # node metrics
          - dimensions: [[NodeName, InstanceId, ClusterName]]
            metric_name_selectors:
              - node_cpu_utilization
              - node_memory_utilization
              - node_network_total_bytes
              - node_cpu_reserved_capacity
              - node_memory_reserved_capacity
              - node_number_of_running_pods
              - node_number_of_running_containers
          - dimensions: [[ClusterName]]
            metric_name_selectors:
              - node_cpu_utilization
              - node_memory_utilization
              - node_network_total_bytes
              - node_cpu_reserved_capacity
              - node_memory_reserved_capacity
              - node_number_of_running_pods
              - node_number_of_running_containers
              - node_cpu_usage_total
              - node_cpu_limit
              - node_memory_limit

          # pod metrics
          - dimensions: [[PodName, Namespace, ClusterName]]
            metric_name_selectors:
              - pod_status
              - pod_cpu_utilization
              - pod_memory_utilization
              - pod_network_rx_bytes
              - pod_network_tx_bytes
              - pod_cpu_reserved_capacity
              - pod_memory_reserved_capacity
              - pod_number_of_container_restarts
              - pod_cpu_utilization_over_pod_limit
              - pod_memory_utilization_over_pod_limit

          # cluster metrics
          - dimensions: [[ClusterName]]
            metric_name_selectors:
              - cluster_node_count
              - cluster_failed_node_count

          # node fs metrics
          - dimensions: [[NodeName, InstanceId, ClusterName], [ClusterName]]
            metric_name_selectors:
              - node_filesystem_utilization

    service:
      pipelines:
        metrics:
          receivers: [awscontainerinsightreceiver]
          processors: [batch/metrics]
          exporters: [awsemf]

      extensions: [health_check]

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 30 days.

This issue was closed because it has been marked as stale for 30 days with no activity.