ADOT EKS add-on documentation is missing important parts
tgraupne opened this issue · comments
Describe the bug
The EKS add-on documentation on the official AWS page is linking to this Getting Started Guide:
https://aws-otel.github.io/docs/getting-started/adot-eks-add-on
When following this guide, no metrics are send to CloudWatch and the adot-collector is showing warnings.
Steps to reproduce
I followed the aforementioned guide.
- Create EKS add-on with
aws eks create-addon
- I deployed the
OpenTelemetryCollector
custom resource.
What did you expect to see?
I expected that the official EKS add-on configures all necessary components to send metrics and logs to CloudWatch.
What did you see instead?
No metrics were sent to CloudWatch and the adot-collector
showed warning.
Additional context
After some hours of online research, I analysed the kubernetes resources created by the adot-operator
and discovered differences to the maintained helm charts.
I noticed, that the following resources were missing:
- Service Accounts
- Cluster Role
- Cluster Role Binding
- environment values
- volumes
Moreover, I found out I needed to use eksctl
to create a Service Account / IAM Role combination. I attached the following policy: arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy
.
Eventually, I used the following manifest file:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: adot-collector-cluster-role
rules:
- apiGroups: [""]
resources: ["pods", "nodes", "endpoints"]
verbs: ["list", "watch", "get"]
- apiGroups: ["apps"]
resources: ["replicasets"]
verbs: ["list", "watch", "get"]
- apiGroups: ["batch"]
resources: ["jobs"]
verbs: ["list", "watch"]
- apiGroups: [""]
resources: ["nodes/proxy"]
verbs: ["get"]
- apiGroups: [""]
resources: ["nodes/stats", "configmaps", "events"]
verbs: ["create", "get"]
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["update"]
- apiGroups: [""]
resources: ["configmaps"]
resourceNames: ["otel-container-insight-clusterleader"]
verbs: ["get","update", "create"]
- apiGroups: ["coordination.k8s.io"]
resources: ["leases"]
verbs: ["create","get", "update"]
- apiGroups: ["coordination.k8s.io"]
resources: ["leases"]
resourceNames: ["otel-container-insight-clusterleader"]
verbs: ["get","update", "create"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: adot-collector-cluster-role-binding
subjects:
- kind: ServiceAccount
name: adot-collector
namespace: opentelemetry-operator-system
roleRef:
kind: ClusterRole
name: adot-collector-cluster-role
apiGroup: rbac.authorization.k8s.io
---
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
name: adot-collector
namespace: opentelemetry-operator-system
spec:
mode: daemonset
serviceAccount: adot-collector
securityContext:
runAsUser: 0
runAsGroup: 0
env:
- name: K8S_NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: HOST_IP
valueFrom:
fieldRef:
fieldPath: status.hostIP
- name: HOST_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: K8S_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
volumes:
- name: rootfs
hostPath:
path: /
- name: dockersock
hostPath:
path: /var/run/docker.sock
- name: varlibdocker
hostPath:
path: /var/lib/docker
- name: containerdsock
hostPath:
path: /run/containerd/containerd.sock
- name: sys
hostPath:
path: /sys
- name: devdisk
hostPath:
path: /dev/disk/
volumeMounts:
- name: rootfs
mountPath: /rootfs
readOnly: true
- name: dockersock
mountPath: /var/run/docker.sock
readOnly: true
- name: containerdsock
mountPath: /run/containerd/containerd.sock
- name: varlibdocker
mountPath: /var/lib/docker
readOnly: true
- name: sys
mountPath: /sys
readOnly: true
- name: devdisk
mountPath: /dev/disk
readOnly: true
config: |
extensions:
health_check:
receivers:
awscontainerinsightreceiver:
processors:
batch/metrics:
timeout: 60s
exporters:
awsemf:
namespace: ContainerInsights
log_group_name: '/aws/containerinsights/{ClusterName}/performance'
log_stream_name: '{NodeName}'
log_retention: 30
resource_to_telemetry_conversion:
enabled: true
dimension_rollup_option: NoDimensionRollup
parse_json_encoded_attr_values: [Sources, kubernetes]
metric_declarations:
# node metrics
- dimensions: [[NodeName, InstanceId, ClusterName]]
metric_name_selectors:
- node_cpu_utilization
- node_memory_utilization
- node_network_total_bytes
- node_cpu_reserved_capacity
- node_memory_reserved_capacity
- node_number_of_running_pods
- node_number_of_running_containers
- dimensions: [[ClusterName]]
metric_name_selectors:
- node_cpu_utilization
- node_memory_utilization
- node_network_total_bytes
- node_cpu_reserved_capacity
- node_memory_reserved_capacity
- node_number_of_running_pods
- node_number_of_running_containers
- node_cpu_usage_total
- node_cpu_limit
- node_memory_limit
# pod metrics
- dimensions: [[PodName, Namespace, ClusterName]]
metric_name_selectors:
- pod_status
- pod_cpu_utilization
- pod_memory_utilization
- pod_network_rx_bytes
- pod_network_tx_bytes
- pod_cpu_reserved_capacity
- pod_memory_reserved_capacity
- pod_number_of_container_restarts
- pod_cpu_utilization_over_pod_limit
- pod_memory_utilization_over_pod_limit
# cluster metrics
- dimensions: [[ClusterName]]
metric_name_selectors:
- cluster_node_count
- cluster_failed_node_count
# node fs metrics
- dimensions: [[NodeName, InstanceId, ClusterName], [ClusterName]]
metric_name_selectors:
- node_filesystem_utilization
service:
pipelines:
metrics:
receivers: [awscontainerinsightreceiver]
processors: [batch/metrics]
exporters: [awsemf]
extensions: [health_check]
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 30 days.
This issue was closed because it has been marked as stale for 30 days with no activity.