Profiling issue for Java pods with auto-instumentation method

Question

Profiling issue for Java pods with auto-instumentation method

Vaibhav-1995 opened this issue 4 months ago · comments

Describe the bug

I am using Java Profiling with Alloy (auto-instrumentation method) for enabling profiling on java pods within cluster. Deployed pyroscope and alloy separately using helm chart and have added below config in alloy configmap for java profiling as provided on below link -

https://github.com/grafana/pyroscope/tree/main/examples/grafana-agent-auto-instrumentation/java/kubernetes

But profiling starts on only few random java pods and not on all java pods. Not able to identify that why profiling is not enabled on all java pods.

Expected behavior

As per documentation on below link all pre-requisites are done at alloy end in helm chart - so not seems issue from alloy end as some pods starts profiling as well and that data is visible in grafana

https://grafana.com/docs/alloy/latest/reference/components/pyroscope/pyroscope.java/

So expected behaviour is that all java pods should start profiling upon adding above config in alloy configmap.

Environment

Infrastructure: Kubernetes EKS
Deployment tool: helm

Additional Context

content: |

  logging {
    level  = "debug"
    format = "logfmt"
  }

  // Discovers all kubernetes pods.
  // Relies on serviceAccountName=grafana-alloy in the pod spec for permissions.
  discovery.kubernetes "pods" {
    role = "pod"
  }

  // Discovers all processes running on the node.
  // Relies on a security context with elevated permissions for the alloy container (running as root).
  // Relies on hostPID=true on the pod spec, to be able to see processes from other pods.
  discovery.process "all" {
    // Merges kubernetes and process data (using container_id), to attach kubernetes labels to discovered processes.
    join = discovery.kubernetes.pods.targets
  }
  // Drops non-java processes and adjusts labels.    
  discovery.relabel "java" {
    targets = discovery.process.all.targets
    // Drops non-java processes.
    rule {
      source_labels = ["__meta_process_exe"]
      action = "keep"
      regex = ".*/java$"
    }
    // Sets up the service_name using the namespace and container names.
    rule {
      source_labels = ["__meta_kubernetes_namespace", "__meta_kubernetes_pod_container_name"]
      target_label = "service_name"
      separator = "/"
    }
    // Sets up kubernetes labels (labels with the __ prefix are ultimately dropped).
    rule {
      action = "replace"
      source_labels = ["__meta_kubernetes_pod_node_name"]
      target_label = "node"
    }
    rule {
      action = "replace"
      source_labels = ["__meta_kubernetes_namespace"]
      target_label = "namespace"
    }
    rule {
      action = "replace"
      source_labels = ["__meta_kubernetes_pod_name"]
      target_label = "pod"
    }
    rule {
      action = "replace"
      source_labels = ["__meta_kubernetes_pod_container_name"]
      target_label = "container"
    }
    // Sets up the cluster label.
    // Relies on a pod-level annotation with the "cluster_name" name.
    // Alternatively it can be set up using external_labels in pyroscope.write. 
    rule {
      action = "replace"
      source_labels = ["__meta_kubernetes_pod_annotation_cluster_name"]
      target_label = "cluster"
    }
  }

  // Attaches the Pyroscope profiler to the processes returned by the discovery.relabel component.
  // Relies on a security context with elevated permissions for the alloy container (running as root).
  // Relies on hostPID=true on the pod spec, to be able to access processes from other pods.
  pyroscope.java "java" {
    profiling_config {
      interval = "15s"
      alloc = "512k"
      cpu = true
      lock = "10ms"
      sample_rate = 100
    }
    forward_to = [pyroscope.write.local.receiver]
    targets = discovery.relabel.java.output
  }
    
  pyroscope.write "local" {
    // Send metrics to the locally running Pyroscope instance.
    endpoint {
      url = "http://xxx-xxx-pyroscope-distributor.observability-pyroscope-dev.svc.cluster.local:4040"
    }
    external_labels = {
      "static_label" = "static_label_value",
    }
  }

Tolya Korniltsev · Answer 1 · Mon Dec 02 2024 11:17:14 GMT+0800 (China Standard Time)

Please provide the following:

attach alloy logs from a node.
specify which pods are profiled and which pods are not profiled but expected to be profiled
base docker image for the failing pods and/or JVM version and vendor

Vaibhav Ingulkar · Answer 2 · Mon Dec 02 2024 18:18:50 GMT+0800 (China Standard Time)

Hi @korniltsev
Thanks for your reply.

PFB details :

attach alloy logs from a node - below are two main error logs reflecting in alloy pod
ts=2024-12-02T09:30:02.574990354Z level=error component_path=/ component_id=pyroscope.java.java pid=4118021 err="failed to reset: failed to read jfr file: open /proc/4118021/root/tmp/asprof-186018-4118021.jfr: no such file or directory"
ts=2024-12-02T04:58:02.01712108Z level=error component_path=/ component_id=pyroscope.java.java pid=716979 err="failed to start: asprof failed to run: asprof failed to run /tmp/alloy-asprof-glibc-ed25bbf0083bff602254601eb6c4a927823d988f/bin/asprof: exit status 255 Target JVM failed to load /tmp/alloy-asprof-glibc-ed25bbf0083bff602254601eb6c4a927823d988f/bin/../lib/libasyncProfiler.so\n"
specify which pods are profiled and which pods are not profiled but expected to be profiled
Basically pods of opensource components like Openmetadata, Clickhouse Zookeeper and Trino which are java based are profiled but the Custom Java Applications pods are not profiled
base docker image for the failing pods and/or JVM version and vendor
Base image used to build custom java applications is Red Hat Universal Base Image 9 for JDK 11 & 17

Vaibhav Ingulkar · Answer 3 · Tue Dec 03 2024 20:36:36 GMT+0800 (China Standard Time)

Hi @korniltsev
Any update?

Tolya Korniltsev · Answer 4 · Fri Dec 06 2024 19:54:49 GMT+0800 (China Standard Time)

I'm sorry I did not have time to look into this yet. I may have time to look into this next week.

CC @aleks-p just in case :) feel free to look in to this as well if you want to.

Vaibhav Ingulkar · Answer 5 · Fri Dec 13 2024 21:40:54 GMT+0800 (China Standard Time)

Hi Team,
Any update? Got stuck on this.

Vaibhav Ingulkar · Answer 6 · Wed Dec 18 2024 13:38:45 GMT+0800 (China Standard Time)

I'm sorry I did not have time to look into this yet. I may have time to look into this next week.

CC @aleks-p just in case :) feel free to look in to this as well if you want to.

HI @korniltsev @aleks-p
Could you please update if any?

Vaibhav Ingulkar · Answer 7 · Fri Dec 27 2024 19:50:44 GMT+0800 (China Standard Time)

Hi,

Any update?