grafana / pyroscope

Continuous Profiling Platform. Debug performance issues down to a single line of code

Home Page:https://grafana.com/oss/pyroscope/

Repository from Github https://github.comgrafana/pyroscopeRepository from Github https://github.comgrafana/pyroscope

Profiling issue for Java pods with auto-instumentation method

Vaibhav-1995 opened this issue · comments

Describe the bug

I am using Java Profiling with Alloy (auto-instrumentation method) for enabling profiling on java pods within cluster. Deployed pyroscope and alloy separately using helm chart and have added below config in alloy configmap for java profiling as provided on below link -

https://github.com/grafana/pyroscope/tree/main/examples/grafana-agent-auto-instrumentation/java/kubernetes

But profiling starts on only few random java pods and not on all java pods. Not able to identify that why profiling is not enabled on all java pods.

Expected behavior

As per documentation on below link all pre-requisites are done at alloy end in helm chart - so not seems issue from alloy end as some pods starts profiling as well and that data is visible in grafana

https://grafana.com/docs/alloy/latest/reference/components/pyroscope/pyroscope.java/

So expected behaviour is that all java pods should start profiling upon adding above config in alloy configmap.

Environment

  • Infrastructure: Kubernetes EKS
  • Deployment tool: helm

Additional Context

content: |

  logging {
    level  = "debug"
    format = "logfmt"
  }

  // Discovers all kubernetes pods.
  // Relies on serviceAccountName=grafana-alloy in the pod spec for permissions.
  discovery.kubernetes "pods" {
    role = "pod"
  }

  // Discovers all processes running on the node.
  // Relies on a security context with elevated permissions for the alloy container (running as root).
  // Relies on hostPID=true on the pod spec, to be able to see processes from other pods.
  discovery.process "all" {
    // Merges kubernetes and process data (using container_id), to attach kubernetes labels to discovered processes.
    join = discovery.kubernetes.pods.targets
  }
  // Drops non-java processes and adjusts labels.    
  discovery.relabel "java" {
    targets = discovery.process.all.targets
    // Drops non-java processes.
    rule {
      source_labels = ["__meta_process_exe"]
      action = "keep"
      regex = ".*/java$"
    }
    // Sets up the service_name using the namespace and container names.
    rule {
      source_labels = ["__meta_kubernetes_namespace", "__meta_kubernetes_pod_container_name"]
      target_label = "service_name"
      separator = "/"
    }
    // Sets up kubernetes labels (labels with the __ prefix are ultimately dropped).
    rule {
      action = "replace"
      source_labels = ["__meta_kubernetes_pod_node_name"]
      target_label = "node"
    }
    rule {
      action = "replace"
      source_labels = ["__meta_kubernetes_namespace"]
      target_label = "namespace"
    }
    rule {
      action = "replace"
      source_labels = ["__meta_kubernetes_pod_name"]
      target_label = "pod"
    }
    rule {
      action = "replace"
      source_labels = ["__meta_kubernetes_pod_container_name"]
      target_label = "container"
    }
    // Sets up the cluster label.
    // Relies on a pod-level annotation with the "cluster_name" name.
    // Alternatively it can be set up using external_labels in pyroscope.write. 
    rule {
      action = "replace"
      source_labels = ["__meta_kubernetes_pod_annotation_cluster_name"]
      target_label = "cluster"
    }
  }

  // Attaches the Pyroscope profiler to the processes returned by the discovery.relabel component.
  // Relies on a security context with elevated permissions for the alloy container (running as root).
  // Relies on hostPID=true on the pod spec, to be able to access processes from other pods.
  pyroscope.java "java" {
    profiling_config {
      interval = "15s"
      alloc = "512k"
      cpu = true
      lock = "10ms"
      sample_rate = 100
    }
    forward_to = [pyroscope.write.local.receiver]
    targets = discovery.relabel.java.output
  }
    
  pyroscope.write "local" {
    // Send metrics to the locally running Pyroscope instance.
    endpoint {
      url = "http://xxx-xxx-pyroscope-distributor.observability-pyroscope-dev.svc.cluster.local:4040"
    }
    external_labels = {
      "static_label" = "static_label_value",
    }
  }

Please provide the following:

  • attach alloy logs from a node.
  • specify which pods are profiled and which pods are not profiled but expected to be profiled
  • base docker image for the failing pods and/or JVM version and vendor

Hi @korniltsev
Thanks for your reply.

PFB details :

  • attach alloy logs from a node - below are two main error logs reflecting in alloy pod

  • ts=2024-12-02T09:30:02.574990354Z level=error component_path=/ component_id=pyroscope.java.java pid=4118021 err="failed to reset: failed to read jfr file: open /proc/4118021/root/tmp/asprof-186018-4118021.jfr: no such file or directory"

  • ts=2024-12-02T04:58:02.01712108Z level=error component_path=/ component_id=pyroscope.java.java pid=716979 err="failed to start: asprof failed to run: asprof failed to run /tmp/alloy-asprof-glibc-ed25bbf0083bff602254601eb6c4a927823d988f/bin/asprof: exit status 255 Target JVM failed to load /tmp/alloy-asprof-glibc-ed25bbf0083bff602254601eb6c4a927823d988f/bin/../lib/libasyncProfiler.so\n"

  • specify which pods are profiled and which pods are not profiled but expected to be profiled

  • Basically pods of opensource components like Openmetadata, Clickhouse Zookeeper and Trino which are java based are profiled but the Custom Java Applications pods are not profiled

  • base docker image for the failing pods and/or JVM version and vendor

  • Base image used to build custom java applications is Red Hat Universal Base Image 9 for JDK 11 & 17

Hi @korniltsev
Any update?

I'm sorry I did not have time to look into this yet. I may have time to look into this next week.

CC @aleks-p just in case :) feel free to look in to this as well if you want to.

Hi Team,
Any update? Got stuck on this.

I'm sorry I did not have time to look into this yet. I may have time to look into this next week.

CC @aleks-p just in case :) feel free to look in to this as well if you want to.

HI @korniltsev @aleks-p
Could you please update if any?

Hi,

Any update?