Suspected memory leak in batch processor

Question

Suspected memory leak in batch processor

davidgargti20 opened this issue 2 years ago · comments

While using batch processor for logs , the collector works fine for few time but suddenly a suspected memory leak occurs making memory to rise exponentially. No abnormality is recorded in the debug logs of the collector when this occurs.
Would be better if anyone help in knowing the possible cause for this in the batch processor.

Ziqi Zhao · Answer 1 · Fri Jul 01 2022 17:17:52 GMT+0800 (China Standard Time)

Can you share your configuration file?

davidgargti20 · Answer 2 · Fri Jul 01 2022 17:39:04 GMT+0800 (China Standard Time)

@fatsheep9146

apiVersion: v1
kind: ConfigMap
metadata:
  name: collector-config
data:
  collector.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
          http:
    
    exporters:
      googlecloud:
        retry_on_failure:
          enabled: true
        project: codenation-186008
        log:
          default_log_name: app
      logging:
        loglevel: debug
    
    processors:
      memory_limiter:
        check_interval: 1s
        limit_mib: 500
        spike_limit_mib: 100
      batch:
        send_batch_size: 50000
        timeout: 60s
      attributes:
        actions:
          - key: gcp.trace_sampled
            value: true
            action: upsert
      transform/1:
        logs:
          queries:
            - set(attributes["traceId"],trace_id.string)
            - set(attributes["service.instance.id"],resource.attributes["service.instance.id"])
            - set(attributes["service.name"],resource.attributes["service.name"])
            - set(attributes["k8s-pod/run"],resource.attributes["k8s-pod/run"])
            - set(attributes["k8s.cluster.name"],resource.attributes["k8s.cluster.name"])
            - set(attributes["host.name"],resource.attributes["host.name"])
            - set(attributes["container.id"],resource.attributes["container.id"])
            - set(attributes["cloud.region"],resource.attributes["cloud.region"])
      groupbyattrs:
        keys:
          - traceId
      transform/2:
        logs:
          queries:
            - keep_keys(resource.attributes, "")
      transform/3:
        logs:
          queries:
            - set(resource.attributes["severity"],"ERROR") where severity_text=="ERROR"
            - set(resource.attributes["severity"],"ERROR") where severity_text=="Error"
      filter:
        logs:
          include:
            match_type: strict
            resource_attributes:
              - Key: severity
                Value: ERROR
      tail_sampling:
        decision_wait: 20s
        policies:
          - name: error_otel_status
            type: status_code
            status_code:
              status_codes:
                - ERROR
          - name: error_http_status
            type: numeric_attribute
            numeric_attribute:
              key: http.status_code
              min_value: 400
    
    service:
      telemetry:
        logs:
          level: debug
      pipelines:
        traces:
          receivers: [otlp]
          processors: [batch, attributes, tail_sampling]
          exporters: [googlecloud]
        logs:
          receivers: [ otlp ]
          processors: [ batch,attributes, transform/1,transform/2, groupbyattrs, transform/3,filter ]
          exporters: [ logging,googlecloud ]

Dmitrii Anoshin · Answer 3 · Tue Jul 05 2022 06:42:05 GMT+0800 (China Standard Time)

Why do you believe the problem is caused by the batch processor? Does removing it from the processors list solve the problem? Did you try moving it to the end?

davidgargti20 · Answer 4 · Tue Jul 05 2022 13:02:39 GMT+0800 (China Standard Time)

@dmitryax yes removing the batch problem solved the problem.

Dmitrii Anoshin · Answer 5 · Tue Jul 05 2022 13:24:49 GMT+0800 (China Standard Time)

@davidgargti20 can you try moving it further in the list of processor and see if the issue goes away?

davidgargti20 · Answer 6 · Tue Jul 05 2022 13:57:21 GMT+0800 (China Standard Time)

@dmitryax it didn't go away tried earlier

Ziqi Zhao · Answer 7 · Mon Jul 11 2022 13:25:55 GMT+0800 (China Standard Time)

While using batch processor for logs , the collector works fine for few time but suddenly a suspected memory leak occurs making memory to rise exponentially.

Does this mean that the memory rise only happens in the pipeline of logs, not metrics?

Does the amount of log input to collector change rapidly when the memory rise happen?

Ziqi Zhao · Answer 8 · Mon Jul 11 2022 13:38:00 GMT+0800 (China Standard Time)

BTW, what language sdk do you use to export log data in otlp protocal to collector? I only know that Golang does not implement any sdk about log.

Gen Xu · Answer 9 · Mon Aug 15 2022 08:39:17 GMT+0800 (China Standard Time)

I have encountered similar issue. Not only batch processor but also the hostmetrics receiver/scrapehelper seesm to have memory leak issue too. I am using hostmetrics with 0.1 seconds to exaggerate the issue.

See pictures for my pprof heap results

Ziqi Zhao · Answer 10 · Mon Aug 15 2022 09:50:56 GMT+0800 (China Standard Time)

Could you show your config file of collector? @gen-xu

Gen Xu · Answer 11 · Mon Aug 15 2022 11:05:15 GMT+0800 (China Standard Time)

Could you show your config file of collector? @gen-xu

@fatsheep9146

we had some secrets in config but the following should reproduce the unbounded increasing of memory usage.

extensions:
    health_check:
    pprof:
      endpoint: "0.0.0.0:1777"
      block_profile_fraction: 3
      mutex_profile_fraction: 5

receivers:
    hostmetrics:
        collection_interval: 0.1s
        scrapers:
            cpu:
            load:
            memory:
            paging:
            process:
            processes:
            network:
            disk:
            filesystem:
    otlp:
        protocols:
            grpc:
            http:

processors:
    batch:
        timeout: 1s

exporters:
    kafka:
        brokers:
            - localhost:9092
        protocol_version: "3.0.0"
        producer:
            max_message_bytes: 10000000
            flush_max_messages: 16
        metadata:
            retry:
                max: 30
                backoff: 3s
    logging:
        loglevel: info

service:
    pipelines:
        logs:
            receivers:
                - otlp
            processors:
                - batch
            exporters:
                - kafka
                - logging
        traces:
            receivers:
                - otlp
            processors:
                - batch
            exporters:
                - kafka
                - logging
        metrics:
            receivers:
                - hostmetrics
                - otlp
            processors:
                - batch
            exporters:
                - kafka
                - logging
    extensions:
      - health_check
      - pprof

it is worth noting that there are many errored metrics, not sure that can causes some dangling objects

Aug 15 03:02:08 ubuntu otelcol-contrib[3140977]: 2022-08-15T03:02:08.697Z        error        scraperhelper/scrapercontroller.go:197        Error scraping metrics        {"kind": "receiver", "name": "hostmetrics", "pipeline": "metrics", "error": "error reading process name for pid 2: readlink /proc/2/exe: no such file or directory; error reading process name for pid 3: readlink /proc/3/exe: no such file or directory; error reading process name for pid 4: readlink /proc/4/exe: no such file or directory; error reading process name for pid 5: readlink /proc/5/exe: no such file or directory; error reading process name for pid 7: readlink /proc/7/exe: no such file or directory; error reading process name for pid 9: readlink /proc/9/exe: no such file or directory; error reading process name for pid 10: readlink /proc/10/exe: no such file or directory;

and the metrics might also be helpful here

# HELP otelcol_exporter_enqueue_failed_log_records Number of log records failed to be added to the sending queue.
# TYPE otelcol_exporter_enqueue_failed_log_records counter
otelcol_exporter_enqueue_failed_log_records{exporter="kafka",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 0
otelcol_exporter_enqueue_failed_log_records{exporter="logging",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 0
# HELP otelcol_exporter_enqueue_failed_metric_points Number of metric points failed to be added to the sending queue.
# TYPE otelcol_exporter_enqueue_failed_metric_points counter
otelcol_exporter_enqueue_failed_metric_points{exporter="kafka",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 0
otelcol_exporter_enqueue_failed_metric_points{exporter="logging",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 0
# HELP otelcol_exporter_enqueue_failed_spans Number of spans failed to be added to the sending queue.
# TYPE otelcol_exporter_enqueue_failed_spans counter
otelcol_exporter_enqueue_failed_spans{exporter="kafka",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 0
otelcol_exporter_enqueue_failed_spans{exporter="logging",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 0
# HELP otelcol_exporter_queue_capacity Fixed capacity of the retry queue (in batches)
# TYPE otelcol_exporter_queue_capacity gauge
otelcol_exporter_queue_capacity{exporter="kafka",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 5000
# HELP otelcol_exporter_queue_size Current size of the retry queue (in batches)
# TYPE otelcol_exporter_queue_size gauge
otelcol_exporter_queue_size{exporter="kafka",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 16
# HELP otelcol_exporter_sent_metric_points Number of metric points successfully sent to destination.
# TYPE otelcol_exporter_sent_metric_points counter
otelcol_exporter_sent_metric_points{exporter="kafka",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 2.167837e+06
otelcol_exporter_sent_metric_points{exporter="logging",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 2.400797e+06
# HELP otelcol_process_cpu_seconds Total CPU user and system time in seconds
# TYPE otelcol_process_cpu_seconds counter
otelcol_process_cpu_seconds{service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 151.60000000000002
# HELP otelcol_process_memory_rss Total physical memory (resident set size)
# TYPE otelcol_process_memory_rss gauge
otelcol_process_memory_rss{service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 3.44559616e+08
# HELP otelcol_process_runtime_heap_alloc_bytes Bytes of allocated heap objects (see 'go doc runtime.MemStats.HeapAlloc')
# TYPE otelcol_process_runtime_heap_alloc_bytes gauge
otelcol_process_runtime_heap_alloc_bytes{service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 1.9367716e+08
# HELP otelcol_process_runtime_total_alloc_bytes Cumulative bytes allocated for heap objects (see 'go doc runtime.MemStats.TotalAlloc')
# TYPE otelcol_process_runtime_total_alloc_bytes counter
otelcol_process_runtime_total_alloc_bytes{service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 4.7210411432e+10
# HELP otelcol_process_runtime_total_sys_memory_bytes Total bytes of memory obtained from the OS (see 'go doc runtime.MemStats.Sys')
# TYPE otelcol_process_runtime_total_sys_memory_bytes gauge
otelcol_process_runtime_total_sys_memory_bytes{service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 3.19738952e+08
# HELP otelcol_process_uptime Uptime of the process
# TYPE otelcol_process_uptime counter
otelcol_process_uptime{service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 96.713802677
# HELP otelcol_processor_batch_batch_send_size Number of units in the batch
# TYPE otelcol_processor_batch_batch_send_size histogram
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",le="10"} 0
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",le="25"} 0
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",le="50"} 0
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",le="75"} 0
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",le="100"} 0
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",le="250"} 0
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",le="500"} 0
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",le="750"} 0
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",le="1000"} 0
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",le="2000"} 0
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",le="3000"} 0
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",le="4000"} 0
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",le="5000"} 0
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",le="6000"} 0
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",le="7000"} 0
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",le="8000"} 0
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",le="9000"} 263
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",le="10000"} 268
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",le="20000"} 268
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",le="30000"} 268
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",le="50000"} 268
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",le="100000"} 268
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",le="+Inf"} 268
otelcol_processor_batch_batch_send_size_sum{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 2.400797e+06
otelcol_processor_batch_batch_send_size_count{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 268
# HELP otelcol_processor_batch_batch_size_trigger_send Number of times the batch was sent due to a size trigger
# TYPE otelcol_processor_batch_batch_size_trigger_send counter
otelcol_processor_batch_batch_size_trigger_send{processor="batch",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 268
# HELP otelcol_receiver_accepted_metric_points Number of metric points successfully pushed into the pipeline.
# TYPE otelcol_receiver_accepted_metric_points counter
otelcol_receiver_accepted_metric_points{receiver="hostmetrics",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",transport=""} 2.405277e+06
# HELP otelcol_receiver_refused_metric_points Number of metric points that could not be pushed into the pipeline.
# TYPE otelcol_receiver_refused_metric_points counter
otelcol_receiver_refused_metric_points{receiver="hostmetrics",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2",transport=""} 0
# HELP otelcol_scraper_errored_metric_points Number of metric points that were unable to be scraped.
# TYPE otelcol_scraper_errored_metric_points counter
otelcol_scraper_errored_metric_points{receiver="hostmetrics",scraper="cpu",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 0
otelcol_scraper_errored_metric_points{receiver="hostmetrics",scraper="disk",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 0
otelcol_scraper_errored_metric_points{receiver="hostmetrics",scraper="filesystem",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 0
otelcol_scraper_errored_metric_points{receiver="hostmetrics",scraper="load",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 0
otelcol_scraper_errored_metric_points{receiver="hostmetrics",scraper="memory",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 0
otelcol_scraper_errored_metric_points{receiver="hostmetrics",scraper="network",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 0
otelcol_scraper_errored_metric_points{receiver="hostmetrics",scraper="paging",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 0
otelcol_scraper_errored_metric_points{receiver="hostmetrics",scraper="process",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 278613
otelcol_scraper_errored_metric_points{receiver="hostmetrics",scraper="processes",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 0
# HELP otelcol_scraper_scraped_metric_points Number of metric points successfully scraped.
# TYPE otelcol_scraper_scraped_metric_points counter
otelcol_scraper_scraped_metric_points{receiver="hostmetrics",scraper="cpu",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 1074
otelcol_scraper_scraped_metric_points{receiver="hostmetrics",scraper="disk",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 7518
otelcol_scraper_scraped_metric_points{receiver="hostmetrics",scraper="filesystem",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 2148
otelcol_scraper_scraped_metric_points{receiver="hostmetrics",scraper="load",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 3225
otelcol_scraper_scraped_metric_points{receiver="hostmetrics",scraper="memory",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 1074
otelcol_scraper_scraped_metric_points{receiver="hostmetrics",scraper="network",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 5375
otelcol_scraper_scraped_metric_points{receiver="hostmetrics",scraper="paging",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 3222
otelcol_scraper_scraped_metric_points{receiver="hostmetrics",scraper="process",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 1.159584e+06
otelcol_scraper_scraped_metric_points{receiver="hostmetrics",scraper="processes",service_instance_id="091c8cd5-f7f6-4d31-8c40-3f2ec3e9c59e",service_version="0.57.2"} 2148