grafana / tempo

Grafana Tempo is a high volume, minimal dependency distributed tracing backend.

Home Page:https://grafana.com/oss/tempo/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Code(499) desc = context canceled

kosov73 opened this issue · comments

Describe the bug

When I request a large trace I get a Query error

image

When I request a small trace there are no errors

image

Querier log

***
level=info ts=2024-05-15T17:09:56.545922291Z caller=tempodb.go:335 org_id=single-tenant msg="searching for trace in block" findTraceID=00000000000000000b5e89b6afd13b28 block=7f252c82-6cef-41ca-a90c-d929b1de1bae found=true 
level=info ts=2024-05-15T17:09:57.410773318Z caller=tempodb.go:335 org_id=single-tenant msg="searching for trace in block" findTraceID=00000000000000000b5e89b6afd13b28 block=4ad92767-2e90-488f-b50b-e6e94ae08f45 found=true
level=info ts=2024-05-15T17:09:57.29174733Z caller=tempodb.go:335 org_id=single-tenant msg="searching for trace in block" findTraceID=00000000000000000b5e89b6afd13b28 block=5c0c1b61-1978-4510-87f8-7e7771ca46e1 found=true
***

Querier frontend log

level=info ts=2024-05-15T17:11:50.207159709Z caller=handler.go:109 tenant=single-tenant method=GET traceID=785906df70703c47 url="/api/traces/b5e89b6afd13b28?start=1715790980&end=1715794880" duration=29.993346907s status=500 err="rpc error: code = Code(499) desc = context canceled" response_size=0

Environment:

Infrastructure: Kubernetes
Deployment tool: helm: tempo-distributed 1.9.5, appVersion: 2.4.1

Config:

What parameter needs to be changed for the trace to load.

storage:
      trace:
        backend: s3
        s3:
          bucket: tempo
          endpoint: "minio.local"
          access_key: "access_key"
          secret_key: "secret_key"
          insecure: false                                 
          region: us-east-1
        pool:
          queue_depth: 70000

    ingester:
      replicas: 6
      config:
        replication_factor: 6
        max_block_bytes: 100000000

    distributor:
      replicas: 1
      autoscaling:
        enabled: false
        minReplicas: 1
        maxReplicas: 3
      config:
        log_received_spans:
          enabled: true

    compactor:
      replicas: 1
      config:
        compaction:
          block_retention: 96h
          compacted_block_retention: 1h
          compaction_window: 1h

    querier:
      replicas: 2
      config:
        frontend_worker:
          grpc_client_config:
            max_send_msg_size: 504857600
        trace_by_id:
          query_timeout: 200s
        search:
          query_timeout: 200s
          prefer_self: 20
          # external_hedge_requests_at: 0
          # external_hedge_requests_up_to: 0
        max_concurrent_queries: 30

    queryFrontend:
      replicas: 1
    
      config:
        search:
          concurrent_jobs: 2000
          target_bytes_per_job: 904857600
        trace_by_id:
          query_shards: 150
          # hedge_requests_at: 0
          # hedge_requests_up_to: 0  

    traces:
      jaeger:
        grpc:
          enabled: true
          receiverConfig:
            endpoint: 0.0.0.0:14250
            max_recv_msg_size_mib: 100000
			
    server:
      httpListenPort: 3100
      logLevel: info
      logFormat: logfmt
      grpc_server_max_recv_msg_size: 504857600
      grpc_server_max_send_msg_size: 504857600
      http_server_read_timeout: 200s
      http_server_write_timeout: 200s
    global_overrides:
      ingestion_rate_strategy: global
      max_bytes_per_trace: 504857600
      max_traces_per_user: 504857600
      ingestion_burst_size_bytes: 504857600
      ingestion_rate_limit_bytes: 504857600
      max_bytes_per_tag_values_query: 50000000

It appears the query is timing out. I would consider increasing trace_by_id -> query_shards and scaling up your queriers.

You can also use the Tempo CLI to pull the trace directly from the blocks to get information about it:

https://grafana.com/docs/tempo/latest/operations/tempo_cli/#query-trace-summary-command

I increased Traces_by_id -> query_shards to 1000 and increased the number of queries to 4, it got better, but the ingester and querier started crashing with OOMKiller, the resources each had 25 GB of memory

ingester:
  replicas: 6
  config:
    replication_factor: 6
querier:
  replicas: 4
queryFrontend:
  replicas: 1
  config:
    search:
      concurrent_jobs: 2000
      target_bytes_per_job: 904857600
    trace_by_id:
      query_shards: 1000

set replication_factor = 3. i have only very vague ideas on how Tempo will behave with RF6.