Code(499) desc = context canceled
kosov73 opened this issue · comments
Describe the bug
When I request a large trace I get a Query error
When I request a small trace there are no errors
Querier log
***
level=info ts=2024-05-15T17:09:56.545922291Z caller=tempodb.go:335 org_id=single-tenant msg="searching for trace in block" findTraceID=00000000000000000b5e89b6afd13b28 block=7f252c82-6cef-41ca-a90c-d929b1de1bae found=true
level=info ts=2024-05-15T17:09:57.410773318Z caller=tempodb.go:335 org_id=single-tenant msg="searching for trace in block" findTraceID=00000000000000000b5e89b6afd13b28 block=4ad92767-2e90-488f-b50b-e6e94ae08f45 found=true
level=info ts=2024-05-15T17:09:57.29174733Z caller=tempodb.go:335 org_id=single-tenant msg="searching for trace in block" findTraceID=00000000000000000b5e89b6afd13b28 block=5c0c1b61-1978-4510-87f8-7e7771ca46e1 found=true
***
Querier frontend log
level=info ts=2024-05-15T17:11:50.207159709Z caller=handler.go:109 tenant=single-tenant method=GET traceID=785906df70703c47 url="/api/traces/b5e89b6afd13b28?start=1715790980&end=1715794880" duration=29.993346907s status=500 err="rpc error: code = Code(499) desc = context canceled" response_size=0
Environment:
Infrastructure: Kubernetes
Deployment tool: helm: tempo-distributed 1.9.5, appVersion: 2.4.1
Config:
What parameter needs to be changed for the trace to load.
storage:
trace:
backend: s3
s3:
bucket: tempo
endpoint: "minio.local"
access_key: "access_key"
secret_key: "secret_key"
insecure: false
region: us-east-1
pool:
queue_depth: 70000
ingester:
replicas: 6
config:
replication_factor: 6
max_block_bytes: 100000000
distributor:
replicas: 1
autoscaling:
enabled: false
minReplicas: 1
maxReplicas: 3
config:
log_received_spans:
enabled: true
compactor:
replicas: 1
config:
compaction:
block_retention: 96h
compacted_block_retention: 1h
compaction_window: 1h
querier:
replicas: 2
config:
frontend_worker:
grpc_client_config:
max_send_msg_size: 504857600
trace_by_id:
query_timeout: 200s
search:
query_timeout: 200s
prefer_self: 20
# external_hedge_requests_at: 0
# external_hedge_requests_up_to: 0
max_concurrent_queries: 30
queryFrontend:
replicas: 1
config:
search:
concurrent_jobs: 2000
target_bytes_per_job: 904857600
trace_by_id:
query_shards: 150
# hedge_requests_at: 0
# hedge_requests_up_to: 0
traces:
jaeger:
grpc:
enabled: true
receiverConfig:
endpoint: 0.0.0.0:14250
max_recv_msg_size_mib: 100000
server:
httpListenPort: 3100
logLevel: info
logFormat: logfmt
grpc_server_max_recv_msg_size: 504857600
grpc_server_max_send_msg_size: 504857600
http_server_read_timeout: 200s
http_server_write_timeout: 200s
global_overrides:
ingestion_rate_strategy: global
max_bytes_per_trace: 504857600
max_traces_per_user: 504857600
ingestion_burst_size_bytes: 504857600
ingestion_rate_limit_bytes: 504857600
max_bytes_per_tag_values_query: 50000000
It appears the query is timing out. I would consider increasing trace_by_id -> query_shards
and scaling up your queriers.
You can also use the Tempo CLI to pull the trace directly from the blocks to get information about it:
https://grafana.com/docs/tempo/latest/operations/tempo_cli/#query-trace-summary-command
I increased Traces_by_id -> query_shards to 1000 and increased the number of queries to 4, it got better, but the ingester and querier started crashing with OOMKiller, the resources each had 25 GB of memory
ingester:
replicas: 6
config:
replication_factor: 6
querier:
replicas: 4
queryFrontend:
replicas: 1
config:
search:
concurrent_jobs: 2000
target_bytes_per_job: 904857600
trace_by_id:
query_shards: 1000
set replication_factor = 3
. i have only very vague ideas on how Tempo will behave with RF6.