cortexproject / cortex

A horizontally scalable, highly available, multi-tenant, long term Prometheus.

Home Page:https://cortexmetrics.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Calls to bucket storage on GCS fails after upgrading to 1.16.0

sivadeepN opened this issue · comments

Describe the bug

Calls to bucket storage are failing from different components after upgrading to 1.16.0 from 1.14.1. I didnt find any extra configuration in the latest versions. Can someone help here, pasting some logs in the thread

ts=2024-03-20T08:45:51.503282442Z caller=cortex.go:444 level=error msg="module failed" module=alertmanager err="invalid service state: Failed, expected: Running, failure: failed to load alertmanager configurations for owned users: failed to fetch alertmanager config for user 01aa839c-559e-450e-a852-bfe01aac701c: Get \"https://storage.googleapis.com/test_alertmanager_api/alerts/01aa839c-559e-450e-a852-bfe01aac701c\": http2: server sent GOAWAY and closed the connection; LastStreamID=3, ErrCode=COMPRESSION_ERROR, debug=\"hpack_truncated_block\""

To Reproduce

  1. Upgrade cortex to 1.16.0 from 1.14.3

Expected behavior
Cortex should keep working as before

Environment:

  • Infrastructure: [e.g., Kubernetes, gcs,]
  • Deployment tool: [e.g., helm]

Additional Context

Compactor :
ts=2024-03-14T08:14:16.378140292Z caller=compactor.go:644 level=error component=compactor msg="failed to discover users from bucket" err="Get "https://storage.googleapis.com/storage/v1/b/test_blocks_integration/o?alt=json&delimiter=%2F&endOffset=&fields=nextPageToken%2Cprefixes%2Citems%28name%29&includeTrailingDelimiter=false&pageToken=&prefix=&prettyPrint=false&projection=full&startOffset=&versions=false\": http2: server sent GOAWAY and closed the connection; LastStreamID=1, ErrCode=COMPRESSION_ERROR, debug="hpack_truncated_block""

ingester :
[cortex-aggregator-v2-ingester-1 ingester] ts=2024-03-20T09:26:17.038736639Z caller=ingester.go:2321 level=warn msg="shipper failed to synchronize TSDB blocks with the storage" user=23835d6d-b09c-49bc-a6af-5ff74270da4c uploaded=0 err="check exists: Get "https://storage.googleapis.com/storage/v1/b/test_blocks_integration/o/23835d6d-b09c-49bc-a6af-5ff74270da4c%2F01HSDHF39NA9ARYR4AWTW8BV2R%2Fmeta.json?alt=json&prettyPrint=false&projection=full\": http2: server sent GOAWAY and closed the connection; LastStreamID=1, ErrCode=COMPRESSION_ERROR, debug="hpack_truncated_block""

`alertmanager:
cluster:
listen_address: 0.0.0.0:9094
peers: cortex-aggregator-v2-alertmanager-http-metrics-headless.cortex-2.svc.cluster.local:9094
data_dir: /data
enable_api: true
external_url: /api/prom/alertmanager
alertmanager_storage:
backend: gcs
gcs:
bucket_name: test_alertmanager_api
api:
prometheus_http_prefix: /prometheus
response_compression_enabled: true
auth_enabled: true
blocks_storage:
backend: gcs
bucket_store:
bucket_index:
enabled: true
chunks_cache:
backend: memcached
memcached:
addresses: dns+cortex-aggregator-v2-memcached-blocks-${POD_ZONE:a}.cortex-2.svc.cluster.local:11211
max_async_buffer_size: 500000
max_async_concurrency: 500
max_get_multi_batch_size: 500
max_get_multi_concurrency: 1000
max_idle_connections: 500
timeout: 15s
index_cache:
backend: memcached
memcached:
addresses: dns+cortex-aggregator-v2-memcached-blocks-index-${POD_ZONE:a}.cortex-2.svc.cluster.local:11211
max_async_buffer_size: 500000
max_async_concurrency: 500
max_get_multi_batch_size: 500
max_get_multi_concurrency: 1000
max_idle_connections: 500
max_item_size: 10485760
timeout: 15s
metadata_cache:
backend: memcached
memcached:
addresses: dns+cortex-aggregator-v2-memcached-blocks-metadata.cortex-2.svc.cluster.local:11211
sync_dir: /data/tsdb-sync
gcs:
bucket_name: test_blocks_integration
tsdb:
dir: /data/tsdb
max_exemplars: 10000
retention_period: 6h
compactor:
block_deletion_marks_migration_enabled: false
sharding_enabled: true
sharding_ring:
kvstore:
consul:
host: consul:8500
store: consul
distributor:
pool:
health_check_ingesters: true
remote_timeout: 2s
ring:
kvstore:
store: memberlist
shard_by_all_labels: true
frontend:
grpc_client_config:
grpc_compression: gzip
log_queries_longer_than: 10s
max_outstanding_per_tenant: 500
frontend_worker:
frontend_address: cortex-aggregator-v2-query-frontend-headless:9095
grpc_client_config:
backoff_config:
max_period: 10s
max_retries: 2
min_period: 100ms
grpc_compression: gzip
ingester:
lifecycler:
availability_zone: ${POD_ZONE}
final_sleep: 30s
heartbeat_period: 15s
join_after: 30s
num_tokens: 256
observe_period: 10s
ring:
heartbeat_timeout: 1m
kvstore:
consul:
consistent_reads: false
host: consul:8500
http_client_timeout: 20s
prefix: collectors/
store: consul
replication_factor: 3
zone_awareness_enabled: true
tokens_file_path: /data/tokens
ingester_client:
grpc_client_config:
grpc_compression: gzip
max_recv_msg_size: 104857600
max_send_msg_size: 16777216
limits:
enforce_metric_name: true
ingestion_burst_size: 50000
ingestion_rate: 350000
ingestion_rate_strategy: global
max_cache_freshness: 5m
max_fetched_chunks_per_query: 2000000
max_fetched_series_per_query: 100000
max_global_series_per_user: 5600000
max_label_name_length: 1024
max_label_names_per_series: 100
max_query_lookback: 4536h
max_series_per_metric: 50000
max_series_per_user: 0
reject_old_samples: false
reject_old_samples_max_age: 168h
memberlist:
abort_if_cluster_join_fails: false
bind_port: 7946
join_members:

  • cortex-aggregator-v2-distributor-memberlist
    querier:
    active_query_tracker_dir: /data/cortex/querier
    batch_iterators: true
    ingester_streaming: true
    lookback_delta: 2m
    max_concurrent: 20
    max_samples: 50000000
    query_ingesters_within: 6h
    query_store_after: 5h55m
    store_gateway_addresses: ""
    store_gateway_client:
    grpc_compression: gzip
    timeout: 2m
    query_range:
    align_queries_with_step: true
    cache_results: true
    max_retries: 2
    results_cache:
    cache:
    background:
    writeback_buffer: 10000
    writeback_goroutines: 10
    memcached:
    batch_size: 1024
    expiration: 3h
    parallelism: 100
    memcached_client:
    service: cortex-aggregator-v2-memcached-frontend
    timeout: 1s
    split_queries_by_interval: 24h
    ruler:
    alertmanager_refresh_interval: 2m
    alertmanager_url: http://_http-metrics._tcp.cortex-aggregator-v2-alertmanager-http-metrics-headless/api/prom/alertmanager
    enable_alertmanager_discovery: true
    enable_api: true
    enable_sharding: true
    evaluation_interval: 30s
    ring:
    kvstore:
    consul:
    host: consul:8500
    prefix: rulers/
    store: consul
    rule_path: /data/rules
    ruler_client:
    grpc_compression: gzip
    ruler_storage:
    backend: gcs
    gcs:
    bucket_name: test_ruler_api
    runtime_config:
    file: /etc/cortex/overrides/overrides.yaml
    period: 10s
    server:
    grpc_listen_port: 9095
    grpc_server_max_concurrent_streams: 100
    grpc_server_max_recv_msg_size: 10485760
    grpc_server_max_send_msg_size: 4194304
    http_listen_port: 80
    log_level: warn
    storage:
    engine: blocks
    store_gateway:
    sharding_enabled: true
    sharding_ring:
    kvstore:
    consul:
    host: consul:8500
    store: consul
    tokens_file_path: /data/tokens
    tracing:
    otel:
    oltp_endpoint: grafana-agent-traces.tempo.svc.cluster.local:4317
    type: otel`

Hey @sivadeepN, personally I don't use GCS so I am unsure about this error but we do have other users who use GCS experience no issues using the same GCS bucket client.

I am wondering if it is because the 1.16 release still uses an old version of the GCS client library. Can you try the latest master image on one of your container and see if the error is still here?

quay.io/cortexproject/cortex:master-065e382

Even updating to the master branch didnt work, this would be a production blocker for all users on GCS.