Calls to bucket storage on GCS fails after upgrading to 1.16.0

Question

Calls to bucket storage on GCS fails after upgrading to 1.16.0

sivadeepN opened this issue 2 months ago · comments

Describe the bug

Calls to bucket storage are failing from different components after upgrading to 1.16.0 from 1.14.1. I didnt find any extra configuration in the latest versions. Can someone help here, pasting some logs in the thread

ts=2024-03-20T08:45:51.503282442Z caller=cortex.go:444 level=error msg="module failed" module=alertmanager err="invalid service state: Failed, expected: Running, failure: failed to load alertmanager configurations for owned users: failed to fetch alertmanager config for user 01aa839c-559e-450e-a852-bfe01aac701c: Get \"https://storage.googleapis.com/test_alertmanager_api/alerts/01aa839c-559e-450e-a852-bfe01aac701c\": http2: server sent GOAWAY and closed the connection; LastStreamID=3, ErrCode=COMPRESSION_ERROR, debug=\"hpack_truncated_block\""

To Reproduce

Upgrade cortex to 1.16.0 from 1.14.3

Expected behavior
Cortex should keep working as before

Environment:

Infrastructure: [e.g., Kubernetes, gcs,]
Deployment tool: [e.g., helm]

Additional Context

Compactor :
ts=2024-03-14T08:14:16.378140292Z caller=compactor.go:644 level=error component=compactor msg="failed to discover users from bucket" err="Get "https://storage.googleapis.com/storage/v1/b/test_blocks_integration/o?alt=json&delimiter=%2F&endOffset=&fields=nextPageToken%2Cprefixes%2Citems%28name%29&includeTrailingDelimiter=false&pageToken=&prefix=&prettyPrint=false&projection=full&startOffset=&versions=false\": http2: server sent GOAWAY and closed the connection; LastStreamID=1, ErrCode=COMPRESSION_ERROR, debug="hpack_truncated_block""

ingester :
[cortex-aggregator-v2-ingester-1 ingester] ts=2024-03-20T09:26:17.038736639Z caller=ingester.go:2321 level=warn msg="shipper failed to synchronize TSDB blocks with the storage" user=23835d6d-b09c-49bc-a6af-5ff74270da4c uploaded=0 err="check exists: Get "https://storage.googleapis.com/storage/v1/b/test_blocks_integration/o/23835d6d-b09c-49bc-a6af-5ff74270da4c%2F01HSDHF39NA9ARYR4AWTW8BV2R%2Fmeta.json?alt=json&prettyPrint=false&projection=full\": http2: server sent GOAWAY and closed the connection; LastStreamID=1, ErrCode=COMPRESSION_ERROR, debug="hpack_truncated_block""

`alertmanager:
cluster:
listen_address: 0.0.0.0:9094
peers: cortex-aggregator-v2-alertmanager-http-metrics-headless.cortex-2.svc.cluster.local:9094
data_dir: /data
enable_api: true
external_url: /api/prom/alertmanager
alertmanager_storage:
backend: gcs
gcs:
bucket_name: test_alertmanager_api
api:
prometheus_http_prefix: /prometheus
response_compression_enabled: true
auth_enabled: true
blocks_storage:
backend: gcs
bucket_store:
bucket_index:
enabled: true
chunks_cache:
backend: memcached
memcached:
addresses: dns+cortex-aggregator-v2-memcached-blocks-${POD_ZONE:a}.cortex-2.svc.cluster.local:11211
max_async_buffer_size: 500000
max_async_concurrency: 500
max_get_multi_batch_size: 500
max_get_multi_concurrency: 1000
max_idle_connections: 500
timeout: 15s
index_cache:
backend: memcached
memcached:
addresses: dns+cortex-aggregator-v2-memcached-blocks-index-${POD_ZONE:a}.cortex-2.svc.cluster.local:11211
max_async_buffer_size: 500000
max_async_concurrency: 500
max_get_multi_batch_size: 500
max_get_multi_concurrency: 1000
max_idle_connections: 500
max_item_size: 10485760
timeout: 15s
metadata_cache:
backend: memcached
memcached:
addresses: dns+cortex-aggregator-v2-memcached-blocks-metadata.cortex-2.svc.cluster.local:11211
sync_dir: /data/tsdb-sync
gcs:
bucket_name: test_blocks_integration
tsdb:
dir: /data/tsdb
max_exemplars: 10000
retention_period: 6h
compactor:
block_deletion_marks_migration_enabled: false
sharding_enabled: true
sharding_ring:
kvstore:
consul:
host: consul:8500
store: consul
distributor:
pool:
health_check_ingesters: true
remote_timeout: 2s
ring:
kvstore:
store: memberlist
shard_by_all_labels: true
frontend:
grpc_client_config:
grpc_compression: gzip
log_queries_longer_than: 10s
max_outstanding_per_tenant: 500
frontend_worker:
frontend_address: cortex-aggregator-v2-query-frontend-headless:9095
grpc_client_config:
backoff_config:
max_period: 10s
max_retries: 2
min_period: 100ms
grpc_compression: gzip
ingester:
lifecycler:
availability_zone: ${POD_ZONE}
final_sleep: 30s
heartbeat_period: 15s
join_after: 30s
num_tokens: 256
observe_period: 10s
ring:
heartbeat_timeout: 1m
kvstore:
consul:
consistent_reads: false
host: consul:8500
http_client_timeout: 20s
prefix: collectors/
store: consul
replication_factor: 3
zone_awareness_enabled: true
tokens_file_path: /data/tokens
ingester_client:
grpc_client_config:
grpc_compression: gzip
max_recv_msg_size: 104857600
max_send_msg_size: 16777216
limits:
enforce_metric_name: true
ingestion_burst_size: 50000
ingestion_rate: 350000
ingestion_rate_strategy: global
max_cache_freshness: 5m
max_fetched_chunks_per_query: 2000000
max_fetched_series_per_query: 100000
max_global_series_per_user: 5600000
max_label_name_length: 1024
max_label_names_per_series: 100
max_query_lookback: 4536h
max_series_per_metric: 50000
max_series_per_user: 0
reject_old_samples: false
reject_old_samples_max_age: 168h
memberlist:
abort_if_cluster_join_fails: false
bind_port: 7946
join_members:

cortex-aggregator-v2-distributor-memberlist
querier:
active_query_tracker_dir: /data/cortex/querier
batch_iterators: true
ingester_streaming: true
lookback_delta: 2m
max_concurrent: 20
max_samples: 50000000
query_ingesters_within: 6h
query_store_after: 5h55m
store_gateway_addresses: ""
store_gateway_client:
grpc_compression: gzip
timeout: 2m
query_range:
align_queries_with_step: true
cache_results: true
max_retries: 2
results_cache:
cache:
background:
writeback_buffer: 10000
writeback_goroutines: 10
memcached:
batch_size: 1024
expiration: 3h
parallelism: 100
memcached_client:
service: cortex-aggregator-v2-memcached-frontend
timeout: 1s
split_queries_by_interval: 24h
ruler:
alertmanager_refresh_interval: 2m
alertmanager_url: http://_http-metrics._tcp.cortex-aggregator-v2-alertmanager-http-metrics-headless/api/prom/alertmanager
enable_alertmanager_discovery: true
enable_api: true
enable_sharding: true
evaluation_interval: 30s
ring:
kvstore:
consul:
host: consul:8500
prefix: rulers/
store: consul
rule_path: /data/rules
ruler_client:
grpc_compression: gzip
ruler_storage:
backend: gcs
gcs:
bucket_name: test_ruler_api
runtime_config:
file: /etc/cortex/overrides/overrides.yaml
period: 10s
server:
grpc_listen_port: 9095
grpc_server_max_concurrent_streams: 100
grpc_server_max_recv_msg_size: 10485760
grpc_server_max_send_msg_size: 4194304
http_listen_port: 80
log_level: warn
storage:
engine: blocks
store_gateway:
sharding_enabled: true
sharding_ring:
kvstore:
consul:
host: consul:8500
store: consul
tokens_file_path: /data/tokens
tracing:
otel:
oltp_endpoint: grafana-agent-traces.tempo.svc.cluster.local:4317
type: otel`

Ben Ye · Answer 1 · Thu Mar 21 2024 17:07:19 GMT+0800 (China Standard Time)

Hey @sivadeepN, personally I don't use GCS so I am unsure about this error but we do have other users who use GCS experience no issues using the same GCS bucket client.

I am wondering if it is because the 1.16 release still uses an old version of the GCS client library. Can you try the latest master image on one of your container and see if the error is still here?

quay.io/cortexproject/cortex:master-065e382

sivadeep nallana · Answer 2 · Thu Mar 21 2024 20:08:20 GMT+0800 (China Standard Time)

Even updating to the master branch didnt work, this would be a production blocker for all users on GCS.