cortexproject / cortex

A horizontally scalable, highly available, multi-tenant, long term Prometheus.

Home Page:https://cortexmetrics.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Allow runtime config to be optionally loaded with strict or non-strict unmarshaling

anna-tran opened this issue · comments

Issue

Currently, Cortex crashes when there is an unknown field in the runtime config YAML. While updating user limits, a spelling mistake in a single field name (which YAML thinks is now an unknown field in the Cortex limits struct) for a single user can cause the whole service to go down.

Proposal

Add a boolean value to the cortex config to make unmarshaling of the runtime config strict or non-strict, and have that value determine whether strict mode is enabled here.

Enabling non-strict mode allows Cortex to ignore unknown fields when unmarshalling so a bad config for a single user does not affect the Cortex service availability for other users.

I am not a fan of non-strict. Currently, if the config was updated wrongfully, a metric is generated. Which can be used for alerting. If no cortex process are restarted, they keep the old config, everything works. If we are using kubernetes, with enough pods crashed, the pdb can protect us from having too many restarts, while we attend the alert.

With non-strict, now we might have broken things for a user and there is no alert for that. It makes hard reasoning about the health of the config.

If I have a typo on active_series field for example, with non-strict mode Cortex will ignore this field, then it will use the default value for active_series? This sounds even worse than crashing Cortex pods.

To answer your question @yeya24, yes a typo on the active_series field will force Cortex to use the default value.
I think these are fair points, we can use the Kubernetes PDB to protect Cortex from loading a bad runtime configuration and create a mechanism on our service side to rollback a bad configuration version.