Allow runtime config to be optionally loaded with strict or non-strict unmarshaling
anna-tran opened this issue · comments
Issue
Currently, Cortex crashes when there is an unknown field in the runtime config YAML. While updating user limits, a spelling mistake in a single field name (which YAML thinks is now an unknown field in the Cortex limits struct) for a single user can cause the whole service to go down.
Proposal
Add a boolean value to the cortex config to make unmarshaling of the runtime config strict
or non-strict
, and have that value determine whether strict
mode is enabled here.
Enabling non-strict
mode allows Cortex to ignore unknown fields when unmarshalling so a bad config for a single user does not affect the Cortex service availability for other users.
I am not a fan of non-strict. Currently, if the config was updated wrongfully, a metric is generated. Which can be used for alerting. If no cortex process are restarted, they keep the old config, everything works. If we are using kubernetes, with enough pods crashed, the pdb can protect us from having too many restarts, while we attend the alert.
With non-strict, now we might have broken things for a user and there is no alert for that. It makes hard reasoning about the health of the config.
If I have a typo on active_series
field for example, with non-strict mode Cortex will ignore this field, then it will use the default value for active_series
? This sounds even worse than crashing Cortex pods.
To answer your question @yeya24, yes a typo on the active_series
field will force Cortex to use the default value.
I think these are fair points, we can use the Kubernetes PDB to protect Cortex from loading a bad runtime configuration and create a mechanism on our service side to rollback a bad configuration version.