Do not cache so many prometheus metrics
olevski opened this issue · comments
Amalthea has a lot of metrics because it creates a new metric for each combination of user and project id.
I did not know this but prometheus caches all these metrics even after they have been scrpated until .remove
is called on the metric with the label names that should be removed from the cache.
Without removing this the cache grows and at some point amalthea published 100k-200k metrics at a time.
So we should periodically call .remove
on metrics that we do not need anymore. The easiest way to do this is to call this 1 minute after a session is killed. So after a session is removed we call remove
for the user id and project id combo of the session. If we call remove
immediately then prometheus will not have a chance to scrape the metrics that were generated. That is why we should have a delay. This delay can be configurable with a default of 1 minute.