google / trillian

A transparent, highly scalable and cryptographically verifiable data store.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add convenience metric

pgporada opened this issue · comments

We rely on grafana, prometheus, and alertmanager for our monitoring stack. When metrics are ingested, they contain a non-human-friendly logid such as entries_added{logid="abcdef1234567890abc"}. In the docs there is an optional display name that can be set per shard. It would be a big help if a metric could be exported that contained the display_name, tree_state, and tree_type.

A metric such as this would be perfect shard_information{logid="abcdef1234567890abc",display_name="2020",tree_state="active",tree_type="LOG"}.

With this proposed metric, we would be able to use Prometheus' group_left and make the rest of our metrics human-friendly.

Here's an example of what we currently do in grafana. Configuring this across multiple dashboards is time consuming, requires manual intervention when adding a new shard, and is prone to error. Additionally, because this is done in grafana, the human-friendly name can't be sent to a prometheus alert.
shard-mapping

I've considered writing a database exporter to generate a metric, but I think it would be better suited to be built into trillian instead.

Perhaps this already exists and I've missed it. If not, thank you for considering it.

Internally, we expose these log names as another metric of the logsigner binary. On our dashboards we join it with logserver and logsigner metrics. We could do something similar here. Does your setup allow you joining metrics from different processes?

An alternative approach is to bake the logid->name mapping into your monitoring stack rather than Trillian metrics. I.e. you would have some (very short) map of {123->"2019", 456->"2020", ...}, and whenever you see a logid label you would automatically add a logname label from this mapping before it gets to the graphing/alerting phase. Is that possible in your stack?

I realise that one of the reasons we don't have this metric externally (yet) is that it's computed differently from others. All metrics in Trillian are set in-line when their value is known. The "log name / type" metric is collected in response to monitoring system's "pull" queries, not in-line. Does prometheus have such "callback" kinds of metrics? We could look at supporting them on the interface level.

By the looks of it, metrics like CounterFunc and GaugeFunc could be helpful, but they seem to be constrained to not have any non-constant labels. Is there a workaround?

As a workaround I setup a mysql datasource and used the following grafana variable config.

SELECT CAST(TreeId as CHAR) as __value, DisplayName as __text FROM Trees;

Which allows for a human readable logid mapping.

2023: abcsometrillianinternalidentifier123