mozilla / bigquery-etl

Bigquery ETL

Home Page:https://mozilla.github.io/bigquery-etl

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Telemetry Dev Cycle: active metric count is wrong

badboy opened this issue · comments

The current count of active metrics is wrong in the Dev Cycle Dashboard.
The whole calculation seems wrong right now.
Note how the firefox-ios channels have nearly no active metrics listed (23 out of 403 today for release).

Here's why:

  1. It all starts with the probe-info service which is ingested into a table here: https://github.com/mozilla/bigquery-etl/blob/b5294fcdbcad505068852fac1f22c24ca9e00b8f/sql/moz-fx-data-shared-prod/telemetry_dev_cycle_external/glean_metrics_stats_v1/query.py
  • That sets the first_seen_date and last_seen_date fields for that metric. Note: this is when the metric was added and when it was last modified. This is NOT when we saw data for it.
  1. The derived table does some transformations, probably to handle version-expiry (that's what I understood from it, is that right?) and matching that against the release dates of an app, which later ends up picking the last_seen_date of that metric as the expired_date (because the actual expiry date is long after the last_seen_date).
  2. The Explore will generate SQL that counts a metric as "active" if it does NOT have a expired_date: COUNT(CASE WHEN (glean_metrics_stats.expired_date IS NULL) THEN 1 ELSE NULL END) AS glean_metrics_stats_active_metrics

I assume "active" metrics wanted to report those that we have actually seen recently?
In that case just the probe-info service won't be enough.
GLAM also does something similar and it's bitten us before where we reported something as inactive/unavailable because it wasn't there often enough.

IMO the metrics_stats table should just report when a metrics expires and the dashboard just show non-expired (=active) metrics, regardless of whether we have seen data there in the past couple of days.

cc @lelilia @kik-kik as you did the initial PR #4730

┆Issue is synchronized with this Jira Task

commented

Thank you very much for the feedback!

Yes I get the first_seen_date and last_seen_date from the probe-info. The matching to versions is only done for those fileds where the expiration is given as a version number, most of them are given as a date or "never"

I see that a lot of the metrics "expire" after my current definition in the past couple of days ... So my threshold is definitely off .. but the way I understood it, is that a metric can be set to expire "never" but then the collecting part is removed from the code and so it's actually expired even though it doesn't say so in the expired part of probe-info? That is why I'm matching it the way I do ... if the last seen is way before the expired, it's expired, if expired is before now it's expired else it's still active ...

So would your suggestion be, to increase the threshold of "have seen it in the last x days" or to just go with the expiration in probe-info or query another table for more info?

The last_seen in the probe-info is when the definition in the metrics.yaml was last touched. It does not express whether we still see data for that metric.
There's a field in-source which states whether the definition still exists in the metrics.yaml.
So never-expiring metrics should still be active if in-source: true.