cloudflare / daphne

Implementation of DAP

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Report metrics are over-counted if `try_put_agg_share_span()` is retried

cjpatton opened this issue · comments

Suppose handle_agg_job_cont_req() marks a report rejected due to VDAF prep failure and increments the corresponding prometheus metric. If, on the subsequent call to try_put_agg_share_span(), another report is marked as replayed, then the first report will be marked rejected once more and the prometheus metric will be incremented again.

One idea to address this: Dump the retry logic. Split up try_put_agg_share_span() so we can do the following:

  • replays := subset of reports in the job that exist in ReportsProcessed
  • agg_share_span, agg_job_resp := handle_agg_job_cont_req(state, agg_job_cont_req, replays)
  • commit agg_share_span to AggregateStore
  • commit replays to ReportsProcessed

This might also help with this bug: https://github.com/cloudflare/daphne/blob/main/daphne_worker/src/roles/aggregator.rs#L452-L454

This doesn't work because it leads to a race condition between DO transactions across aggregation jobs. @mendess suggestion is to just move the metrics out of the protocol logic (vdaf/mod.rs) to roles logic.