Report metrics are over-counted if `try_put_agg_share_span()` is retried
cjpatton opened this issue · comments
Suppose handle_agg_job_cont_req()
marks a report rejected due to VDAF prep failure and increments the corresponding prometheus metric. If, on the subsequent call to try_put_agg_share_span()
, another report is marked as replayed, then the first report will be marked rejected once more and the prometheus metric will be incremented again.
One idea to address this: Dump the retry logic. Split up try_put_agg_share_span()
so we can do the following:
- replays := subset of reports in the job that exist in ReportsProcessed
- agg_share_span, agg_job_resp := handle_agg_job_cont_req(state, agg_job_cont_req, replays)
- commit agg_share_span to AggregateStore
- commit replays to ReportsProcessed
This might also help with this bug: https://github.com/cloudflare/daphne/blob/main/daphne_worker/src/roles/aggregator.rs#L452-L454
This doesn't work because it leads to a race condition between DO transactions across aggregation jobs. @mendess suggestion is to just move the metrics out of the protocol logic (vdaf/mod.rs) to roles logic.