Report metrics are over-counted if `try_put_agg_share_span()` is retried

Question

Report metrics are over-counted if `try_put_agg_share_span()` is retried

cjpatton opened this issue 8 months ago · comments

Christopher Patton commented 8 months ago

Suppose handle_agg_job_cont_req() marks a report rejected due to VDAF prep failure and increments the corresponding prometheus metric. If, on the subsequent call to try_put_agg_share_span(), another report is marked as replayed, then the first report will be marked rejected once more and the prometheus metric will be incremented again.

Christopher Patton · Answer 1 · Fri Oct 13 2023 06:04:01 GMT+0800 (China Standard Time)

One idea to address this: Dump the retry logic. Split up try_put_agg_share_span() so we can do the following:

replays := subset of reports in the job that exist in ReportsProcessed
agg_share_span, agg_job_resp := handle_agg_job_cont_req(state, agg_job_cont_req, replays)
commit agg_share_span to AggregateStore
commit replays to ReportsProcessed

This might also help with this bug: https://github.com/cloudflare/daphne/blob/main/daphne_worker/src/roles/aggregator.rs#L452-L454

Christopher Patton · Answer 2 · Fri Oct 13 2023 23:32:11 GMT+0800 (China Standard Time)

This doesn't work because it leads to a race condition between DO transactions across aggregation jobs. @mendess suggestion is to just move the metrics out of the protocol logic (vdaf/mod.rs) to roles logic.