mozilla / libprio

INACTIVE - A C library implementing a basic version of the Prio system for private aggregation. https://crypto.stanford.edu/prio/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add option for differential privacy

henrycg opened this issue · comments

Optionally, Prio servers should add Laplace-distributed noise to their shares of the final statistic before publishing them. In this way, the final output of the system can be differentially private w.r.t. to the data of any particular client.

Googles differential privacy team has released implementations of discrete laplacian and gaussian noise: https://github.com/google/differential-privacy/blob/main/cc/algorithms/distributions.cc

They are in C++ and under Apache 2.0 license.

My two favored approaches would be:

  1. Use the differential-privacy library to externally call it as a noise source on the servers, e.g. in prio-processor. Then we could compile both libraries independently. This seems to be the most straightforward.

  2. Port the necessary code and tests to C to cleanly integrate them with libprio. This will be more work, but lead to more portable and lean code.

The existing code provides almost all functionality that is needed, apart from a wrapper to output bitmasks of arbitrary length instead of floats or ints, as well as doing the scaling of the noise correctly when used with fixed-point encoding (see #106).

I would prefer to integrate gaussian noise first though, since that seems to be the most useful for interoperability with other frameworks.

@henrycg @rhelmer any opinions?

Thanks for reviving this issue! A couple of quick points:

  • For the types of statistics that libprio supports (sums), it seems like Laplace noise is strictly better than Gaussian. In particular, I think that with Laplace noise we can get \epsilon-DP, whereas with Gaussian, I think we would be settling for weaker (\epsilon, \delta)-DP. (This also makes the analysis less clean.) I am no differential-privacy expert though—I am mostly relying on the Dwork & Roth book for this.

  • However we implement DP noise generation, we should make sure to take care of the pitfalls mentioned in this paper.

Of the two options you mention, Option 2 does seem like it would be better for usability and portability but, as you say, would require more work and care in the implementation.

If we go with Option 1, I would want to make sure that we keep a clean separation between libprio and the external DP library. One way to do that would be to have the server config take as input a pointer to a function (e.g., "sample_noise") that maps void to mp_int. Each server would add sample_noise() to its share of the aggregate output before publishing these shares. I'm not sure that this is a great idea for usability, so I'm open to other approaches too.

You are correct that gaussian noise gives us a weaker notion of DP, but its properties under composition are well studied, which makes it widely used, e.g. in differentially private stochastic gradient descent for deep learning.

The great thing here is, that the above DP SGD needs gradient clipping, which we can enforce with the range proofs of prio. Then libprio can be used as differentially private backend with secure aggregation for federated learning! (which is my plan)

So if we use gaussian DP, we get a weaker notion, but more versatility.

Regarding "least significant bits", the discrete distributions from the abover repo were built to mitigate exactly this attack.

If I go with Option 1, your above implementation advice will be very useful!

Also, if I go with option 1, making both kinds of noise available should be relatively little extra effort, so thats another argument for it. I'll think a bit more about the implications of both approaches and come back with updates.

Since new work on Prio is happening at libprio-rs, I will close this issue.