tff differential privacy model stucks in learning process

Question

tff differential privacy model stucks in learning process

deepquantum88 opened this issue a year ago · comments

Amandeep Singh Bhatia commented a year ago

aggregation_factory = tff.learning.model_update_aggregator.dp_aggregator(
noise_multiplier, clients_per_round)

sampling_prob = clients_per_round / total_clients

learning_process = tff.learning.algorithms.build_unweighted_fed_avg(
my_model_fn,
client_optimizer_fn=lambda: tf.keras.optimizers.SGD(0.01),
server_optimizer_fn=lambda: tf.keras.optimizers.SGD(1.0, momentum=0.9),
model_aggregator=aggregation_factory)

python=3.9.7
TF=2.11.0
TFF=0.48.0

the training of model stuck in learning_process " tff.learning.algorithms.build_unweighted_fed_avg". Can you please help.

Zachary Charles · Answer 1 · Thu Apr 06 2023 23:42:44 GMT+0800 (China Standard Time)

This looks similar to #3756 - If what you are running has a call to tff.backends.native.set_sync_local_cpp_execution_context then I think it should be removed.

Amandeep Singh Bhatia · Answer 2 · Fri Apr 07 2023 00:06:23 GMT+0800 (China Standard Time)

@zcharles8 Thank you for your response. I have not called to tff.backends.native.set_sync_local_cpp_execution_context.

I am using TFF tutorial for image classification with the changes as shown in above code.

Zachary Charles · Answer 3 · Fri Apr 07 2023 00:08:30 GMT+0800 (China Standard Time)

Are you running this in colab? If so then you'll probably need to upgrade to TFF v0.52.0, which re-enabled colab support.

Amandeep Singh Bhatia · Answer 4 · Fri Apr 07 2023 00:09:41 GMT+0800 (China Standard Time)

I am not running in colab. I am using TFF 0.48.0 version on my system.

Zachary Charles · Answer 5 · Fri Apr 07 2023 00:11:15 GMT+0800 (China Standard Time)

Yeah, unfortunately TFF versions less than 0.52.0 generally don't have compatibility with colab. While 0.48.0 can be pip installed in colab, the execution stack doesn't work with colab (hence the indefinite hang). You'll need to upgrade to fix this.

Amandeep Singh Bhatia · Answer 6 · Fri Apr 07 2023 00:18:24 GMT+0800 (China Standard Time)

I am not working in colab. And on my system , I am working with TFF 0.48.0, and not able to install TFF version >0.48.0

Is the hang issue can be solved in TFF 0.48.0 on local system?

Zachary Charles · Answer 7 · Fri Apr 07 2023 01:07:47 GMT+0800 (China Standard Time)

Ah, sorry, I misread your comment. My recommendation would be to upgrade to TFF 0.52.0, I'm not sure if we have any mechanisms to fix things on older TFF versions.

Can you provide the full details of how you are running things? There's a template that loads when you file a bug that would be really helpful if you could fill out in detail. Otherwise it's extremely difficult to diagnose what the issue is. I've reproduced the template below.

Describe the bug
A clear and concise description of what the bug is. It is often helpful to
provide a link to a colab notebook that
reproduces the bug.

Environment (please complete the following information):

OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
Python package versions (e.g., TensorFlow Federated, TensorFlow):
Python version:
Bazel version (if building from source):
CUDA/cuDNN version:
What TensorFlow Federated execution stack are you using?

Note: You can collect the Python package information by running pip3 freeze
from the command line and most of the other information can be collected using
TensorFlows environment capture
script.

Expected behavior
A clear and concise description of what you expected to happen.

Additional context
Add any other context about the problem here.

Amandeep Singh Bhatia · Answer 8 · Fri Apr 07 2023 01:26:30 GMT+0800 (China Standard Time)

The issue is that when i try to install TFF 0.52.0 on my system, it shows that no matching distribution for TFF0.52.0. It shows upto 0.48.0 version only.

TF=2.11.0
Python=3.9.7

Zachary Charles · Answer 9 · Fri Apr 07 2023 01:30:28 GMT+0800 (China Standard Time)

There's more information in the template that would be really helpful here. Including:

What is your OS platform and distribution?
What CUDA versions?
Are you using any non-default TFF execution stacks?
A minimal repro of the behavior you are referring to.

Amandeep Singh Bhatia · Answer 10 · Fri Apr 07 2023 02:11:20 GMT+0800 (China Standard Time)

DISTRIB_ID=CentOS
CentOS Linux Release 7.9.2009 (core)

The code works fine with python 3.9 upto TFF 0.48.0. But not able to install TFF 0.52.
Sorry, I did not understand "Are you using any non-default TFF execution stacks? "

Amandeep Singh Bhatia · Answer 11 · Sun Apr 09 2023 03:42:02 GMT+0800 (China Standard Time)

@zcharles8 @ZacharyGarrett I updated the TFF to 0.52.0 and TF 2.11.0 on ubuntu my system

When i execute state = learning_process.initialize() of tff+differential privacy

The execution gets hang and not proceed further.

can you please help?

Even i run same file on google colab also, it hang at same point.

Michael Reneer · Answer 12 · Fri Apr 21 2023 02:58:20 GMT+0800 (China Standard Time)

Can you print the result of running ldd --version on your system?

Amandeep Singh Bhatia · Answer 13 · Tue Jun 13 2023 11:26:08 GMT+0800 (China Standard Time)

@michaelreneer here's the result
ldd (Ubuntu GLIBC 2.35-0ubuntu3.1) 2.35

Amandeep Singh Bhatia · Answer 14 · Tue Jun 13 2023 11:30:04 GMT+0800 (China Standard Time)

@zcharles8 @ZacharyGarrett

I am working on my linux ubuntu system.
I have installed python3.9.0, TensorFlow 2.11.0, TFF=0.52.0

When i import the libraries (TFF). I throws a TypeError of unhashable type 'list'.

Even i tried to check lower version of TFF and versions>0.52.0. But it remains same.

I cannot run on colab because i have other dependencies also. So I want to install and rn the things on my system only.
Can you please help in this.

Amandeep Singh Bhatia · Answer 15 · Mon Jul 24 2023 09:50:09 GMT+0800 (China Standard Time)

@zcharles8 @michaelreneer can you please help in it?
Even I tried tff version 0.61.0, but still it stucks at

data_frame = pd.DataFrame()
rounds = 100
clients_per_round = 50

for noise_multiplier in [0.0, 0.5, 0.75, 1.0]:
print(f'Starting training with noise multiplier: {noise_multiplier}')
data_frame = train(rounds, noise_multiplier, clients_per_round, data_frame)
print()

Zachary Charles · Answer 16 · Mon Jul 24 2023 23:58:54 GMT+0800 (China Standard Time)

When you say it's stuck, what do you mean? Is it just taking a long amount of time? In particular, can you try reducing clients_per_round to something like 2, and rounds to something small like 3? That way we can see whether it is actually hanging indefinitely, or just slow.

As for the unhashable list error - I think we would need a full stack trace. In particular, this sounds like lists getting passed in as keys to a dictionary, but where?

Amandeep Singh Bhatia · Answer 17 · Wed Jul 26 2023 01:57:46 GMT+0800 (China Standard Time)

I reduced the clients. But still it stucks in
learning_process = tff.learning.algorithms.build_unweighted_fed_avg(
my_model_fn,
client_optimizer_fn=lambda: tf.keras.optimizers.SGD(0.01),
server_optimizer_fn=lambda: tf.keras.optimizers.SGD(1.0, momentum=0.9),
model_aggregator=aggregation_factory)

and not processing further for several hours indefinite. I am executing it in virtual box ubuntu.

@zcharles8

Zachary Charles · Answer 18 · Wed Jul 26 2023 06:18:46 GMT+0800 (China Standard Time)

Can you check that (1) you can run simpler TFF computations (eg. if you follow the examples in https://www.tensorflow.org/federated/tutorials/building_your_own_federated_learning_algorithm#federated_computations, do these terminate?)

and (2) that the dataset is loading correctly? Eg. after running https://www.tensorflow.org/federated/tutorials/federated_learning_with_differential_privacy#download_and_preprocess_the_federated_emnist_dataset, can you do something like

train_data.create_tf_dataset_for_client(train_data.client_ids[0])

Basically, I'm trying to figure out what call exactly is hanging.

Amandeep Singh Bhatia · Answer 19 · Wed Jul 26 2023 07:14:50 GMT+0800 (China Standard Time)

@zcharles8 the first statement does not terminate
2nd statment working perfect.

i am not getting what can be the reason behind first statment ..not terminating..

Amandeep Singh Bhatia · Answer 20 · Wed Jul 26 2023 07:19:15 GMT+0800 (China Standard Time)

@zcharles8 I thought instead of doing tff +differential privacy tutorial...i tried to switch to differential privacy aggregators

I am trying to use it in simple tff tutorial for image classification...https://www.tensorflow.org/federated/tutorials/federated_learning_for_image_classification

I replaced only this part training_process = tff.learning.algorithms.build_weighted_fed_avg(
model_fn,
client_optimizer_fn=lambda: tf.keras.optimizers.SGD(learning_rate=0.02),
server_optimizer_fn=lambda: tf.keras.optimizers.SGD(learning_rate=1.0))

with
dp_mean=tff.learning.dp_aggregator(noise_multiplier=0.2, clients_per_round=10)

training_process = tff.learning.algorithms.build_unweighted_fed_avg(
model_fn,
client_optimizer_fn=lambda: tf.keras.optimizers.SGD(learning_rate=0.02),
server_optimizer_fn=lambda: tf.keras.optimizers.SGD(learning_rate=1.0),
model_aggregator=dp_mean,
use_experimental_simualtion_loop=False)

it worked but on initialzing it stucks in indefinite loop
train_state = training_process.initialize()

Please help..

Zachary Charles · Answer 21 · Wed Jul 26 2023 08:32:08 GMT+0800 (China Standard Time)

The fact that even a simple computation like

@tff.federated_computation(tff.FederatedType(tf.float32, tff.CLIENTS))
def get_average_temperature(client_temperatures):
  return tff.federated_mean(client_temperatures)

get_average_temperature([68.5, 70.3, 69.8])

hangs indefinitely suggests a problem in the TFF installation itself. One last thing to help verify this. Can you try running

tff.federated_computation(lambda: 'Hello, World!')()

If this hangs too, then it's likely the installation of TFF. I would strongly recommend trying to re-install TFF, using the latest available version.

Amandeep Singh Bhatia · Answer 22 · Wed Jul 26 2023 08:35:20 GMT+0800 (China Standard Time)

@zcharles8 it hangs too. I tried installing tff 0.48.0 still not worked.

is there any way that tff.learning.dp_aggregator(noise_multiplier=0.2, clients_per_round=10) can work with lower version?

Zachary Charles · Answer 23 · Wed Jul 26 2023 08:37:26 GMT+0800 (China Standard Time)

I think we're up to v0.61.0 or something like that. Is there a reason you can't use that version?

As for earlier versions - I'm not sure. You're welcome to see what the corresponding tutorial looked like in v0.48.0. We keep a record of all the versions as tags on github, and generally try to keep the tutorials up-to-date with the versions.

Amandeep Singh Bhatia · Answer 24 · Wed Jul 26 2023 08:40:27 GMT+0800 (China Standard Time)

@zcharles8 because the university server cannot support more than v0.48.0
and then i tried installing virtual box ubuntu on my system and installed v0.52.0, then diff privacy tutorial hangs there.

it worked only on colab. but i need to use install tfq nightly version, which not able to install on colab..

Zachary Charles · Answer 25 · Wed Jul 26 2023 08:44:45 GMT+0800 (China Standard Time)

Is TFQ = TensorFlow Quantum? It's very possible that there are incompatible dependencies between the two. Moreover, I don't know off the top of my head if TFQ will work in the context of a TFF computation.

Amandeep Singh Bhatia · Answer 26 · Wed Jul 26 2023 08:46:35 GMT+0800 (China Standard Time)

@zcharles8 yes that is tensorflow quantum. yes it worked and developed algorithms between the two.
But stucks when i am trying to work with differential privacy, tff and tfq

Zachary Charles · Answer 27 · Wed Jul 26 2023 08:53:15 GMT+0800 (China Standard Time)

Based on your responses above, the problem is likely that TFF isn't installed correctly, not because of any specific differential privacy code in TFF. Again, there is no guarantee that TFQ will work in the context of a tff.federated_computation. Adding support is a nice feature request, but one that we likely do not have the capacity to add.

In light of this, I don't think there are any specific recommendations I can give. You could potentially look at the various versions of TFF and TFQ and try to see if there is some version of both that have compatible dependencies (eg. TensorFlow versions, python versions, numpy versions, etc.). This could be difficult though.

Amandeep Singh Bhatia · Answer 28 · Wed Jul 26 2023 10:28:26 GMT+0800 (China Standard Time)

@zcharles8 Thank you for your prompt response.

I am trying other way..
using custom https://www.tensorflow.org/federated/tutorials/composing_learning_algorithms

can i use differential privacy tff dp aggreagtor in above custom code...

dp_mean=tff.learning.dp_aggregator(noise_multiplier=0.2, clients_per_round=10)

Zachary Charles · Answer 29 · Wed Jul 26 2023 23:35:43 GMT+0800 (China Standard Time)

The tutorial you link is fully compatible with DP aggregators. Again, the problem is that something about your installation of TFF is making it so that all federated computations hang indefinitely. This really needs to be solved by re-installing TFF, ideally a newer version.

Amandeep Singh Bhatia · Answer 30 · Wed Jul 26 2023 23:39:56 GMT+0800 (China Standard Time)

@zcharles8 Thank you for your prompt response. I installed new version after installing ubuntu and reming virtual box.it worked.

I am trying other way..
using custom https://www.tensorflow.org/federated/tutorials/composing_learning_algorithms

can i use differential privacy tff dp aggreagtor in above custom code...

dp_mean=tff.learning.dp_aggregator(noise_multiplier=0.2, clients_per_round=10)

Zachary Charles · Answer 31 · Wed Jul 26 2023 23:42:33 GMT+0800 (China Standard Time)

You can replace the aggregator factory in https://www.tensorflow.org/federated/tutorials/composing_learning_algorithms#defining_the_building_blocks with whatever aggregator you want (including a DP aggregator).

You can also add such aggregators to the standard FedAvg API: https://www.tensorflow.org/federated/api_docs/python/tff/learning/algorithms/build_weighted_fed_avg

Given that you said that it works, I'm going to mark this as resolved. If you have other bugs, please file them in a separate issue.