Every round of an FL Computation returns the same accuracy

Question

Every round of an FL Computation returns the same accuracy

drakstik opened this issue a year ago · comments

Here is my code, it's inspired by the mnist demo but I use a different dataset and the random forest model:

def batch_data(data_shard, bs=32, pref=25):
    dataset = tf.data.Dataset.from_tensor_slices(collections.OrderedDict(
        x=tf.constant([data_shard[0]], dtype=tf.float64),
        y=tf.constant([[data_shard[1]]], dtype=tf.int64)
    ))
    return dataset.shuffle(20).repeat(5).batch(bs).prefetch(pref)

federated_data = []
for (client_name, data) in clients.items():
    federated_data.append(batch_data(data))
   
def create_keras_model():
  return tfdf.keras.RandomForestModel(num_trees=50, max_depth=10)

def model_fn():
  keras_model = create_keras_model()
  return tff.learning.models.from_keras_model(
      keras_model,
      input_spec=federated_data[0].element_spec,
      loss=tf.keras.losses.BinaryCrossentropy(),
      metrics=[tf.keras.metrics.BinaryAccuracy()])
      
training_process = tff.learning.algorithms.build_weighted_fed_avg(
    model_fn,
    client_optimizer_fn=lambda: tf.keras.optimizers.SGD(learning_rate=0.02),
    server_optimizer_fn=lambda: tf.keras.optimizers.SGD(learning_rate=1.0))
    
training_process = tff.learning.algorithms.build_weighted_fed_avg(
    model_fn,
    client_optimizer_fn=lambda: tf.keras.optimizers.SGD(learning_rate=0.02),
    server_optimizer_fn=lambda: tf.keras.optimizers.SGD(learning_rate=1.0))

train_state = training_process.initialize()

NUM_ROUNDS = 11
for round_num in range(2, NUM_ROUNDS):
  result = training_process.next(train_state, federated_data[:500])
  train_state = result.state
  train_metrics = result.metrics
  print(train_state)
  print('round {:2d}, metrics={}'.format(round_num, train_metrics))

When I run the code, I get the same accuracy over and over again and my train_state is always like this:

LearningAlgorithmState(global_model_weights=ModelWeights(trainable=[], non_trainable=[False]), distributor=(), client_work=(), aggregator=OrderedDict([('value_sum_process', ()), ('weight_sum_process', ())]), finalizer=[0])

When I run this print(training_process.initialize.type_signature.formatted_representation()), I get:

( -> <
  global_model_weights=<
    trainable=<>,
    non_trainable=<
      bool
    >
  >,
  distributor=<>,
  client_work=<>,
  aggregator=<
    value_sum_process=<>,
    weight_sum_process=<>
  >,
  finalizer=<
    int64
  >
>@SERVER)

Why is my accuracy & train_state not changing after each next?

The dataset & model are super simple, so tff might just be converging quickly? Even so, shouldnt the train_state at least change?

Please help!

Zachary Charles · Answer 1 · Tue Apr 25 2023 23:38:45 GMT+0800 (China Standard Time)

According to the output snippets you pasted (very helpful, thank you!) the TFF model you are constructing has no trainable weights. Thus, each call to next does not change the model. TFF algorithms assume that the trainable weights represent the weights you would like to train in FL.

I should note that it also isn't clear exactly what algorithm you're attempting to implement. As noted by TFDF, "decision forests are not trained in batches the same way neural networks are." (link). Thus, I think you'd have to reconsider how TFDF operates, and how you'd like to generalize that to the federated setting.

One possibility (that I have only thought about in a cursory manner) would be to use RandomForestModel.fit locally within each client, and then "average" the resulting forests somehow (though how one would do this is not obvious to me, I suspect there is research on this though). You could then implement that algorithm in TFF.

drakstik · Answer 2 · Fri Apr 28 2023 05:06:24 GMT+0800 (China Standard Time)

According to the output snippets you pasted (very helpful, thank you!) the TFF model you are constructing has no trainable weights. Thus, each call to next does not change the model. TFF algorithms assume that the trainable weights represent the weights you would like to train in FL.

I should note that it also isn't clear exactly what algorithm you're attempting to implement. As noted by TFDF, "decision forests are not trained in batches the same way neural networks are." (link). Thus, I think you'd have to reconsider how TFDF operates, and how you'd like to generalize that to the federated setting.

One possibility (that I have only thought about in a cursory manner) would be to use RandomForestModel.fit locally within each client, and then "average" the resulting forests somehow (though how one would do this is not obvious to me, I suspect there is research on this though). You could then implement that algorithm in TFF.

Thanks for the reply, I actually figured the same thing. Your quote about decision forests not being trained in batches is echoed here as well "Unlike Backpropagation, the training of RF does not "transmit" the loss gradient to from its output to its input." (link) TFF is best suited for Sequential type models, so Neural Networks and such.

This makes sense so we adopted a neural network instead. Thankfully, our RF model was easily switched to a NN that worked just as well. So my model was tfdf.keras.RandomForestModel(num_trees=50, max_depth=10) which does not output "transmittable parameters", so not sure how that interacts with TFF in the background.

We changed our RF model to this instead:

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
def create_compiled_keras_model():

    model = Sequential([
        Dense(64, activation='relu', input_shape=(14,)),
        Dense(32, activation='relu'),
        Dense(1, activation='sigmoid')
    ])
    
    return model
    
 def model_fn():
  keras_model = create_compiled_keras_model()
  return tff.learning.models.from_keras_model(
      keras_model,
      input_spec=federated_train_data[0].element_spec,
      loss=tf.keras.losses.BinaryCrossentropy(),
      metrics=[tf.keras.metrics.BinaryAccuracy()])

training_process = tff.learning.algorithms.build_weighted_fed_avg(
    model_fn,
    client_optimizer_fn=lambda: tf.keras.optimizers.Adam(learning_rate= 0.001),
    server_optimizer_fn=lambda: tf.keras.optimizers.Adam(learning_rate= 0.001))

So now if we print print(training_process.initialize.type_signature.formatted_representation()) we get this now:

( -> <
  global_model_weights=<
    trainable=<
      float32[14,64],
      float32[64],
      float32[64,32],
      float32[32],
      float32[32,1],
      float32[1]
    >,
    non_trainable=<>
  >,
  distributor=<>,
  client_work=<>,
  aggregator=<
    value_sum_process=<>,
    weight_sum_process=<>
  >,
  finalizer=<
    int64,
    float32[14,64],
    float32[14,64],
    float32[64],
    float32[64],
    float32[64,32],
    float32[64,32],
    float32[32],
    float32[32],
    float32[32,1],
    float32[32,1],
    float32[1],
    float32[1]
  >
>@SERVER)

There might be a better way, since we are seeing great variation in the results, compared to a tfdf. What do you think? This was probably my second time working with NNs, so I'm not that good at understanding what could cause the variability. Will leave this open and post our results here in terms of variability, or if we find a better way to implement an NN for our DF.

drakstik · Answer 3 · Sun Apr 30 2023 08:28:43 GMT+0800 (China Standard Time)

So there is no variation in the results, my bad. I think I was doing some weird stuff with the input datasets. So yeah, lesson learnt, use sequential models and learn how to represent RF models into neural networks. Good luck everyone!