HyperGradient Computation Methods Are Not Isolated...

Question

HyperGradient Computation Methods Are Not Isolated...

haamoon opened this issue 6 years ago · comments

Hi,

I wrote the following code to compare the hyper-gradient computed by ReverseHG and ForwardHG methods in the same file:

### ReverseHG
farho = far.HyperOptimizer()
hypergradient = farho.hypergradient
run = farho.minimize(val_loss, oo_optim, tr_loss, io_optim)
grads_hvars = [hypergradient.hgrads_hvars(hyper_list=hll)
    for opt, hll in farho._h_optim_dict.items()]
run(T, inner_objective_feed_dicts=tr_supplier, outer_objective_feed_dicts=val_supplier, _skip_hyper_ts=True)
grads_hvars_val = ss.run(grads_hvars, _opt_fd(farho._global_step, val_supplier))
print(grads_hvars_val)


### ForwardHG
hypergradient_fwd = far.ForwardHG()
farho_fwd = far.HyperOptimizer(hypergradient=hypergradient_fwd)
run_fwd = farho_fwd.minimize(val_loss, oo_optim, tr_loss, io_optim)
grads_hvars_fwd = [hypergradient_fwd.hgrads_hvars(hyper_list=hll)
    for opt, hll in farho_fwd._h_optim_dict.items()]
run_fwd(T, inner_objective_feed_dicts=tr_supplier, outer_objective_feed_dicts=val_supplier, _skip_hyper_ts=True)
grads_hvars_fwd_val = ss.run(grads_hvars_fwd, _opt_fd(farho_fwd._global_step, val_supplier))
print(grads_hvars_fwd_val)

They receive identical inputs and compute the hyper-gradient for the same hyper-variable (_skip_hyper_ts=True so the hyper-parameter remains unchanged) but for some reason their output is quiet different. If noticed that if I run them in separate files (with a fix random seed) or run ForwardHG block before ReverseHG block their outputs would be similar. I can not see how Reverse and Forward hyper-gradient computation can effect each other as they don't share any variable. Could you please explain how these two methods can be run in the same file?

I have also attached the complete python code for this experiment.

cp.py.zip

Luca Franceschi · Answer 1 · Tue Jan 09 2018 00:58:20 GMT+0800 (China Standard Time)

Hi Haamoon,

I believe the issue is caused by the (different) random initialization of the model weights. Each time you run either run_fwd or run the model parameters are initialized according to the randomized initializer of tenosrflow.contrib.layers.fully_connected (by default I belive should be xavier_initializer). This makes the hypergradients different (as they should be). If you try to initialize the parameters (weights of the network) with a constant value the issue should be solved. ( I did not write the code with the idea of using them together, so it is possible that some axiliary variable is wrongly reinitialized when you call the other method. I will check!)

An other thing.. by calling hypergradient.hgrads_hvars new nodes are added to the graph to make the final computations for calculating the gradients, which is not necessary since they were already created in the HyperOptimizer.minimize function. To retrieve a list of hypergradients use far.hypergradients().

Let me know if this helps!

Cheers,
Luca

Amirreza Shaban · Answer 2 · Tue Jan 09 2018 05:38:26 GMT+0800 (China Standard Time)

You are right, with a fix initialization the gradients are the same and I could access hyper-gradients with far.hypergradients method. Thanks for the help Luca!