MPO temperature updates

Question

MPO temperature updates

henry-prior opened this issue 2 years ago · comments

Hi,

I have a question about the MPO implementation, specifically the temperature parameter used for the importance weights. Based on the derivation in the papers I've gathered that the temperature parameter should be the optimal value of the dual function at each iteration, and given its convexity we can make more of an effort to fully optimize during each step e.g. "via a few steps of gradient descent on \eta for each batch" right below the formula for \eta on page 4 of "Relative Entropy Regularized Policy Iteration". This is also mentioned below equation 4 on page 4 of "A Distributional View on Multi-Objective Policy Optimization".

In the Acme implementation this is absent, so curious about whether or not this is still considered a useful/necessary aspect of the algorithm by researchers at DeepMind. If you'd like, I'm happy to add (possibly optional) functionality for it. I've been messing around with it locally for a bit and see how it's a bit tricky in the current architecture but have something that may work while still being clean and not breaking the current design. In my own implementation of MPO in JAX I use SciPy's SLSQP optimizer on the temperature, which works well, but may bit a bit difficult in Acme given that it requires you to be outside of JITing and able to pull DeviceArray values back to the Python process. In my testing it wasn't any slower than a gradient optimizer, but you do break up the asynchronous dispatch which could create noticeable bottleneck scenarios.

Curious to hear about the decision making here either way, Acme is an amazing project with a great design!

Bobak Shahriari · Answer 1 · Tue Sep 27 2022 18:18:17 GMT+0800 (China Standard Time)

Hi henry-prior,

Indeed one could spend more effort optimizing the dual but I would question whether the benefits are worth the added core complexity, especially since, as with the other objectives, the dual function is only approximated at the sampled batch. For this reason it seems appropriate to use SGD/Adam similar optimization strategies. Since you've done the comparison though, I'd love to see whether the performance varies significantly one way or another!

Having said that, I do agree with you that being slightly more careful with the dual parameters is important, which is why we use a separate optimizer with its own learning rate to ensure the duals are such that the desired constraint is satisfied on average. In fact we track these by, e.g. logging kl_q_rel, which is the relative E-step constraint (kl_mean_rel for the M-step constraint) and should stay close to 1.

Abbas may have different opinions though, so I'll ping him in case he wants to add a couple of pennies.

Thanks for the important question! Happy Acming!

Bobak

Henry Prior · Answer 2 · Wed Sep 28 2022 02:27:35 GMT+0800 (China Standard Time)

Hi Bobak,

Thanks for your insight here! We're definitely on the same page about the setup for optimizing the dual parameters, I'm just considering taking a few more gradient steps during each training epoch. After thinking a bit more about this I realized that in my implementation I optimize the temperature before calculating the weights which is different than how it's described in "Relative Entropy Regularized Policy Iteration" right after the quote I shared the authors mention taking those gradient steps after the weight calculation, i.e. the temperature for the weight calculation should be from the previous step. Maybe not hugely important, but I'll use this approach when making the comparison for Acme.

Thanks for pointing out the logged KL between the target and non-parametric policies, I'll definitely keep that in mind when experimenting.

Going to kick off some runs and will share comparisons. If there are any envs/tasks you'd particularly like to see let me know. I'll start with Humanoid-Stand.

Henry

Henry Prior · Answer 3 · Sun Oct 02 2022 04:16:27 GMT+0800 (China Standard Time)

Hey @bshahr following up here with some initial results. Right off the bat, I don't see a real benefit of taking multiple gradient steps on the temperature parameters based on minimal testing. A caveat here being the minimal testing, as the exact setup and hyperparameters could make a difference. I'll detail other setups I'd like to test at the end of this comment.

Here are some plots. Cartpole results are on the first 20 random seeds with 100,000 environment steps and humanoid results are only on seed=0 with 2,000,000 environment steps. Using my own compute here so want to experiment more before running more humanoid seeds.

First to share my code: master...henry-prior:acme:full-optimization-of-temperature

I took a pretty naive approach to start which minimizes changes to the current training setup by

Changing the optimizer for the temperature parameters to SGD while maintaining the same singular dual_optimizer object by using optax.masked and optax.chain.
Passing the optimizer and its state into the loss function at each step by "walking along" the gradient descent and accumulating the losses at each point without modifying the parameter.

This means that the updating of the temperature parameters is done the same way as before on the gradients of the losses. I've modified the learning rate for the temperature parameters so the total step size is comparable to what it was previously, but the gradient steps are more fine-grained. This may be too strong of a constraint, and it may make more sense to increase the learning rate a bit more.

What I'd like to try next:

Most simply, do some hyperparameter search on the learning rate of the new temperature optimizer and number of steps
Use Adam as the optimizer for temperature while still taking multiple gradient steps. This is a bit more complicated and the reason I chose not to do this first is that it makes the code messier and creates what I see as some faux pas by needing to update the parameter and the optimizer state within the loss module and then return both of them as auxiliary outputs. Adam can't be used in the setup I currently have because the math doesn't work out the same way for the accumulated losses (in SGD the sum of the successive updates along the gradient descent is equal to one update on the gradient of the sum of the losses along the descent, this doesn't hold for Adam because of the t-1 gradient term). I still think this may be useful to test, though. I can probably figure out a way to make it cleaner, but still thinking about it.

Bobak Shahriari · Answer 4 · Mon Oct 03 2022 17:25:31 GMT+0800 (China Standard Time)

Ok great! Thanks for the confirmation Henry! This follows my expectation. I'll close the issue now but feel free to update us all with your future findings. 📈🙌🏽