aravindr93 / mjrl

Hi I got here after reading your paper "Towards Generalization and Simplicity in Continuous Control" and I was wondering if the file https://github.com/aravindr93/mjrl/blob/master/mjrl/policies/gaussian_mlp.py was the one used in your paper? The paper claims to use linear policies but it seems this network in an MLP?

Hi @zafarali thanks for the interest. The paper indeed uses linear policy. For the paper, we used a bare numpy implementation without any autograd (I am planning to make this public as well, but will likely not do any further development in that repo). To get linear policies with this repo, you can simply strip off the hidden layers in the MLP policy. Please let me know if some further clarification is needed.

Thanks for your reply!

So is it safe to say that no gpu's were needed? How long did your experiments take?

I found this surrogate loss in the REINFORCE code, can you elaborate a bit on this? What does "CPI" stand for? I am unsure if using this surrogate re-create the vanilla REINFORCE algorithm (i.e. will the likelihood ratio = 1)?

mjrl/mjrl/algos/batch_reinforce.py

Lines 37 to 44 in 7705640

    
           def CPI_surrogate(self, observations, actions, advantages): 
        
               advantages = advantages / (np.max(advantages) + 1e-8) 
        
               adv_var = Variable(torch.from_numpy(advantages).float(), requires_grad=False) 
        
               old_dist_info = self.policy.old_dist_info(observations, actions) 
        
               new_dist_info = self.policy.new_dist_info(observations, actions) 
        
               LR = self.policy.likelihood_ratio(new_dist_info, old_dist_info) 
        
               surr = torch.mean(LR*adv_var) 
        
               return surr

Yes, no GPUs are needed for the OpenAI gym continuous control tasks. This is standard practice in other code bases as well (e.g. rallab).

The experiments can take anywhere between 5 minutes to 2 hours depending on complexity of task (from simple swimmer to complex humanoid) on a single good workstation.

CPI stands for Conservative Policy Iteration (paper: http://www.cs.cmu.edu/~./jcl/papers/aoarl/Final.pdf), which was the paper that originally proposed that surrogate. This gradient of the surrogate will be identical to REINFORCE as proved in that paper. The LR will indeed equal 1 when using on-policy samples. This general code structure is useful for other algorithms that use off-policy samples (e.g. PPO, Q-prop).

Thanks, this is super useful!

@zafarali Sorry for the delay. Here is the linear policy code, which basically removes the hidden layers from the neural network: https://github.com/aravindr93/mjrl/blob/master/mjrl/policies/gaussian_linear.py

Also, here is an example that compares linear policies with neural networks and demonstrates that linear policies train faster for some problems: https://github.com/aravindr93/mjrl/blob/master/examples/linear_nn_comparison.py

	def CPI_surrogate(self, observations, actions, advantages):
	advantages = advantages / (np.max(advantages) + 1e-8)
	adv_var = Variable(torch.from_numpy(advantages).float(), requires_grad=False)
	old_dist_info = self.policy.old_dist_info(observations, actions)
	new_dist_info = self.policy.new_dist_info(observations, actions)
	LR = self.policy.likelihood_ratio(new_dist_info, old_dist_info)
	surr = torch.mean(LR*adv_var)
	return surr

linear policy?