openai / baselines

OpenAI Baselines: high-quality implementations of reinforcement learning algorithms

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Gumbel Distribution and Derivability

mm1212345 opened this issue · comments

Hey there!
I am currently working my way through the action sampling process from a categorical variable. In order to get from the logits to the probabilities as accurately as possible, the Gumbel noise is added to the logits. This is the reason for the double log. Correct?

But still, the action is choosen with tf.argmax(self.logits - tf.log(-tf.log(u)), axis=-1). Isn't it the case that still the argmax operation results in the whole sampling process not being derivable?
What else do I not understand?