Implementation of dilated form of MMD
Atan03 opened this issue · comments
Hello, I am currently replicating the experiments of DOGDA and DOMWU as described in the paper1. My goal is to modify the existing code based on the dilated version of the MMD algorithm to incorporate a specific adjustment. It seems that setting
However, I have some uncertainties about the gradient computation step in the dilated version, especially regarding the subtraction of num_children.
I understand that the computed gradient is related to the dilated entropy of the sequence form strategy, but I am still struggling to comprehend it.
Your assistance in clarifying this matter would be highly appreciated.
Despite not having complete clarity, I have followed the previously mentioned idea and completed the implementation of DOMWU (equivalent to MMD with
Please forgive my limited familiarity with dilated form updating. If I intend to implement DOGDA (equivalent to MMD with
Once again, your assistance in understanding this would be greatly appreciated, even though I recognize that it is more of a theoretical question than a coding implementation issue.
Footnotes
-
Last-iterate Convergence in Extensive-Form Games ↩
Hi @Atan03! I'm glad you are using my code and am happy to answer your question.
It seems that setting
$\alpha =0$ in MMD causes it to degenerate into MD.
Exactly! MMD with
do I only need to replace the term involving negative entropy in MMD with L2-norm and replace the softmax with projection onto the probability simplex?
It is true that you would need to change both the negative entropy and softmax terms, but you would also need to the change the gradient computation and specifically the part you pointed out with regards to the num_children
. The subtraction of num_children
trick is only valid for the case of dilated entropy but not dilation with the l2 norm.
Below I can run through the math on why this is the case.
Say we want to compute the partial derivative
- The dialted entropy at state
$s$ - All the children states of the sequence
$C(s,a)$ . These are all the possible states that can be reached after selecting action$a$ at state$s$ .
Mathematically we have:
I've introduced some new notation, here is a breakdown:
-
$x_{(s,\cdot)}$ is the slice of the sequence form corresponding to state$s$ -
$p(s)$ is the parent sequence of state$s$ -
$\psi_s$ is the local dgf at state$s$ , in our case we are picking negative entropy -
$\pi_s = \frac{x_{(s,\cdot)}}{x_{p(s)}}$ the policy at state$s$ .
The first part (1) by chain rule just reduces to the partial derivative of negative entropy (or wtv dgf you use) with respect to the policy at state
The second part is more interesting for us. If we just look at one child state
If we plug in negative entropy as
Therefore, each child contributes a
Thank you for your comprehensive explanation! It has been immensely helpful to me. Now, I have a comprehensive understanding of the formula. Following your guidance, I successfully implemented the OGDA on NFG and achieved convergence results. I am now closing this issue.