[Question] invalid action mask in interior action selection?

Question

[Question] invalid action mask in interior action selection?

honglu2875 opened this issue 2 years ago · comments

Hello! First thank you guys for this repo. I learned a lot about JAX and MuZero from reading the codes. I plan to use this library on our project soon to hopefully solve some math problems.

Just a question about the invalid action mask. I noticed that you have the invalid action at root nodes but not the interior nodes. Is not introducing interior action mask part of the features of being rule agnostic (there is a sentence in MuZero paper saying MuZero "learns" the rules)? Or is one still recommended to go ahead and rewrite a few functions to add masks for the interior action selection?

I should be able to rewrite the gumbel_muzero_interior_action_selection to achieve this, but since you have such an API only for roots, I'm asking to make sure that this is still the best practice and doesn't go against the purpose of the algorithm.

If it's the latter case, do you plan to expand features in that direction? Would love to help if there is the opportunity.

Ivo Danihelka · Answer 1 · Tue Sep 06 2022 05:36:32 GMT+0800 (China Standard Time)

Thanks for asking.
If you know which interior actions are invalid, you can give them zero probability from the policy network. For example, by assigning them logit=-1e9. This should be enough if using Gumbel MuZero (or Gumbel AlphaZero).

Honglu Fan · Answer 2 · Tue Sep 06 2022 09:57:14 GMT+0800 (China Standard Time)

Ah Indeed!!
But does this also mean the invalid action mask for root is also redundant or was it designed for other purposes?

Ivo Danihelka · Answer 3 · Tue Sep 06 2022 16:26:35 GMT+0800 (China Standard Time)

The invalid action mask at the root seems useful in a few cases:
a) For UCT-like algorithms that do not use the policy network.
b) The sequential halving in Gumbel MuZero also needs to know the number of valid actions. We considered detecting that based on some threshold for the action log-probabilities. The usage of the invalid action mask may be less magical there.

Honglu Fan · Answer 4 · Tue Sep 06 2022 16:29:04 GMT+0800 (China Standard Time)

Gotcha, thanks!