Masking removed still behaves to some extent

Question

Masking removed still behaves to some extent

sycz00 opened this issue 2 years ago · comments

Hey,

not sure if that is the proper way of asking you something but I'll give it a try.
So according to you paper, you analysed that invalid action masking still works quite well even without using the mask after the training finished. As far as I understood, you compared against the method of giving a reward penalty for executing some invalid action.

In my case I am experiencing exactly the opposite. The training with invalid action masking outputs not reasonable distributions, as opposed to the negative reward penalty. However, the negative rew. penalty is still not satisfying.
In my case, it doesn't bother me too much, since I am having access to the mask even for inference time.
Question:
Do you have any other insides of what could be the issue. Because training with invalid actions masking accelerates training by a factor of 10. The training pipeline looks great and the number of discrete actions is also not "too big" (8 actions). Another disturbing factor for my issue is, that you prove that the policy gradient is a valid gradient and must therefore be correct for backpropagation.

Thanks in advance if you have any ideas or hints.
Greetings,
Fabian