airsplay / R2R-EnvDrop

PyTorch Code of NAACL 2019 paper "Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

reward quantification

chijames opened this issue · comments

Hi,

Thanks for the code. Can you please explain a bit more about why we need to quantize the reward in agent.py? Since I did not see this in the paper. Thanks.

Hi,

The discrete reward is a standard technique in reinforcement learning with hand-crafted reward. An example here: Visual Attention for multiple object detection.
Using raw distances as rewards would roughly have similar results but are not robust. As a result, the accuracies vibrate during training and have a large variance w.r.t different random seeds.

In general, most RL methods are over-sensitive to the scale and distribution of reward. Hence it is common to see reward normalization techniques, e.g., reward * alpha (to scale the reward dist) or ln(reward) | exp(reward) | 1/reward (to modify the family of distributions) in the code.

The discrete reward is yet another technique to normalize the distribution of rewards, which is insensitive to the scale and distribution families.

However, no free lunch in the world!! The discrete reward would introduce "positive weight cycle", where the agent could accumulate positive reward by looping in the cycle. This issue is solved by our RL + IL method.

Hope these help!

Thanks for your thorough explanation!

@airsplay

Using raw distances as rewards would roughly have similar results but are not robust.

I don't understand how this phenomenon happens. Could you provide some work about it?

@ZhuFengdaaa
Hmmm, it might not be a work mentioning it. But I could try to explain it.

Stability of RL (somehow) relies on the distribution of rewards. With discretization, the distribution would be reduced to a well-understood Bernoulli distribution. At the same time, the distribution of distance on a graph depends on the characteristic of a graph. Thus a few normalized techniques need to be used. E.g., log the reward if the distribution belongs to some specific exponential family.

However, you do not always know the characteristic of a graph hence a biased reward normalization would make the RL unstable.